Artificial intelligence AI full stack system (6)

Chapter 1 How Neural Networks are Implemented

In recent years, the development of neural networks has become more and more complex, its application fields have become wider and wider, and its performance has become better and better. However, the training method still relies on the BP algorithm. There are also some improved algorithms for the BP algorithm, but the general idea is basically the same, except for some small improvements in individual parts of the BP algorithm, such as variable step size, adaptive step size, etc. Also, due to the presence of noise in the training data, the smaller the loss function is not the better when training a neural network. When the loss function is extremely small, the so-called "overfitting" problem may occur, causing the performance of the neural network to seriously degrade in actual use.

6. Overfitting problem

Please add image description

1. What is the overfitting problem?

Please add image description

  • The blue dots in the above figure give 6 sample points. Assume that these sample points come from the sampling of a certain curve, but we don’t know what the original curve looks like. How to "restore" the output based on these 6 sample points? What about the original curve? This is a fitting problem. The figure below shows 3 fitting solutions. The green one is a straight line. Obviously the fitting is a bit rough. The blue curve is a bit complicated. After each sample point, the curve is perfectly fitted to the 6 sampling points. Together, it seems to be a good result, but the price paid for this is that the curve is curved, and it feels like it is fitting for the sake of fitting, without considering the distribution trend of the 6 sample points. Considering that the sampling process often contains noise, this so-called perfect fit is actually not perfect. Although the red curve does not pass through each sample point, it better reflects the distribution trend of the six sample points and is likely to be closer to the original curve. Therefore, there is reason to believe that the red curve is closer to the original curve and is the fitting result we want. If we use the sum of squared errors between the fitting function and the sample points as the evaluation of the fit, that is, the loss function, the green curve has the largest loss function because it is far away from the sample points, and the blue curve has passed through each sample point. The error is 0, the loss function is the smallest, and the loss function of the red curve is somewhere in between. The green curve is called underfitting due to insufficient fitting, the blue curve is called overfitting due to excessive fitting, and the red curve is the fitting result we hope for. In the training of neural networks, similar under-fitting and over-fitting problems will also occur.

Please add image description

  • Underfitting is obviously a bad result. What problems will overfitting bring?

2. Overfitting problem of neural network

Please add image description

  • We divide the sample set into two sets: a training set and a test set. The training set is used to train the neural network, and the test set is used to test the performance of the neural network. As shown in the figure above, the ordinate is the error rate, and the abscissa is the iteration rounds during training. The red curve is the error rate on the training set, and the blue curve is the error rate on the test set. After each certain training iteration, the error rates on the training set and test set are tested. It can be found from the figure that at the beginning of training, due to the underfitting state, both the error rate on the training set and the error rate on the test set gradually decrease as the training progresses. However, when the training iteration rounds reach N times, the error rate on the test set gradually increases, which is the phenomenon of overfitting. The error rate on the test set is equivalent to the performance of the neural network in actual use, so we hope to get a suitable fit that minimizes the error rate on the test set, so the training should end when the iteration rounds reach N times , to prevent overfitting.
  • When training, it is not that the smaller the loss function is, the better.
  • It’s not easy to tell when overfitting begins. A simple method is to use the test set, make an error rate curve like the picture above, find N points, and use the parameter values ​​obtained at N points as the parameter values ​​of the neural network.
  • However, this method requires a relatively large sample set, because both training and testing require more samples. In actual use, we often face the problem of insufficient samples.
  • In order to solve the over-fitting problem, researchers have proposed some methods that can effectively alleviate the over-fitting problem. Of course, each method is not omnipotent. It can only be said that it weakens the over-fitting problem to a certain extent.

3. Method to reduce overfitting: regularization term method

  • In the BP algorithm, the loss function used is:

E d ( w ) = ∑ k = 1 M ( t k d − o k d ) 2 E_d(w) = \sum^{M}_{k=1}{(t_{kd} - o_{kd})^2} Ed(w)=k=1M(tkdokd)2

  • Add a regularization term ∥ w ∥ 2 2 \begin{Vmatrix}w\\\end{Vmatrix}_2^2 to this loss function w 22, becomes:
    E d ( w ) = ∑ k = 1 M ( tkd − okd ) 2 + ∥ w ∥ 2 2 E_d(w) = \sum^{M}_{k=1}{(t_{kd} - o_{kd})^2} + \begin{Vmatrix}w\\\end{Vmatrix}_2^2Ed(w)=k=1M(tkdokd)2+ w 22
  • Where ∥ w ∥ 2 2 \begin{Vmatrix}w\\\end{Vmatrix}_2^2 w 22Represents the 2-norm of weight w, ∥ w ∥ 2 2 \begin{Vmatrix}w\\\end{Vmatrix}_2^2 w 22represents the square of the 2-norm.
  • The 2-norm of w is each weight wi w_iwiAfter squaring, sum and then take the square root. The square of the 2-norm is used here, so it is the sum of the squares of the weights. If you use wi (i = 1, 2, . . ., N) w_i(i=1,2,...,N)wi(i=1,2,...,N ) represents the i-th weight, then:
    ∥ w ∥ 2 2 = w 1 2 + w 2 2 + ⋯ + w N 2 \begin{Vmatrix}w\\\end{Vmatrix}_2^2 = w_1^2 + w_2^2 + \cdots + w_N^2 w 22=w12+w22++wN2
  • Of course, this is not limited to 2-norm, other norms can also be used.
    Please add image description

4. The role of regularization terms: reducing model complexity

  • Why can overfitting be avoided by adding a regularization term?
    • Adding a regularization term to the loss function is equivalent to minimizing the loss function while requiring the weight to be as small as possible, which is equivalent to limiting the range of change of the weight.
    • Take the curve fitting shown in the figure below as an example. As a general case, a curve fitting function f(x) can be considered as the following form:
      f (x) = w 0 + w 1 x + w 2 x 2 + ⋯ + wnxnf(x) = w_0 + w_1x + w_2x^2 + \cdots + w_nx^nf(x)=w0+w1x+w2x2++wnxn
    • If f(x) contains xn x_nxnThe more terms and the larger n, the more complex curves f(x) can represent, the stronger the fitting ability, and the easier it is to cause overfitting.

Please add image description

  • For example, among the three curves shown in the figure above, the green curve is a straight line, and its form is:
    f (x) = w 0 + w 1 xf(x) = w_0 + w_1xf(x)=w0+w1x
  • It only contains the x term and can only represent a straight line, so it appears to be underfitting. For the blue curve, the form is:
    f (x) = w 0 + w 1 x + w 2 x 2 + w 3 x 3 + w 4 x 4 + w 5 x 5 f(x) = w_0 + w_1x + w_2x^2 + w_3x^3 + w_4x^4 + w_5x^5f(x)=w0+w1x+w2x2+w3x3+w4x4+w5x5
    contains 5xnx^nxn items, the expressive ability is relatively strong, resulting in overfitting. For the red curve, its form is:
    f ( x ) = w 0 + w 1 x + w 2 x 2 f(x) = w_0 + w_1x + w_2x^2f(x)=w0+w1x+w2x2
    contains 2xnx^nxThe n term may be just right for this problem, so it reflects a better fitting effect. But in practice, it is difficult for us to know how manyxnx^nxn items are suitable, it is possiblexnx^nxThere are relatively many n items. By adding regularization terms to the loss function, the weight w is made as small as possible, which can limit the occurrence of overfitting to a certain extent. For example, for the blue curve:
    f ( x ) = w 0 + w 1 x + w 2 x 2 + w 3 x 3 + w 4 x 4 + w 5 x 5 f(x) = w_0 + w_1x + w_2x^2 + w_3x^3 + w_4x^4 + w_5x^5f(x)=w0+w1x+w2x2+w3x3+w4x4+w5x5
    although it contains 5xnx^nxn terms, but if we end up withw 3 w_3w3 w 4 w_4 w4 w 5 w_5 w5If they are relatively small, then it will be the same as the red curve:
    f (x) = w 0 + w 1 x + w 2 x 2 f(x) = w_0 + w_1x + w_2x^2f(x)=w0+w1x+w2x2
    is relatively close.
  • For a complex neural network, it generally has a strong expressive ability. If special methods are not adopted to limit it, it can easily cause overfitting.

5. L2 (2-norm) regularization term

Please add image description

6. L1 (1-norm) regularization term

Please add image description

7. Method to reduce overfitting: Dropout

  • The so-called discarding method is to randomly and temporarily delete some neurons in the process of training the neural network, and only train the remaining neurons. Which neurons are discarded are random and temporary. They are only discarded in this weight update. Which neurons are discarded in the next update will be randomly selected again. That is to say, every time a weight update is performed, Do it again with random discard. The figure below gives a schematic diagram of discarding. The neuron representation shown by the dotted line in the figure has been temporarily discarded. It can be considered that these neurons have been temporarily deleted from the neural network. Discarding only occurs during training. After training is completed, all neurons are used when using the neural network.
  • The more neurons a neural network contains, the stronger its expressive ability and the easier it is to cause overfitting. So a simple understanding is that in the training stage, by discarding and reducing the number of neurons, a simplified neural network is obtained, which reduces the expression ability of the neural network. However, since the neurons discarded each time are different, it is equivalent to training multiple simplified neural networks. When using the neural network, all neurons are used, so it is equivalent to integrating multiple simplified neural networks together. , which can not only reduce overfitting, but also maintain the performance of the neural network. Give an example to illustrate the rationale for this. For example, 10 students form a group to do an experiment. If 10 students do it together every time, it is likely that two or three top students will play a major role, and other students will not receive sufficient training. However, if a "discarding mechanism" is introduced and 5 students are randomly selected from 10 students for experiments each time, more students will be fully trained. When 10 students are grouped together to conduct research, since each student has been fully trained, the 10 people grouped together will have stronger research capabilities.

Please add image description

  • Discarding is performed at each layer of the neural network. Except for the input layer and output layer, discarding occurs in every layer. The discarding ratio is about 50%. That is to say, in each layer of the neural network, approximately Lose about 50% of neurons.

8. Methods to reduce overfitting: data enhancement method

  • In curve fitting, if there is enough data, the risk of overfitting will become smaller, because enough data will limit the drastic changes of the fitting function, making the fitting function closer to the original function.
    Please add image description

9. How to get more data?

  • In addition to collecting as much data as possible, existing data can be used to generate some new data. For example, if we want to identify cats and dogs, we already have some pictures of cats and dogs. Then we can transform one picture into many pictures through rotation, scaling, partial interception, changing colors, etc., making the number of training samples dozens of times. , increased hundreds of times. Experiments show that the performance of neural networks can be effectively improved through data augmentation.

Please add image description

10. Summary

Please add image description

  • Due to reasons such as noise in the data, the smaller the loss function is not the better during the training process of the neural network, because when the training reaches a certain level, further reducing the error on the training set will increase the error on the test set. This phenomenon is called overfitting.
  • There are three ways to reduce overfitting:

(1) Regular term method. That is to say, adding regular terms to the loss function makes the weight as small as possible to prevent overfitting.

(2) Abandonment method. During the training process, a part of neurons are randomly and temporarily discarded, and each discard is equivalent to training only one sub-network. The result is equivalent to training multiple sub-networks and then integrating them together for use. Each part of the network is fully trained, thereby improving the overall performance of the neural network.

(3) Data augmentation method. Generally speaking, the larger the training data, the better the performance of the trained neural network will be. When there is not enough training data, the training data can be increased by processing existing data to generate new data. This method is called data augmentation method. For example, for image data, one picture can be transformed into many pictures through rotation, scaling, partial interception, color change, etc., thereby increasing the number of training samples dozens or hundreds of times.

Guess you like

Origin blog.csdn.net/sgsgkxkx/article/details/133281146