Optimization of Fully Connected Neural Networks

foreword

  After understanding the fully connected network from concept to different levels, it is finally necessary to perform several trainings on the data we input through the neural network to obtain ideal results. Non-linear factors in the training process: deintegration of activation functions, dedifferentiation, gradient disappearance and gradient explosion; issues such as batch normalization, loss function selection, overfitting and dropout, model regularization, etc. need to be continuously optimized And continuous detection, this article will summarize these points.
insert image description here

1. Gradient disappearance

  Gradient disappearance: Since the gradient is multiplied during backpropagation, if the local gradient is too small, it will cause the problem of gradient disappearance during backpropagation, which will eventually lead to the update of WWW , that is, after each update of the gradient, the initial weight is basically unchanged.
  How to choose the activation function for the gradient disappearance problem?
  ①The maximum value of the derivative of the Sigmoid activation function is 0.25 , which is very small. Assuming that the hidden layer has 100 layers, and the derivation results of Sigmoid are all 0.1, then the gradient will be 0.1 to the 100th power when the gradient is backpropagated, making the gradient more forward will be smaller and close to 0. Similarly, the tanh activation function has the same problem, but the tanh activation function converges faster than sigmoid.
  ②The slope of the ReLu function is 1 when it is greater than 0, which solves the problem of gradient elimination, because the k-th power of 1 is still 1, but when it is less than 0, the slope becomes 0. At this time, when the weight is updated, the update is stopped.
  Less than 0 to stop updating
  Advantages : Provides unilateral inhibition because the neural network is fully connected and has an over-fitting problem, and the connection that stops updating can alleviate the over-fitting problem.
  Disadvantage : The information of this neuron can no longer be transmitted, because W has become a fixed value at this time, causing the neuron to die and losing some links.
  ③Leaky ReLu: When it is less than 0, it will continue to update the weight, but it is not continuous at 0 because there is no derivative value at 0.

2. Gradient Explosion

  Exploding gradients are due to the multiplicative nature of the chain rule.
  Gradient explosion: After the gradient at the cliff is multiplied by the learning rate, it will be a very large value, thus "flying" out of the reasonable area, and eventually the algorithm will not converge.
insert image description here
  Solution :
  1. Set the learning rate value to a smaller value.
  2. Gradient clipping.

2.1 Fixed threshold clipping

  Set the threshold, when the gradient is less than/greater than the threshold, the updated gradient is the threshold. as the picture shows:
insert image description here
        But it is difficult to find a suitable threshold

2.2 Measured according to the norm of the parameter

  Clip the L2 norm of the gradient, that is, the square sum of the partial derivatives of all parameters and then the square root.
insert image description here

3. Loss function

3.1 Softmax

Softmax is to first perform ee   on the score of each categoryThe e exponential function is transformed, and then normalized (the score of each category is divided by the sum of the scores of all categories), and finally the probability value of each category is output.
insert image description here

3.2 Cross-entropy loss

  The concept of entropy: A simple understanding is to reflect the amount of information. When an event is very certain, it means that there is no amount of information. When the probability values ​​of each possibility of the event are equal, the entropy is the largest at this time.
insert image description here
  How to measure the difference of two random distributions?
  Only when the real distribution is in one-hot form, cross entropy and relative entropy are equal, but when the real distribution is not in one-hot form, relative entropy should be selected to compare the difference between the two.

3.3 Cross-entropy loss and multi-class support vector machine loss

  ①The calculation logic is different, the multi-class support vector machine loss focuses on whether the score of the correct category is higher than the score of each other category, and the cross-entropy loss is the difference between the probability value of each category measured and the real value .
  ②Cross-entropy loss requires that the score of the real category is not only larger than other categories, but also requires the probability of the real category to be larger than other categories.

4. Gradient descent optimization

4.1 Momentum method

insert image description here
  The momentum method reduces vibration by accumulating velocity, and also has the ability to find a better solution:
  Phenomenon : the loss function often has a bad local minimum or a problem
with the saddle point   gradient descent algorithm : the gradient between the local minimum and the saddle point is 0, the algorithm Can't pass. Advantages of the momentum method : Due to the existence of momentum, the algorithm can break out of the local minimum point and saddle point to find a better solution.
  
insert image description here

4.2 Adaptive gradient method

  The adaptive gradient method reduces the shock by reducing the step size in the oscillation direction and increasing the step size in the flat direction, and accelerates the direction to the valley bottom.
  How to distinguish the oscillating direction from the flat direction?
  Answer: The direction with the larger square of the gradient magnitude is the direction of oscillation; the direction with the smaller square of the gradient magnitude is the flat direction.
insert image description here
  But AdaGrad has a disadvantage, rrThe r value will lead to a small step size as the number of iterations increases, so that it cannot move and loses the meaning of adjustment. The RMSProp method, improved in AdaGrad, multiplies acoefficientless than 1
insert image description here

4.3 Adam

  Using momentum and adaptive gradient ideas at the same time, the correction step can greatly alleviate the cold start problem at the beginning of the algorithm.
  When you first start updating the weights, if you use vv directlyv , the step size is 0.1ggg , the step size is too small, and the weight update is slow. If you use the correctedv ~ \tilde vv~ , the step size is 1ggg . When iterating more than 10 times,uuu ofttThe power of t will be very small, and it will lose its effect. At this time, the correctedv ~ \tilde vv~ approximately equal tovvv
insert image description here

5. Batch Normalization

insert image description here

6. Overfitting and underfitting

  Overfitting : It refers to the phenomenon that the model selected during learning contains too many parameters, so that the model predicts the known data well, but predicts the unknown data poorly. In this case, the model may just memorize the training set data instead of learning the data features.
  Underfitting : The model is too descriptive to learn the patterns in the data well. Underfitting is usually caused by a model that is too simple.

7. Dealing with overfitting

  The optimal solution - get more training data.
  Suboptimal solution - adjust the amount of information that the model allows to store or impose constraints on the information that the model allows to store. This type of method is called regularization.

7.1 L2 regularization

  L2 regularization makes the interface of the model smoother and does not generate more complex interfaces.
insert image description here

7.2 Weight Decay

  The purpose of reducing the model capacity is achieved by limiting the range of parameter values.
insert image description here
  θ \thetaThe smaller the θ is, the smaller the regular term
  is to use the Lagrange multiplier method to solve it, then the construction:
insert image description here
  and forλ \lambdaλ θ \theta θ Knowing one - can solve the other, so it can be equivalent to:
insert image description here
  λ→ ∞ \lambda \to \inftyl ,w ∗ → 0 w^* \to 0w0 useλ \lambdaλ to controlθ \thetai

  Then the gradient calculation and parameter update become :
insert image description here
insert image description here
  when λ η < 1 \lambda \eta < 1l h<1 , the weight decay is achieved

7.3 Random deactivation (dropout method)

  Random deactivation : Let the neurons in the hidden layer not be activated with a certain probability.
  Implementation method : During the training process, using Dropout for a certain layer is to randomly discard some outputs of the layer (the output value is set to 0), and these discarded neurons seem to be deleted by the network.
  Dropout ratio : It is the proportion of features that are set to 0, usually in the range of 0.2~0.5.
  Dropout is only enabled during training to adjust parameters and is not used during inference
  Why can random deactivation prevent overfitting??
  ① Random deactivation reduces the network parameters involved in the calculation each time the gradient is updated, reducing the model capacity, so it can prevent overfitting.
  ② Random deactivation encourages weight dispersion, that is, it plays a role of regularization, thereby preventing overfitting.
  ③Random deactivation is also equivalent to model integration. Model integration generally improves accuracy and prevents overfitting.

  The above is the sharing of the points that need to be optimized in the fully connected neural network. The photo data comes from computer vision and deep learning of teacher Lu Peng of Beijing University of Posts and Telecommunications.

Guess you like

Origin blog.csdn.net/m0_58807719/article/details/128168384