Reason Analysis of Ineffective Neural Network Training

Table of contents

1. Local minima and saddle point

2. Batch and momentum

3. Automatically adjust the learning rate (learning rate)

4. The loss loss function may also have an impact 

5. Batch Normalization


1. Local minima and saddle point

        When the loss cannot drop, the gradient may be close to 0. At this time, both the local minimum (local minima) and the saddle point (saddle point) are possible, collectively referred to as critical point. 

         At this point, it is necessary to determine which situation it belongs to, just calculate the Hessian:

  

 2. Batch and momentum

        From a physical point of view, when the ball reaches the local minimal, when the momentum is relatively large, it will cross this small slope.

        In normal gradient descent, every time the gradient is calculated, the parameters are calculated and updated in the opposite direction.

        With the gradient descent of momentum, it can be understood that the update direction is the negative direction of Gradient plus the direction of the previous movement.

        Another interpretation is that the so-called momentum, the update direction is not only considering the current Gradient, but the sum of all past Gradients.

 3. Automatically adjust the learning rate (learning rate)

        When the loss does not decrease, we said that it may be a critical point, but it may also be the following situation:

        The learning rate should be customized for each parameter:

(1) The most common method of modifying the learning rate is the root mean square, which is used in Adagrad.

(2) You can adjust the importance of the current gradient by yourself - RMSProp.

(3)Adam: RMSProp + Momentum。

(4) Learning Rate Scheduling, so that the learning rate η changes with time.

4. The loss loss function may also have an impact 

        In classification, softmax is usually added:

        If it is classified into two categories, sigmoid is more commonly used, but in fact the results of the two methods are the same. Here is the loss function:

        In fact, cross-entropy is most commonly used in classification. In PyTorch, the CrossEntropyLoss function already includes softmax, and the two are bound together. It can be seen from the figure that when the loss is large, the MSE is very flat and cannot be gradient-descent to a place where the loss is small, and it is stuck; but the cross-entropy can go all the way down the gradient.

 5. Batch Normalization

        It is hoped that for different parameters, the range of influence on loss is relatively uniform, as shown in the right picture below:

        The method is Feature Normalization:

        After normalization, the features of each dimension have a mean of 0 and a variance of 1. In general, feature normalization makes gradient descent converge faster. The output of the latter step also needs to be normalized, and these normalizations are all for a Batch. Some of the more famous Normalization:

Guess you like

Origin blog.csdn.net/qq_22749225/article/details/125764719