Deep Learning (5) - Introduction and Optimization of Gradient Descent Algorithm

     The mastery of basic knowledge determines the height of research. When we first came into contact with deep learning, we usually saw other people’s generalizations. This method is very good for us to get started quickly, but it also has a big disadvantage. The knowledge understanding is not thorough. As a result, we are confused about algorithm optimization. I also started my exploration of the essence of deep learning knowledge with the idea of ​​​​knowledge summary, and I hope to help more people. There are unclear points in the article. I hope fellow researchers (friends who study deep learning) will point it out and I will work hard to improve my article.

1. Introduction to Gradient Descent Algorithm

     The purpose of finding the loss function we mentioned earlier is to optimize the network by continuously optimizing the parameters by reducing the loss. It is obvious that we need to find an extreme value. It is very simple for us to find the extreme value of a linear function, which is to continuously derive the derivative and then make judgments to find the extreme value. When it is a binary function, we need to perform partial derivatives. If the function is not convex we will get multiple extreme value problems. Therefore, it is difficult to find the minimum value of a high-level loss function such as deep learning. In addition, deep learning requires a large number of samples for learning, which brings different gradient descent methods to different effects. First, let’s introduce the most commonly used framework for finding gradients:

     1. Global gradient descent ( batch gradient descent )

           Each iteration of gradient calculation requires all training set samples to update the model.

     for i in range(epochs):

          params_grad = evaluate_gradient(loss_function,data,params)

          params = params - learning_rate * params_grad

The above is the code implementation. We can summarize the advantages and disadvantages of global variables. We can see that since all data participates in the iteration every time, it will ensure that the direction of our optimization is correct each time, allowing the function to converge to an extreme value. (If it is a convex function it will be the minimum value, if it is a non-convex function it may be a local minimum). However, each iteration can take a long time, and training on too much data at once can lead to computational memory shortages.

   2. Stochastic gradient descent

         Stochastic gradient descent randomly selects one sample from training for learning each time.

       Advantages and Disadvantages: One disadvantage of SGD is that the loss value fluctuates greatly, and it cannot ensure that the gradient proceeds in the correct direction every time it is updated. However, there are also some advantages from it, which is that the fluctuation may cause the function to jump out of the local optimal solution and reach another better local optimal solution. It has better convergence effect and even the global optimal point. In addition, since fluctuations will increase the number of iterations, the convergence will be slower. However, it will still converge to the local or global optimum in the end.

  3. Mini-batch gradient descent ( Mini-batch gradient decent )

     Mini-batch gradient descent is a balance between global gradient descent and stochastic gradient descent. First, we scramble the data and randomly group it. Then input a small set of data for each training.

   Advantages and Disadvantages: Compared with gradient descent, Mini-batch gradient descent reduces the volatility of convergence. Make updates more stable. Compared with global gradient descent, MIni-batch improves the learning speed without worrying about the limitations of memory. Each input image is a selection problem, and we can select an appropriate number of samples after multiple trials.

2. Problems and challenges of gradient descent algorithm

  1. The choice of learning rate is particularly difficult. If the learning rate is small, the convergence speed will be slow. If the learning rate is large, the convergence effect will be poor, that is, it will continue to turbulence near the extreme value but cannot drop down.

  2. Learning rate adjustment, that is, setting a strategy to continuously change the learning rate during the training process, but this still requires setting the threshold.

  3. Each parameter update in the model requires the same learning rate. However, due to the sparseness of data features and the different statistical characteristics and spaces of the values ​​of each feature, each parameter update requires a different learning rate.

  4. For non-convex functions, it will easily fall into local optimal solutions. Another most serious problem is that it does not appear in the local optimal solution but in the saddle point (relatively stable on a certain line but neither a maximum nor a minimum).

3. Optimization methods to alleviate the above problems

  Below we introduce the gradient descent optimization algorithm:

   1. Momentum

    In some functions with many local extreme values, gradient descent will wander in these places, so that the convergence speed will decrease. In this case, we add a momentum memory ending. The principle is that when the direction of gradient descent at this moment is the same as the direction of gradient descent at the previous moment, the gradient descent speed at this moment will be increased. If the direction is opposite to the previous moment, the gradient descent speed in that direction needs to be reduced. This will ensure that it won't shift too far in the wrong direction and make the gradient descent faster.

  2、 Dose 

  Adagrad is a gradient-based optimization algorithm that can adapt to different learning rates for each parameter update. It can obtain a large learning rate for sparse features. For features with many features, we use a smaller learning rate. Therefore, the optimization algorithm is more suitable for processing sparse feature data, and can improve the robustness of the network. Generally, word vectors can be obtained by training, and a small learning rate is assigned to the words that appear in the field.

The equation is as follows:

                   

  It can be seen from the equation that the momentum is different, and the second-order momentum is used.

  3、Adam

  Adaptive Moment Estimation (Adam) is also a method of adaptive learning rate, and its decay method is similar to momentum.

    The equation is: In fact, it is to combine the first-order momentum and the second-order momentum.

               

4. How to choose the SGD optimizer

    If your data features are sparse, then you are better off using adaptive learning rate SGD optimization methods (Adagrad, Adadelta, RMSprop, Adam) because you do not need to manually adjust the learning rate during training. However, many of the latest papers now use the most basic learning rate annealing adjustment of SGD. The evidence shows that SGD has a chance to obtain the minimum value, but compared to other optimizers that rely on robust initialization values ​​and learning rate annealing adjustment strategies, And it is easy to fall into minima and saddle points. Therefore, in order to achieve fast convergence, the SGD optimization method with adaptive learning rate should be selected.

5. More SGD optimization strategies

    The aforementioned Momentum, Adagrade and Adam are all carried out by changing the learning rate and controlling the gradient in the right direction. The following optimization algorithms are a supplement to these distributions, making training more stable.

   1. Randomly shuffle the samples or sort the samples according to a certain meaning.

     Disturbing the order of the training set is to ensure that there is no bias in the learning process; however, in some cases we solve the problem step by step, and some meaningful sorting of the training set is more conducive to convergence and SGD performance.

    2. Batch normalization (BN algorithm)

      In order to ensure the convergence of the network and the impact of training speed performance, we often initialize the input data on the principle of zero mean and 1 variance. However, as the network is trained, the parameters are updated to varying degrees, so that these parameters will lose the distribution of 0 mean and 1 variance. This will cause data deviation, reduce the convergence speed, and easily cause too many neurons to be unactivated, which will lead to excessive parameter adjustment and affect the convergence performance.

    3. Early stopping (ending training early)

     If the loss function no longer decreases significantly during multiple iterations, training should be terminated early. Or if there is an increase in over-fitting, it must be terminated early.

  4、 gradient noise

   Adding a random error of Gaussian distribution at each iteration can increase the robustness of the network and is suitable for deep network training. Because adding random noise is likely to skip the local extremum and achieve better convergence.

 

    The above is a summary of the knowledge related to gradient descent. If there are unclear expressions and mistakes, please correct me. Next, we will introduce the principle of BN (batch normalization) mentioned in the article and its role in the deep network in detail.

Guess you like

Origin blog.csdn.net/qq_37100442/article/details/81743813
Recommended