Summary of full gradient descent algorithm, stochastic gradient descent algorithm, mini-batch gradient descent algorithm, stochastic average gradient descent algorithm, gradient descent algorithm

1. Common gradient descent algorithm

  • Full gradient descent algorithm (Full gradient descent, FGD)
  • Stochastic gradient descent (SGD)
  • Stochastic average gradient descent (SAGD)
  • Mini-batch gradient descent algorithm (Mini-batch gradient descent, MGD)
  • Different: the difference lies in the way the samples are used
  • Same: Both are to correctly adjust the weight vector, by calculating a gradient for each weight, so as to update the weight, so that the objective function is minimized as much as possible

2. Full Gradient Descent Algorithm (FGD)

  • Calculate the errors of all samples in the training set , sum them and take the average as the objective function
  • The weight vector is moved in the opposite direction of its gradient such that the current objective function is reduced the most
  • Since all gradients need to be calculated on the entire dataset when each update is performed, FGD is slow and cannot handle datasets that exceed the memory capacity limit
  • FGD also cannot update the model online, that is, it cannot add new samples during the running process
  • is to calculate the gradient of the loss function with respect to the parameter θ on the entire training data set

\theta=\theta-\eta \cdot \nabla_{\theta} J(\theta)

3. Stochastic gradient descent algorithm (SGD)

Since FGD needs to calculate all sample errors every iteration to update the weights, and there are often hundreds of millions of training samples in practical problems, the efficiency is low and it is easy to fall into the local optimal solution, so the stochastic gradient descent algorithm is proposed

The objective function calculated in each round is no longer the overall sample error, but only a single sample error , that is, only the gradient of the objective function of one sample is substituted each time to update the weight , and then the next sample is taken to repeat the process until the loss function value stops falling or the loss function value is less than a certain tolerable threshold

This process is simple and efficient, and can usually avoid updating iterations from converging to a local optimal solution. Its iteration form is

\theta=\theta-\eta \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)

Only one sample iteration is used at a time, and it is easy to fall into a local optimal solution if it encounters noise

Among them, represents the feature valuex^{(i)} of a training sample , represents the label value of a training sampley^{(i)}

4. Small batch gradient descent algorithm (mini-bantch)

The small batch gradient descent algorithm is a compromise between FGD and SGD, which takes into account the advantages of the above two methods to a certain extent.

  • Each time a small sample set is randomly selected from the training sample set , and FGD is used to iteratively update the weights on the extracted small sample set
  • The number of sample points contained in the extracted small sample set is called batch_size , which is usually set to the power of 2 , which is more conducive to GPU accelerated processing
  • special
    • If batch_size=1, it becomes SGD
    • If batch_size=n, it becomes FGD, and its iteration form is

 \theta=\theta-\eta \cdot \nabla_{\theta} J\left(\theta ; x^{(i: i+n)} ; y^{(i: i+n)}\right)

Five, stochastic average gradient descent algorithm (SAGD)

In the SGD method, although the problem of high computational cost is avoided, the effect of SGD is often unsatisfactory for large data training, because each round of gradient update is completely irrelevant to the previous round of data and gradients.

The stochastic average gradient algorithm overcomes this problem, maintains an old gradient for each sample in memory , randomly selects the i-th sample to update the gradient of this sample, and keeps the gradients of other samples unchanged, then calculates the average of all gradients, and then updates the parameters

In this way, each round of update only needs to calculate the gradient of one sample , and the calculation cost is equivalent to SGD, but the convergence speed is much faster

6. Summary

  • The FGD method uses the entire data set for each round of update , so it takes the most time and cost, and the memory storage is the largest.
  • SAGD performs poorly in the early stages of training and is slower to optimize. This is because we often set the initial gradient to 0, and each round of SAGD gradient update combines the previous round of gradient values
  • Considering the number of iterations and running time, SGD has very good performance. It can quickly get rid of the initial gradient value in the early stage of training, and quickly reduce the average loss function to a very low level. However, it should be noted that when using the SGD method, the step size should be carefully selected, otherwise it is easy to miss the optimal solution
  • Mini-batch combines the "boldness" of SGD and the "carefulness" of FGD, and its performance is exactly between SGD and FGD. In the current field of machine learning, mini-batch is the most used gradient descent algorithm, precisely because it avoids the shortcomings of FGD operation efficiency, low cost and unstable convergence effect of SGD.

7. Gradient descent optimization algorithm (expansion)

The following algorithms are mainly used for deep learning optimization

  • Momentum method: In fact, the momentum method (SGD with monentum) is the sister version of SAGD. SAGD averages the gradients of the past K times, and SGD with monentum is a weighted average of all past gradients.
  • Nesterov accelerated gradient descent method: similar to a smart ball, it can know the deceleration when it encounters the slope again
  • Adagrad: Let the learning rate use parameters. For features with fewer occurrences, we use a larger learning rate for them, and for features with more occurrences, we use a smaller learning rate for them.
  • Adadelta: Adadelta is an extension algorithm of Adagrad to deal with the problem of monotonically decreasing Adagrad learning rate
  • RMSProp: It combines the exponential moving average of the square of the gradient to adjust the change of the learning rate, which can converge well in the case of an unstable (Non-Stationary) objective function
  • Adam: Combining the advantages of AdaGrad and RMSProp two optimization algorithms, it is an adaptive learning rate algorithm

 Learning to navigate: http://xqnav.top/

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128247578