[Learning Series 6] Common optimization algorithms

Table of contents

1 Common optimization algorithms

1.1 Gradient descent algorithm (batch gradient descent BGD)

1.2 Stochastic gradient descent method (Stochastic gradient descent SGD)

1.3 Mini-batch gradient descent (Mini-batch gradient descent MBGD)

1.4 Momentum method (Momentum)

1.5 AdaGrad

1.6 RMSProp

1.7 Adam


1 Common optimization algorithms

1.1 Gradient descent algorithm (batch gradient descent BGD)

All samples need to be sent in each iteration, which has the advantage that all samples are taken into account in each iteration, and what is done is global optimization.

1.2 Stochastic gradient descent method (Stochastic gradient descent SGD)

Aiming at the shortcomings of the slow training speed of the gradient descent algorithm, the stochastic gradient descent algorithm is proposed. The stochastic gradient descent algorithm is to randomly select a group from the sample, update it once according to the gradient after training, and then extract a group, and then update it again. When the sample size is extremely large, it may not be necessary to train all the samples to obtain a model with a loss value within the acceptable range.
The api in torch is: torch.optim.SGD()

1.3 Mini-batch gradient descent (Mini-batch gradient descent MBGD)

SGD is relatively faster, but there are also problems. Because the training of a single sample may bring a lot of noise, the SGD does not always optimize the direction of the whole rest every iteration, so it may converge at the beginning of training. Fast, but slow after training for a while. On this basis, a small batch gradient descent method is proposed, which randomly selects a small batch from the samples for training each time, instead of a group, so as to ensure both the effect and the speed.

1.4 Momentum method (Momentum)

Although the mini-batch SGD algorithm can bring a good training speed, it cannot always reach the optimal point when it reaches the optimal point, but hovers around the optimal point.
Another disadvantage is that mini-batch SGD requires us to choose an appropriate learning rate. When we use a small learning rate, it will cause the network to converge too slowly during training; when we use a large learning rate, it will lead to The range of the optimized radiance skip function during training, that is, the optimal point may be skipped. All we hope is that the loss function of the network has a good convergence speed when the network is optimized, and at the same time it does not swing too much.

So the Momentum optimizer can just solve the problems we are facing. It is mainly based on the gradient-based mobile teaching weighted average, smoothing the parameters of the network, so that the swing of the gradient becomes smaller.

v=0.8v+0.2\Delta w\Delta wrepresenting the previous gradient

w=w-\alpha v , α is the learning rate

1.5 AdaGrad

The AdaGrad cylinder method is to take the square of the gradient of each generation of each parameter and accumulate it in the square root, divide the global learning rate by this number, and use it as a dynamic update of the learning rate, so as to achieve the effect of adaptive learning rate

gradent = history\_gradent + (\Delta w)^2

w=w-\frac{\alpha }{\sqrt{gradient}+\delta }\Delta w

\deltais a small constant, set approximately to10^{-7}

1.6 RMSProp

In the Momentum optimization algorithm, although the problem of large swings in optimization has been initially solved, in order to further optimize the loss function, there is a problem of excessive swings in the update, and to further speed up the convergence of the function, the RMSProp algorithm uses square weighting for the gradient of the parameters. average.

gradent=0.8*history\_gradent+0.2*(\Delta w)^2

w=w-\alpha \frac{\Delta w}{\sqrt{gradient}+\delta}\Delta w

1.7 Adam

The Adam (Adaptive Moment Estimation) algorithm is an algorithm that combines the Momentum algorithm and the RMSProp algorithm, which can prevent the gradient from swinging too much, and at the same time increase the convergence speed

  1. Cumulants and squared cumulants that need to initialize gradients v_w=0,s_w=0
  2. In the t-th round of training, we can first calculate the parameter update of Momentum and RMSProp v_w=0.8v+0.2\Delta w, the gradient calculated by Momentum and  s_w=0.8*s+0.2*(\Delta w)^2 the gradient calculated by RMSProp
  3. After processing the values ​​in it, we get:w=w-\alpha \frac{v_w}{\sqrt{s_w}+\delta }

The api in torch is: torch.optim.Adam()

Here is an intuitive dynamic diagram to show the effect of the above optimization algorithm:

  • The figure below describes the performance of the six optimizers on a surface:

The following figure compares the performance of the optimizer in 6 on a surface with saddle points:

  • The figure below compares the running process of 6 optimizers converging to the target point (five-pointed star)

related suggestion: 

[Basics of Machine Learning] Review and Summary of Various Gradient Descent Optimization Algorithms 213580115ba2b4c581cef3ffcbfa4bfb8ad644e9d57e6513d7979fddd &scene=27 

Guess you like

Origin blog.csdn.net/WakingStone/article/details/129646973