Optimizer algorithm optimizer

In fact, I didn’t originally want to write about optimizer, because there are a lot of searches on the Internet, but I found that many articles are messy. It may be easier for people who know the relevant concepts, but for novices or people who have been exposed to it before For relatively few people, it is more laborious. I am reorganizing it here as my own notes. Most of the information comes from the previous Daniel blog, and there will be relevant reference links at the end.

1.Related background

1.1. Exponential Weighted Moving Average

1.1.1.Evolution and Overview

Arithmetic average (equal weights) -> Weighted average (unequal weights) -> Moving average (approximately only the most recent N times of data are taken for calculation) -> Batch normalization (BN) and the basic
EMA of various optimization algorithms: It is an exponentially decreasing weighted moving average. The weighted influence of each value decreases exponentially with time. The closer the time is to the current moment, the greater the weighted influence.

Insert image description here

1.1.2.Formal understanding

  • v t = β v t − 1 + ( 1 − β ) θ t v_t = \beta v_{t-1} + (1 - \beta)\theta_{t} vt=b vt1+(1b ) it, θ t \theta_{t} in the formulaitis the actual temperature at time t; the coefficient \beta represents the speed of weighted decline. The smaller the value, the faster the weight declines; vt v_tvtis the value of EMA at time t.
  • When v 0 = 0 v_0 = 0v0=0 时,可得: v t = ( 1 − β ) ( θ t + β θ t − 1 + β 2 θ t − 2 + . . . + β t − 1 θ 1 ) v_t = (1-\beta) (\theta_{t}+\beta\theta_{t-1}+\beta^{2}\theta_{t-2}+ ... +\beta^{t-1}\theta_{1}) vt=(1b ) ( it+b it1+b2 it2+...+bt1θ1) , as can be seen from the formula: daily temperature (θ \thetaThe weight coefficient of θ ) is reduced exponentially. The closer the time is to the current moment, the greater the weighted influence of the data.
  • In the optimization algorithm, we generally take β > = 0.9 \beta >= 0.9b>=0 . 9,而1 + β + β 2 + . . . + β t − 1 = 1 − β t 1 − β 1 + \beta + \beta^{2} + ... + \beta^{t-1} = \frac{1-\beta^{t}} {1-\beta}1+b+b2+...+bt1=1 b1 bt, so when t is large enough, β t ≈ 0 \beta^{t} \approx 0bt0 , at this time it is an exponentially weighted moving average in the strict sense.
  • In the optimization algorithm, we generally take β > = 0.9 \beta >= 0.9b>=0 . 9 , at this time,β 1 1 − β ≈ 1 e ≈ 0.36 \beta^{\frac{1}{1-\beta}} \approx \frac{1}{e} \approx 0.36b1 b1e10 . 3 6 , that is to sayN = 1 1 − β N = \frac{1}{1-\beta}N=1 b1Days later, the height of the curve dropped to about the original 1 3 \frac{1}{3}31, as time moves forward θ \thetaThe θ weight is getting smaller and smaller, so it is equivalent to saying: we only consider the latest (latest)N = 1 1 − β N = \frac{1}{1-\beta}N=1 b1Days of data are used to calculate the EMA at the current moment, which is the source of the moving average.

1.1.3.EMA deviation correction

Insert image description here
in β \betaWhen β = 0.98, ideally, we should be able to get a green curve. However, in reality, what we get is a purple curve. Its starting point is much lower than the real one, and the temperature at the starting position cannot be well estimated. This problem is called is: cold start problem, this is due tov 0 v_0v0= 0 caused.
Solution: Divide the EMA of all moments by 1 − β t 1 - \beta^{t}1bAfter t , it is used as the modified EMA. When t is small, this approach can make the estimation more accurate in the initial stage; when t is large, the bias correction has almost no effect, so it has almost no impact on the original formula. Note: We generally takeβ > = 0.9 \beta >= 0.9b>=0.9 . When calculating the bias-corrected EMA at time t, the EMA before correction at time t-1 is still used .

1.1.4. Understanding of the application of EMA in Momentum optimization algorithm

Assume that the value of each gradient is g, γ = 0.95 \gamma = 0.95c=0. 9 5. At this time, the parameter update amplitude will accelerate to decrease. When n reaches about 150, the speed upper limit is reached, and then it will decrease at a constant speed (refer to the formula in 1 for understanding) .
If the gradient directions of some parameters in a certain time period are inconsistent with the previous ones, then the real parameter update amplitude will become smaller; on the contrary, if the gradient directions of the parameters in a certain time period are consistent, then the real parameter update amplitude will be smaller. The parameter update amplitude will become larger, which will accelerate convergence. In the later stages of iteration, due to random noise problems, it often oscillates near the convergence value. The momentum method will play a decelerating effect and increase stability.

2. Recursive descent algorithm

2.1.BGD MBGD SGD

For details, please refer to previous articles
or this reference link:
Deep Learning - Detailed Explanation of Optimizer Algorithm Optimizer (BGD, SGD, MBGD, Momentum, NAG, Adagrad, Adadelta, RMSprop, Adam)

Recent research shows that the reason why deep neural networks are difficult to train is not because it is easy to enter the local minimum. On the contrary, due to the very complex network structure, even local minimum can obtain very good results in most cases. The reason why it is difficult to train is that the learning process can easily fall into the saddle surface, that is, on the slope, some points are rising and some points are falling. This situation is more likely to occur in flat areas, where the gradient values ​​in all directions are almost 0.

2.2.Momentum

Momentum methods aim to speed up learning, especially when faced with small, continuous gradients that contain a lot of noise. Momentum simulates the inertia of an object in motion, that is, the direction of the previous update will be considered to a certain extent when updating, and the gradient of the current batch will be used to fine-tune the final result. This can increase stability to a certain extent and lead to faster learning.
Formula:
mt = μ ∗ mt − 1 + gt Δ θ t = − η ∗ mt m_t = \mu * m_{t-1} +g_t \\\Delta \theta_t = -\eta *m_tmt=mmt1+gtD it=hmt

Insert image description here
This added item can make the update speed faster in dimensions where the gradient direction remains unchanged, and slower in dimensions where the gradient direction changes, thus speeding up convergence and reducing oscillation.

Hyperparameter setting value: Generally, the value of γ is about 0.9.

advantage:

  • In the early stages of descent, the last parameter update is used, which can accelerate learning when the descent direction is consistent.
  • In the middle and late stages of decline, when oscillating back and forth near the local minimum, gradient–>0 causes the update amplitude to increase and jump out of the trap;
  • Reduces updates when the gradient changes direction. Overall, momentum can accelerate learning in relevant directions and suppress oscillations, thereby accelerating convergence.

shortcoming:

  • This situation is equivalent to the ball rolling down the hill blindly along the slope. If it can have some foresight, such as knowing that it needs to slow down when it is about to go uphill, the adaptability will be better.

2.3.Nesterov Accelerated Gradient

It can be seen that the previously accumulated momentum mt − 1 m_{t−1}mt1It does not directly change the current gradient gt g_tgt, so Nesterov’s improvement is to let the previous momentum directly affect the current momentum, that is:
Insert image description here
Therefore, after adding the Nesterov term, the gradient is calculated to correct the current gradient after a large jump.

The difference between Nesterov momentum and standard momentum lies in the calculation of gradient. The gradient of Nesterov momentum is calculated after applying the current velocity. Therefore, Nesterov momentum can be interpreted as adding a correction factor to the standard momentum method.
In fact, to put it bluntly, the current position is temporarily updating θ \theta using the last gradient.θ , and then calculate the gradient, so that you can see the effect of updating according to the last gradient, which is equivalent to taking a step forward, which is combined with the previous momentum, thus playing a correction role

So why? How to correct it mathematically? Look at the equivalent form below
Insert image description here
. The difference between this equivalent form of NAG and Momentum is that this update direction adds an extra β [ g ( θ i − 1 ) − g ( θ i − 2 ) ] \beta[g(\theta_{ i-1}) - g(\theta_{i-2})]b [ g ( ii1)g ( ii2) ] , its intuitive meaning is obvious: if this gradient is larger than the last gradient, then there is reason to believe that it will continue to grow, then I will add the expected increase in advance. ; If it becomes smaller than last time, it will be a similar situation. This explanation sounds as mysterious as the original explanation, but readers may have discovered thatthis extra term is approximating the second derivative of the objective function! So NAG essentially takes more into account the second derivative information of the objective function, no wonder it can accelerate convergence! In fact, the so-called "look forward" is often mentioned in second-order methods such as Newton's method. Metaphorically speaking, it means "looking forward". In essence, mathematics uses the second-order derivative information of the objective function.

Other related references: Faster than Momentum: Uncovering the true face of Nesterov Accelerated Gradient

2.4 Dosage

The following ones are all recursive descent algorithms of adaptive learning rate
. At the beginning of training, we are far away from the final optimal value point and need to use a larger learning rate. After several rounds of training, we need to reduce the training learning rate.
When mini-batch gradient descent is used, there will be noise during the iteration. Although the cost function will continue to decrease, the convergence result of the algorithm is to swing around the minimum value, and reducing the learning rate will make the final value Near the minimum value, it is closer to the convergence point.
Insert image description here

Divide the learning rate of each parameter by the root mean square of its previous derivatives (divide each parameter by the sum of the mean squares of all previous gradients). Insert image description here
Here nt n_tntThe recursion starts from t=1 to form a constraint regularizer (actually the sum of the squares of all previous gradients), and ε ensures that the denominator is non-zero;

Insert image description here
We discovered a phenomenon that as the gradient increases, our learning rate is expected to increase, which is gt in the picture; but at the same time, as the gradient increases, our denominator is gradually increasing. If it is large, the overall learning rate will be reduced. Why is this?
This is because as the number of updates increases, we hope that our learning rate will become slower and slower. Because we believe that in the initial stage of the learning rate, we are far away from the optimal solution of the loss function. As the number of updates increases, we believe that we are getting closer and closer to the optimal solution, so the learning rate also slows down.

Features:
  (1) When gtgt is small in the early stage, the regularizer is large and can amplify the gradient
  (2) When gtgt is large in the later stage, the regularizer is small and can constrain the gradient
  (3) Suitable for processing sparse gradients.
Disadvantages:
  (1) A global learning rate needs to be manually set
  (2) When ηη is set too large, the regularizer will be too sensitive and the gradient will be adjusted too much
  (3) In the middle and later stages, the accumulation of the squared gradient on the denominator will become increasingly large. The larger the value is, the gradient–>0 will cause the training to end early.

2.5. Adadelta

Adadelta is an extension of Adagrad. The original plan is still to adaptively constrain the learning rate, but the calculation is simplified. Adagrad will accumulate all previous gradient squares, while Adadelta only accumulates fixed-size items, and does not store these items directly, but only approximately calculates the corresponding average value. Right now:
Insert image description here

The AdaDelta algorithm is mainly to solve the defects existing in the AdaGrad algorithm. Let's first introduce the advantages and existing problems of the AdaGrad algorithm:

The iteration formula of AdaGrad is as follows:
Δ xt = η ∑ i = 1 tgi 2 ∗ gt \Delta{x_{t}}=\frac{\eta}{\sqrt{\sum_{i=1}^{t} {g_i^2}}}*g_tΔx _t=i=1tgi2 hgt
x t = x t − 1 − Δ x t x_t=x_{t-1}-\Delta{x_t} xt=xt1Δx _t

where gt g_tgtRepresents the gradient value of the current iteration number.

advantage

  • The learning rate will grow with the reciprocal of the gradient, that is to say, a larger gradient has a smaller learning rate, and a smaller gradient has a larger learning rate, which can solve the problem that the learning rate remains unchanged in the ordinary sgd method.
    Disadvantages
  • You still need to manually specify the initial learning rate yourself, and since the historical gradients are accumulated in the denominator, the learning rate will gradually drop to 0, and if the initial gradient is large, the learning rate of the entire training process will remain small, resulting in learning Time gets longer.
    The AdaDelta algorithm was proposed to solve the above problems. AdaDelta has two solutions:

Improvement method one: Accumulate Over Window

  • Sum the gradients over a window w instead of accumulating them all the time
  • Because storing the gradients before w is inefficient, an exponential decay of the mean of all previous gradients (implemented using RMS, the root mean square value) can be used as an alternative implementation.

The update formula is as follows:
① Change the cumulative gradient information from all historical gradients to the accumulation within a window period forward from the current time:
E [ g 2 ] t = ρ ∗ E [ g 2 ] t − 1 + ( 1 − ρ ) ∗ gt 2 E[g^2]_t=\rho*E[g^2]_{t-1}+(1-\rho)*g_t^2E [ g2]t=rE [ g2]t1+(1r )gt2

Equivalent to the accumulation of historical gradient information multiplied by an attenuation coefficient ρ \rhoρ , then use (1 − ρ 1-\rho1ρ ) is added as the squared weighting coefficient of the current gradient.

②Then change the above E [ gt 2 ] E[g_t^2]E [ gt2] After taking the square root , it is used as the learning rate attenuation coefficient after each iteration update:
}{\sqrt{E[g^2]_t+\epsilon}}*g_txt+1=xtE [ g2]t+ϵ hgt


RMS ( gt ) = E [ g 2 ] t + ϵ RMS(g_t)=\sqrt{E[g^2]_t+\epsilon};RMS(gt)=E [ g2]t+ϵ
where ϵ \epsilonϵ is a minimum value added to prevent the denominator from being zero.
This update method solves the problem of accumulating historical gradients and causing the learning rate to continue to decrease. At that time, you still need to choose the initial learning rate yourself.

Improvement method two: Correct Units with Hessian Approximation
I don’t understand this part. How did the formula of Newton’s method become the following? Refer to [AdaDelta algorithm]( https://blog.csdn.net/xiangjiaojun_/article/details/83960136#comments )
. Through Newton's method, we can know that the iteration step size of Newton's method is f ′ ′ ( x ) f^{\prime\ prime}(x)f (x), the first-order Newton iteration formula is;
xt + 1 = xt − f ′ ( x ) f ′ ′ ( x ) x_{t+1}=x_t-\frac{f^{\prime}(x )}{f^{\prime\prime}(x)}xt+1=xtf(x)f(x)

It can be seen that the iteration step size of Newton's algorithm is an analytical solution of the second-order approximation, and we do not need to manually specify the learning rate.
The step size of the higher-order Newton's method iteration is Hessian \boldsymbol{Hessian}H e s s i a n matrix.
The AdaDelta algorithm adopts this idea and usesHessian \boldsymbol{Hessian}Diagonal approximation of He s s i a n matrix He ssian \boldsymbol{Hessian}H e s s i a n matrix.
The formula is as follows:
Δ x ≈ ∂ f ∂ x ∂ 2 f ∂ x 2 \Delta{x}\approx{\frac{\frac{\partial{f}}{\partial{x}}}{\frac{ \partial^2{f}}{\partial{x^2}}}}Δx _x22 fxf
于是有:
1 ∂ 2 f ∂ x 2 = Δ x ∂ f ∂ x \frac{1}{ {\frac{\partial^2{f}}{\partial{x^2}}}}=\frac{\Delta{x}}{\frac{\partial{f}}{\partial{x}}} x22 f1=xfΔ x
The update formula is:
xt + 1 = xt − 1 ∂ 2 f ∂ x 2 ∗ gt = Δ x ∂ f ∂ x ∗ gt x_{t+1}=x_t-\frac{1}{ {\frac{\ partial ^2{f}}{\partial{x^2}}}}*g_t=\frac{\Delta{x}}{\frac{\partial{f}}{\partial{x}}}*g_txt+1=xtx22 f1gt=xfΔ xgt
In the same way, the numerator and denominator are processed according to the previous method, and the following formula can be obtained:

  • It is assumed that the curvature near x is smooth, and xt + 1 x_{t+1}xt+1​Can
    approximate xt x_txt
    Δ x = R M S [ ( Δ x ) ] t − 1 R M S [ g ] t ∗ g t \Delta{x}=\frac{RMS[(\Delta{x})]_{t-1}}{RMS[g]_t}*g_t Δx _=RMS[g]tRMS[(Δx)]t1gt
    x t + 1 = x t − Δ x x_{t+1}=x_t-\Delta{x} xt+1=xtΔ x
    wheregt g_tgtis the gradient of this iteration.
  • Since RMS is always positive, it is guaranteed that the update direction is always the negative direction of the gradient.
  • The molecule acts as an acceleration term and as momentum in the time window wwAccumulate previous gradients on w .
    The following is a demonstration of the algorithm in the paper:
    Insert image description here

2.6 RMSProp

Both RMSprop and Adadelta are designed to solve the problem of sharp decline in Adagrad's learning rate.
RMSProp can be regarded as a special case of Adadelta. RMSprop is the same as the first form of Adadelta: (it uses exponential weighted average, aiming to eliminate the problem in gradient descent. Oscillation has the same effect as Momentum. If the derivative of a certain dimension is relatively large, the exponential weighted average will be large. If the derivative of a certain dimension is small, the exponential weighted average will be small. This ensures that the derivatives of each dimension are in the same amount. level, thereby reducing the swing. Allowing the use of a larger learning rate eta)
Insert image description here
Advantages:
(1) Due to the use of exponentially weighted average of gradient squares, the problem of AdaGrad ending prematurely in deep learning is improved, and the effect tends to both Between
(2) is suitable for processing non-stationary processes (that is, the process depends on time, and the exponential weighted average is used to handle non-stationary processes better) - the effect is better for RNN Disadvantages: (1) Still depends on the global
learning
rate

2.7 Adam

This algorithm is another way to calculate an adaptive learning rate for each parameter. Equivalent to RMSprop + Momentum

In addition to storing the exponential decay average of the squared past gradient vt like Adadelta and RMSprop, it also maintains the exponential decay average of the past gradient mt like momentum:
Insert image description here

If mt and vt are initialized to 0 vectors, then they will be biased towards 0, so bias correction is done to offset these biases by calculating the bias-corrected mt and vt:

Insert image description here

Gradient update rules:

Insert image description here
Hyperparameter setting values:
Recommended β1 = 0.9, β2 = 0.999, ϵ = 10e−8

Practice shows that Adam works better than other adaptive learning methods.

Here are Adadelta, RMSprop and Adam to share their understanding:

  • RMSprop uses a second-order moving average method to adjust the adaptive adjustment step size, but it still depends on the global learning rate eta \etathe
  • Adadelta also uses the second-order moving average adaptive adjustment step size. The AdaDelta algorithm also maintains an additional state variable Δ xt Δx_tΔx _t, use Δ xt − 1 Δx_{t−1}Δx _t1to calculate the change in the independent variable, so there is no reliance on the hyperparameter eta \etathe
  • Adam also uses the second-order moving average to adaptively adjust the step size, and also uses the momentum moving average to adjust the partial derivative. It is a combination of RMSprop and momentum, so it relies on the global learning rate eta \ etathe

2.8 How to choose an optimization algorithm

If the data is sparse, use adaptive methods, namely Adagrad, Adadelta, RMSprop, Adam .

RMSprop, Adadelta, and Adam have similar effects in many cases.

Adam adds bias-correction and momentum to RMSprop.

As the gradient becomes sparse, Adam will perform better than RMSprop.
Overall, Adam is the best choice .

SGD is used in many papers, without momentum, etc. Although SGD can reach the minimum, it takes longer than other algorithms and may be trapped in saddle points.

If faster convergence is required, or a deeper and more complex neural network is trained, an adaptive algorithm needs to be used.

Insert image description here
Insert image description here

Other related optimization methods

With reference to several common optimization methods (gradient descent method, Newton method, quasi-Newton method, conjugate gradient method, etc.),
how to explain the Newton iteration method in an easy-to-understand manner?

Reference link

Guess you like

Origin blog.csdn.net/u014665013/article/details/88847536