Summary of various optimizers for deep learning

1. Design principle of optimization algorithm

The principle of the optimization algorithm in deep learning is the gradient descent method, which minimizes the objective function J ( θ ) J(\theta)J ( θ ) , the optimal solution process, first solve the gradient of the objective function∇ J ( θ ) \nabla J(\theta)J ( θ ) , then the parameterθ \thetaIf θ is an independent variable,θ t = θ t − 1 − η ∇ J ( θ ) \theta_{t}=\theta_{t-1}-\eta\nabla J(\theta) .it=it1η J ( θ )η \etaη is the learning rate, indicating the size of the gradient update step. The algorithm that the optimization process depends on is called an optimizer. It can be seen that the two cores of the deep learning optimizer are the gradient and the learning rate. The former determines the direction of parameter update, and the latter The latter determines the degree of updating of the parameters.

We define θ \thetaθ is the parameter to be optimized,J ( θ ) J(\theta)J ( θ ) is the objective function, and the initial learning rate isη \etan . The execution framework of the optimization algorithm during the gradient descent process is as follows:

1. Calculate the gradient of the objective function with respect to the current parameters:

g t = ∇ J ( θ t ) g_t = \nabla J(\theta_t) gt=J ( it)

2. Calculate the first-order and second-order momentum of the historical gradient as needed:

mt = ϕ ( g 1 , g 2 , ⋅ ⋅ ⋅ , gt ) m_t = \phi(g_1, g_2,···,g_t)mt=ϕ ( g1,g2,⋅⋅⋅,gt)

V t = ψ ( g 1 , g 2 , ⋅ ⋅ ⋅ , g t ) V_t = \psi(g_1, g_2,···,g_t) Vt=ψ ( g1,g2,⋅⋅⋅,gt)

3. Calculate the descending gradient at the current moment:

p = η ∗ mt / V t (adaptive optimizer) p = \eta * m_t / \sqrt{V_t} (adaptive optimizer)p=themt/Vt ( adaptive optimizer )

p = η ∗ gt (non-adaptive optimizer) p = \eta * g_t (non-adaptive optimizer)p=thegt( non-adaptive optimizer )

4. Perform gradient descent update

θ t + 1 = θ t − p \theta_{t+1} = \theta_t - pit+1=itp

For various optimizers, step 3 33 and step4 44 are the same, the main difference is reflected in step1 11 and step2 22. Let's explain the design ideas of these optimizers in detail.

1. Non-adaptive optimizer

During the optimization process, the learning rate remains unchanged throughout the whole process, or according to a certain learning learninglearning s c h e d u l e schedule sc h e d u l e changes over time, called a non-adaptive optimizer, this category includes the most commonSGD SGDSG D (Stochastic Gradient Descent) withMomentum MomentumM o m e n t u m 'sSGD SGDSG D、带N esterov NesterovN es t ero vSGD SGDSG D et al.

1、 B G D BGD BGD( B a t c h Batch Batch g r a d i e n t gradient gradient d e s c e n t descent descent)

Its gradient update formula is:

θ = θ − η ∗ ∇ θ J ( θ ) \theta = \theta - \eta * \nabla_{\theta} J(\theta) .i=itheiJ(θ)

S G D SGD SG D uses the data of the entire training set to calculate loss lossevery time the gradient is updatedl The gradient of oss to parameters, so it is very slow to calculate, and it will be very tricky when encountering a large number of data sets, and it is not possible to update the model in real time with new data.

2、 S G D SGD SG D stochastic gradient descent method (Stochastic StochasticStochastic g r a d i e n t gradient gradient d e s c e n t descent descent)

Its gradient update formula and SGD SGDSG D is similar.

S G D SGD The gradient descent process of SG D is similar to a small ball rolling down a hill. Its forward direction is only consistent with the maximum slope direction of the current hill (maximum negative gradient direction), and the initial velocity at each moment is 0. The gradient update rule is: each update for abatch batchGradient updates are performed for each sample in bat ch , which makes the network update parameters very fast. But its shortcomings are also obvious, becauseSGD SGDSG D updates the gradient very frequently, which will causecost costcost f u n c t i o n function F u n c t i o n will have serious shocks, and it may oscillate back and forth at the global minimum and jump out of the global optimum.

3、 M S G D MSGD MSGD( M i n i − b a t c h Mini-batch Minibatch g r a d i e n t gradient gradient d e s c e n t descent descent)

In fact, we more often refer to MSGD as MSGDMSG D calls itSGD SGDSG D , its gradient update formula andSGD SGDSG D is similar. The gradient update rule is:MSGD MSGDMSG D uses a batch batcheach timeba t c h sample, iennN samples are calculated, so that it can reduce the variance when updating parameters, and the convergence is more stable. On the other hand, it can make full use of the highly optimized matrix operation in the deep learning library for more effective gradient calculation. and2 22 mediumSGD SGDThe difference of SG D is that instead of using all samples in the data set each time the gradient is updated, it is abatch batchnninside ba t c hn samples. SGD SGDmentioned in the article belowSG D refers toMSGD MSGDMSG D. _
insert image description here
advantage:

1. Although it looks like MSGD is in the process of updating MSGDMSG D' scost costcost f u n c t i o n function f u n c t i o n fluctuates greatly and will take many detours, but the requirements for gradients are very low (calculating gradients is fast), and for the introduction of noise, a lot of theoretical and practical work has proved that as long as the noise is not particularly large,MSGD MSGDMSG D can converge well.

2. When applying large data sets, the training speed is very fast. For example, take hundreds of data points from millions of data samples each time, and calculate a MSGD MSGDMSG D gradient, update the model parameters. Compared with traversing all samples of the standard gradient descent method, it is much faster to update the parameters every time a sample is input.

shortcoming:

1. Good convergence cannot be guaranteed; mini minimini- b a t c h batch ba t c h only uses a part of the data set for gradient descent each time, so each descent is not strictly in the direction of the minimum, but the overall downward trend is in the direction of the minimum, and it is extremely easy to fall into a local minimum.

2、 l r lr l If r is too small, the convergence speed will be very slow, if it is too large,loss lossloss f u n c t i o n function Function will keep oscillating or even deviate from the minimum value . (One measure is to set a larger learning rate first, and when the change between two iterations is lower than a certain threshold, reducelearning learninglearning r a t e rate r a t e , but the setting of this threshold needs to be written in advance, but in this case it cannot adapt to the characteristics of the data set).

3. For non-convex functions, it is also necessary to avoid being trapped in a local minimum or a saddle point, because the gradient around the saddle point is close to 0 00 M S G D MSGD MSG D could easily get stuck here. To explain the saddle point here is: the curve, surface, or hypersurface of the saddle point neighborhood of a smooth function is located on different sides of the tangent of this point. As shown below2 22 .

insert image description here
4、SGDM SGDMSG D M (withMomentum MomentumM o m e n t u m SGD SGDof momentumSGD)

In order to solve the problems that are prone to occur in the above optimization algorithms, SGDM SGDMSG D M application was born, in the originalSGD SGDThe first-order momentum is added to the SG D. From an intuitive understanding, it is to add an inertia. In places with steep slopes, there will be greater inertia, and the descent will be faster ; will be slower.

Its gradient update formula is:

vt = γ vt − 1 + η ∇ θ J ( θ ) v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta) .vt=v _t1+ηiJ(θ)

θ = θ − v t \theta = \theta - v_t i=ivt

SGDM SGDMSG D M by addingγ vt − 1 \gamma v_{t-1}v _t1, which is defined as momentum, γ \gammaγ commonly used value is0.9 0.90.9 . After using the momentum, in the gradient descent process, the descent speed can be made faster in the direction where the gradient direction is unchanged, and the descent speed in the direction where the gradient direction is changed is slower. Here we give a closer example: in the original gradient descent, we have been descending in one direction, and when we encounter a ravine (that is, an inflection point), we are on the other side of the hillside when we cross the ravine. At this time The gradient direction is opposite to the previous one. At this time, due to the accumulation of the previous gradient size, the changes between the two hillsides will be canceled out, so it will not always oscillate on the two hillsides, and it is easy to go down the valley. It just reduces vibration. SGDM SGDMThe status of the gradient update of SG D M is as shown in Figure 3 33 , and Figure1 11 , the vibration is significantly reduced.
insert image description here
5.NAG NAGN A G (N esterov NesterovNesterov a c c e l e r a t e d accelerated accelerated g r a d i e n t gradient gradient)

SGDM introduced above SGDMEach descending step of SG D M is composed of an accumulation of the previous descending direction and the gradient direction of the current point, but when it just descends to the vicinity of the inflection point, if we continue to update the parameters in this way at this time, we will have a A larger magnitude crosses the inflection point, that is, the model will not automatically reduce the magnitude of the update when it encounters an inflection point. NAG improves the momentum method for the above problems, and its expression is as follows:

vt = γ vt − 1 + η ∇ θ J ( θ − γ vt − 1 ) v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta-\gamma v_{t-1). })vt=v _t1+ηiJ(θv _t1)

θ = θ − v t \theta = \theta - v_t i=ivt

NAG NAGN A G uses the previous gradient value at the current position to perform a parameter update first, and then calculates the gradient at the updated position, and adds this part of the gradient to the previously accumulated gradient value vector. Simply put, it is based on the previous accumulation. The gradient direction of simulates the value after the parameter update in the next step, and then replaces the gradient at the current position in the momentum method with the gradient at the simulated position. So why does this method solve the problem mentioned before? BecauseNAG NAGN A G has a step to predict the position gradient of the next step, so when it falls near the inflection point,NAG NAGN A G predicts that it will cross the inflection point, and the gradient of this item will be corrected for the previous gradient, which is equivalent to preventing its span from being too large.
insert image description here
As above picture4 44 , we give an example to briefly showNAG NAGThe update principle of N A G , where the blue line represents SGDM SGDMSG D M method, the short blue line represents the gradient update of the current position, and the long blue line represents the gradient accumulated before; the first red line represents the use ofNAG NAGThe NAG algorithm predicts the gradient update of the next position. The first brown line represents the previously accumulated gradient, and the vector addition result (green line) is the direction of the parameter update . NAG NAGThe state diagram of N A G when updating the gradient is shown in Figure5 55 , it can be seen from the figure that compared toSGDM SGDMSG D M andSGD SGDSG D ,NAG NAGN A G showed better performance.
insert image description here

2. Adaptive optimizer

In the optimization process, the learning rate changes adaptively with the gradient, and try to eliminate the influence of the given global learning rate as much as possible. It is called an adaptive optimizer, and the common ones include A dagrad AdagradA d a g r a dA dadelta AdadeltaAdadelta RMS propRMSprop _ _ _ _RMSprop A d a m Adam A d am等。
1、A dagrad AdagradThe d a g r a d

Dagrad is AdagradA da g ra d is actually a constraint on the learning rate. For parameters that are frequently updated, we have accumulated a lot of knowledge about it. We don’t want to be affected too much by a single sample, and we hope that the learning rate will be slower; for occasional For the updated parameters, we know too little information, and hope to learn more from each occasional sample, that is, the learning rate is larger. The use of second-order momentum in this method means the arrival of the era of "adaptive learning rate" optimization algorithm.

Here we illustrate the second order momentum V t V_tVtDefinition of : It is used to measure the historical update frequency of parameters, and the second-order momentum is the sum of squares of all gradient values ​​so far. A dagrad AdagradThe expression of A d a g r a d is:

m t = g t m_t = g_t mt=gt

V t = ∑ i = 1 t g t 2 V_t = \sum_{i=1}^{t}{g_t}^2 Vt=i=1tgt2

θ t + 1 = θ t − η m t V t \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{V_t}} it+1=ittheVt mt

where gt g_tgtfor ttThe parameter gradient at time t , let's explain whyadagrad adagrada d a g rad can change the learning rate of its parameters for different frequency features . First, we see that the second-order momentumV t V_tVt, which is the cumulative sum of squared gradients. For features with less training data, the corresponding parameters are updated slowly, that is to say, the cumulative sum of squared gradients of their gradient changes will be relatively small, so the learning rate corresponding to the above parameter update equation is will become larger, so for a certain characteristic data set, the corresponding parameter update speed will be faster. In case the above denominator is 0 00 , so often add a smoothing term parameterϵ \epsilonϵ , the parameter update equation becomes:

θ t + 1 = θ t − η mt V t + ϵ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{V_t+\epsilon}}it+1=ittheVt+ϵ mt

but adagrad adagrada d a g ra d also has a problem, that is, its denominator will increase as the number of training increases, which will cause the learning rate to become smaller and smaller, and finally infinitely close to 0 00 , making it impossible to effectively update the parameters.

2、 A d a d e l t a Adadelta A d a d e lt a

For A dagrad AdagradCons of A d a g r a d , A dadelta AdadeltaA d a d e lt a to the second order momentumV t V_tVtImproved, and A dagrad AdagradCompared with A d a g ra d , the denominator is replaced by the attenuation average value of the gradient square in the past. This denominator is equivalent to the root mean square value RMS RMS of the gradientRMS( r o o t root root m e a n mean mean s q u a r e d squared s q u a red ) . Its expression is as follows:

m t = g t m_t = g_t mt=gt

V g , t = γ V g , t − 1 + ( 1 − γ ) g t 2 V_{g,t} = \gamma V_{g,t-1} + (1-\gamma){g_t}^2 Vg,t=γ Vg,t1+(1c ) gt2

V Δ θ , t = γ V Δ θ , t − 1 + ( 1 − γ ) Δ θ t 2 V_{\Delta \theta,t} = \gamma V_{\Delta \theta,t-1} + (1 -\gamma){\Delta \theta_t}^2VD i , t=γ VD θ , t 1+(1c ) D it2

R M S [ g ] t = V g , t + ϵ RMS[g]t = \sqrt{V{g,t}+\epsilon} RMS[g]t=Vg,t+ϵ

RMS [ Δ θ ] t = V Δ θ , t + ϵ RMS[\Delta\theta]t = \sqrt{V{\Delta\theta,t}+\epsilon};RMS[Δθ]t=V D i ,t+ϵ

θ t + 1 = θ t − R M S [ Δ θ ] t − 1 R M S [ g ] t m t \theta_{t+1} = \theta_{t} - \frac{RMS[\Delta \theta]_{t-1}}{RMS[g]_t}m_t it+1=itRMS[g]tRMS[Δθ]t1mt

where the second-order momentum change to the gradient is RMS [ g ] t RMS[g]_tRMS[g]t; The second-order momentum of the change in the variable is RMS [ Δ θ ] t RMS[\Delta \theta]_tRMS[Δθ]t, and replace it with the learning rate using . by adadelta adadeltaa da a del lt a optimization algorithm, we don't even need to set a default learning rate, because this has been removed in the new rules .

3、 R M S p r o p RMSprop RMSp ro p

R M S p r o p RMSprop RMSp ro pA dadelta AdadeltaA d a d e lt a is all about solvingA dagrad AdagradThe Adagrad learning rate drops sharply because of the problem . RMS prop RMSpropRMSp ro pA dadelta AdadeltaThe calculation formula of A d a d e lt a is very similar, but it is proposed independently at the same time, and its expression is:

m t = g t m_t = g_t mt=gt

V t = γ V t − 1 + ( 1 − γ ) g t 2 V_t = \gamma V_{t-1} + (1-\gamma){g_t}^2 Vt=γ Vt1+(1c ) gt2

θ t + 1 = θ t − η mt V t + ϵ \theta_{t+1} = \theta_{t} - \eta \frac{m_t}{\sqrt{V_t + \epsilon}}it+1=ittheVt+ϵ mt

It can be seen from this that RMS prop RMSpropRMSp ro p also needs to manually set an initial learning rateη \etan . The authors suggest thatϵ \epsilonϵ is set to 0.9, learning rateη \etaη was set to 0.001.

4、 A d a m Adam Adam( A d a p t i v e Adaptive Adaptive M o m e n t Moment Moment E s t i m a t i o n Estimation Estimation)

A d a m Adam The Adam algorithm is another way to calculate an adaptive learning rate for each parameter. It is a kind of momentumM omentum MomentumM o m e n t u m andRMS prop RMSpropThe method of combining RMSp ro p introduces two parameters at the same time β 1 \beta_1b1and β 2 \beta_2b2, whose expression is:

mt = β 1 mt − 1 + ( 1 − β 1 ) gt (first-order momentum) m_t = \beta_1m_{t-1} + (1-\beta_1)g_t(first-order momentum)mt=b1mt1+(1b1)gt( first order momentum )

V t = β 2 V t − 1 + ( 1 − β 2 ) gt 2 (second-order momentum) V_t = \beta_2V_{t-1} + (1-\beta_2){g_t}^2(second-order momentum)Vt=b2Vt1+(1b2)gt2 (Second Momentum)

m ~ t = m t 1 − β 1 t \tilde{m}_t = \frac{m_t}{1-\beta_1^t} m~t=1b1tmt

V ~ t = V t 1 − β 2 t \tilde{V}_t = \frac{V_t}{1-\beta_2^t} V~t=1b2tVt

θ t + 1 = θ − η m ~ t V ~ t + ϵ \theta_{t+1} = \theta - \eta \frac{\denominator{m}_t}{\sqrt{\denominator{V}_t + \epsilon}}it+1=itheV~t+ϵ m~t

Among them β 1 \beta_1b1The default value is 0.9 0.90.9β 2 \beta_2b2The default value is 0.999 0.9990.999 ,ϵ \epsilonϵ1 0 − 8 10^{-8}108 A d a m Adam A d am combines momentum andRMS prop RMSpropRMSp ro p the merits of both, from experience has shownA dam AdamAdam performs well in practice and has advantages compared with other adaptive learning algorithms .

Summarize

Let's take a look at the performance of the above optimization algorithms on saddle points and contour lines:
insert image description here
insert image description here
As can be seen from Figure 6 and Figure 7, A dagrad AdagradA d a g r a dA dadelta AdadeltaAdadelta RMS propRMSprop _ _ _ _RMS rop finds the right direction and moves forward almost quickly, and converges fairly quickly, while other methods are either slow or take many detours to find it .

1. First of all, there is no conclusion on which of the major algorithms is better. If you are just getting started, give priority to SGD SGDSG D +N esterov NesterovNesterov M o m e n t u m Momentum M o m e n t u m orA dam AdamAdam

2、 A d a m Adam Adaptive learning rate algorithms such as Adam have advantages for sparse data, and the convergence speed is fast; but SGDM SGDMSG D M tends to achieve better end results.

3. Choose according to your needs - in the process of model design experiment, to quickly verify the effect of the new model, you can first use A dam AdamAdam performs rapid experimental optimization; fine-tunedSGD SGD can be used before the model is launched or the results are releasedThe SG D series optimization algorithm performs extreme optimization of the model.

4. Consider the combination of different algorithms. First use A dam AdamA da am for a quick descent before switching toSGD SGDThe SG D series optimization algorithm is fully tuned.

Guess you like

Origin blog.csdn.net/qq_52302919/article/details/131626516