The principle of Adam optimizer and its variants

This article will start with SGD to introduce the principle of the Adam optimizer and the background of its variants.

1. The principle of SGD

SGD (Stochastic Gradient Descent) is based on the principle of the fastest gradient descent method, assuming that we have a loss function L(\theta ), among which \thetaare the parameters to be learned, and define the following optimization path \theta^{k+1}=\theta^{k}+t^k\Delta(\theta^{k}),\ k=0,1,2,... ...to L(\theta )minimize the value of the loss function. This is a \thetaprocess of continuously updating the iterative parameters, which krepresents one of the update steps, t^kthe update step size (ie learning rate), \Delta(\theta^{k})and the update direction.

Assuming that there is an optimal parameter \theta^*and the current parameter is close to the optimal parameter \theta^k, we choose an appropriate parameter update step size to \theta^{k+1}=\theta^{k}+t^k\Delta(\theta^{k})force the optimal parameter. L(\theta )We perform Taylor expansion on the objective loss function :

L(\theta^*) = L(\theta^k+v)\approx L(\theta^k) + \nabla L(\theta^k) v 

 Because \theta^*it is the optimal parameter, so:

L(\theta^*) < L(\theta^k) \rightarrow \nabla L(\theta^k) v< 0

The steepest descent method refers to vfinding an appropriate value on the basis of normalization so that the directional derivative is \nabla L(\theta^k) vthe smallest, or that the L(\theta^k)approach may approach the optimal value L(\theta^*). Assuming it is the L2 paradigm \left \| v \right \|\leq 1, the directional derivative is the smallest v = -\nabla L(\theta^k)at that time . Therefore, the update path of the steepest descent method can be expressed as:

\theta^{k+1}=\theta^{k} - equal to L(\theta^k),\k=0,1,2,.

 It t^krepresents the update step size, because the above Taylor expansion contains the requirement to update near the parameters, so it is necessary to control the update step size, which is called the learning rate in SGD.

2. The principle of SGD with Momentum momentum SGD

Because the direction gradient in SGD g_k = L(\theta^k)may cause oscillations in parameter learning due to certain point deviations, the smoothing parameters are added through momentum:

m_k = \beta m_{k-1} + (1-\beta )g_k

 \theta^{k+1}=\theta^{k} - t^k m_k,\ k=0,1,2,... ...

3. The principle of Adam

Momentum SGD solves the oscillation of learning caused by gradient deviation at some points, but at the same time the learning rate setting will also affect learning. When the gradient is small, the learning rate setting is too small, which will slow down the training speed. When the gradient is large, If the learning rate is set too large, it will cause training oscillations, because Adam adds an adaptive adjustment learning rate (that is, update step size) on the basis of momentum SGD.

m_k = \beta_1 m_{k-1} + (1-\beta_1 )g_k

v_k = \beta_2 v_{k-1} + (1-\beta_2 )g_k^2

\theta^{k+1}=\theta^{k} - t^k m_k / \sqrt{v_k},\ k=0,1,2,... ...

Adam adds a second-order momentum on the basis of momentum SGD v^k, through which the step size is adaptively controlled. When the gradient is small, the overall learning rate t^k /\sqrt{v_k}will increase, and vice versa. Therefore, in general, compared with SGD, Adam, Its convergence rate is faster.

At the same time, in order to avoid the oscillation of the learning rate caused by the gradient deviation of some points, \beta_2the momentum characteristic is introduced (due to the quadratic case of the gradient, generally \beta_2 > \beta_1).

4. Principle of AdamW

But Adam has another problem. When there is an L2 regular term in the loss function, Adam optimization will not be effective. The main reason is that the learning rate of Adam changes, and when the gradient becomes larger, its learning rate will become smaller. Therefore, the weight parameter with a larger gradient will be more different from the weight parameter with a smaller gradient, which is contrary to the L2 regularity. We illustrate this process with the formula:

Assuming that the target loss function is added with the L2 regularization term, it is expressed as follows:

L(\theta ) = L(\theta ) + \frac{1}{2}\left \| \theta \right \|^2

If the momentum SGD is used as the optimizer, the update of the parameters at this time can be written as the following formula, and it can be seen that the L2 regular term is equivalent to the weight decay.

\hat{m_k} = \beta m_{k-1} + (1-\beta )(g_k + \theta^k) = m_k + (1-\beta )\theta^k

\theta^{k+1}=\theta^{k} - t^k \hat{m_k} = \beta \theta^{k} - t^k m_k,\ k=0,1,2,... ...

But when Adam is applied, the weight decay coefficient has a smaller value when the gradient is larger, making Adam's optimization of the L2 regular term not good. Therefore, AdamW mainly adds a weight decay item in Adam to help optimize the L2 regular term:

\theta^{k+1}=\theta^{k} - t^k (m_k / \sqrt{v_k} + \omega \theta^{k}),\ k=0,1,2,... ...

\omega =\omega_{norm}\sqrt{\frac{b}{BT}}

In the above formula \omega, is the coefficient of weight decay, where b represents the batch size, B represents the number of batches trained in epoch, and T represents the total number of epochs. It can be seen that the weight decay coefficient is related to the number of training rounds.

5. The principle of AdamWR

AdamWR mainly adds the function of warm restart, which solves the problem of preventing model training from falling into local optimum, because the learning rate and gradient will always converge. When the local optimum is reached, it is difficult or takes a long time to jump out , so AdamWR mainly improves the exploration space of the model by periodically increasing the learning rate.

This function of periodically adjusting the learning rate is called cosine annealing, which can be expressed as:

t^k = t^i_{min} + 0.5(t^i_{max} - t^i_{min})(1 + cos(\pi T_{cur}/T_i))

AdamWR divides the entire training process into multiple hot restart processes. The i in the above formula represents the i-th hot restart process, which t^i_{min}indicates the minimum learning rate in this stage, and indicates T_ithe total number of training epochs required in the current hot restart round. T_{cur}The number of epochs that have been trained so far.

The exploration space of the model through AdamWR is larger. The following figure evaluates the advantage space that AdamWR can find in the case of different initial learning rates and L2 regular term weight values.

 

 

Guess you like

Origin blog.csdn.net/tostq/article/details/130597333