[pytorch optimizer] Detailed explanation of Adam optimization algorithm

Reprinted from: https://blog.csdn.net/weixin_39228381/article/details/108548413
for learning records only

1. Description

Each time the model 反向传导will 可学习参数pcalculate one for each 偏导数g_tfor 更新the corresponding parameter p. Usually, 偏导数g_t it will not be directly applied to the corresponding learnable parameter p , but 通过优化器做一下处理one is obtained, 新的值 \widehat{g}_tand the processing process is Frepresented by a function (the content of F corresponding to different optimizers is different), that is \widehat{g}_t=F(g_t), it is then used together with the learning rate lrto update the learnable The parameter p, ie p=p-\widehat{g}_t*lr.

Adam是在RMSProp和AdaGrad的基础上改进的. First master the principle of RMSProp, it is easy to understand Adam.
For a detailed explanation of Adagrad and RMSProp optimization algorithms, you can read this blog

2. Adam principle

On the basis of RMSProp, make two improvements: 梯度滑动平均and 偏差纠正.

1. Gradient moving average

In RMSProp, the square of the gradient is obtained by smoothing with a smoothing constant, that is ( insert image description hereaccording to the paper, the sliding mean of the square of the gradient is represented by v; according to the pytorch source code, the smoothing constant in Adam is used β, and the one used in RMSProp is α), but there is no Smooth the gradient itself.

In Adam, the gradient is also smoothed, and the smoothed sliding mean is denoted by m, that is insert image description here, in Adam 两个β.

2. Deviation correction

For the calculation of the sliding mean of the above m, when t=1, , m_1=\beta*m_0+(1-\beta)*g_1since m_0the initial value is 0 and βclose to 1, when it tis small, the value of m 偏向于0的is vthe same. Here bias correction is done insert image description hereby , ie \widehat{m}_t=\frac{m_t}{1-\beta^t}.

3. Adam calculation process

For the convenience of understanding, the following pseudocode is slightly different from the paper, and the blue part is more than RMSProp .
insert image description here

Three, pytorch Adam parameters

torch.optim.Adam(params,
                lr=0.001,
                betas=(0.9, 0.999),
                eps=1e-08,
                weight_decay=0,
                amsgrad=False)

1. params

The learnable parameters in the model that need to be updated

2. lr

learning rate

3. betas

smoothing constant \beta_1and\beta_2

4. eps

insert image description here, added to the denominator to prevent division by 0

5. weight_decay

The function of weight_decay is to modify the partial derivative with the value of the current learnable parameter p, namely: g_t=g_t+(p*weight_decay), the partial derivative of the learnable parameter p to be updated here isg_t

The function of weight_decay is L2正则化not directly related to Adam .

6. amsgrad

If amsgrad is True, on the basis of the above pseudo-code, keep the largest value in history , record as v_{max}, use the largest value for each calculation v_{max}, otherwise use the current value v_t.

amsgrad is not directly related to Adam.

Guess you like

Origin blog.csdn.net/All_In_gzx_cc/article/details/127986540