Reprinted from: https://blog.csdn.net/weixin_39228381/article/details/108548413
for learning records only
Article directory
1. Description
Each time the model 反向传导
will 可学习参数p
calculate one for each 偏导数g_t
for 更新
the corresponding parameter p. Usually, 偏导数g_t
it will not be directly applied to the corresponding learnable parameter p , but 通过优化器做一下处理
one is obtained, 新的值
and the processing process is F
represented by a function (the content of F corresponding to different optimizers is different), that is , it is then used together with the learning rate lr
to update the learnable The parameter p, ie .
Adam是在RMSProp和AdaGrad的基础上改进的
. First master the principle of RMSProp, it is easy to understand Adam.
For a detailed explanation of Adagrad and RMSProp optimization algorithms, you can read this blog
2. Adam principle
On the basis of RMSProp, make two improvements: 梯度滑动平均
and 偏差纠正
.
1. Gradient moving average
In RMSProp, the square of the gradient is obtained by smoothing with a smoothing constant, that is ( according to the paper, the sliding mean of the square of the gradient is represented by v
; according to the pytorch source code, the smoothing constant in Adam is used β
, and the one used in RMSProp is α
), but there is no Smooth the gradient itself.
In Adam, the gradient is also smoothed, and the smoothed sliding mean is denoted by m
, that is , in Adam 两个β
.
2. Deviation correction
For the calculation of the sliding mean of the above m, when t=1, , since m_0
the initial value is 0 and β
close to 1, when it t
is small, the value of m 偏向于0的
is v
the same. Here bias correction is done by , ie .
3. Adam calculation process
For the convenience of understanding, the following pseudocode is slightly different from the paper, and the blue part is more than RMSProp .
Three, pytorch Adam parameters
torch.optim.Adam(params,
lr=0.001,
betas=(0.9, 0.999),
eps=1e-08,
weight_decay=0,
amsgrad=False)
1. params
The learnable parameters in the model that need to be updated
2. lr
learning rate
3. betas
smoothing constant and
4. eps
, added to the denominator to prevent division by 0
5. weight_decay
The function of weight_decay is to modify the partial derivative with the value of the current learnable parameter p, namely: , the partial derivative of the learnable parameter p to be updated here isg_t
The function of weight_decay is L2正则化
not directly related to Adam .
6. amsgrad
If amsgrad is True, on the basis of the above pseudo-code, keep the largest value in history , record as , use the largest value for each calculation , otherwise use the current value .
amsgrad is not directly related to Adam.