Adam optimizer (common understanding)

There are many explanations about the Adam optimizer on the Internet, but they are always stuck in some parts. Here, I summarize some of the difficult explanations. Please point out any misunderstandings.

Adam, the name comes from: Adaptive Moment Estimation , adaptive moment estimation. It is a jack-of-all-trades optimizer proposed in 2014. It is very convenient to use and has a fast gradient descent speed, but it is easy to oscillate near the optimal value. The performance in the competition will be slightly inferior to SGD, after all, the simplest is the most effective. But the super ease of use makes Adam widely used.

Adam's derivation formula:

explain:

The first gradient g_{t}is the partial derivative \hat{L}of the loss function.\theta_{t}

The second term m_{t}is the first moment estimate of the gradient in the form of momentum at time t.

The third term v_{t}is the second moment estimate of the gradient in the form of momentum.

The fourth term \hat{m_{t}}is the first-order moment estimation after bias correction. Among them: \beta_{1}^{t}is the t power of beta 1, the same applies below.

The fifth item \hat{v_{t}}is the second-order moment estimation after bias correction.

The last item is the update formula, you can refer to RMSProp and previous algorithms.

question:

1. Gradient descent: If you don’t understand gradient descent, it is recommended to understand the SGD optimizer first.

2. Momentum: It was applied in the previous SGDM optimizer.

3. Moment estimation: If you don't understand, please read "Probability Theory and Mathematical Statistics" in the university.

4. Why bias correction is needed:

Here is just my understanding. Take the second-order moment estimation v_{t}as an example, v_{t}the formulas of each are as follows:

\begin{align*} v_{1}&=(1-\beta_{2})g_{1}^2\\ v_{2}&=\beta_{2}v_{1}+(1-\beta_{2})g_{2}^2\\ &=\beta_{2}(1-\beta_{2})g_{1}^2+(1-\beta_{2})g_{2}^2\\ &=(1-\beta_{2})(\beta_{2}g_{1}^2+g_{2}^2)\\ &=(1-\beta_{2})(\beta_{2}^{2-1}g_{1}^2+\beta_{2}^{2-2}g_{2}^2)\\ &=(1-\beta_{2}){\textstyle \sum_{i=1}^{2}\beta_{2}^{2-i}g_{i}^2}\\ v_{t}&=(1-\beta_{2}){\textstyle \sum_{i=1}^{t}\beta_{2}^{t-i}g_{i}^2} \end{align*}

And what we actually need is a second moment estimate of the gradient, ie E(g_{i}^{2}). Therefore, the second-order moment estimate obtained by using momentum is biased and needs to be corrected. When we estimate the second moment of momentum v_{t}, we can obtain the relationship with and E(v_{t})through the geometric sequence formula E(v_{t})E(g_{i}^{2})

\begin{align*} E(v_{t})&=(1-\beta_{2})E({\textstyle \sum_{i=1}^{t}\beta_{2}^{t-i}g_{i}^2})+\xi \\ &=(1-\beta_{2})(1+\beta_{2}^{1}+\beta_{2}^{2}+...+\beta_{2}^{t-1})E(g_{i}^2)+\xi \\ &=(1-\beta_{2})(\frac{1-(\beta_{2})^{t}}{1-\beta_{2}})E(g_{i}^2)+\xi \\ &=(1-\beta_{2}^{t})E(g_{i}^2)+\xi \end{align*}

Therefore, to get it E(g_{i}^{2}), you need to remove the previous coefficient ( (1-\beta_{2}^{t})it is a constant, \beta_{2}^{t}it is the tth power of beta 2, t: t time).

The main problems are these. For others, you can read more information about some optimizers before Adam, many of which are in the same line.

Guess you like

Origin blog.csdn.net/BeiErGeLaiDe/article/details/126059488