There are many explanations about the Adam optimizer on the Internet, but they are always stuck in some parts. Here, I summarize some of the difficult explanations. Please point out any misunderstandings.
Adam, the name comes from: Adaptive Moment Estimation , adaptive moment estimation. It is a jack-of-all-trades optimizer proposed in 2014. It is very convenient to use and has a fast gradient descent speed, but it is easy to oscillate near the optimal value. The performance in the competition will be slightly inferior to SGD, after all, the simplest is the most effective. But the super ease of use makes Adam widely used.
Adam's derivation formula:
explain:
The first gradient is the partial derivative of the loss function.
The second term is the first moment estimate of the gradient in the form of momentum at time t.
The third term is the second moment estimate of the gradient in the form of momentum.
The fourth term is the first-order moment estimation after bias correction. Among them: is the t power of beta 1, the same applies below.
The fifth item is the second-order moment estimation after bias correction.
The last item is the update formula, you can refer to RMSProp and previous algorithms.
question:
1. Gradient descent: If you don’t understand gradient descent, it is recommended to understand the SGD optimizer first.
2. Momentum: It was applied in the previous SGDM optimizer.
3. Moment estimation: If you don't understand, please read "Probability Theory and Mathematical Statistics" in the university.
4. Why bias correction is needed:
Here is just my understanding. Take the second-order moment estimation as an example, the formulas of each are as follows:
And what we actually need is a second moment estimate of the gradient, ie . Therefore, the second-order moment estimate obtained by using momentum is biased and needs to be corrected. When we estimate the second moment of momentum , we can obtain the relationship with and through the geometric sequence formula :
Therefore, to get it , you need to remove the previous coefficient ( it is a constant, it is the tth power of beta 2, t: t time).
The main problems are these. For others, you can read more information about some optimizers before Adam, many of which are in the same line.