Optimizer summary
https://zhuanlan.zhihu.com/p/22252270
SGD optimization algorithm
SGD here refers to the mini-batch gradient descent, on batch gradient descent, stochastic gradient descent, as well as mini-batch gradient descent of the specific differences would not elaborate. SGD now generally refers to the mini-batch gradient descent.
SGD is that each iteration of the calculation mini-batch gradient, then the parameters are updated, it is the most common method of optimization. which is:
Wherein is the learning rate, is entirely dependent on the current gradient SGD gradient batch, it is understood to allow the gradient of the current batch of how far the impact parameter updating.
Shortcoming
- Select the appropriate learning rate is more difficult - updating of all parameters using the same learning rate. For sparse data or features, sometimes we might want to update faster for features not often arise, for updating features often appear slower, this time SGD is less able to meet the requirements of the
- SGD likely to converge to a local optimum,
Momentum
momentum is the concept of physical simulation in momentum, before the accumulation of momentum to replace the real gradient. Formula is as follows:
Wherein is the momentum factor
Features:
- When the early fall, a parameter update on the use of consistent decline direction, take on greater can be well accelerated
- When the late fall, when the local minimum round shocks , that updated amplitude increases, out of the trap
- Gradient change direction when possible to reduce the update summary, SGD Momentum items can be accelerated in the relevant direction, suppress oscillation, thereby accelerating convergence
Adam optimization algorithm
https://www.jianshu.com/p/aebcaf8af76e
Calculating a gradient of the time step t:
First, calculate the gradient of the exponential moving average, is initialized to 0 M0.
Momentum similar algorithm, before considering the time step gradient momentum.
exponential decay rate coefficient β1, control weight distribution (gradient of the current momentum) generally takes a value close to 1.
The default is 0.9
FIG demonstrate the simple steps 1 to 20 times, each time step gradient of proportion with the accumulation time longer.
FIG demonstrate the simple steps 1 to 20 times, each time step gradient of proportion with the accumulation time longer.
Next, calculate the exponential moving average gradient squared, v0 initialized to zero.
exponential decay rate coefficient β2, prejudice the square before the gradient control.
RMSProp similar algorithm, weighted mean square gradient.
The default is 0.999
Third, since m0 is initialized to 0, will lead mt tend to zero, especially in the early stages of training.
So, here we need to correct the deviation of the mean gradient mt, reduce the impact of the deviation on the initial training.
Fourth, and similar m0, v0 initialized to 0 since the initial stages of training leading to bias 0 vt, to correct it.
Fifth, the parameter update, initial learning rate α is multiplied by the square root of the ratio of the average gradient of the gradient variance.
Wherein the default learning rate α = 0.001
ε = 10 ^ -8, avoiding the divisor becomes zero.
As can be seen from the expression, the update step size calculation, can be adaptively adjusted from the average gradient angle and the gradient of the square of the two, rather than directly determined by the current gradient.
Adam code implementation