Optimization Introduction

Optimizer summary

https://zhuanlan.zhihu.com/p/22252270

 

SGD optimization algorithm

SGD here refers to the mini-batch gradient descent, on batch gradient descent, stochastic gradient descent, as well as mini-batch gradient descent of the specific differences would not elaborate. SGD now generally refers to the mini-batch gradient descent.

SGD is that each iteration of the calculation mini-batch gradient, then the parameters are updated, it is the most common method of optimization. which is:

g_{t}=\bigtriangledown_{\theta _{t-1}}f\left ( \theta _{t-1} \right )

\bigtriangleup \theta _{t}=-\eta g_{t}

Wherein \ andis the learning rate, g_{t}is entirely dependent on the current gradient SGD gradient batch, it \ andis understood to allow the gradient of the current batch of how far the impact parameter updating.

Shortcoming

  • Select the appropriate learning rate is more difficult - updating of all parameters using the same learning rate. For sparse data or features, sometimes we might want to update faster for features not often arise, for updating features often appear slower, this time SGD is less able to meet the requirements of the
  • SGD likely to converge to a local optimum,

Momentum

momentum is the concept of physical simulation in momentum, before the accumulation of momentum to replace the real gradient. Formula is as follows:

m_{t}=\mu\ast m_{t-1}+g_{t}

\bigtriangleup \theta _{t}=-\eta m_{t}
 

Wherein \ muis the momentum factor

Features:

  • When the early fall, a parameter update on the use of consistent decline direction, take on greater \ mucan be well accelerated
  • When the late fall, when the local minimum round shocks gradient\rightarrow 0, \ andthat updated amplitude increases, out of the trap
  • Gradient change direction when \ mupossible to reduce the update summary, SGD Momentum items can be accelerated in the relevant direction, suppress oscillation, thereby accelerating convergence

Adam optimization algorithm

https://www.jianshu.com/p/aebcaf8af76e

Adam update rules

Calculating a gradient of the time step t:

First, calculate the gradient of the exponential moving average, is initialized to 0 M0.

Momentum similar algorithm, before considering the time step gradient momentum.

exponential decay rate coefficient β1, control weight distribution (gradient of the current momentum) generally takes a value close to 1.

The default is 0.9

FIG demonstrate the simple steps 1 to 20 times, each time step gradient of proportion with the accumulation time longer.

FIG demonstrate the simple steps 1 to 20 times, each time step gradient of proportion with the accumulation time longer.

 

Next, calculate the exponential moving average gradient squared, v0 initialized to zero.

exponential decay rate coefficient β2, prejudice the square before the gradient control.

RMSProp similar algorithm, weighted mean square gradient.

The default is 0.999

Third, since m0 is initialized to 0, will lead mt tend to zero, especially in the early stages of training.

So, here we need to correct the deviation of the mean gradient mt, reduce the impact of the deviation on the initial training.

 

Fourth, and similar m0, v0 initialized to 0 since the initial stages of training leading to bias 0 vt, to correct it.

Fifth, the parameter update, initial learning rate α is multiplied by the square root of the ratio of the average gradient of the gradient variance.

Wherein the default learning rate α = 0.001

ε = 10 ^ -8, avoiding the divisor becomes zero.

As can be seen from the expression, the update step size calculation, can be adaptively adjusted from the average gradient angle and the gradient of the square of the two, rather than directly determined by the current gradient.

Adam code implementation

Published 102 original articles · won praise 117 · views 330 000 +

Guess you like

Origin blog.csdn.net/pursuit_zhangyu/article/details/100067391