Neural network optimization steps and commonly used neural network optimizers

Optimizing neural network-related parameters:
w represents the parameters to be optimized, loss represents the loss function, lr represents the learning rate, batch represents the data of each iteration, t represents the total number of iterations of the current batch:
the steps of neural network parameter optimization:
1. Calculate the loss function at time t about the current gradient gt=▽loss=2, calculate the first-order momentum mt and second-order momentum vt3 at time t, calculate the descending gradient at time t: ηt=lr*mt/4, calculate the parameters at time t+1 :Wt+1=Wt-ηt=Wt-lr*mt/

First-order momentum: a function related to the gradient.
Second-order momentum: a function related to the square of the gradient.
Commonly used optimizers:
(1) SDG (Stochastic gradient descent): Stochastic gradient descent
without momentum
mt=gt  Vt=1  ηt=lr*mt/=lr*gtwt+1=wt-ηt=wt-lr*mt/=Wt-lr*gtwt+1=wt-

(2) SGDM (Stochastic gradient descent with momentum) adds first-order momentum on the basis of SGD.
mt=β*mt-1+(1-β)*gt vt=1 (the second-order momentum constant vt is equal to 1) Mt represents the exponential moving average of the gradient direction at each time.  β is a hyperparameter, a value close to 1.  ηt=lr*mt/=lr*mt=lr*(β*mt-1+(1-β)*gt)Wt+1=wt-ηt=wt-lr*(β*mt-1+(1- β)*gt)

(3) Adagrad, adding second-order momentum on the basis of SGD
Mt=gtVt=2ηt=lr*mt/()=lr*gt/2Wt+1=wt-gt=wt-lr*gt/

(4) RMSProp, adding second-order momentum on the basis of SGD
mt=gtVt=β*Vt-1+(1-β)*gt² (The second-order momentum v is calculated using an exponential moving average and represents the average value over a period of time) ηt=lr*mt//lr*gt/Wt+ 1=wt-ηt=wt-lr*gt/

(5) Adam combines SGDM first-order momentum and RMSProp second-order momentum at the same time
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45187794/article/details/108101498