Common Deep Learning Optimization Algorithms (Overview)

Here is a brief introduction to the principles and implementation of various optimization algorithms for deep learning.

Gradient Descent¶

Before talking about gradient descent, first understand the learning rate (learning rate, we record it here as $\eta$ )

The learning rate is used as a parameter to determine whether the objective function can converge to a local minimum, and when to converge to the minimum. If the learning rate is too small , the gradient descent speed will be too slow, and the training will take a long time. If the learning rate is too large , it will fail to converge.

Although gradient descent is rarely used directly in deep learning, understanding it is key to understanding the stochastic gradient descent algorithm in the next section.

In order to understand well, select the objective function (loss function) $f(x) = x^2$

$f^{\prime}(x) = 2x$

We set initial value $x = 10$ ， $\eta = 0.2$

$x_{1} = 10$ $f (x) = 100$

$x_{2} = 6$ $x_{2} = x_{1} - \eta f^{\prime}(x_{1})$ $f(x_2) = 36$

$x_{3} = 3.6$ $x_{3} = x_{2} - \eta f^{\prime}(x_{2})$ $f(x_3) = 12.96$

…

The loss drops as shown in the figure:
Please add a picture description

Batch Gradient Descent (BGD)

All sample parameters participate in each iteration update

Disadvantages: There are many samples, it takes a long time to update once, and the time and memory overhead are very large

We usually say that the gradient gradient descent refers to the batch gradient descent, which is to use the gradient descent formula to iterate all the sample parameter values

Stochasticity Drop (SGD)

The so-called random if the gradient descent method is used, the calculation cost of each independent variable iteration is $\mathcal{O}(n)$ , it varies with $n$ grows linearly. Therefore, when the training dataset is large, the computational cost of gradient descent per iteration will be high.

Stochastic Gradient Descent (SGD) reduces the computational cost at each iteration. In each iteration of stochastic gradient descent, the same sample parameters are randomly selected for update. It can be seen that the calculation cost is reduced to $\mathcal{O}(1)$ .

Disadvantages: However, due to the limited amount of information received at each step, the estimation of the gradient by the stochastic gradient descent method often deviates, resulting in very unstable convergence of the objective function curve, accompanied by violent fluctuations, and sometimes even non-convergence

Mini-batch Gradient Descent (MGD)

Small batch gradient descent, that is, min-batch, each iteration, in fact, batch gradient descent is a compromise method. It uses some small samples to approximate all, such as using 20 samples or 30 samples for iterative update. That is much more accurate than random, and the batch size can still reflect a distribution of samples. In deep learning, this method is used the most, because this method will not converge very slowly, and the local optimum of convergence is more acceptable. Note: When the min-batch parameter is 1 each time, it becomes SGD

Disadvantages: Same as SGD, it floats when it converges again, is unstable, and vibrates near the optimal solution,

Momentum Gradient Descent

SGD is prone to shocks when it encounters a gully. To this end, momentum can be introduced to it, accelerating the decline of SGD in the right direction and suppressing oscillations.

The main principle is to replace the gradient with the average value of the past gradient, which greatly speeds up the convergence speed and can prevent the problem of stagnation in the optimization process of stochastic gradient descent.

An exponentially weighted average is added to stochastic gradient descent (SGD) to update the gradient of the parameters, where the weight is usually set to 0.9.
$\begin{aligned} &S_{d W}[l]=\beta S_{ d W}[l]+(1-\beta) d W^{[l]} \\ &S_{db}[l]=\beta S_{db}[l]+(1-\beta) db[l ] \\ &W[l]:=W^{[l]}-\alpha S_{d W}[l] \\ &{ }_{b}[l]:=b[l]-\alpha S_{ db}[l] \end{aligned}$
So what will happen to the gradient descent process, as shown in the figure below:
insert image description here
when the front and rear gradient directions are consistent, momentum gradient descent can accelerate learning; and when the front and rear gradient directions are inconsistent, momentum gradient descent can suppress oscillations.

AdaGrad algorithm

The AdaGrad algorithm will use a mini-batch stochastic gradient $g t$ element-wise squared accumulator variable $s t$ , at time 0, AdaGrad willEach element in $s0 is initialized to$ $0 .$ at time $t$ , first the mini-batch stochastic gradient $g t$ is squared element-wise and added to the variable st:
$\leftarrow s t-1+gt \odot gt$
where ⊙ is element-wise multiplication. Next, we readjust the learning rate of each element in the independent variable of the objective function by element-wise operation:
$w_{t} \leftarrow w_{t}-1-\frac {\alpha}{\sqrt{s t+\epsilon}} \odot g_{t}$
The element-wise squared accumulator variable st of the mini-batch stochastic gradient appears in the denominator term of the learning rate.
1. If the partial derivative of the objective function with respect to a certain parameter in the independent variable is always large, then the learning rate of the parameter will drop faster; 2. Conversely,
if the partial derivative of the objective function with respect to a certain parameter in the independent variable has always been is smaller, then the learning rate of this parameter will decrease slowly.
However, since $st t$ has been accumulating element-wise squared gradients, and the learning rate for each element in the independent variable has been decreasing (or unchanged) during iterations. Therefore, when the learning rate drops rapidly in the early iteration and the current solution is still not good, it may be difficult for the AdaGrad algorithm to find a useful solution in the later iteration because the learning rate is too small.

RMSprop algorithm

When the learning rate drops rapidly in the early iteration and the current solution is still not good, it may be difficult for the AdaGrad algorithm to find a useful solution in the later iteration because the learning rate is too small. In order to solve this problem, the RMSProp algorithm made a little modification to the AdaGrad algorithm.
Unlike the state variable st in the AdaGrad algorithm, which is the sum of all small batch stochastic gradients gt by the element square up to time step t, the RMSProp (Root Mean Square Prop) algorithm uses the element square to make these gradients an exponentially weighted moving average sdw = β sdw + (
$\begin{aligned} &s dw=\beta s_{dw}+(1-\beta)(dw)^{2} \\ &s db=\beta s_{db}+(1-\beta)(db)^{2} \\ &w:=w- \frac{\alpha}{\sqrt{sd w+\epsilon}} dw \\ &b:=b-\frac{\alpha}{\sqrt{s_{db}+\epsilon}} db \end{aligned}$
where $\epsilon$ is the same as a constant in order to maintain numerical stability. The learning rate of each element of the final independent variable is no longer continuously reduced during the iteration process. RMSProp helps reduce wobble on the path to the minimum and allows a larger learning rate α, which speeds up the algorithm learning.

Adam Algorithm

The Adam optimization algorithm (Adaptive Moment Estimation, Adaptive Moment Estimation) combines the Momentum and RMSProp algorithms. The Adam algorithm also performs an exponentially weighted moving average on the small batch stochastic gradient based on the RMSProp algorithm.

Assuming that each mini-batch is used to calculate dW, db, the $t$ Configuration:
Determine the equation for the equation:
$\begin{aligned} &v_{d W}=\beta 1 v_{d W}+(1-\beta 1) d W \\ &v_{db}=\beta 1 v_{ db}+(1-\beta1) db\\&v_{d W}^{corr}[l] ted=\frac{v}{1}-(\beta1)^{t}\\&s_{d W}=\beta 2 sd W+(1-\beta 2)(d W)^{2}\\&s_{db}=\beta 2 sd b+(1-\beta 2)(db)^{2}\ \ &\left.s_{d W}^{corr}[l] ed=\frac{s}{1}-W^{2} l\right] \end{aligned}$
where $l$ is a certain layer, $t$ is the value of the moving average.
The parameters of the Adam algorithm are updated:
$\begin{aligned} &W:=W- \alpha \frac{v_{d W}^{\text {corrected }}}{\sqrt{s_{d W}^{\text {corrected }}+\epsilon}} \\ &b:=b-\alpha \frac{v_{db}^{\text {corrected }}}{\sqrt{s_{db}^{\text {corrected }}+\epsilon}} \end{aligned}$
Parameter update of the Adam algorithm:
learning rate $\alpha$ : You need to try a series of values to find a more suitable
β1: The commonly used default value is 0.9
β2: The author of the Adam algorithm recommends 0.999
ϵ: The author of the Adam algorithm recommends the default value of epsilon $1 e - 8$
Note: β1, β2, ϵ usually do not need debugging

Comparison of the effects of all optimization algorithms
insert image description here
Image source Alec Radford's paper

Learning Rate Scheduler

learning rate annealing

If a fixed learning rate α is set

Near the minimum point, due to the presence of certain noise in different batches, it will not converge precisely, but will always fluctuate within a large range around the minimum.
If the learning rate α is slowly reduced over time, when the initial α is large, the step size of the decline is large, and the gradient can be dropped at a faster speed; and the value of α is gradually reduced in the later stage, that is, the step size is reduced. Long, it is helpful for the convergence of the algorithm, and it is easier to approach the optimal solution.
The most commonly used learning rate annealing method:
1. Decay with the number of steps
2. Exponential decay
$\alpha=0.95^{\text {epoch}} * \alpha 0$
3、 $1/ decay$ $\alpha=\frac{1}{1+\text { decay*epoch}} * \$ alpha_0
$a = \frac{1}{1 + decaye*epoch} * a_{0}$
Among them, decay is the decay rate (hyperparameter), and epoch_num is the number of times to pass all the training samples.
For large data models, it is necessary to use these methods to automatically decay the learning rate. And some small networks can be directly adjusted manually

references :

https://zhuanlan.zhihu.com/p/32626442

Hands-on Deep Learning

Overview of Gradient Descent Algorithm