[Miscellaneous notes] Summary of basic optimization algorithms for deep learning

0.

In the training process of deep learning, the three main problems that are easy to encounter are local minimum, saddle point and gradient disappearance.

Local minimum : If the training process falls into a local minimum point, it is very likely that the final result will be a local optimum rather than a global optimum. Of course, if there is a certain degree of noise, it may be possible to jump out of this local minimum. This is also the advantage of mini-batch stochastic gradient descent, i.e. it can move parameters away from local minima with variations on the mini-batch gradient.

Gradient disappearance : There are many ways to alleviate gradient disappearance, such as using ReLU activation function, BatchNorm, etc.
  \space 

1. Gradient descent, stochastic gradient descent, mini-batch gradient stochastic descent

The first thing to note is that most deep learning problems are non-convex. For the loss function lll , which has a Heissan matrix:
H = [ ∂ l ∂ xi ∂ xj ] n × n H=[\frac{\partial l}{\partial x_i \partial x_j}]_{n\times n}H=[xixjl]n×n
where nn is the dimension of the variable (is a vector).
If all the eigenvalues ​​of the Heissan matrix are negative, it has a local maximum; all the eigenvalues ​​​​are positive and negative. But fornnWhen n is very large, it is difficult to ensure that all eigenvalues ​​have the same sign, which isa saddle point. For a convex function, the eigenvalue of its Heissan matrix must be non-negative, so the optimization problem of deep learning is not convex optimization.

Supplement: Convex function:
definition (convex function):
  \space 
Given a convex set XXX , functionf : X → R f:X\rightarrow \mathbb Rf:XR (note: convex functions defined on convex sets) is said to be convex if:
λ f ( x 1 ) + ( 1 − λ ) f ( x 2 ) ≥ f ( λ x 1 + ( 1 − λ ) x 2 ) , ∀ x 1 ≤ x 2 ∈ X , λ ∈ [ 0 , 1 ] \lambda f(x_1)+(1-\lambda)f(x_2)\ge f(\lambda x_1+ (1-\lambda)x_2 ),\forall x_1\le x_2\in X,\lambda \in [0,1]λf(x1)+(1λ ) f ( x2)f(λx1+(1l ) x2),x1x2X,l[0,1]
  \space  
Convex functions have the following properties :
1. Jensen's inequality (the function whose mean value is greater than or equal to the mean value)
2. The local minimum value is the global minimum value (proof: proof by contradiction)
3. The level set is still a convex set, namely:
S ≜ { x ∣ x ∈ X , f ( x ) ≤ b } S \triangleq\{x|x\in X ,f(x)\le b\}S{ xxX,f(x)b } is still convex. (Proof:λ x + ( 1 − λ ) x ′ ∈ S , ∀ x , x ′ ∈ S , λ ∈ [ 0 , 1 ] , ie f ( λ x + ( 1 − λ ) x ′ ) ≤ b \lambda x+(1-\lambda)x'\in S,\forall x,x'\in S,\lambda \in [0,1], which proves that f(\lambda x+ (1-\lambda)x')\le bλx+(1l ) xS,x,xS,l[0,1],That is to say f ( λ x+(1l ) x)b )
4. As mentioned above, the Heissan matrix is ​​positive semi-definite.
5. Can handle constraint problems:
min ⁡ f ( x ) s . t . ci ( x ) ≤ 0 , i ∈ I \min f(x) \ \st \quad c_i(x)\le 0,i\in Iminf(x)s.t.ci(x)0,iThe method of I
processing has: 1. Lagrangian multiplier method, 2. Penalty term, 3. Projection (should be to take the projection constraint range. The definition of projection is: the projection of the element outside the collection to the collection is equal to the collection from the collection The closest element to the outer element:PX ( x ) ≜ arg min ⁡ x ′ ∈ X ∣ ∣ x − x ′ ∣ ∣ P_X(x)\triangleq\argmin_{x'\in X}||xx'||PX(x)xXargminxx

1.1 Gradient Descent

Intuitively, we should be able to approach the optimal point by updating the parameters in the opposite direction of the gradient direction, that is:
x ← x − η ∇ f ( x ) , where f : R d → R \bm x\leftarrow \bm x-\eta \nabla f(\bm x), where f:\mathbb R^d\rightarrow \mathbb Rxxηf(x),where f:RdR

note x \bm xx represents the parameter. Among themη \etaη is the learning rate. If it is small, the convergence will be slow. If it is too large, it may oscillate or even diverge.

Newton's method can update the learning rate, but it is not applicable due to excessive storage consumption, which will be explained in detail below.

f f fTaylor 's equation (infinitesimal):
f ( x + ϵ ) = f ( x ) + ϵ ∇ f ( x ) + 1 2 ϵ TH ϵ + o ( ∣ ∣ ϵ ∣ ∣ 3 ) f(x + \ epsilon)=f(x)+ \epsilon\nabla f( x)+ \frac{1}{2}\epsilon^T \bm H \epsilon+o(||\epsilon||^3)f(x+) _=f(x)+ϵf(x)+21ϵTHϵ+o(ϵ3)

Both sides pair ϵ \epsilonThen , given:
0 = ∇ f ( x ) + 1 2 ( H + HT ) ϵ = ∇ f ( x ) + H ϵ 0=\equation f( x)+\frac{1}{2}( \ bm H + \bm H ^T)\epsilon=\equal f( x)+H\epsilon0=f(x)+21(H+HT )ϵ=f(x)+H ϵ
(usingd ( x TA x ) dx = ( AT + A ) x \frac{d(x^TAx)}{dx}=(A^T+A)xdxd(xTAx)=(AT+A ) x and the symmetry properties of the Heissan matrix)

Then: ϵ = − H − 1 ∇ f ( x ) \epsilon=-\bm H^{-1}\nabla f( x)ϵ=H1f(x)

Obviously to update in this way, you can indeed get closer to the optimal point better, but it takes O ( d 2 ) O(d^2)O(d2 )Space complexity to store the Heissan matrix. And if the Heissan matrix has negative eigenvalues ​​for non-convex functions, it may be updated in the opposite direction.

1.2 Stochastic Gradient Descent SGD

Note that the above gradient descent method takes the loss function of the entire sample as input, if there are a total of nnn samples, the time complexity of calculation isO ( n ) O(n)O ( n ) , very large.

If we randomly take one sample at a time iii , put the sample in parameterx \bm xThe loss function under x is recorded as fi ( x ) f_i(\bm x)fi( x ) , then we usefi ( x ) f_i(\bm x)fi( x ) to estimate the gradient of the entire sample set, which is updated as follows:

x ← x − η ∇ f i ( x ) \bm x\leftarrow \bm x-\eta \nabla f_i(\bm x) xxηfi(x)

In addition, if the overall loss function is equal to the average of each sample loss function, that is, f ( x ) = 1 n ∑ i = 1 nfi ( x ) f(\bm x)=\frac{1}{n}\sum_ {i=1}^nf_i(\bm x)f(x)=n1i=1nfi( x ) , then∇ fi ( x ) is ∇ f ( x ) \nabla f_i(\bm x) is \nabla f(\bm x)fi( x ) is an unbiased estimator off ( x )
because: E [ ∇ fi ( x ) ] = 1 n ∑ i = 1 nfi ( x ) = ∇ f ( x ) E[\nabla f_i(\bm x )]=\frac{1}{n}\sum_{i=1}^nf_i(\bm x)=\nabla f(\bm x)E[fi(x)]=n1i=1nfi(x)=f(x)

The main disadvantage of SGD is that it cannot use the parallel computing of hardware, and because only the gradient information of one sample is used, the convergence performance is not good.

1.3 Small batch SGD

In order to be able to make good use of the parallel computing characteristics of the GPU, it is a bit wasteful for SGD to take one sample calculation at a time. To this end, we can take the gradient of a batch of samples at a time to replace the original gradient of one sample:

x ← x − η ∣ I t ∣ ∑ i ∈ I t ∇ f i ( x ) \bm x\leftarrow \bm x-\frac{\eta}{|I_t|} \sum_{i\in I_t}\nabla f_i(\bm x) xxIthiItfi(x)

其中 I t ⊂ I I_t\subset I ItI.

2. Momentum method

The gradient is the direction in which the value of the function changes the fastest, but sometimes it causes jitter. If we consider the gradient of the past time (that is, consider the past update direction) when updating the weight, it is like the "inertia" in physics. Going too far all of a sudden.

Consider the previous small batch SGD, remember
gt , t − 1 = 1 ∣ I t ∣ ∑ i ∈ I t ∇ fi ( xt − 1 ) \bm g_{t,t-1} = \frac{1}{|I_t |} \sum_{i\in I_t}\nabla f_i(\bm x_{t-1})gt,t1=It1iItfi(xt1)

For convenience, time information is added to the subscript. gt , t − 1 \bm g_{t,t-1}gt,t1means from t − 1 t-1t1 moment tottUpdate at time t . xt − 1 \bm x_{t-1}xt1is the parameter at the previous moment.

We define momentum:
vt = β vt − 1 + gt , t − 1 v 0 = 0 \bm v_t =\beta\bm v_{t-1}+\bm g_{t,t-1}\\ \bm v_0 = \bm0vt=b vt1+gt,t1v0=0

That is, the momentum at this moment is represented by the momentum and gradient direction at the previous moment. Obviously, the influence of the gradient at the old moment on the momentum will become smaller and smaller as time goes on. To demonstrate this, it can be expanded: vt
= β vt − 1 + gt , t − 1 = vt = β ( β vt − 2 + gt − 1 , t − 2 ) + gt , t − 1 = . . . = β t − 1 ( β v 0 + g 1 , 0 ) + β 0 gt , t − 1 + β gt − 1 , t − 2 + . . . β t − 2 g 2 , 1 = ∑ τ = 0 t − 1 β τ gt − τ , t − τ − 1 \bm v_t =\beta\bm v_{t-1}+g_{t,t-1}=\bm v_t =\beta(\beta\bm v_{t-2}+g_{t-1,t -2})+g_{t,t-1}\\ =...=\beta^{t-1}(\beta\bm v_{0}+g_{1,0})+\beta^0 g_{t,t-1}+\beta g_{t-1,t-2}+...\beta^{t-2}g_{2,1}\\ =\sum_{\tau=0} ^{t-1}\beta^\tau g_{t-\tau,t-\tau - 1}vt=b vt1+gt,t1=vt=b ( b vt2+gt1,t2)+gt,t1=...=bt1(βv0+g1,0)+b0 Mrt,t1+βgt1,t2+. . . bt2g2,1=τ = 0t1bt gt τ , t τ 1

Therefore, the gradient information is replaced by momentum information, and the weights are updated as follows:

v t ← β v t − 1 + g t , t − 1 x t ← x t − 1 − η v t \bm v_t \leftarrow\beta\bm v_{t-1}+\bm g_{t,t-1}\\ \bm x_t \leftarrow \bm x_{t-1}-\eta \bm v_t vtb vt1+gt,t1xtxt1ηvt

β = 0 \beta=0b=0 degenerates into regular small-batch SGD.

3.RMS plug

The AdaGrad algorithm updates the state variable as follows:
st = st − 1 + gt 2 \bm s_t= \bm s_{t-1}+\bm g_t^2st=st1+gt2

g t g_t gtThe meaning of is the same as before. After that, the weights are updated according to the formula:
xt ← xt − 1 − η st + ϵ ∗ gt \bm x_t \leftarrow \bm x_{t-1}-\frac{\eta}{\sqrt{\ bm s_t+\epsilon}}*\bm g_txtxt1st+ϵ hgt

where * stands for element-wise multiplication. Doing so means: for each parameter have a different learning rate (learning rate η st + ϵ \frac{\eta}{\sqrt{\bm s_t+\epsilon}}st+ ϵ hDifferent values ​​for each element of the vector)

But as time grows, it is likely that st \bm s_tstwill continue to grow. To constrain this, consider:

s t = λ s t − 1 + ( 1 − λ ) g t 2 \bm s_t= \lambda\bm s_{t-1}+(1-\lambda)\bm g_t^2 st=λst1+(1l ) gt2

This is the only difference between RMSprop and AdaGrad.

4.Adam

Combining the previous ideas, the Adam algorithm is produced.

Consider both momentum and state variables:

v t = β 1 v t − 1 + ( 1 − β 1 ) g t s t = β 2 s t − 1 + ( 1 − β 2 ) g t 2 \bm v_t =\beta_1\bm v_{t-1}+(1-\beta_1)\bm g_t\\ \bm s_t= \beta_2\bm s_{t-1}+(1-\beta_2)\bm g_t^2 vt=b1vt1+(1b1)gtst=b2st1+(1b2)gt2

General1 > β 2 > β 1 1>\beta_2>\beta_11>b2>b1, which means that the update of the state variable is slower than the momentum.

But initializing to 0 value as before will bring a relatively large deviation. Because although there is a gradient at the beginning, but vt \bm v_tvtThe value of etc. will be relatively small.
In order to cope with the ttIn the case where the gradient is small when t is small,it can be normalized as follows:

v t ^ = v t 1 − β 1 t , s t ^ = s t 1 − β 2 t \hat{\bm v_t} = \frac{\bm v_t}{1-\beta_1^t},\hat{\bm s_t} = \frac{\bm s_t}{1-\beta_2^t} vt^=1b1tvt,st^=1b2tst

The denominator is less than 1, the value can be expanded at the beginning.

Combined with the method of RMSprop, different learning rates are assigned to each parameter:

gt ′ = η st ^ + ϵ ∗ vt ^ \bm g_t'=\frac{\eta}{\sqrt{\hat{\bm s_t} }+\epsilon}*\hat{\bm v_t}gt=st^ +ϵhvt^

Practice shows that st ^ + ϵ \sqrt{\hat{\bm s_t} }+\epsilonst^ +ϵ is more effective thanst ^ + ϵ \sqrt{\hat{\bm s_t}+\epsilon}st^+ϵ good.

Finally update as follows:

x t ← x t − 1 − g t ′ \bm x_t \leftarrow \bm x_{t-1}-\bm g_t' xtxt1gt

Adam is very insensitive to the learning rate.

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/122244594
Recommended