Deep learning--gradient descent algorithm (continuous update)

Gradient Descent (GD) algorithm


1. Gradient descent

(1) Basics

1. Gradient

▽ f = ( ∂ f ∂ x , ∂ f ∂ y , ∂ f ∂ z ) \triangledown f=(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z}) f=(xf,yf,zf)
 

2. Gradient descent

  • Gradient descent is a common method for minimizing hazard and loss functions. The direction of the gradient is actually the direction in which the function rises fastest at this point, and what is often required is the minimization of the loss function, so there is a negative sign.
  • In actual work, it is an ideal situation to obtain a set of loss functions that make the loss function reach the global minimum, and a more general situation is to make the accuracy of the model reach a certain minimum value within the acceptable range.
     

3. Gradient descent algorithm

θ 1 = θ 0 − α ▽ J ( θ 0 ) \theta_{1}=\theta_{0}-\alpha \triangledown J(\theta_{0});i1=i0a J ( i0)
, whereθ 1 \theta_{1}i1Sum θ 1 \theta_{1}i1Indicates the beginning and end of the position, α \alphaα represents the step size or learning rate,J ( θ ) J(\theta)J ( θ ) denotes\thetaA function of θ

The actual steps of the algorithm:

  • Initialize weights and biases with random values
  • Pass the input to the network and get the output value
  • Calculate the error between predicted and true values
  • For each neuron that produces an error, adjust the corresponding (weight) value to reduce the error
  • Repeat iterations until the optimal value of network weights is obtained

 
Code:
 

(2) Classification

1. Batch Gradient Descent (BGD)

θ t + 1 = θ t − α ▽ J t ( θ ) \theta_{t+1}=\theta_{t}-\alpha \triangledown J_{t}(\theta) it+1=itαJt(θ)
假如以MSE作为损失函数,有
y i ^ = ∑ j = 0 m w j x i , j ( w 0 = b , x i , 0 = 1 ) J ( w ) = 1 2 n ∑ i = 1 n ( y i ^ − y i ) 2 ▽ J ( w j ) = ∂ J ( w ) ∂ w j = 1 n ∑ i = 1 n ( y i ^ − y i ) x i , j w j , t + 1 = w j , t − α ▽ J t ( w j ) , t = 1 , 2 , . . . \hat{y_{i}}=\sum_{j=0}^{m} w_{j} x_{i,j} (w_{0}=b, x_{i,0}=1)\\ J(w)= \frac {1} {2n} \sum_{i=1}^{n} ( \hat{y_{i}} -y_{i})^{2}\\ \triangledown J(w_{j})=\frac{\partial J(w)}{\partial w_{j}}=\frac{1}{n} \sum_{i=1}^{n} ( \hat{y_{i}} -y_{i})x_{i,j}\\ w_{j,t+1}=w_{j,t}-\alpha \triangledown J_{t}(w_{j}),t=1,2,... yi^=j=0mwjxi,j(w0=b,xi,0=1)J(w)=2 n1i=1n(yi^yi)2J(wj)=wjJ(w)=n1i=1n(yi^yi)xi,jwj,t+1=wj,tαJt(wj),t=1,2,. . .
Among them, j is the number of features of the data, m is the total number of features, i is the number of samples of the data, n is the total number of samples, t is the number of iterations epoch,yi ^ \hat{y_{i}}yi^is the predicted value, yi y_{i}yiis the actual value, J ( w ) J(w)J ( w ) isthe cost function
 
(1) Advantages: the direction of descent is the overall average gradient, and the global optimal solution can be obtained
(2) Disadvantages: the entire data sample set needs to be calculated, and the speed will be relatively slow
(3) Features:

  • For the update of parameters, all samples contribute, so the calculation is the maximum gradient, and the magnitude of an update is large
  • When there are not many samples, the convergence speed will be very fast
     

2. Mini-batch Gradient Descent (MBGD)

Each update uses k samples, and the distribution of the response samples to a certain extent
θ t + 1 = θ t − α k ∑ i = k ( t − 1 ) + 1 kt ▽ J i , t ( θ ) \theta_{t +1}=\theta_{t}-\frac{\alpha}{k} \sum_{i=k(t-1)+1}^{kt} \triangledown J_{i,t}(\theta)it+1=itkai=k(t1)+1ktJi,t(θ)
假如以MSE作为损失函数,有
y i ^ = ∑ j = 0 m w j x i , j ( w 0 = b , x i , 0 = 1 ) J i ( w ) = 1 2 ( y i ^ − y i ) 2 ▽ J i ( w j ) = ∂ J i ( w ) ∂ w j = ( y i ^ − y i ) x i , j w j , t + 1 = w j , t − α k ∑ i = k ( t − 1 ) + 1 k t ▽ J i , t ( w j ) , t = 1 , 2 , 3 , . . . , [ n k ] \hat{y_{i}}=\sum_{j=0}^{m} w_{j} x_{i,j} (w_{0}=b, x_{i,0}=1)\\ J_{i}(w)= \frac {1} {2} ( \hat{y_{i}} -y_{i})^{2}\\ \triangledown J_{i}(w_{j})=\frac{\partial J_{i}(w)}{\partial w_{j}}=( \hat{y_{i}} -y_{i})x_{i,j}\\ w_{j,t+1}=w_{j,t}- \frac{\alpha}{k} \sum_{i=k(t-1)+1}^{kt}\triangledown J_{i,t}(w_{j}),t=1,2,3,...,[\frac{n}{k}] yi^=j=0mwjxi,j(w0=b,xi,0=1)Ji(w)=21(yi^yi)2Ji(wj)=wjJi(w)=(yi^yi)xi,jwj,t+1=wj,tkai=k(t1)+1ktJi,t(wj),t=1,2,3,...,[kn]
Among them, j is the number of features of the data, m is the total number of features, i is the number of samples of the data, n is the total number of samples, t is the number of iterations epoch, k is the number of batch samples batch size, yi ^ \hat{y_{i} }yi^is the predicted value, yi y_{i}yiis the actual value, J i ( w ) J_{i}(w)Ji( w ) isthe loss function
 
(1) Advantages: It ensures the training speed and the accuracy of the final convergence
(2) Disadvantages: It is difficult to choose an appropriate learning rate
 

3. Stochastic Gradient Descent (SGD)

Each iteration updates the parameters, using only one random sample p
θ t + 1 = θ t − α ▽ J i = p , t ( θ ) , p ∈ [ 1 , 2 , . . . , n ] \theta_{t+ 1}=\theta_{t}-\alpha \triangledown J_{i=p,t}(\theta),p \in [1,2,...,n]it+1=itαJi=p,t( i ) ,p[1,2,...,n]
假如以MSE作为损失函数,有
y i ^ = ∑ j = 0 m w j x i , j ( w 0 = b , x i , 0 = 1 ) J i ( w ) = 1 2 ( y i ^ − y i ) 2 ▽ J i ( w j ) = ∂ J i ( w ) ∂ w j = ( y i ^ − y i ) x i , j w j , t + 1 = w j , t − α ▽ J i = p , t ( w j ) , 其 中 p ∈ [ 1 , 2 , . . . , n ] , t = 1 , 2 , . . . \hat{y_{i}}=\sum_{j=0}^{m} w_{j} x_{i,j} (w_{0}=b, x_{i,0}=1)\\ J_{i}(w)= \frac {1} {2} ( \hat{y_{i}} -y_{i})^{2}\\ \triangledown J_{i}(w_{j})=\frac{\partial J_{i}(w)}{\partial w_{j}}=( \hat{y_{i}} -y_{i})x_{i,j}\\ w_{j,t+1}=w_{j,t}- \alpha \triangledown J_{i=p,t}(w_{j}),其中p \in [1,2,...,n],t=1,2,... yi^=j=0mwjxi,j(w0=b,xi,0=1)Ji(w)=21(yi^yi)2Ji(wj)=wjJi(w)=(yi^yi)xi,jwj,t+1=wj,tαJi=p,t(wj),where p[1,2,...,n],t=1,2,. . .
Among them, j is the number of features of the data, m is the total number of features, i is the number of samples of the data, n is the total number of samples, t is the number of iterations epoch, p is the order of random samples, yi ^ \hat{y_{i} }yi^is the predicted value, yi y_{i}yiis the actual value, J i ( w ) J_{i}(w)Ji( w ) isthe loss function
 
(1) Advantages: only the gradient of one sample needs to be calculated, and the training speed is very fast
(2) Disadvantages: it is easy to jump from one local optimum to another, and the accuracy decreases
(3) Features:

  • When there are many samples, the convergence speed is fast
  • Each update uses one sample to approximate all samples, so the approximate gradient is calculated, even with interference, falling into a local optimal solution
     
3.1. Averaged Stochastic Gradient Descent (ASGD, SAG)

If MSE is used as the loss function, and only one random sample p is used to update the gradient, there is
y i , t ^ = ∑ j = 0 m w j , t x i , j ( w 0 = b , x i , 0 = 1 ) J i , t ( w ) = 1 2 ( y i , t ^ − y i ) 2 ▽ J i , t ( w j ) = ∂ J i , t ( w ) ∂ w j = ( y i , t ^ − y i ) x i , j 初 始 化 : ▽ J i , t = 1 ( w j ) = ( ∑ j = 0 m w j , t = 1 x i , j − y i ) x i , j w j , t + 1 = w j , t − α n [ ▽ J i = p , t ( w j ) + ∑ i = 1 n ▽ J i ≠ p , t − 1 ( w j ) ] 其 中 p ∈ [ 1 , 2 , . . . , n ] , t = 1 , 2 , . . . \hat{y_{i,t}}=\sum_{j=0}^{m} w_{j,t} x_{i,j} (w_{0}=b, x_{i,0}=1)\\ J_{i,t}(w)= \frac {1} {2} ( \hat{y_{i,t}} -y_{i})^{2}\\ \triangledown J_{i,t}(w_{j})=\frac{\partial J_{i,t}(w)}{\partial w_{j}}=( \hat{y_{i,t}} -y_{i})x_{i,j}\\ 初始化:\triangledown J_{i,t=1}(w_{j})=(\sum_{j=0}^{m} w_{j,t=1} x_{i,j}-y_{i})x_{i,j}\\ w_{j,t+1}=w_{j,t}- \frac{\alpha}{n} [\triangledown J_{i=p,t}(w_{j})+\sum_{i=1}^{n}\triangledown J_{i\neq p,t-1}(w_{j})]\\ 其中p \in [1,2,...,n],t=1,2,... yi,t^=j=0mwj,txi,j(w0=b,xi,0=1)Ji,t(w)=21(yi,t^yi)2Ji,t(wj)=wjJi,t(w)=(yi,t^yi)xi,jInitialization : Ji,t=1(wj)=(j=0mwj,t=1xi,jyi)xi,jwj,t+1=wj,tna[Ji=p,t(wj)+i=1nJi=p,t1(wj)]where p[1,2,...,n],t=1,2,. . .
Among them, j is the number of features of the data, m is the total number of features, i is the number of samples of the data, n is the total number of samples, t is the number of iterations epoch, k is the number of batch samples batch size, yi ^ \hat{y_{ i}}yi^is the predicted value, yi y_{i}yiis the actual value, J i ( w ) J_{i}(w)Ji( w ) isthe loss function.
 
In the SGD method, although the problem of high computational cost is avoided, the effect of SGD is often unsatisfactory for large data training, because each round of gradient update is completely consistent with the previous round. Data has nothing to do with gradients. The stochastic average gradient algorithm overcomes this problem, maintains an old gradient for each sample in memory, randomly selects the pth sample to update the gradient of this sample, and keeps the gradients of other samples unchanged, and then obtains the gradient of all gradients average, which in turn updates the parameters.
 

3.2 Stochastic Gradient Descent with Momentum (SGD with Momentum) (SGDM)

 

3.3 SGDW

 

3.4 SGDWM

 

3.5 Cyclical LR

 

3.6 SGDR

 

2. Gradient descent optimization algorithm

(1) Supplementary theoretical knowledge

1. Exponentially weighted average

V t = β V t − 1 + ( 1 − β ) θ t V_{t}=\beta V_{t-1}+(1-\beta)\theta _{t} Vt=b Vt1+(1b ) it
 

2. Bias correction in exponentially weighted average

V t = 1 1 − β t [ β V t − 1 + ( 1 − β ) θ t ] ( V 0 = 0 ) V_{t}=\frac{1}{1-\beta ^{t}}[\beta V_{t-1}+(1-\beta)\theta _{t}]\\ (V_{0}=0) Vt=1bt1[ b Vt1+(1b ) it](V0=0)

3. Exponentially weighted moving average

It is mainly to correct the previous data, when t → ∞ t\to \inftyt ,β t → 0 \beta ^{t}\to0bt0 , that is, it has little influence on the later data
 

4. Nesterov acceleration algorithm

5. Newton's method

(2) Momentum optimization algorithm

Introduce the momentum idea in physics, accelerate the gradient descent, the gradient descent is in the same dimension, the parameter update becomes faster, and when the gradient changes, the update parameter becomes slower, so that the convergence can be accelerated and the turbulence can be reduced
 

1. momentum momentum

When the parameters are updated, the previously updated direction
vt + 1 = β vt + α ▽ J ( θ t ) θ t + 1 = θ t − vt + 1 v_{t+1}=\beta v_{t} is retained to a certain extent +\alpha \triangledown J(\theta_{t})\\ \theta_{t+1}=\theta_{t}-v_{t+1}vt+1=b vt+a J ( it)it+1=itvt+1
 
(1) Features:

  • When the gradient direction changes, momentum can reduce the parameter update speed, thereby reducing the shock
  • When the gradient direction is the same, momentum can speed up parameter updates, thereby speeding up convergence
     

2. Nesterov Accelerated Gradient (NAG)

The improvement of momentum is to make a correction when the gradient is updated. The specific method is to add the momentum of the previous moment
vt + 1 = β vt − α ▽ J ( θ t + β vt ) θ t + 1 = θ to the current gradient J t + vt + 1 v_{t+1}=\beta v_{t}-\alpha \triangledown J(\theta_{t}+ \beta v_{t})\\ \theta_{t+1}=\theta_ {t}+v_{t+1}vt+1=b vta J ( it+b vt)it+1=it+vt+1
Disadvantages:
This algorithm will lead to extremely slow running speed, which is twice as slow as momentum, so in the actual implementation process, almost no one directly uses this algorithm, but uses a deformed version.
 
Derivation process:
Let θ t ′ = θ t + β vt \theta'_{t}=\theta_{t}+ \beta v_{t}it=it+b vt,得
θ t + 1 = θ t + β v t − α ▽ J ( θ t + β v t ) \theta_{t+1}=\theta_{t}+\beta v_{t}-\alpha \triangledown J(\theta_{t}+ \beta v_{t}) it+1=it+b vta J ( it+b vt)
⇒ θ t + 1 ′ − β v t + 1 = θ t ′ − α ▽ J ( θ t ′ ) \Rightarrow\theta'_{t+1}-\beta v_{t+1}=\theta'_{t}-\alpha \triangledown J(\theta'_{t}) it+1b vt+1=ita J ( it)
⇒ θ t + 1 ′ = θ t ′ + β [ β v t − α ▽ J ( θ t + β v t ) ] − α ▽ J ( θ t ′ ) \Rightarrow\theta'_{t+1}=\theta'_{t}+\beta[\beta v_{t}-\alpha \triangledown J(\theta_{t}+ \beta v_{t})]-\alpha \triangledown J(\theta'_{t}) it+1=it+b [ b vta J ( it+b vt)]a J ( it)
⇒ θ t + 1 ′ = θ t ′ + β 2 v t − α ( 1 + β ) ▽ J ( θ t ′ ) \Rightarrow\theta'_{t+1}=\theta'_{t}+\beta^{2}v_{t}-\alpha(1+\beta) \triangledown J(\theta'_{t}) it+1=it+b2v _ta ( 1+b ) J ( it)
⇒ θ t + 1 = θ t + β 2 v t − α ( 1 + β ) ▽ J ( θ t ) \Rightarrow\theta_{t+1}=\theta_{t}+\beta^{2}v_{t}-\alpha(1+\beta) \triangledown J(\theta_{t}) it+1=it+b2v _ta ( 1+b ) J ( it)
 
变形版本:
v t + 1 = β 2 v t − α ( 1 + β ) ▽ J ( θ t ) θ t + 1 = θ t + v t + 1 v_{t+1}=\beta^{2}v_{t}-\alpha(1+\beta) \triangledown J(\theta_{t})\\ \theta_{t+1}=\theta_{t}+v_{t+1} vt+1=b2v _ta ( 1+b ) J ( it)it+1=it+vt+1


insert image description here

Please add a picture description
Momentum: (blue) first update the gradient, and then take a big step in the original momentum direction
NAG: (green) first take a big step in the original momentum direction (brown), then calculate the gradient (red), and get the corrected green line
 

(3) Propagation optimization algorithm

1. Resilient propagation (Rprop)

 

2. Root Mean Square propagation Root Mean Square propagation (RMSprop)

A special case of Adadelta, when ρ=0.5, E becomes the average of the sum of squared gradients; if you find the root again, it becomes RMS (root mean square) gt = ▽ J ( θ t ) G t =
β G t − 1 + ( 1 − β ) gt 2 θ t + 1 = θ t − α G t + ε gt g_{t}=\triangledown J(\theta_{t})\\ G_{t}=\beta G_{t-1}+(1-\beta)g_{t}^{2}\\ \theta_{t+1}=\theta_{t}-\frac{\alpha}{\sqrt{G_{t }+\varepsilon}} g_{t}gt=J ( it)Gt=βGt1+(1b ) gt2it+1=itGt+e agt
 
(1) Features: RMSprop is a development of AdaGrad, a variant of Adadelta, the effect tends to be between the two. It is suitable for dealing with non-stationary targets and works well for RNN
 

(4) Adaptive learning rate algorithm

1. Adaptive Gradient (AdaGrad)

g t = ▽ J ( θ t ) G t = G t − 1 + g t 2 θ t + 1 = θ t − α G t + ε g t g_{t}=\triangledown J(\theta_{t})\\ G_{t}=G_{t-1}+g_{t}^{2}\\ \theta_{t+1}=\theta_{t}-\frac{\alpha}{\sqrt{G_{t}+\varepsilon}} g_{t} gt=J ( it)Gt=Gt1+gt2it+1=itGt+e agt
where α \alphaα is the learning rate, generally0.01 0.010 . 0 1ε \varepsilonε is to prevent the denominator from being 0, generally take1 0 − 7 10^{-7}107
 
(1) Advantages: In the early stage, in the direction of a more gentle parameter space, greater progress will be made; for parameters with larger gradients, the learning rate will become smaller. For parameters with small gradients, the effect is reversed. In this way, the parameters can be reduced slightly faster in a gentle place, so as not to stagnate.
(2) Disadvantages: the learning rate is reduced prematurely and excessively; because it is the square of the accumulated gradient, the gradient will disappear in the later stage
 

2. Adadelta

AdaGrad constraints
can be used : E ( g 2 ) t = 0 = 0 , E ( h 2 ) t = 0 = 0 gt = ▽ J ( θ t ) E ( g 2 ) t = β E ( g 2 ) t − 1 + ( 1 − β ) gt 2 ht = E ( h 2 ) t − 1 + ε E ( g 2 ) t + ε gt E ( h 2 ) t = β E ( h 2 ) t − 1 + ( 1 − β ) ht 2 θ t + 1 = θ t − ht Theorem:E(g^{2})_{t=0}=0,E(h^{2})_{t=0}=0 \\g_{t}=\triangledown J(\theta_{t})\\E(g^{2})_{t}=\beta E(g^{2})_{t-1}+( 1-\beta) g_{t}^{2}\\h_{t}= \frac {\sqrt{E(h^{2})_{t-1}+\varepsilon}}{\sqrt{E (g^{2})_{t}+\varepsilon}} g_{t}\\ E(h^{2})_{t}=\beta E(h^{2})_{t-1 }+(1-\beta) h_{t}^{2}\\\theta_{t+1}=\theta_{t}-h_{t}Initialization : E ( g2)t=0=0,E(h2)t=0=0gt=J ( it)E ( g2)t=βE(g2)t1+(1b ) gt2ht=E ( g2)t+e E(h2)t1+e gtE(h2)t=βE(h2)t1+(1b ) ht2it+1=itht
where ε \varepsilonε is to prevent the denominator from being 0, generally take1 0 − 6 10^{-6}106
 
(1) Features:

  • In the early and mid-term of training, the acceleration effect is good, very fast
  • late training. Repeatedly jitter around local minima
  • Do not rely on the global learning rate, manually setting a global learning rate will not affect the final result
     

3. Adaptive Momentum Estimation (Adam)

初 始 化 : m 0 = 0 , v 0 = 0 g t = ▽ J ( θ t ) m t = 1 1 − β 1 t [ β 1 m t − 1 + ( 1 − β 1 ) g t ] v t = 1 1 − β 2 t [ β 2 v t − 1 + ( 1 − β 2 ) g t 2 ] θ t + 1 = θ t − α v t + ε m t 初始化:m_{0}=0,v_{0}=0\\ g_{t}=\triangledown J(\theta_{t})\\ m_{t}=\frac{1}{1-\beta_{1} ^{t}}[\beta_{1} m_{t-1}+(1-\beta_{1})g_{t}]\\ v_{t}=\frac{1}{1-\beta_{2} ^{t}}[\beta_{2} v_{t-1}+(1-\beta_{2})g_{t}^{2}]\\ \theta_{t+1}=\theta_{t}-\frac{\alpha}{\sqrt{v_{t}}+\varepsilon} m_{t} initialization : m0=0,v0=0gt=J ( it)mt=1b1t1[ b1mt1+(1b1)gt]vt=1b2t1[ b2vt1+(1b2)gt2]it+1=itvt +eamt
where α \alphaα is the learning rate, generally0.001 0.0010 . 0 0 1 ;β 1 \beta_{1}b1and β 2 \beta_{2}b2is the smoothing constant or decay rate, generally 0.9 and 0.9 respectively0.9 and 0.999 0.999 _0 . 9 9 9 ,ε \varepsilonε is to prevent the denominator from being 0, generally take1 0 − 8 10^{-8}108
 
(1) Explanation:

  • Adam also needs to find the sum of gradient squares v (that is, G in RMSPropz)
  • Smoothly update the gradient using the new variable m
  • The sum of squared gradients of the gradient will be updated with reservations
  • The sum of squared gradients of the gradient will be updated with reservations

(2) Features:

  • The parameters are relatively stable
  • Good at dealing with sparse gradients, good at dealing with non-stationary targets
  • Compute different adaptive learning rates for different parameters
  • Also suitable for most non-convex optimization problems - suitable for large data sets and high-dimensional spaces
     
3.1 Adamax

Adam added learning rate cap
 

3.2 AdamW

Adam joins weight decay
 

3.3 AMSGrad

 

3.4 I hope

Introduce Nesterov acceleration effect in Adam
 

3.5 SparseAdam

Adam for sparse tensors
 

3.6 AdaBound

 

(5) Others

1.SWATS

 

2. RAdam

 

3.Lookahead

 

4.Nesterov accelerated gradient (NAG)

 

method comparison

algorithm advantage shortcoming Applicable situation
batch gradient descent When the objective function is convex, the global optimum can be found The convergence speed is slow, all the data needs to be used, and the memory consumption is large Not suitable for large data sets, cannot update the model online
stochastic gradient descent Avoid the interference of redundant data, accelerate the convergence speed, and be able to learn online The variance of the update value is large, and the convergence process will fluctuate, which may fall into a minimum value. It is difficult to choose an appropriate learning rate Suitable for models that need to be updated online, suitable for large-scale training samples
Mini-batch gradient descent Reduce the variance of the update value, and the convergence is more stable Choosing the right learning rate is difficult
Momentum Ability to speed up SGD in relevant directions, suppress oscillations, and thus speed up convergence Need to manually set the learning rate Suitable for reliable initialization parameters
Nesterov The gradient is calculated to correct the current gradient after a large jump Need to manually set the learning rate
Dosing No need to manually adjust each learning rate It still relies on manually setting a global learning rate. If the learning rate is set too large, the adjustment to the gradient is too large. In the middle and late stages, the gradient is close to 0, which makes the training end early Fast convergence is required, when training complex networks: suitable for processing sparse gradients
Adadelta There is no need to preset a default learning rate. In the early and middle stages of training, the acceleration effect is good and fast, which can avoid the problem of inconsistent units on both sides when updating parameters In the later stage of training, it repeatedly jitters near the local minimum Fast convergence is required, when training complex networks
RMS plug Fix aggressive learning rate scaling in Adagrad Still depends on the global learning rate Fast convergence is required, when training complex networks; good for non-stationary targets - works well for RNNs
Adam Smaller memory requirements, calculate different adaptive learning rates for different parameters Fast convergence is required when training complex networks; the advantages of being good at handling sparse gradients and dealing with non-stationary targets are also suitable for most non-convex optimizations - suitable for large data sets and high-dimensional spaces
  • For sparse data, try to use an optimization method with adaptive learning rate
  • SGD usually takes longer to train, but results are more reliable
  • If you care about faster convergence, it is recommended to use the learning rate adaptive optimization method
  • Adadelta, RMSprop, Adam are relatively similar algorithms

The above is not finished and needs to be updated, it is only for personal study, the infringement contact is deleted, if there are any mistakes or deficiencies, please point out for improvement.

Guess you like

Origin blog.csdn.net/abc31431415926/article/details/127968430