[Deep Learning] Detailed Explanation of Optimizer

optimizer

The deep learning model is used to calculate the error degree of the target prediction by introducing a loss function. Based on the error results calculated by the loss function, small changes to the model parameters (i.e. weights and biases) are required in order to reduce prediction errors. But the question is how do you know when the parameters should be changed, and if so, by how much? This is when the optimizer is introduced. In simple terms, an optimizer optimizes a loss function, the job of the optimizer is to change the trainable parameters in such a way that the loss function is minimized, the loss function guides the optimizer to move in the right direction

The optimizer, that is, the optimization algorithm, is used to find the optimal solution of the model, by comparing the gap between the output predicted by the neural network itself and the real label, that is, the Loss function.

In order to find the smallest loss (that is, to obtain the local optimal solution in the backpropagation of neural network training), the method of gradient descent (Gradient Descent) is usually used, and the gradient descent is the optimization algorithm. A sort of.

1. SGD (gradient descent method)

1.1 Principle

It means that the directional derivative of a certain function at this point obtains the maximum value along this direction, that is, the function changes the fastest along this direction (the direction of the gradient) at this point, and the rate of change is the largest (the modulus of the gradient)
insert image description here

1.2 Iterative steps of gradient descent method

An intuitive explanation of gradient descent:
For example, we are somewhere on a big mountain, because we don’t know how to go down the mountain, so we decide to take one step at a time, that is, every time we go to a position, solve the gradient of the current position, Take a step down along the negative direction of the gradient, that is, the current steepest position, and then continue to solve the gradient of the current position, and take a step to the position where this step is located along the steepest and most easy-to-downhill position. Going down step by step like this, until we feel that we have reached the foot of the mountain. Of course, if we go on like this, we may not be able to reach the foot of the mountain, but to a certain part of the lower part of the mountain.
img-KdzKWPp3-1661424408143)

Give MSE:
J ( θ ) = 1 m ∑ i = 1 m ( x ∗ θ − y ) 2 J(\theta)=\frac 1m\sum^m_{i=1}(x*\theta-y )^2J(θ)=m1i=1m(xiy)2
The goal is to find a set of suitable θ(w1,w2,w3,…,wn) to minimize the value of the objective function J(θ). (Find the optimal solution in the fastest and most efficient way)

1.3 Three different gradient descent methods

The difference is that the amount of sample data calculated each time the parameters are updated is different

1.3.1 Batch gradient descent

The batch gradient descent method (Batch Gradient Descent) is aimed at the entire data set, and solves the direction of the gradient by calculating all samples
θ = θ − η ∇ θ J ( θ ) \theta = \theta - \eta \nabla_{ \theta}J(\theta )i=iηiJ(θ)

for i in range(nb_epochs):
	params_grad = evaluate_gradient(loss_function, data, params)
	params = params - learning_rate * params_grad

1.3.2 Stochastic gradient descent

For each parameter update, only one random data sample
θ = θ − η ∇ θ J ( x ( i ) , y ( i ) ; θ ) \theta = \theta - \eta \nabla_{\theta} needs to be calculated J( x^{(i)}, y^{(i)} ;\theta)i=iηiJ(x(i),y(i);i )

for i in range(nb_epochs):
	np.random.shuffle(data)
	for example in data:
		params_grad = evaluate_gradient(loss_function, example, params)
		params = params - learning_rate * params_grad

1.3.3 Mini-batch gradient descent method (Mini-batch gradient descent)

θ = θ − η
∇ θ J ( x ( i : i + n ) , y ( i : i + n ) ; θ ) \theta = \theta - \eta \nabla_{\theta}J(x^{(i:i+n)}, y^{(i:i+n)}; \theta )i=iηiJ(x(i:i+n),y(i:i+n);i )

for i in range(nb_epochs):
	np.random.shuffle(data)
	for batch in get_batches(data, batch_size=50):
		params_grad = evaluate_gradient(loss_function, batch, params)
 		params = params - learning_rate * params_grad

1.3.4 Comparison of three methods

  • The convergence speed of Batch gradient descent is too slow, and there will be a lot of redundant calculations (such as calculating similar samples).
  • Although Stochastic gradient descent greatly accelerates the convergence speed, its gradient descent fluctuates very large (high variance).
  • Mini-batch gradient descent neutralizes the advantages and disadvantages of the two, so the SGD algorithm usually defaults to Mini-batch gradient descent

1.3.5 Disadvantages of Mini-batch Gradient Descent Method

However, Mini-batch gradient descent cannot guarantee good convergence. There are mainly the following disadvantages:

  • It is very difficult to choose an appropriate learning rate. If the learning rate is too low, the convergence will be slow. If the learning rate is too high, the fluctuation will be too large.

  • All parameters use the same learning rate. For sparse data or features, sometimes we want to update parameters for infrequently occurring features faster, and for frequently occurring features to update slower. At this time, SGD cannot meet the requirements.

  • Sgd is easy to converge to a local optimal solution, and in some cases may be trapped in the saddle point. In the case of appropriate initialization and step size, the influence of the saddle point is not so great.

1.3.6 How does adjusting Batch_Size affect the training effect?

  • Batch_Size is too small, and the model performance is extremely bad (error soars).
  • As the Batch_Size increases, the processing speed of the same amount of data is faster.
  • As the Batch_Size increases, the number of epochs required to achieve the same accuracy increases.

Due to the contradiction between the above two factors, Batch_Size is increased to a certain time to achieve the optimal time; because the final convergence accuracy will fall into different local extremums, so Batch_Size is increased to a certain time to achieve the final convergence accuracy. best. If the training set is small (less than 2000 samples), use the BGD method directly

It is precisely because of these shortcomings of SGD that there are various algorithms proposed later.

1.4 SGD in pytorch:

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, *, maximize=False, foreach=None)

parameter:

  • params(iterable): parameters that need to be optimized
  • lr(float): learning rate
  • momentum (float, optional) : momentum factor defaults to 0
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • dampening (float, optional) – dampening of momentum (default: 0)
  • nesterov (bool, optional) – whether to use Nesterov momentum (default: False)
  • maximize (bool, optional) – Maximize the parameter according to the objective, not minimize (default: False)
  • foreach (bool, optional) – whether to implement for each optimizer (default: None)

2. Momentum

Momentum uses the idea of ​​momentum in physics to accelerate the current gradient by accumulating previous momentum (mt−1).
mt = μ ∗ mt − 1 + η ∇ θ J ( θ ) θ t = θ t − 1 − mt m_t = \mu*m_{t-1}+\eta \nabla_{\theta}J(\theta) \ \ \theta_{t} = \theta_{t-1} - m_tmt=mmt1+ηiJ(θ)it=it1mt
where μ is the momentum factor, usually set to 0.9 or an approximate value.

features

  • In the initial stage of parameter decline, add the value of the previous parameter update; if the direction of the two declines before and after is the same, multiplying by a larger μ can accelerate well.
  • In the middle and late stages of parameter decline, when oscillating back and forth near the local minimum, gradient→0, μ makes the update range increase and jumps out of the trap.
  • When the gradient direction changes, momentum can reduce the parameter update speed, thereby reducing oscillation; when the gradient direction is the same, momentum can speed up parameter update, thereby accelerating convergence.
  • All in all, momentum can accelerate SGD convergence and suppress oscillations.

3. NAG

Newton accelerated gradient momentum optimization method (NAG, Nesterov accelerated gradient): Take a small step at the speed of the previous step, then look at the current gradient and then take another step.

Try to understand it like this : In momentum, the ball will blindly follow the gradient downhill, which is prone to mistakes, so a smarter ball is needed, which can know where it is going in advance, and know how to slow down when it goes to the slope , rather than advocating another slope.

  • Advantages: The direction of gradient descent is more accurate
  • Disadvantages: It does not have a great effect on the convergence rate

NAG makes a correction when the gradient is updated, avoiding going too fast, and improving sensitivity at the same time.

Momentum does not directly affect the current gradient ∇ θ J ( θ ) \nabla_{\theta}J(\theta)iJ ( θ ) , so the improvement of NAG is to use the last momentum (−μ∗mt−1) current gradient∇ θ J ( θ ) \nabla_{\theta}J(\theta)iJ ( θ ) made a correction.

m t = μ ∗ m t − 1 + η ∇ θ J ( θ − μ ∗ m t − 1 ) θ t = θ t − 1 − m t m_t = \mu*m_{t-1}+\eta \nabla_{\theta}J(\theta-\mu*m_{t-1})\\\theta_{t} = \theta_{t-1} - m_t mt=mmt1+ηiJ(θmmt1)it=it1mt

The comparison between Momentum and NAG is as follows:
image_3

  • Momentum: The blue vector
    Momentum first calculates the current gradient value (short blue vector), and then adds the previously accumulated gradient/momentum (long blue vector).
  • NAG: The green vector
    NAG first calculates the previously accumulated gradient/momentum (long brown vector), and then adds the gradient value (red vector) of the current gradient value after correction (−μ∗mt−1), and the result is the final Updated value of NAG (green vector).

Both Momentum and NAG are intended to make gradient updates more flexible. However, the artificially designed learning rate is always a bit blunt. Here are several methods of adaptive learning rate.


4. Dose

Adagrad is a constraint on the learning rate, AdaGrad uses a small batch stochastic gradient gt g_tgtElement-wise squared accumulator variable nt n_tnt. At time step 0, set n 0 n_0n0Each element in is initialized to 0. At time step t, first the mini-batch stochastic gradient gt g_tgtAdd element-wise to variable nt n_tnt

gt = ∇ θ J ( θ ) nt = nt − 1 + ( gt ) 2 θ t = θ t − 1 − η nt + ϵ ∗ gt θ t = θ t − 1 − η ∑ r = 1 t ( gr ) + ϵ ∗ gt g_t = \nabla_{\theta}J(\theta)\\ n_t = n_{t-1}+ (g_t)^2\\ \theta_{t} = \theta_{t-1} - \ frac{\eta}{\sqrt{n_t+\epsilon}} * g_t\\ \theta_{t} = \theta_{t-1} - \frac{\eta}{\sqrt{\sum^t_{r=1 }(g_r)^2+\epsilon}} *g_tgt=iJ(θ)nt=nt1+(gt)2it=it1nt+ϵ hgtit=it1r=1t(gr)2+ϵ hgt

features

  • Early period nt n_tntWhen it is small, the regularizer is larger and can amplify the gradient
  • late nt n_tntWhen it is larger, the regularizer is smaller and can reduce the gradient
  • In the middle and late stages, the accumulation of the gradient square on the denominator will become larger and larger, making the gradient→0, so that the training ends early.

shortcoming

  • It can be seen from the formula that it still depends on a global learning rate η set manually
  • If η is set too large, the regularizer will be too sensitive and the gradient adjustment will be too large.
  • The most important thing is that the accumulation of gradient squares on the denominator in the middle and late stages will become larger and larger, making gradient → 0, so that the training ends early and it is impossible to continue learning.

Adadelta has mainly improved on the last shortcoming.


5. Adadelta

Adadelta still constrains the learning rate, but simplifies the calculation.

gt = ∇ θ J ( θ ) nt = υ ∗ nt − 1 + ( 1 − υ ) ( gt ) 2 θ t = θ t − 1 − η nt + ϵ ∗ gt g_t = \nabla_{\theta}J(\ theta)\\n_t = \upsilon*n_{t-1}+(1-\upsilon)(g_t)^2\\\theta_{t}=\theta_{t-1} - \frac{\eta}{ \sqrt{n_t+\epsilon}} *g_tgt=iJ(θ)nt=unt1+(1u ) ( gt)2it=it1nt+ϵ hgt

Its state variable is the square term gt 2 g_t^2gt2Exponentially weighted moving average, so as the nearest 1 1 − v \frac 1{1-v}1v1weighted average of the mini-batch stochastic gradient squared terms for time steps. In this way, the learning rate of each element of the independent variable will no longer decrease (or remain unchanged) during the iteration process.
Here Adadelta still relies on the global learning rate, and then the author uses the approximate Newton iteration method to make some improvements:

E [ g 2 ] t = ρ ∗ E [ g 2 ] t − 1 + ( 1 − ρ ) ∗ ( gt ) 2 Δ θ t = − ∑ r = 1 t − 1 Δ θ r E [ g 2 ] t + ϵ E[g^2]_t = \rho*E[g^2]_{t-1} + (1-\rho)*(g_t)^2\\\Delta\theta_{t} = - \frac {\sum^{t-1}_{r=1}\Delta\theta_r}{\sqrt{E[g^2]_t+\epsilon}}E [ g2]t=rE [ g2]t1+(1r )(gt)2D it=E [ g2]t+ϵ r=1t1D ir

Among them, E stands for expectation.

At this point, it can be seen that Adadelta no longer depends on the global learning rate.

features

  • In the early and middle stages of training, the acceleration effect is good and very fast.
  • In the later stage of training, it repeatedly jitters around the local minimum.

6. RMS plug

RMSprop can be seen as a special case of Adadelta.

当 ρ=0.5 times, E [ g 2 ] t = ρ ∗ E [ g 2 ] t − 1 + ( 1 − ρ ) ∗ ( gt ) 2 E[g^2]_t = \rho*E[g^2 ]_{t-1} + (1-\rho)*(g_t)^2E [ g2]t=rE [ g2]t1+(1r )(gt)2 becomes the average of the sum of squared gradients.

If you find the root again, it becomes RMS (Root Mean Squared, root mean square):
RMS [ g ] t = E [ g 2 ] t + ϵ Δ θ t = − η RMS [ g ] t ∗ gt RMS[g ]_t = \sqrt{E[g^2]_t + \epsilon}\\\Delta\theta_{t} = - \frac{\eta}{\sqrt{RMS[g]_t}} * g_tRMS[g]t=E [ g2]t+ϵ D it=RMS[g]t hgt
A better set of parameters is set as: η=0.001, γ=0.9

features

  • In fact, RMSprop still depends on the global learning rate
  • The effect of RMSprop is between Adagrad and Adadelta
  • Good for handling non-stationary targets - works well for RNNs.

7. Adam

Adam (Adaptive Moment Estimation) is essentially RMSprop with a momentum term.

m t = μ ∗ m t − 1 + ( 1 − μ ) ∗ g t n t = v ∗ n t − 1 + ( 1 − v ) ∗ ( g t ) 2 m t ^ = m t 1 − μ t n t ^ = n t 1 − v t Δ θ t = − m t ^ n t ^ + ϵ ∗ η m_t = \mu*m_{t-1}+(1-\mu)*g_t\\n_t = v*n_{t-1}+(1-v)*(g_t)^2\\\hat{m_t} = \frac{m_t}{1-\mu^t}\\\hat{n_t} = \frac{n_t}{1-v^t}\\\Delta \theta_t = - \frac{\hat{m_t}}{\sqrt{\hat{n_t}}+\epsilon} * \eta mt=mmt1+(1m )gtnt=vnt1+(1v)(gt)2mt^=1mtmtnt^=1vtntD it=nt^ +ϵmt^the

mt, nt are the first-order moment estimation and second-order moment estimation of the gradient, which can be regarded as the expectation E [ g ] t , E [ g 2 ] t E[g]_t, E[g^2]_tE [ g ]t,E [ g2]testimate;

m t ^ , n t ^ \hat{m_t}, \hat{n_t} mt^,nt^are corrections for mt, nt, respectively, which can be approximated as unbiased estimates of expectations.

It can be seen that the direct moment estimation of the gradient has no additional requirements for memory, and can be dynamically adjusted according to the gradient, while − mt ^ nt ^ + ϵ - \frac{\hat{m_t}}{\sqrt{\hat{ n_t}}+\epsilon}nt^ + ϵmt^Form a dynamic constraint on the learning rate, and have a clear range.

The default parameter settings proposed by the author are: μ=0.9, v=0.999, ϵ=10−8

features

  • After the Adam gradient is bias-corrected, the learning rate of each iteration has a fixed range, making the parameters relatively stable.
  • Combining the advantages of Adagrad being good at dealing with sparse gradients and RMSprop being good at dealing with non-stationary targets
  • Compute different adaptive learning rates for different parameters
  • Also suitable for most non-convex optimization problems - suitable for large data sets and high-dimensional spaces.

8. Adamax

Adamax is a variant of Adam that provides a simpler bound on the upper bound on the learning rate.
nt = max ( v ∗ nt − 1 , ∣ gt ∣ ) Δ θ t = − mt ^ nt + ϵ ∗ η n_t = max(v*n_{t-1}, |g_t|)\\ \Delta \theta_t = - \frac{\hat{m_t}}{ { {n_t}}+\epsilon} * \etant=max(vnt1,gt)D it=nt+ϵmt^
The learning rate bounds for η Adamax are simpler.

9. I hope

Nadam is similar to Adam with a NAG momentum term.

g t ^ = g t 1 − ∏ i = 1 t μ i m t = μ t ∗ m t − 1 + ( 1 − μ t ) ∗ g t m t ^ = m t 1 − ∏ i = 1 t + 1 μ i n t = v ∗ n t − 1 + ( 1 − v ) ∗ ( g t ) 2 n t ^ = n t 1 − v t m t ^ = ( 1 − μ t ) ∗ g t ^ + μ t + 1 ∗ m t ^ Δ θ t = − m t ^ n t ^ + ϵ ∗ η \hat{g_t} = \frac{g_t}{1-\prod^t_{i=1}\mu_i}\\m_t = \mu_t*m_{t-1}+(1-\mu_t)*g_t\\\hat{m_t} = \frac{m_t}{1-\prod^{t+1}_{i=1}\mu_i}\\n_t = v*n_{t-1}+(1-v)*(g_t)^2\\\hat{n_t} = \frac{n_t}{1-v^t} \hat{m_t} = (1-\mu_t)*\hat{g_t}+\mu_{t+1}*\hat{m_t}\\\Delta \theta_t = - \frac{\hat{m_t}}{\sqrt{\hat{n_t}}+\epsilon} * \eta gt^=1i=1tmigtmt=mtmt1+(1mt)gtmt^=1i=1t+1mimtnt=vnt1+(1v)(gt)2nt^=1vtntmt^=(1mt)gt^+mt+1mt^D it=nt^ +ϵmt^the

It can be seen that Nadam has a stronger constraint on the learning rate, and also has a more direct impact on the update of the gradient.

In general, better results can be achieved using Nadam on problems using RMSprop or Adam with momentum.


10. Visualization of the descent process of several algorithms

10.1. Comparison of the gradient descent process of the algorithm:

insert image description here

can be seen:

Adagrad, Adadelta, and RMSprop all reach the optimal solution on the right very quickly, but at this time Momentum and NAG begin to decline, and the initial decline speed is very slow. But soon NAG will find the correct direction of descent and get closer to the optimal solution more quickly.

SGD has the slowest decline, but the direction of decline is always the most correct.

10.2. Contrast at saddle point:

insert image description here

can be seen:

SGD is stuck at the saddle point and cannot continue to optimize.

SGD, Momentum, and NAG all wobbled back and forth at the saddle point, but eventually Momentum and NAG escaped the saddle point.

But at the same time, Adagrad, RMSprop and Adadelta left the saddle point very quickly.


11. Choice of optimization algorithm

  • For sparse data, try to use an algorithm with adaptive learning rate, without manual adjustment, and it is best to use the default parameters
  • SGD usually takes the longest to train, but with a good initialization and learning rate scheduling scheme, the results tend to be more reliable. But SGD is easy to be trapped in the saddle point, and this shortcoming cannot be ignored.
  • If you care about the speed of convergence and need to train a deeper and more complex network, it is recommended to use the learning rate adaptive optimization method.
  • Adagrad, Adadelta and RMSprop are relatively similar algorithms with similar performance.
  • Where RMSprop or Adam with momentum can be used, Nadam tends to achieve better results.

12. Other Strategies for Optimizing SGD

12.1. Shuffling and Curriculum Learning

Shuffling is to shuffle the data. After each epoch, the data is shuffled once, which can avoid the order of the training samples from affecting the optimization results.

But on the other hand, on some problems, giving the training data a meaningful order may lead to better performance and better convergence. This method of establishing a meaningful order for the training data is called Curriculum Learning.

12.2. Batch Normalization

In order to effectively learn parameters, we generally initialize the parameters to 0 mean and unit variance at the beginning. However, during the training process, the parameters will be updated to different value ranges, making the effect of normalization disappear, resulting in problems such as slowing down the training speed or exploding gradients (when the network is getting deeper and deeper).

BN restores normalization to the data of each batch, and at the same time, these changes to the data are reversible, that is, the parameters of the middle layer are normalized without losing the expressive ability of the middle layer.

After using BN, we can use a higher learning rate and no longer need to spend so much attention on parameter initialization.

BN also has the effect of regularization, and also weakens the need for Dropout.

12.3. Early Stopping

During training we will monitor the validation error and will (be patient) stop training early if the validation error does not improve significantly.

12.4. Gradient noise

Add a Gaussian noise when the gradient is updated:

g t , i = g t , i + N ( 0 , σ t 2 ) g_{t,i} = g_{t,i} + N(0,\sigma^2_t) gt,i=gt,i+N(0,pt2)

The initialization strategy for the variance value is:

σ t 2 = η ( 1 + t ) γ \sigma^2_t = \frac{\eta}{(1+t)^{\gamma}}pt2=(1+t)ch

Neelakantan et al. show that noise makes the network more robust and is helpful for deep and complex network training. They guess that adding noise will give the model more opportunities to escape from local optimal solutions (deep models are often prone to falling into local optimal solutions)

Guess you like

Origin blog.csdn.net/LogosTR_/article/details/126530513