[Summary and analysis of CV knowledge points] | optimizer and learning rate

[Summary and analysis of CV knowledge points] | optimizer and learning rate

【Written in front】

This series of articles is suitable for students or people who have already started Python and have a certain programming foundation, as well as students or people who are looking for jobs in artificial intelligence, algorithms, and machine learning. The series of articles includes deep learning, machine learning, computer vision, feature engineering, etc. I believe it can help beginners quickly get started with deep learning, and help job seekers fully understand the knowledge points of algorithms.

1. What are the common optimization functions (optimizer)?

1. SGD (Stochastic Gradient Descent)

Batch Gradient Descent

During each round of training, the Batch Gradient Descent algorithm uses the data of the entire training set to calculate the gradient of the cost fuction, and uses the gradient to update the model parameters:

Θ = Θ − α ⋅ ∇ Θ J ( Θ ) \Theta=\Theta-\alpha \cdot \nabla \Theta J(\Theta);Th=Tha∇Θ J ( Θ )

advantage:

  • If the cost function is a convex function, it can be guaranteed to converge to the global optimal value; if it is a non-convex function, it can converge to the local optimal value

shortcoming:

  • Batch gradient descent can be very slow since each iteration needs to be computed on the entire dataset

  • When the number of training is large, more memory is required

  • Batch gradient descent (Batch Gradient Descent) does not allow online updates of the model, such as adding new instances.

Stochastic Gradient Descent

Contrary to the batch gradient descent algorithm, the Stochastic gradient descent algorithm immediately calculates the gradient of the cost fuction to update the parameters every time a piece of data is read:

Θ = Θ − α ⋅ ∇ Θ J ( Θ ; x ( i ) , y ( i ) ) \Theta=\Theta-\alpha \cdot \nabla_{\Theta} J\left(\Theta ; x^{(i). )}, y^{(i)}\right)Th=ThaThJ( Θ ;x(i),y(i))

advantage:

  • The algorithm converges quickly (in the Batch Gradient Descent algorithm, the gradient of many similar samples is calculated in each round, and this part is redundant)

  • can be updated online

  • There is a chance to jump out of a poor local optimum and converge to a better local optimum or even a global optimum

shortcoming:

  • Easy to converge to local optimum and easy to get stuck in saddle point

Mini-batch Gradient Descent

The mini-batch Gradient Descent method is a compromise between the above two methods, and each time a subset (mini-batch) is taken from all training data to calculate the gradient:

Θ = Θ − α ⋅ ∇ Θ J ( Θ ; x ( i i i + n ) , y ( i : i + n ) ) \Theta=\Theta-\alpha \cdot \nabla_{\Theta} J\left(\Theta ; x^{(i i i+n)}, y^{(i: i+n)}\right) Th=ThaThJ( Θ ;x( iii + n ) ,y(i:i+n))

Mini-batch Gradient Descent only calculates the gradient of one mini-batch in each iteration, which not only has high computational efficiency, but also has relatively stable convergence. This method is currently the mainstream method in deep learning training.

The main challenges faced by the above three methods are as follows:

  • Choosing an appropriate learning rate α is difficult. A learning rate that is too small will lead to slow convergence, while a learning rate that is too fast will cause large fluctuations and hinder convergence.

  • The currently available method is to adjust the learning rate during the training process, such as simulated annealing algorithm: pre-define a number of iterations m, and reduce the learning rate every time m training is performed, or when the value of the cost function is lower than a threshold Decrease the learning rate. However, the number of iterations and threshold must be defined in advance, so it cannot adapt to the characteristics of the data set.

  • In the above method, the learning rate of each parameter is the same, which is unreasonable: if the training data is sparse and the frequency of occurrence of different features varies greatly, then it is more reasonable to use the low frequency A larger learning rate is set for the features with higher frequency, and a smaller learning rate is set for the feature data with higher frequency.

  • Recent studies have shown that the reason why deep neural networks are difficult to train is not because it is easy to enter the local minimum. On the contrary, due to the very complex network structure, even the local minimum can get very good results in most cases. The reason why it is difficult to train is because the learning process is easy to fall into the saddle surface, that is, on the slope, some points are rising and some points are falling. This situation is more likely to occur in flat areas, where the gradient values ​​​​are almost 0 in all directions.

2. Momentum

A disadvantage of the SGD method is that its update direction is completely dependent on the gradient calculated by the current batch, so it is very unstable. The Momentum algorithm borrows the concept of momentum from physics. It simulates the inertia of an object when it is moving, that is, when updating, it retains the direction of the previous update to a certain extent, and at the same time uses the gradient of the current batch to fine-tune the final update direction. In this way, the stability can be increased to a certain extent, thus learning faster, and there is also the ability to get rid of local optimum:

vt = γ ⋅ vt − 1 + α ⋅ ∇ Θ J ( Θ ) Θ = Θ − vt v_{t}=\gamma \cdot v_{t-1}+\alpha \cdot \nabla_{\theta} J(\ Theta)\\\Theta=\Theta-v_{t}vt=cvt1+aThJ(Θ)Th=Thvt

The Momentum algorithm will observe the historical gradient vt − 1 v_{t-1}vt1, if the direction of the current gradient is consistent with the historical gradient (indicating that the current sample is unlikely to be an outlier), the gradient in this direction will be enhanced, and if the current gradient is inconsistent with the direction of the historical gradient, the gradient will be attenuated. A vivid explanation is **:** We push a ball downhill, the ball accumulates momentum as it goes downhill, and becomes faster and faster on the way, γ can be regarded as air resistance, if the direction of the ball changes, the momentum will decay.

3. Nesterov Momentum

In the process of the ball rolling down, the author hopes that the ball can know in advance where the slope will rise, so that the ball will start to slow down before encountering the rising slope. This method is Nesterov Momentum, which has a strong theoretical guarantee of convergence in convex optimization. And, in practice Nesterov Momentum also works better than pure Momentum:

vt = γ ⋅ vt − 1 + α ⋅ ∇ Θ J ( Θ − γ vt − 1 ) Θ = Θ − vt v_{t}=\gamma \cdot v_{t-1}+\alpha \cdot \nabla_{\ Theta} J\left(\Theta-\gammav_{t-1}\right)\\\Theta=\Theta-v_{t}vt=cvt1+aThJ( Thv _t1)Th=Thvt

The core idea is: pay attention to the momentum method, if you only look at the γ * v item, then the current θ will become θ-γ * v after the action of momentum. Therefore, the position of θ-γ * v can be regarded as a "prospect" position for current optimization. Therefore, the derivative can be derived at θ-γ * v, instead of the original θ.

4. Dose

In the above method, for each parameter θ i θ_iiiThe training of all uses the same learning rate α. The Adagrad algorithm can automatically adjust the learning rate during training, using a larger α update for parameters with a lower frequency of occurrence; on the contrary, a smaller α update for parameters with a higher frequency of occurrence. Therefore, Adagrad is very suitable for dealing with sparse data.

We set gt , i g_{t, i}gt,iis the gradient of the i-th parameter in the t-th round, that is, gt , i = ∇ Θ J ( Θ i ) g_{t, i}=\nabla_{\Theta} J\left(\Theta_{i}\right)gt,i=ThJ( Thi) . Therefore, the process of parameter update in SGD can be written as:

Θ t + 1 , i = Θ t , i − α ⋅ g t , i \Theta_{t+1, i}=\Theta_{t, i}-\alpha \cdot g_{t, i} Tht+1,i=Tht,iagt,i

Adagrad for each parameter θ i θ_i in each round of trainingiiThe learning rate is updated, and the parameter update formula is as follows:

Θ t + 1 , i = Θ t , i − α G t , i i + ϵ ⋅ g t , i \Theta_{t+1, i}=\Theta_{t, i}-\frac{\alpha}{\sqrt{G_{t, i i}+\epsilon}} \cdot g_{t, i} Tht+1,i=Tht,iGt , ii+ϵ agt,i

其中, G t ∈ R d × d G_{t} \in \mathbb{R}^{d \times d} GtRd × d diagonal matrix, each diagonal position is the corresponding parameterθ i \theta_{i}iiThe sum of squares of gradients from round 1 to round t. ϵ is a smoothing term, which is used to avoid the denominator being 0, and generally takes a value of 1e−8. The disadvantage of Adagrad is that in the middle and late stages of training, the accumulation of gradient squares on the denominator will become larger and larger, so that the gradient will approach 0, making the training end early.

5. RMS plug

RMSprop is an adaptive learning rate method proposed by Geoff Hinton. Adagrad will accumulate all the previous gradient squares, and RMSprop only calculates the corresponding average value, so it can alleviate the problem that the learning rate of the Adagrad algorithm drops rapidly.

E [ g 2 ] t = 0.9 E [ g 2 ] t − 1 + 0.1 g t 2 Θ t + 1 = Θ t − α E [ g 2 ] t + ϵ ⋅ g t E\left[g^{2}\right]_{t}=0.9 E\left[g^{2}\right]_{t-1}+0.1 g_{t}^{2}\\\Theta_{t+1}=\Theta_{t}-\frac{\alpha}{\sqrt{E\left[g^{2}\right]_{t}+\epsilon}} \cdot g_{t} E[g2]t=0.9 E[g2]t1+0.1gt2Tht+1=ThtE[g2]t+ϵ agt

6. Adam

Adam (Adaptive Moment Estimation) is another method of adaptive learning rate. It uses the first-order moment estimation and second-order moment estimation of the gradient to dynamically adjust the learning rate of each parameter. The advantage of Adam is that after bias correction, the learning rate of each iteration has a certain range, making the parameters relatively stable. The formula is as follows:

m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 m ^ t = m t 1 − β 1 t v ^ t = v t 1 − β 2 t Θ t + 1 = Θ t − α v ^ t + ϵ m ^ t m_{t}=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_{t}\\v_{t}=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2}\\\hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}}\\\hat{v}_{t}=\frac{v_{t}}{1-\beta_{2}^{t}}\\\Theta_{t+1}=\Theta_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \hat{m}_{t} mt=b1mt1+(1b1)gtvt=b2vt1+(1b2)gt2m^t=1b1tmtv^t=1b2tvtTht+1=Thtv^t +ϵam^t

The Adam method will also converge better than the RMSprop method, so in practical applications, Adam is the most commonly used method and can get an estimated result relatively quickly.

Comparison of different optimizers

The figure above describes the performance of six optimizers on a surface, from which we can roughly see:

① Descent speed:

The decline rate of the three adaptive learning optimizers Adagrad, RMSProp and AdaDelta is obviously faster than that of SGD. Among them, Adagrad and RMSProp go hand in hand and are faster than AdaDelta.
The two momentum optimizers, Momentum and NAG, fell slowly at the beginning due to a fork in the beginning; as they slowly adjusted, the rate of decline became faster and faster, and NAG even surpassed the leading Adagrad and RMSProp in the later stage.
② Descent trajectory:

SGD and the three adaptive optimizer trajectories are roughly the same. The two momentum optimizers took a "fork in the road" at the beginning, and adjusted later.

The above figure is on a surface with saddle points, comparing the performance of the optimizer in 6, it can be roughly seen from the figure:

The three adaptive learning rate optimizers do not enter the saddle point, among which, AdaDelta drops the fastest, and Adagrad and RMSprop go hand in hand.
The two momentum optimizers, Momentum and NAG, and SGD have all entered the saddle point. But the two momentum optimizers jittered at the saddle point for a while, escaped from the saddle point and fell rapidly, and came from behind to surpass Adagrad and RMSProp.
Unfortunately, SGD entered the saddle point, but stayed at the saddle point and did not continue to descend.

The above figure compares the running process of six optimizers converging to the target point (pentagram). It can be roughly seen from the figure:

① In terms of running speed

The two momentum optimizers Momentum and NAG are the fastest, followed by the three adaptive learning rate optimizers AdaGrad, AdaDelta, and RMSProp, and the slowest is SGD.
② In terms of convergence trajectory

Although the two momentum optimizers run very fast, they have taken a long "fork" in the early and middle stages.
Among the three adaptive optimizers, Adagrad took a fork in the early stage, but then quickly adjusted it, but it took the longest path compared to the other two; AdaDelta and RMSprop have similar running trajectories, but when they are close to the goal, RMSProp can experience significant jitter.
Compared with other optimizers, SGD takes the shortest path, and the path is relatively positive.

2. How to choose an optimizer ?

1) If the training data is sparse, choose an adaptive learning rate algorithm (Adagrad, Adadelta, RMSprop, Adam).

2) RMSprop is an extension of Adagrad that handles a sharply reduced learning rate.

3) Adam adds bias correction and momentum to RMSprop. As the gradient becomes more and more sparse, Adam is slightly better than RMSprop at the end of optimization. Adam is probably the best choice among the above algorithms.

4) The three algorithms Adadelta, RMSprop, and Adam are very similar, and in similar situations, the results are all good.

3. The use of Optimizer in pytorch

The fixed collocations during model training are as follows:

loss.backward()
optimizer.step()
optimizer.zero_grad()

To put it simply, loss.backward() is to reversely calculate the gradient of each parameter, then optimizer.step() updates the parameters in the network, and optimizer.zero_grad() clears the gradient of this round to prevent it from affecting the next wheel update .

Commonly used optimizers are in the torch.optim package, so you need to import the package first:

import torch.optim.Adamimport torch.optim.SGD 

4. Optimizer basic properties

Some basic properties common to all Optimizers:

  • lr: learning rate, learning rate

  • eps: The minimum value of the learning rate. When the learning rate is dynamically updated, the minimum learning rate will not be less than this value.

  • weight_decay: Weight decay. It is equivalent to L2 regularization on the parameters (to make the model complexity as low as possible to prevent overfitting), and this value can be understood as the coefficient of the regularization term.

Each Optimizer maintains a list of param_groups , which maintains the parameters to be optimized and the corresponding attribute settings.

5. Optimizer basic method

  • **add_param_group(param_group):** Add a parameter group for the optimizer's param_groups. This is useful when fine-tuning a pretrained network, as frozen layers can be trained and added to the optimizer as training progresses.

  • **load_state_dict(state_dict):** Load the optimizer state. The argument must be the object returned by optimizer.state_dict().

  • **state_dict():** returns a dict containing the state of the optimizer: state and param_groups.

  • step(closure): Execute a parameter update process.

  • zero_grad(): Clear the gradient of all updated parameters.

When we construct an optimizer, the simplest method is usually as follows:

model = Net()
optimizer_Adam = torch.optim.Adam(model.parameters(), lr=0.1) 

**model.parameters()** returns all the parameters of the model, and passes them into the Adam function to construct an Adam optimizer, and set learning rate=0.1.

Therefore, the param_groups of the Adam optimizer maintains all the parameters of the model model, and the learning rate is 0.1, so that when optimizer_Adam.step() is called, all the parameters of the model will be updated.

6、param_groups

The optimizer's param_groups is a list, each element of which is a set of independent parameters, stored in the form of dict. The structure is as follows:

-param_groups    
    -0(dict)  # 第一组参数        
        params:  # 维护要更新的参数        
        lr:  # 该组参数的学习率        
        betas:        
        eps:  # 该组参数的学习率最小值        
        weight_decay:  # 该组参数的权重衰减系数        
        amsgrad:      
    -1(dict)  # 第二组参数    
    -2(dict)  # 第三组参数    
    ...

In this way, many flexible operations can be realized, such as:

1) Only train part of the parameters of the model

For example, you only want to train the layer1 parameters in the above model, while keeping the layer2 parameters unchanged. The Optimizer can be set as follows:

model = Net()
# 只传入layer层的参数,就可以只更新layer层的参数而不影响其他参数。
optimizer_Adam = torch.optim.Adam(model.layer1.parameters(), lr=0.1)  

2) Different parts of the parameters set different learning rates

For example, if you want to make the model's layer1 parameter learning rate 0.1 and layer2's parameter learning rate 0.2, you can set the Optimizer as follows:

model = Net()
params_dict = [{
    
    'params': model.layer.parameters(), 'lr': 0.1},              
             {
    
    'params': model.layer2.parameters(), 'lr': 0.2}]
optimizer_Adam = torch.optim.Adam(params_dict)

This method is more flexible, manually constructing a params_dict list to initialize the Optimizer. Note that the key of the parameter part in the dictionary must be **'params'**.

7. Dynamically update the learning rate

After understanding the basic structure and usage of the Optimizer, we will introduce how to dynamically update the learning rate during the training process.

1. Manually modify the learning rate

As mentioned earlier, each set of parameters of the Optimizer maintains an lr, so the most direct way is to manually modify the corresponding lr value in the Optimizer during the training process.

model = Net()  # 生成网络
optimizer_Adam = torch.optim.Adam(model.parameters(), lr=0.1)  # 生成优化器

for epoch in range(100):  # 假设迭代100个epoch    
    if epoch % 5 == 0:  # 每迭代5次,更新一次学习率        
        for params in optimizer_Adam.param_groups:             
            # 遍历Optimizer中的每一组参数,将该组参数的学习率 * 0.9            
            params['lr'] *= 0.9            
            # params['weight_decay'] = 0.5  # 当然也可以修改其他属性

2. torch.optim.lr_scheduler

The torch.optim.lr_scheduler package provides some classes for dynamically modifying lr.

  • torch.optim.lr_scheduler.LambdaLr

  • torch.optim.lr_scheduler.StepLR

  • torch.optim.lr_scheduler.MultiStepLR

  • torch.optim.lr_scheduler.ExponentialLR

  • torch.optim.lr_sheduler.CosineAnneaingLR

  • torch.optim.lr_scheduler.ReduceLROnPlateau

After pytorch version 1.1.0, after the lr_scheduler object is created, the first lr update will be executed automatically (it can be understood as executing scheduler.step() once).

Therefore, when using it, you need to call optimizer.step() first, and then scheduler.step().

If the lr_scheduler object is created, scheduler.step() is called first, and then optimizer.step() is called, the value of the first learning rate will be skipped.

# 调用顺序
loss.backward()
optimizer.step()
scheduler.step()...

【Project recommendation】

The core code library of top conference papers for Xiaobai: https://github.com/xmu-xiaoma666/External-Attention-pytorch

YOLO target detection library for Xiaobai: https://github.com/iscyy/yoloair

Analysis of papers for Xiaobai's top journal and conference: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading

reference

https://blog.csdn.net/u010089444/article/details/76725843

https://ruder.io/optimizing-gradient-descent/index.html

https://blog.csdn.net/rookie_wei/article/details/85470914

https://zhuanlan.zhihu.com/p/435669796

https://blog.csdn.net/weixin_40170902/article/details/80092628

Guess you like

Origin blog.csdn.net/Jason_android98/article/details/127240146