optimizer optimizer

optimizer optimizer

The optimizer optimizer is used to adjust the model parameters along the gradient descent direction according to the gradient of the parameters, so that the model loss is continuously reduced and reaches the global minimum. By continuously fine-tuning the model parameters, the model can learn the ability to extract features from the training data. Currently commonly used optimizers include SGD, Adam, AdamW, etc.

SGD

stochastic gradient descent

Randomly select one of the training samples to calculate the gradient, and then adjust the model parameters by × learning rate.

Advantages: fast training speed, good effect on non-convex functions

Disadvantages: The local optimal point or saddle point is that the gradient is 0, and the parameters cannot be updated continuously. It progresses slowly along the gentle path, but oscillates along the steep direction, and it is difficult to quickly converge
wt + 1 = wt − α ∗ gt w_{t+1}=w_t- α*g_twt+1=wtagt
Where α is the learning rate and g_t is the gradient of the current parameter

SGD-M

SGD-with Momentum introduces first-order momentum based on SGD.
mt = β 1 ∗ mt − 1 + ( 1 − β ) ∗ gt m_t=β_1*m_{t-1}+(1-β)*g_tmt=b1mt1+(1b )gt
The momentum factor is added, but it does not completely solve the problem of excessive oscillation and continuous update. When the local gully is deep, it will still be trapped in the local optimal point.

SGD with Nesterov Acceleration

The full name of NAG is Nesterov Accelerated Gradient. After advancing in the direction of cumulative momentum, calculate the current downward direction, and then calculate the current cumulative momentum based on the cumulative momentum and the gradient direction of the next point.

AdaGrad

Adaptive learning rate algorithm, in order to measure the historical update frequency of a certain parameter of the model, so that the learning rate of the parameter with high update frequency is adaptively reduced, while the adaptive learning rate of the parameter with low update frequency is increased, and the second order momentum is introduced, which is recorded to the present The sum of squares of all gradient values ​​so far
V t = ∑ r = 1 tgr 2 V_t=\sum{^t_{r=1}}g^2_rVt=r=1tgr2
Disadvantages: Due to the monotonous increase of V_t, the learning rate continues to decrease, and useful information cannot be learned from the subsequent data, and the second-order vector intelligence indicates that the entire learning process is fully accumulated, and cannot only indicate the recent update frequency.

AdaDelta/RMSProp

Modify the second-order momentum calculation method to the descending gradient of the past period of time, which is the same as the SGD first-order momentum calculation, and the exponential moving average is about the average value of the past period of time:
V t = β 2 ∗ V t − 1 + ( 1 − β 2 ) gt 2 V_t=β_2*V_{t-1}+(1-β_2)g^2_tVt=b2Vt1+(1b2)gt2

Adam

Combining the first-order momentum of SGD with the second-order momentum in AdaDelta, Adam is obtained. Through the first-order momentum and second-order momentum, the learning rate step size and gradient direction can be effectively controlled to prevent gradient oscillation and static.

Adjustment: In order to make the learning rate monotonically decreasing, the second-order momentum needs to be adjusted
V t = max ( β 2 ∗ V t − 1 + ( 1 − β 2 ) gt 2 , V t − 1 ) V_t=max(β_2*V_{ t-1}+(1-β_2)g^2_t,V_{t-1})Vt=max x ( b2Vt1+(1b2)gt2,Vt1)
Since the later learning rate is still low, it may affect the effective convergence.

I hope so

On the basis of Adam, the strategy of SGD with Nesterov Acceleration is added. When calculating the first-order momentum, first find the direction of the next step down, and then go back to calculate the cumulative momentum.

AdamW

The weight decay weight_decay in Adam is incorrect due to the change of adaptive learning rate, AdamW

By decoupling weight attenuation and gradient update, more efficient weight attenuation can be achieved and the performance of the optimizer can be improved.


Scheduler

Develop an appropriate learning rate decay strategy**. **You can use a regular attenuation strategy, such as attenuating every number of epochs; or use performance indicators such as accuracy or AUC to monitor, and reduce the learning rate when the indicators on the test set remain unchanged or decline. Scheduler adjusts the learning rate after multiple steps or epochs according to a certain strategy.

  • LambdaLR

​ Adjust the learning rate by setting a custom function

  • StepLR
    n e w l r = i n i t i a l l r ∗ γ e p o c h / / s i z e new_{lr}=initial_{lr}*γ^{epoch//size} newlr=initiallrce p oc h // s i ze
    γ is a hyperparameter (0-1), size is the number of epochs to update the learning rate

  • MultiStepLR

n e w l r = i n i t i a l l r ∗ γ b i s e c t − r i g h t ( m i l e s t o n e , e p o c h ) new_{lr}=initial_{lr}*γ^{bisect-right(milestone,epoch)} newlr=initiallrcbisectright(milestone,epoch)

​ Update the learning rate of the fixed size in StepLR, and modify it to increase the list_size to adjust the learning rate.

  • ExponentialLR
    n e w l r = i n i t i a l l r ∗ γ e p o c h new_{lr}=initial_{lr}*γ^{epoch} newlr=initiallrce po oc h
    simple exponential learning rate adjustment, updated every epoch

  • LinearLR

    Linear learning rate adjustment, given the initial and final learning rate, given the total update step size, the update step size is linearly decreased, and the minimum learning rate is kept unchanged for the exceeding part. ,

  • CyclicLR

    Continuously rise and fall in a curve, set the bottom and peak learning rate and amplitude rounds

  • CosineAnnealingLR

    Cosine annealing learning rate, set half of T_max period, minimum learning rate eta_min

  • SequentialLR

    Cascading multiple learning rate schedulers

  • ConstantLR

    Constant learning rate, multiplied by factor in the total_iters round, and restore the original learning rate outside

warmup

Since the weights of the model are randomly initialized at the beginning of training, if a larger learning rate is selected at this time, it may cause instability (oscillation) of the model. Choosing the Warmup preheating learning rate method can make The learning rate in the first few epochs or some steps of the training is small. Under the small learning rate of preheating, the model can gradually stabilize. After the model is relatively stable, select the preset learning rate for training to make the model converge. Faster and better models.


idea

Here, the optimizer only takes one step in one direction based on one loss. If you add some storage space, let the loss decrease at one time, and then compare the two losses after taking several steps in different directions, and then choose the point with the lowest loss to repeat. , can be a gradient descent, a random modification.

Since the learning rate of the optimizer of the Adam class will decrease, the input order of the training data will have a certain impact on the model results. Maybe this is a factor in the impact of the random seed. In order to make the high-quality and diverse training data come first , a simple adjustment is required, and the final model effect may be better.

Summarize

Adam and Adamw, I don't understand the mathematical formula of where the weight decays. I will try to think about it later when I look back.

reference

Guess you like

Origin blog.csdn.net/be_humble/article/details/126663495