Model training skills-learning rate setting and warm up


In the model training process, a very important point is how to adjust 学习率(lr) the hyperparametric for 学习率 optimization There are many, warmup it is one of the more important and effective method.

1. Warmup definition

warmup, Which means preheating, is a learning rate preheating method mentioned in the ResNet paper. It first chooses to use a smaller learning rate at the beginning of training, and trains some epochs or steps (such as 4 epoches, 10000steps), and then modify it to the preset learning for training.

2. Learning rate related

-Learning rate setting

  The setting of learning rate, theoretically speaking, usually the following situation:
    learning rate is too small → → Convergence is too slow, learning rate is too large→ → Missing the local optima;
  in fact, the learning rate is too small→ → Does not converge, the learning rate is too large→ → Does not converge; but there is currently no general theory to explain why this happens.

-Some ways to set the learning rate

  1. Under normal circumstances, the learning rate and some other parameters will affect each other, but the impact is not very large, so generally we recommend choosing a suitable learning rate when choosing the learning rate at the beginning, the learning rate here It does not refer to the learning rate that can achieve the best accuracy, but when the accuracy is somewhat reduced, but the speed of convergence becomes faster, this can save a lot of time to adjust other parameters.
  2. The sample size has a relatively large impact on the learning rate. Generally, there is no need to adjust the learning rate when the sample size remains unchanged. If the sample size is expanded, the learning rate needs to be reconsidered.
  3. Reduce the learning rate in the final stage. The latter babysit is
  more stable in the end. It is easier to find the local optimum by reducing the learning rate. You can increase the batchsize, which is more stable.
  Babysit: Look at the entire training process, when the accuracy is no longer improved on the validation set , Reduce the learning rate, and run again. Fine-tuned.
  4. Common learning rate include: 1e-05, 2e-05, 5e-05;
  general training from scratch, from 0.01 to start trial, problems on adjustment; if convergence is too fast, and soon on the over-fitting the training set, can reduce the learning rate, if the training If it is too slow or does not converge, you can increase the learning rate;

3. Warmup function

-constant warmup

  Since the weights of the model are initialized randomly at the beginning of training, if a larger learning rate is selected at this time, the model may be unstable (oscillation). Choose Warmup to warm up the learning rate. The learning rate in a few epochs or some steps at the beginning of training is small. Under the small learning rate of warm-up, the model can gradually stabilize. After the model is relatively stable, select the preset learning rate for training to make the model converge The speed becomes faster and the model effect is better.
  Example : When using a 110-layer ResNet to train on cifar10 in the Resnet paper, first train with a learning rate of 0.01 until the training error is less than 80% (about 400 steps of training), and then use a learning rate of 0.1 for training.

-gradual warmup

  The disadvantage of constant warmup is that changing from a small learning rate to a relatively large learning rate may cause the training error to increase suddenly. So in 18 years, Facebook proposed gradual warmup to solve this problem, that is, starting from the initial small learning rate, each step increases a little, until it reaches the initially set relatively large learning rate, use the initially set learning rate for training .
  1. The implementation simulation code of gradual warmup is as follows:

"""
Implements gradual warmup, if train_steps < warmup_steps, the
learning rate will be `train_steps/warmup_steps * init_lr`.
Args:
    warmup_steps:warmup步长阈值,即train_steps<warmup_steps,使用预热学习率,否则使用预设值学习率
    train_steps:训练了的步长数
    init_lr:预设置学习率
"""
import numpy as np
warmup_steps = 2500
init_lr = 0.1  
# 模拟训练15000步
max_steps = 15000
for train_steps in range(max_steps):
    if warmup_steps and train_steps < warmup_steps:
        warmup_percent_done = train_steps / warmup_steps
        warmup_learning_rate = init_lr * warmup_percent_done  #gradual warmup_lr
        learning_rate = warmup_learning_rate
    else:
        #learning_rate = np.sin(learning_rate)  #预热学习率结束后,学习率呈sin衰减
        learning_rate = learning_rate**1.0001 #预热学习率结束后,学习率呈指数衰减(近似模拟指数衰减)
    if (train_steps+1) % 100 == 0:
             print("train_steps:%.3f--warmup_steps:%.3f--learning_rate:%.3f" % (
                 train_steps+1,warmup_steps,learning_rate))

  2. The Warmup preheating learning rate implemented by the above code and the graph of the decay (sin or exp decay) after the learning rate preheating is completed are as follows:
Insert picture description here

4. Common warmup methods

4.1 constant warmup

The learning rate increases linearly from a very small value to the preset value and then remains unchanged. The coefficient of the learning rate is shown in the figure below:
Insert picture description here

4.2 Linear warmup

The learning rate increases linearly from a very small value to a preset value, and then decreases linearly. The coefficient of the learning rate is shown in the figure below.
Insert picture description here

4.3 Cosine warmup

The learning rate first linearly increases from a small value to the preset learning rate, and then cosattenuates according to the function value. The learning rate coefficient is shown in the figure below.
Insert picture description here
This article refers to: [tuning method]-warmup and Warmup warm-up learning rate

Guess you like

Origin blog.csdn.net/weixin_43624728/article/details/112794849