Here we mainly record two commonly used learning rate adjustment strategies: learning rate warmup and learning rate decay.

Learning rate warm-up

Learning rate warm-up is to gradually increase the learning rate at the beginning of training to help the model converge better. The warm-up phase is usually performed in the first few epochs of training, and then the learning rate decays according to a predefined decay strategy.
The sample code is as follows:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义优化器和学习率调度器
model = YourModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)

# 学习率预热参数
warmup_epochs = 5
warmup_lr_init = 0.01
warmup_lr_end = 0.1

# 在训练循环中更新学习率
for epoch in range(warmup_epochs):
    # 计算当前预热阶段的学习率
    warmup_lr = warmup_lr_init + (warmup_lr_end - warmup_lr_init) * epoch / warmup_epochs
    
    # 设置当前阶段的学习率
    for param_group in optimizer.param_groups:
        param_group['lr'] = warmup_lr
    
    # 训练模型的代码
    
    # 更新学习率调度器
    scheduler.step()

# 正常训练阶段，学习率衰减
for epoch in range(warmup_epochs, num_epochs):
    # 训练模型的代码
    
    # 更新学习率调度器
    scheduler.step()

In the above example, we first defined a model and an optimizer, and then created a learning rate scheduler (here StepLR is used as an example). Next, we set the learning rate warmup parameters, including the number of iterations in the warmup phase (warmup_epochs) and the initial learning rate (warmup_lr_init) and final learning rate (warmup_lr_end).

During the first warmup_epochs iterations of the training loop, we gradually increase the learning rate, linearly from warmup_lr_init to warmup_lr_end. In each iteration, we update the optimizer's parameter set according to the current learning rate.

After that, we enter the normal training phase and use the learning rate scheduler to decay the learning rate (in the example of StepLR, the learning rate decays according to the set gamma after each epoch).

Through such a learning rate warm-up mechanism, the model can be better adapted to the data in the early stage of training, and the stability and performance of training can be improved. According to actual needs, the warm-up effect can be adjusted according to the number of iterations in the warm-up phase, the initial learning rate and the final learning rate.

learning rate decay

Learning rate decay is a common strategy in model training, which gradually reduces the learning rate during training.
Common learning rate decay methods include:

Constant attenuation: Attenuate according to the artificially predefined attenuation frequency or attenuation step
Adaptive decay: By monitoring the change of a certain indicator (such as loss), when the indicator does not change or the change is small, adjust the learning rate;
Custom attenuation: The adjustment strategy provided by the Lambda method is very flexible. We can set different learning rate adjustment methods for different layers. This is very useful in fine-tune. We can not only set different learning rates for different layers , you can also set different learning rate adjustment strategies for it.
For the above three different attenuation strategies, pytorch has built-in six learning rate attenuation functions, as follows:

1) StepLR learning rate attenuation (according to fixed step size attenuation)

function definition

class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)

Function: Adjust the learning rate at equal intervals, the adjustment multiple is gamma times, and the adjustment interval is step_size. The interval unit is step. It should be noted that step usually refers to epoch, not iteration.

Parameters:
step_size(int): The number of learning rate drop intervals. If it is 30, the learning rate will be adjusted to lr*gamma at 30, 60, 90... epochs.
gamma(float): Learning rate adjustment multiple, the default is 0.1 times, that is, a decrease of 10 times.
last_epoch(int): The last epoch number, this variable is used to indicate whether the learning rate needs to be adjusted. When the last_epoch fits the set interval, the learning rate is adjusted. When -1, the learning rate is set to the initial value.

Sample code:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义优化器和学习率调度器
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 更新学习率
    scheduler.step()

2) MultiStepLR learning rate decay

function definition

class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)

Function: Decay at the specified epoch

Parameters:
milestones(list): A list, each element represents when to adjust the learning rate, and the list elements must be incremental. For example, milestones=[30,90,120]
gamma(float): learning rate adjustment multiple, the default is 0.1 times, that is, a decrease of 10 times.
last_epoch(int) : Same as above

sample code

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义优化器和学习率调度器
optimizer = optim.SGD(model.parameters(), lr=0.1)
milestones = [30, 90, 120]  # 里程碑(epoch)列表
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 更新学习率
    scheduler.step()

3) CosineAnnealingLR learning rate decay (cosine annealing decay)

function definition

class torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)

Function: Take the cosine function as the cycle, and reset the learning rate at the maximum value of each cycle. The principle of decay is to gradually reduce the learning rate from a large initial value to a small minimum value, and then gradually increase it back to the initial value. In each cycle (set number of epochs), the learning rate will be adjusted in the form of a cosine function, so that the learning rate changes gradually during the training process.
Specifically, the CosineAnnealingLR method calculates an updated value of the learning rate in each epoch based on a given number of epochs (T_max) and a minimum learning rate (eta_min). The learning rate exhibits a cosine decreasing trend in the first half of each cycle, and gradually increases back to the initial value in the second half. After each epoch, the learning rate is reset to its initial value.
Parameters:
T_max(int): The number of iterations of a learning rate cycle, that is, the learning rate is reset after T_max epochs.
eta_min(float): The minimum learning rate, that is, in one cycle, the learning rate will drop to eta_min at the minimum, the default value is 0

Sample code:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义模型和优化器
model = YourModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# 定义学习率调度器
T_max = 10  # 周期数
eta_min = 0.01  # 最小学习率
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=T_max, eta_min=eta_min)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 更新学习率调度器
    scheduler.step()

4) ExponentialLR learning rate decay (exponential decay)

function definition

class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)

Function: adjust learning rate according to exponential decay, adjustment formula: lr = lr * gamma**epoch

Parameters:
gamma: The bottom of the learning rate adjustment multiple, the exponent is epoch, that is, gamma**epoch
last_epoch(int): Same as above

Sample code:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义优化器和学习率调度器
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 更新学习率
    scheduler.step()

5) ReduceLROnPlateau learning rate decay (based on changes in indicators)

function definition

class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)

Function: When an indicator no longer changes (decreases or increases), adjust the learning rate. This is a very practical learning rate adjustment strategy. For example, when the loss of the verification set no longer decreases, adjust the learning rate; or monitor the accuracy of the verification set, and adjust the learning rate when the accuracy no longer increases.

parameter:

mode(str): mode selection, there are two modes: min and max. min means when the index no longer decreases (such as monitoring loss), and max means when the index no longer increases (such as monitoring accuracy).

factor(float): Learning rate adjustment multiple (equivalent to gamma of other methods), that is, the learning rate is updated as lr = lr * factor patience(int)-literal translation - "patience", that is, how many steps to endure the indicator does not change, When it becomes unbearable, adjust the learning rate. Note, it may not be 5 times in a row.

verbose(bool)-: Whether to print learning rate information

threshold(float): Threshold for measuring the new optimum, used with threshold_mode,
the default value is 1e-4. The function is to control the difference between the current index and the best index.

threshold_mode(str): Select the mode for judging whether the indicator is optimal. There are two modes, rel and abs.
When threshold_mode = rel, and mode = max, dynamic_threshold = best * ( 1 + threshold ); when threshold_mode = rel, and mode = min, dynamic_threshold = best * ( 1 - threshold ); when threshold_mode = abs, and mode = When max, dynamic_threshold = best + threshold; when threshold_mode = rel, and mode = max, dynamic_threshold = best - threshold

cooldown(int): "cooling time", after adjusting the learning rate, let the learning rate adjustment strategy cool down, let the model train for a while, and then restart the monitoring mode.
min_lr(float or list): The lower limit of the learning rate, which can be float or list. When there are multiple parameter groups, you can use list to set it.
eps(float): The minimum value of learning rate decay. When the learning rate change is less than eps, the learning rate will not be adjusted.

Sample code:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义优化器和学习率调度器
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 计算损失
    loss = ...

    # 更新学习率
    scheduler.step(loss)

6) LambdaLR learning rate decay (using a custom decay function)

function definition

class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)

Function: Set different learning rate adjustment strategies for different parameter groups.
The adjustment rule is, lr = base_lr * lambda(self.last_epoch).
It allows to dynamically adjust the learning rate according to a custom decay function. LambdaLR provides a flexible way to define the change rule of the learning rate, which can adjust the learning rate according to the progress of training or other custom conditions.
In LambdaLR, we need to define a decay function lambda_fn, which accepts the current training epoch number as input and returns a learning rate scaling factor. The scaling factor is multiplied by the initial learning rate to get the adjusted learning rate.

Parameters:
lr_lambda(function or list): A function to calculate the learning rate adjustment multiple, the input is usually step, when there are multiple parameter groups, set it to list.
last_epoch(int): ditto

Sample code:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

# 定义模型和优化器
model = YourModel()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# 定义学习率调度器
lambda_fn = lambda epoch: 0.5 ** epoch  # 自定义衰减函数
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_fn)

# 在训练循环中更新学习率
for epoch in range(num_epochs):
    # 训练模型的代码
    
    # 更新学习率调度器
    scheduler.step()

In the above example, we defined a model and an optimizer, and created a learning rate scheduler for LambdaLR. We use the lambda_fn function as the decay function, which takes the current epoch number as input and returns a scaling factor. After each epoch, we call the scheduler.step() method to update the learning rate.

By using LambdaLR, we can flexibly adjust the learning rate according to a custom decay function. In the example, we use an exponential decay function where the learning rate is halved every epoch, but you can define other decay rules as needed.

It should be noted that when using LambdaLR, the design of the decay function is critical. Make sure the decay function properly adjusts the learning rate for better convergence during training. The form and parameters of the decay function can be adjusted according to the specific problem and experimental results.

Reference:
https://zhuanlan.zhihu.com/p/69411064

The above is the usual centralized learning rate adjustment method.

Summary of learning rate adjustment method for pytorch model training

Learning rate warm-up

learning rate decay

1) StepLR learning rate attenuation (according to fixed step size attenuation)

2) MultiStepLR learning rate decay

3) CosineAnnealingLR learning rate decay (cosine annealing decay)

4) ExponentialLR learning rate decay (exponential decay)

5) ReduceLROnPlateau learning rate decay (based on changes in indicators)

6) LambdaLR learning rate decay (using a custom decay function)

Guess you like