Pytorch gradient accumulation implementation

Preface

Mainly used to solve the problem of insufficient graphics card memory.
Gradient accumulation can use a single card to achieve the effect of increasing batch size

Gradient accumulation principle

Mini-Batch is executed in sequence, and the gradient is accumulated at the same time. The accumulated results are averaged and updated after the last Mini-Batch is calculated to update the model variables.
accumulated = ∑ i = 0 N gradient \color{green}accumulated=\sum_{i=0}^{N}grad_{i}accumulated=i=0Ngradi

Gradient accumulation is a way to split the data sample for training the neural network into several small batches by batch, and then calculate them in sequence.
When the model variables are not updated, the original data batch is actually divided into several small Mini-Batches , and the samples used in each step are actually smaller data sets.
The variables are not updated within N steps, so that all Mini-Batch use the same model variables to calculate the gradient to ensure that the same gradient and weight information are calculated. The algorithm is equivalent to using the original batch size without splitting. . That is:
θ i = θ i − 1 − lr ∗ ∑ i = 0 N gradi \color{green}\theta _{i}=\theta _{i-1}-lr*\sum_{i=0}^{ N}grad_{i}ii=ii1lri=0Ngradi
Insert image description here

Code

Code without gradient accumulation

for i, (images, labels) in enumerate(train_data):
    # 1. forwared 前向计算
    outputs = model(images)
    loss = criterion(outputs, labels)

    # 2. backward 反向传播计算梯度
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Added code for gradient accumulation

# 梯度累加参数
accumulation_steps = 4


for i, (images, labels) in enumerate(train_data):
    # 1. forwared 前向计算
    outputs = model(imgaes)
    loss = criterion(outputs, labels)

    # 2.1 loss regularization loss正则化
    loss += loss / accumulation_steps

    # 2.2 backward propagation 反向传播计算梯度
    loss.backward()

    # 3. update parameters of net
    if ((i+1) % accumulation)==0:
        # optimizer the net
        optimizer.step()
        optimizer.zero_grad() # reset grdient

Setting accumulation_steps = 4 in the code means that the batch_size is expanded four times in disguise. Because the gradient is cleared and the parameters are updated every 4 iterations in the code.
loss = loss/accumulation_steps, the gradient is accumulated four times, then the average must be divided by 4. At the same time, because 4 batches have been accumulated, the learning rate should also be expanded by 4 times to make the update step larger.
Reference blog: 1. Gradient accumulation of pytorch's cool operation, which increases the batch size in disguise
2. How to understand gradient accumulation clearly

Guess you like

Origin blog.csdn.net/fcxgfdjy/article/details/133294760