Preface
Mainly used to solve the problem of insufficient graphics card memory.
Gradient accumulation can use a single card to achieve the effect of increasing batch size
Gradient accumulation principle
Mini-Batch is executed in sequence, and the gradient is accumulated at the same time. The accumulated results are averaged and updated after the last Mini-Batch is calculated to update the model variables.
accumulated = ∑ i = 0 N gradient \color{green}accumulated=\sum_{i=0}^{N}grad_{i}accumulated=i=0∑Ngradi
Gradient accumulation is a way to split the data sample for training the neural network into several small batches by batch, and then calculate them in sequence.
When the model variables are not updated, the original data batch is actually divided into several small Mini-Batches , and the samples used in each step are actually smaller data sets.
The variables are not updated within N steps, so that all Mini-Batch use the same model variables to calculate the gradient to ensure that the same gradient and weight information are calculated. The algorithm is equivalent to using the original batch size without splitting. . That is:
θ i = θ i − 1 − lr ∗ ∑ i = 0 N gradi \color{green}\theta _{i}=\theta _{i-1}-lr*\sum_{i=0}^{ N}grad_{i}ii=ii−1−lr∗i=0∑Ngradi
Code
Code without gradient accumulation
for i, (images, labels) in enumerate(train_data):
# 1. forwared 前向计算
outputs = model(images)
loss = criterion(outputs, labels)
# 2. backward 反向传播计算梯度
optimizer.zero_grad()
loss.backward()
optimizer.step()
Added code for gradient accumulation
# 梯度累加参数
accumulation_steps = 4
for i, (images, labels) in enumerate(train_data):
# 1. forwared 前向计算
outputs = model(imgaes)
loss = criterion(outputs, labels)
# 2.1 loss regularization loss正则化
loss += loss / accumulation_steps
# 2.2 backward propagation 反向传播计算梯度
loss.backward()
# 3. update parameters of net
if ((i+1) % accumulation)==0:
# optimizer the net
optimizer.step()
optimizer.zero_grad() # reset grdient
Setting accumulation_steps = 4 in the code means that the batch_size is expanded four times in disguise. Because the gradient is cleared and the parameters are updated every 4 iterations in the code.
loss = loss/accumulation_steps, the gradient is accumulated four times, then the average must be divided by 4. At the same time, because 4 batches have been accumulated, the learning rate should also be expanded by 4 times to make the update step larger.
Reference blog: 1. Gradient accumulation of pytorch's cool operation, which increases the batch size in disguise
2. How to understand gradient accumulation clearly