The traditional training function, a batch is trained like this:
for i,(images,target) in enumerate(train_loader):
# 1. input output
images = images.cuda(non_blocking=True)
target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
outputs = model(images)
loss = criterion(outputs,target)
# 2. backward
optimizer.zero_grad() # reset gradient
loss.backward()
optimizer.step()
- Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
- optimizer.zero_grad() clears the past gradients;
- loss.backward() Backpropagation, calculate the current gradient;
- optimizer.step() updates the network parameters according to the gradient
Simply put, it comes in a batch of data, calculates the gradient once, and updates the network once
Using gradient accumulation is written like this:
for i,(images,target) in enumerate(train_loader):
# 1. input output
images = images.cuda(non_blocking=True)
target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
outputs = model(images)
loss = criterion(outputs,target)
# 2.1 loss regularization
loss = loss/accumulation_steps
# 2.2 back propagation
loss.backward()
# 3. update parameters of net
if((i+1)%accumulation_steps)==0:
# optimizer the net
optimizer.step() # update parameters of net
optimizer.zero_grad() # reset gradient
- Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
- loss.backward() Backpropagation, calculate the current gradient;
- Repeat steps 1-2 multiple times, without clearing the gradient, so that the gradient is accumulated on the existing gradient;
- After the gradient is accumulated for a certain number of times, first optimizer.step() updates the network parameters according to the accumulated gradient, and then optimizer.zero_grad() clears the past gradients to prepare for the next wave of gradient accumulation;
In summary: Gradient accumulation means that every time you get 1 batch of data, you calculate the gradient once, the gradient is not cleared, and it is continuously accumulated. After a certain number of times, the network parameters are updated according to the accumulated gradient, and then the gradient is cleared, and the next cycle .
Under certain conditions, the larger the batchsize, the better the training effect. Gradient accumulation realizes the disguised expansion of batchsize. If accumulation_steps is 8, the batchsize'disguised' is expanded by 8 times, which is one of our beggar labs to solve the limitation of video memory. A good trick, you need to pay attention when using it, and the learning rate should be appropriately enlarged.