pytorch gradient accumulation backpropagation

The traditional training function, a batch is trained like this:

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2. backward
    optimizer.zero_grad()   # reset gradient
    loss.backward()
    optimizer.step()            
  1. Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
  2. optimizer.zero_grad() clears the past gradients;
  3. loss.backward() Backpropagation, calculate the current gradient;
  4. optimizer.step() updates the network parameters according to the gradient

Simply put, it comes in a batch of data, calculates the gradient once, and updates the network once

 

Using gradient accumulation is written like this:

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2.1 loss regularization
    loss = loss/accumulation_steps   
    # 2.2 back propagation
    loss.backward()
    # 3. update parameters of net
    if((i+1)%accumulation_steps)==0:
        # optimizer the net
        optimizer.step()        # update parameters of net
        optimizer.zero_grad()   # reset gradient
  1. Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
  2. loss.backward() Backpropagation, calculate the current gradient;
  3. Repeat steps 1-2 multiple times, without clearing the gradient, so that the gradient is accumulated on the existing gradient;
  4. After the gradient is accumulated for a certain number of times, first optimizer.step() updates the network parameters according to the accumulated gradient, and then optimizer.zero_grad() clears the past gradients to prepare for the next wave of gradient accumulation;

In summary: Gradient accumulation means that every time you get 1 batch of data, you calculate the gradient once, the gradient is not cleared, and it is continuously accumulated. After a certain number of times, the network parameters are updated according to the accumulated gradient, and then the gradient is cleared, and the next cycle .

Under certain conditions, the larger the batchsize, the better the training effect. Gradient accumulation realizes the disguised expansion of batchsize. If accumulation_steps is 8, the batchsize'disguised' is expanded by 8 times, which is one of our beggar labs to solve the limitation of video memory. A good trick, you need to pay attention when using it, and the learning rate should be appropriately enlarged.

Guess you like

Origin blog.csdn.net/wi162yyxq/article/details/106054613