pytorch gradient accumulation backpropagation

The traditional training function, a batch is trained like this:

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2. backward
    optimizer.zero_grad()   # reset gradient
    loss.backward()
    optimizer.step()

Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
optimizer.zero_grad() clears the past gradients;
loss.backward() Backpropagation, calculate the current gradient;
optimizer.step() updates the network parameters according to the gradient

Simply put, it comes in a batch of data, calculates the gradient once, and updates the network once

Using gradient accumulation is written like this:

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2.1 loss regularization
    loss = loss/accumulation_steps   
    # 2.2 back propagation
    loss.backward()
    # 3. update parameters of net
    if((i+1)%accumulation_steps)==0:
        # optimizer the net
        optimizer.step()        # update parameters of net
        optimizer.zero_grad()   # reset gradient

Obtain loss: input images and labels, calculate the predicted value through infer, and calculate the loss function;
loss.backward() Backpropagation, calculate the current gradient;
Repeat steps 1-2 multiple times, without clearing the gradient, so that the gradient is accumulated on the existing gradient;
After the gradient is accumulated for a certain number of times, first optimizer.step() updates the network parameters according to the accumulated gradient, and then optimizer.zero_grad() clears the past gradients to prepare for the next wave of gradient accumulation;

In summary: Gradient accumulation means that every time you get 1 batch of data, you calculate the gradient once, the gradient is not cleared, and it is continuously accumulated. After a certain number of times, the network parameters are updated according to the accumulated gradient, and then the gradient is cleared, and the next cycle .

Under certain conditions, the larger the batchsize, the better the training effect. Gradient accumulation realizes the disguised expansion of batchsize. If accumulation_steps is 8, the batchsize'disguised' is expanded by 8 times, which is one of our beggar labs to solve the limitation of video memory. A good trick, you need to pay attention when using it, and the learning rate should be appropriately enlarged.

pytorch gradient accumulation backpropagation

Guess you like