Pytorch gradient accumulation (gradient accumulation)

Gradient accumulation- gradient accumulation

During deep learning training, the batch size of the data is limited by the GPU memory, and the batch size will affect the final accuracy of the model and the performance of the training process. In the case of constant GPU memory, the model is getting bigger and bigger, which means that the batch size of the data can only be reduced. At this time, Gradient Accumulation (Gradient Accumulation) can be used as a simple solution to solve this problem.

Gradient Accumulation (Gradient Accumulation) is a training technique that can increase the number of batch samples (Batch Size) without additional hardware resources. This is an optimization measure that replaces space with time. It accumulates the gradients of multiple batches of training data. After the specified number of accumulations is reached, the model parameters are updated uniformly using the accumulated gradients to achieve a larger Batch Size model training effect. . The cumulative gradient is equal to the average of the gradients of multiple batches of training data .

The so-called gradient accumulation process is actually very simple. The gradient we use for gradient descent is actually the average value of the gradients calculated by multiple samples. Taking batch_size=128 as an example, you can calculate the gradients of 128 samples at one time and then average them. I can also calculate the average gradient of 16 samples at a time, and then cache and accumulate them. After 8 calculations, I divide the total gradient by 8, and then perform the parameter update. Of course, the parameters must be updated with the average gradient of 8 times after accumulating 8 times. You cannot update every 16 calculations, otherwise it will be batch_size=16.

traditional deep learning

for i, (inputs, labels) in enumerate(trainloader):
    optimizer.zero_grad()                   # 梯度清零
    outputs = net(inputs)                   # 正向传播
    loss = criterion(outputs, labels)       # 计算损失
    loss.backward()                         # 反向传播,计算梯度
    optimizer.step()                        # 更新参数
    if (i+1) % evaluation_steps == 0:
        evaluate_model()

specific process:

  1. optimizer.zero_grad(), clear the network gradient after the previous batch calculation
  2. Forward propagation, the data and labels are passed into the network, and the prediction results are obtained through infer calculation
  3. Calculate the loss value based on the prediction result and label
  4. loss.backward() , use the loss for backpropagation to calculate the parameter gradient
  5. optimizer.step(), use the calculated parameter gradient to update the network parameters

Simply put, it comes in a batch of data, calculates the gradient once, and updates the network once.

Gradient accumulation method

for i, (inputs, labels) in enumerate(trainloader):
    outputs = net(inputs)                   # 正向传播
    loss = criterion(outputs, labels)       # 计算损失函数
    loss = loss / accumulation_steps        # 梯度均值,损失标准化
    loss.backward()                         # 梯度均值累加,反向传播,计算梯度
    
	# 累加到指定的 steps 后再更新参数
	if (i+1) % accumulation_steps == 0:     
        optimizer.step()                    # 更新参数
        optimizer.zero_grad()               # 梯度清零
        if (i+1) % evaluation_steps == 0:
            evaluate_model()

specific process:

  1. Forward propagation, pass the data into the network, and get the prediction result
  2. Calculate the loss value based on the prediction result and label
  3. Use the loss to backpropagate and calculate the parameter gradient
  4. Repeat 1-3, do not clear the gradient, but accumulate the gradient
  5. After the gradient accumulation reaches a fixed number of times, update the parameters, and then clear the gradient to zero

When the gradient is accumulated, each batch still propagates forward and backward normally, but the gradient is not cleared after backpropagation, because backward() in PyTorch performs the operation of gradient accumulation, so when we call N times After loss.backward(), the gradients of these N batches will be accumulated. However, what we need is an average gradient, or average loss, so we should divide the loss obtained by each calculation by accum_steps.

        To sum up, the gradient accumulation means that every time the gradient of a batch is calculated, the gradient is not cleared, but the gradient is accumulated. After the accumulation reaches a certain number of times (accumulation_steps), the network parameters are updated, and then the gradient is cleared.
        Through this method of delayed update of parameters, an effect similar to that of using a large batch size can be achieved. In the usual experimental process, I generally use gradient accumulation technology. In most cases, the effect of the model trained with gradient accumulation is much better than that of the model trained with small batch size.

Precautions:

  • Under certain conditions, the larger the batchsize, the better the training effect, and the gradient accumulation can realize the disguised expansion of the batchsize. If the accumulation_steps is 8, the batchsize 'disguised' will be expanded by 8 times, which is a good trick for the laboratory to solve the limited video memory. You need to pay attention when using it, and the learning rate should also be appropriately enlarged .

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128523906