pytorch memory management, forward propagation intermediate activation storage (intermediate activation) and torch.utils.checkpoint

reference

Intermediate activations for forward pass

Recently, it is hoped that the intermediate activation value can be changed during the model training process, so that the changed intermediate activation value can be used in subsequent backpropagation.
We know that backpropagation needs to use the intermediate variables of forward propagation to calculate the gradient, and these intermediate variables are stored in the video memory of the GPU, and I have not yet found how to extract these intermediate variables from the video memory (you can tell me ah ).

  • We define a simple model as follows:
import torch
from torch.utils.checkpoint import checkpoint

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.net1 = torch.nn.Linear(3, 300)
        self.net2 = torch.nn.Linear(300, 300)
        self.net3 = torch.nn.Linear(300, 400)
        self.net4 = torch.nn.Linear(400, 300)
        self.net5 = torch.nn.Linear(300, 100)
        self.activation_sum = 0
        self.activation_size = 0

    def forward(self, x):
        x = self.net1(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net2(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net3(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net4(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net5(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        return x

It can be seen that in the forward propagation function, each one xis an intermediate variable, and the last one xis the result, which is also an intermediate variable. All these intermediate variables will be stored in video memory. In the forward propagation, we store the number of parameters of the intermediate result and the total parameter size (in bytes) as activation_sum 和 activation_size.
We next verify that the intermediate variables of the model are stored in video memory.

  • Computing video memory usage:
    We can use torch.cuda.memory_allocated()the size (in bytes) stored in the output video memory.
  • A function that returns the model size in bytes:
def modelSize(model):
    param_size = 0
    param_sum = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
        param_sum += param.nelement()
    buffer_size = 0
    buffer_sum = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
        buffer_sum += buffer.nelement()
    all_size = (param_size + buffer_size)
    return all_size
  • define input
device = torch.device("cuda:0")

input = torch.randn(10, 3).to(device)
label = torch.randn(10, 100).to(device)
  • Forward Propagation and Back Propagation
torch.cuda.empty_cache()
before = torch.cuda.memory_allocated()
model = MyModel().to("cuda:0")
after = torch.cuda.memory_allocated()
print("建立模型后显存变大{}".format(after - before))

print("模型大小为{}".format(modelSize(model)))

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
model.train()
optimizer.zero_grad()

before = torch.cuda.memory_allocated()
print("模型前向传播前使用显存为{}".format(before))

output = model(input)  # 前向传播

after = torch.cuda.memory_allocated()
print("模型前向传播后使用显存为{},差值(中间激活)为{}".format(after, after - before))

loss = loss_fn(output, label)
torch.autograd.backward(loss)
optimizer.step()

The result is:

建立模型后显存变大1452544
模型大小为1449200
模型前向传播前使用显存为1457152
模型前向传播后使用显存为1514496,差值(中间激活)为57344

Print the intermediate results of statistics (intermediate activation)

print(model.activation_sum)
print(model.activation_size)

The result is:

14000
56000

It can be seen that the size of the model in the video memory is basically the same as the actual size of the model, which is the size of the model parameters. The video memory before and after the forward propagation of the model becomes larger, and this value is the same as the size of the intermediate result of the model.
This also proves that the intermediate results of the model are stored in the video memory, and released after the backpropagation calculation is completed .

use checkpoints

Checkpoint is to exchange time for storage, using the layer wrapped by checkpoint (it can also be a continuous layer), there is no need to store intermediate results during forward propagation, but to recalculate when intermediate variables are needed during backpropagation.
For example, we rewrite the model as follows:

import torch
from torch.utils.checkpoint import checkpoint

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.net1 = torch.nn.Linear(3, 300)
        self.net2 = torch.nn.Linear(300, 300)
        self.net3 = torch.nn.Linear(300, 400)
        self.net4 = torch.nn.Linear(400, 300)
        self.net5 = torch.nn.Linear(300, 100)
        self.activation_sum = 0
        self.activation_size = 0

    def forward(self, x):
        x = self.net1(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())

        x = checkpoint(torch.nn.Sequential(self.net2, self.net3, self.net4), x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net5(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        return x

Checkpoint example
We use checkpoint to wrap self.net2, self.net3, self.net4it up. In this way, self.net1(x)the intermediate result will be stored in the forward propagation, that is, self.net2the input, and then self.net2, self.net3the result will not be stored, and self.net4the output self.net5of the input will be stored.

This time the result is:

建立模型后显存变大1452544
模型大小为1449200
模型前向传播前使用显存为1457152
模型前向传播后使用显存为1485824,差值(中间激活)为28672

The results
of our statistical model activation_sum 和 activation_sizeare: 7000 28000, which is basically consistent with the activation calculated by the video memory 28672. Because the video memory must store some other values, it must not be exactly the same.

Guess you like

Origin blog.csdn.net/qq_43219379/article/details/124206922