What should I do if the GPU memory is insufficient?

Preface

The models run recently are relatively large, especially Bert. It’s really difficult for me to 1080ti. In Bert’s Example, the official provided some tricks to help us speed up the training. It’s very conscientious, but I feel that it’s not enough, so I spent some time. Sort out a Trick collection to help us come when the video memory is insufficient.

This article is divided into two parts. The first part introduces a topic: how to estimate the video memory required by the model, and the second topic: various tricks when the GPU memory is insufficient.

Monitoring GPU

The most commonly used to monitor GPU, of course  nvidia-smi , but there is a tool to better present information: gpustat .

nvidia-smi
watch --color -n1 gpustat -cpu   # 动态事实监控GPU

It is recommended to configure aliases in the configuration file, anyway, every time I  gpu click, all the information comes out, which is very convenient.

Some students recommend it below nvtop. I tried it briefly and it was really good. The information presented is very rich. I recommend a try.

How to estimate model video memory[1]

First of all, think about a question: What things in the model occupy my video memory, and how can I move it  out of memory?

In fact, the video memory occupied by a model mainly consists of two parts:  the parameters of the model itself, the parameters of the optimizer, and the input and output of each layer of the model .

Model parameters

The parameters of the model itself refer to the Weight and Bias of each network layer, which will be occupied after the model is loaded. It is noted that some layers have parameters, such as CNN and RNN; and some layers are not. Parameters, such as activation layer, pooling layer, etc.

From the perspective of Pytorch, when you execute  model.to(device) it, your model is loaded, and at this time your model has been loaded.

For Pytorch, the model parameters are stored in  model.parameters() , so we don't need to calculate it ourselves, and we can print it directly through Pytorh:

print('Model {} : params: {:4f}M'.format(model._get_name(), para * type_size / 1000 / 1000))

Optimizer parameters

Optimizer parameters refer to the parameters generated by the model during the optimization process, that is, backpropagation. This part of the parameters mainly refers to dw, which is the gradient. In SGD, its size is the same as the parameters. Therefore, during optimization, the parameters of the model The occupied video memory will be doubled.

It is worth noting that different optimizers need to save different optimization parameters. For Adam, because it also needs to save other parameters, the parameter amount of the model will be quadrupled in the optimization interval.

Input and output of each layer of the model

First of all, the first point is the video memory occupied by the input data. The video memory occupied by this part is actually not large. This is because we often use iterators to read data. The data is read into the video memory, and this ensures that the video memory occupied by each input is insignificant compared with the entire network parameters.

Then , when the model performs forward propagation and back propagation, a very important thing is to calculate and save the output of each layer and its corresponding gradient, which means that this also occupies a large part of the video memory.

Finally,  the memory usage of the model output can be summarized as:

  • The output of each layer (multi-dimensional array), its corresponding gradient, it is worth noting that the model output does not need to store the corresponding momentum information (that is, if Adam is used here, the parameter amount of the model output is still 2 times instead of 4 times , I don’t know why?? Ask the boss for advice)

  • The output video memory occupancy is proportional to the batch size

So is there a way to calculate this part of the parameters through Pytorch? The answer is yes. We can assume a batch of samples, and then  model.modules() traverse each layer to obtain the output shape of each layer, and then we can obtain the output parameter amount of a batch of data. [2]

All video memory usage calculations

Video memory usage = model parameters × n + batch size × output parameters × 2 + input data of a batch (often ignored)

其中,n是根据优化算法来定的,如果选用SGD, 则 n = 2, 如果选择Adam, 则 n = 4.

一个很棒的实现如下, 我懒得再重新写了,你可以根据这个改一改,问题不大。

# 模型显存占用监测函数
# model:输入的模型
# input:实际中需要输入的Tensor变量
# type_size 默认为 4 默认类型为 float32

def modelsize(model, input, type_size=4):
   para = sum([np.prod(list(p.size())) for p in model.parameters()])
   print('Model {} : params: {:4f}M'.format(model._get_name(), para * type_size / 1000 / 1000))

   input_ = input.clone()
   input_.requires_grad_(requires_grad=False)

   mods = list(model.modules())
   out_sizes = []

   for i in range(1, len(mods)):
       m = mods[i]
       if isinstance(m, nn.ReLU):
           if m.inplace:
               continue
       out = m(input_)
       out_sizes.append(np.array(out.size()))
       input_ = out

   total_nums = 0
   for i in range(len(out_sizes)):
       s = out_sizes[i]
       nums = np.prod(np.array(s))
       total_nums += nums


   print('Model {} : intermedite variables: {:3f} M (without backward)'
         .format(model._get_name(), total_nums * type_size / 1000 / 1000))
   print('Model {} : intermedite variables: {:3f} M (with backward)'
         .format(model._get_name(), total_nums * type_size*2 / 1000 / 1000))

GPU 显存不足时的Trick [2]

此处不讨论多GPU, 分布式计算等情况,只讨论一些常规的 Trick, 会不定时进行更新。

降低batch size

这应该很好理解,适当降低batch size, 则模型每层的输入输出就会成线性减少, 效果相当明显。这里需要注意的一点是, dev batch size 的调整也有助于降低显存, 同时,不要将 dev 或 test 的batch size 设置为样本集长度, 我最近就干了这个傻事,害的我调试了一天才调出来是这个问题。

选择更小的数据类型

一般默认情况下, 整个网络中采用的是32位的浮点数,如果切换到 16位的浮点数,其显存占用量将接近呈倍数递减。

精简模型

在设计模型时,适当的精简模型,如原来两层的LSTM转为一层;原来使用LSTM, 现在使用GRU;减少卷积核数量;尽量少的使用 Linear 等。

数据角度

对于文本数据来说,长序列所带来的参数量是呈线性增加的, 适当的缩小序列长度可以极大的降低参数量。

total_loss

考虑到 loss 本身是一个包含梯度信息的 tensor, 因此,正确的求损失和的方式为:

total_loss += loss.item()

释放不需要的张量和变量

采用del释放你不再需要的张量和变量,这也要求我们在写模型的时候注意变量的使用,不要随心所欲,漫天飞舞。

Relu 的 inplace 参数

激活函数 Relu() 有一个默认参数 inplace ,默认为Flase, 当设置为True的时候,我们在通过relu() 计算得到的新值不会占用新的空间而是直接覆盖原来的值,这表示设为True, 可以节省一部分显存。

梯度累积

首先, 要了解一些Pytorch的基本知识:

  • 在Pytorch 中,当我们执行 loss.backward() 时, 会为每个参数计算梯度,并将其存储在 paramter.grad 中, 注意到, paramter.grad 是一个张量, 其会累加每次计算得到的梯度。

  • 在 Pytorch 中, 只有调用 optimizer.step()时才会进行梯度下降更新网络参数。 

我们知道, batch size 与占用显存息息相关,但有时候我们的batch size 又不能设置的太小,这咋办呢?答案就是梯度累加

我们先来看看传统训练:

for i,(feature,target) in enumerate(train_loader):
   outputs = model(feature)  # 前向传播
   loss = criterion(outputs,target)  # 计算损失

   optimizer.zero_grad()   # 清空梯度
   loss.backward()  # 计算梯度
   optimizer.step()  # 反向传播, 更新网络参数

而加入梯度累加之后,代码是这样的:

for i,(features,target) in enumerate(train_loader):
   outputs = model(images)  # 前向传播
   loss = criterion(outputs,target)  # 计算损失
   loss = loss/accumulation_steps   # 可选,如果损失要在训练样本上取平均

   loss.backward()  # 计算梯度
   if((i+1)%accumulation_steps)==0:
       optimizer.step()        # 反向传播,更新网络参数
       optimizer.zero_grad()   # 清空梯度

其实,这块有两种理解方式(受到评论区同学启发), 我谈谈在 bert 里面最常见的那种。

比较来看, 我们发现,梯度累加本质上就是累加 accumulation_steps 个 batchsize/accumulationsteps 的梯度, 再根据累加的梯度来更新网络参数,以达到真实梯度类似batch_size 的效果。在使用时,需要注意适当的扩大学习率。

更详细来说, 我们假设 batch size = 4 , accumulation steps = 8 , 梯度积累首先在前向传播的时候以 batch_size=4 来计算梯度,但是不更新参数,将梯度积累下来,直到我们计算了 accumulation steps 个 batch, 我们再更新参数。其实本质上就等价于:

真正的 batch_size = batch_size * accumulation_steps

梯度积累能很大程度上缓解GPU显存不足的问题,推荐使用。

在Bert的仓库中,就使用了这个Trick,十分实用,简直是我们这种乞丐实验室的良心Trick。

梯度检查点

这个Trick我没用过,毕竟模型还没有那么那么大。

等我用过再更新吧,先把坑挖下。

最后

哎, 如果你看完了这篇文章,就说明了一件事情: 小伙子,你卡也不够啊。哎, 乞丐实验室不配深度学习,哭了。

Reference

[1]科普帖:深度学习中GPU和显存分析

[2]如何在Pytorch中精细化利用显存

[3]GPU捉襟见肘还想训练大批量模型?谁说不可以

[5]PyTorch中在反向传播前为什么要手动将梯度清零?

[6]From zero to research — An introduction to Meta-learning

[7] Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups




原文链接:https://zhuanlan.zhihu.com/p/65002487


Guess you like

Origin blog.51cto.com/15060464/2678653