GPU memory how to do less than the depth of the neural network model training?

Foreword

Recent model run relatively large, especially Bert, this is really difficult for me 1080ti, and in Example Bert's official Trick provides some training to help us accelerate, very conscientious, but still feel not enough, then take some time Trick put together a collection to help us to chuckle when insufficient memory.

This paper is divided into two parts, the first part of the introduction of a topic: how to estimate the required memory model, the second theme: Trick variety of GPU when insufficient memory.

Monitor GPU

The most commonly used to monitor GPU, of course, nvidia-smi, but there is a tool that can better display the information: gpustat.

nvidia-smi
watch --color -n1 gpustat -cpu   # 动态事实监控GPU

Recommended in the configuration file aliases, anyway, every time I look gpu, information is all out, very convenient.

Here are the students recommended nvtop, I simply tried, very good indeed, appear to show the information is very rich, it is recommended to try.

How to estimate the memory model [1]
First, pondering a question: What things models occupy my memory, loud noise every turn out of memory?

In fact, a main memory occupied by the model consists of two parts: the model parameters themselves, optimization of parameters, the model inputs and outputs of each layer.

Parameter model itself

The parameters of the model itself refers to the Weight and Bias each network layer, which is part of the memory will be occupied after the model has finished loading, noted that there is some level parameters, such as CNN, RNN; and some layers are not parameters, such as the active layer, pooled layer.

From Pytorch point of view, when you execute model.to (device), your model is loaded, then your model has been loaded up.

For Pytorch, the model parameters are stored in model.parameters (), so we do not need their own calculations, can be directly printed by Pytorh:

print('Model {} : params: {:4f}M'.format(model._get_name(), para * type_size / 1000 / 1000))

Optimization parameters

Optimization of parameters refers to a model parameter optimization process that is the reverse of the propagation generated, which mainly refers to the part of the parameters DW, i.e. gradient, in SGD, which is the same size and parameters, period parameters and therefore, in the optimization model occupied by the memory will be doubled.

Notably, different optimization is required different optimization parameters which are stored, for Adam, since it needs to save the remaining parameters, the model parameters that will quadruple in the optimization interval.

Each input-output model

First of all , the first point is occupied by the data input of memory, this part of the occupied memory is not large, it is because we tend to read the data using an iterative way, which means that we are actually not all one-time memory read data, which is to ensure that the whole network parameter memory occupied by each input ratio is negligible.

Then , when the reverse spread to spread before the model, a very important thing is to calculate and save the output as well as its corresponding gradient of each layer, which means that it also occupies a large part of the memory.

Finally , memory usage model outputs can be summarized as follows:

  • The output of each layer (multi-dimensional array), which corresponds to the gradient, it is noted that the model output does not need to store information corresponding momentum (i.e., if used herein, Adam, the model output parameters is still an amount of 2 times 4 times instead of I do not know why ?? chiefs seek advice)
  • Memory usage and batch output proportional to size

Is there a way to calculate this amount by some parameters Pytorch it? The answer is yes, we can assume that a sample batch and then) to each layer by model.modules (traversal, get each layer output shape, then you can get the amount of output parameters of a batch of data. [2]

All memory usage calculation

显存占用 = 模型自身参数 × n + batch size × 输出参数量 × 2 + 一个batch的输入数据(往往忽略)

Wherein, n being set in accordance with the optimization algorithm, if the choice of the SGD, then n = 2, if the selected Adam, then n = 4.

Achieved a great following, I do not bother to re-write, you can modify it to read this, no problem.

# 模型显存占用监测函数
# model:输入的模型
# input:实际中需要输入的Tensor变量
# type_size 默认为 4 默认类型为 float32 
 
def modelsize(model, input, type_size=4):
    para = sum([np.prod(list(p.size())) for p in model.parameters()])
    print('Model {} : params: {:4f}M'.format(model._get_name(), para * type_size / 1000 / 1000))
 
    input_ = input.clone()
    input_.requires_grad_(requires_grad=False)
 
    mods = list(model.modules())
    out_sizes = []
 
    for i in range(1, len(mods)):
        m = mods[i]
        if isinstance(m, nn.ReLU):
            if m.inplace:
                continue
        out = m(input_)
        out_sizes.append(np.array(out.size()))
        input_ = out
 
    total_nums = 0
    for i in range(len(out_sizes)):
        s = out_sizes[i]
        nums = np.prod(np.array(s))
        total_nums += nums
 
 
    print('Model {} : intermedite variables: {:3f} M (without backward)'
          .format(model._get_name(), total_nums * type_size / 1000 / 1000))
    print('Model {} : intermedite variables: {:3f} M (with backward)'
          .format(model._get_name(), total_nums * type_size*2 / 1000 / 1000))

GPU Trick when the memory is insufficient [2]

Not to discuss the multi-GPU here, distributed computing, etc., only to discuss some conventional Trick, will be updated from time to time.

Reduce batch size

It should be well understood that appropriate to reduce the batch size, the input-output model of each layer will be linearly reduced, the effect is quite obvious. One thing to note here is that the adjustment dev batch size also helps reduce memory, at the same time, do not set the batch size for the dev or test sample set length, I recently did this stupid thing, harm, I debugging the day before It is transferred out of the question.

Selecting a smaller data type

In general default, the entire network uses a 32-bit floating point, if the switching to the 16-bit floating-point number, which will be close to the amount of memory usage was decreasing multiples.

Compact models

In designing the model, appropriate compact models, such as the original layer into two layers LSTM; LSTM original use, now GRU; reduce the number of convolution kernels; minimize use of Linear like.

Angle data

For text data, the amount of sequence length parameter is increased linearly to bring the appropriate parameters greatly reduce the amount of sequence length can be reduced.

total_loss

Considering the loss itself is a gradient tensor containing information, and therefore, the correct way to find and loss:

total_loss += loss.item()

Releasing a tensor and variables

Del release using tensor and variables that you no longer need, it also requires us to pay attention to the variables used in the model when writing, do not arbitrary, flying in the sky.

Relu parameters of inplace

Activation function Relu () there is a default parameter inplace, the default is Flase, when set to True, we in the new value by relu () calculated without taking up a new space but directly overwrite the original value, which represents the set is True, you can save some memory.

Cumulative gradient

First, we must understand some basic knowledge of Pytorch:

  • In Pytorch, when we execute loss.backward (), it calculates a gradient for each parameter and stored in paramter.grad, it is noted, is a tensor paramter.grad, which are added together to give each gradient.
  • In Pytorch, only to call optimizer.step (gradient descent will be updated network parameters).

We know, batch size and memory footprint are closely related, but sometimes our batch size is too small and can not be set, which we supposed to do? The answer is gradient accumulate .

Let's take a look at traditional training:

for i,(feature,target) in enumerate(train_loader):
    outputs = model(feature)  # 前向传播
    loss = criterion(outputs,target)  # 计算损失
 
    optimizer.zero_grad()   # 清空梯度
    loss.backward()  # 计算梯度
    optimizer.step()  # 反向传播, 更新网络参数

After addition of the gradient accumulation, the code is such that:

for i,(features,target) in enumerate(train_loader):
    outputs = model(images)  # 前向传播
    loss = criterion(outputs,target)  # 计算损失
    loss = loss/accumulation_steps   # 可选,如果损失要在训练样本上取平均
 
    loss.backward()  # 计算梯度
    if((i+1)%accumulation_steps)==0:
        optimizer.step()        # 反向传播,更新网络参数
        optimizer.zero_grad()   # 清空梯度

Comparison, we found that the gradient is essentially cumulative cumulative gradient accumulation_steps a batch, and then update the network parameters based on accumulated gradient, in order to achieve a similar effect as batch_size accumulation_steps * batch_size of. In use, the need to pay attention appropriate to expand the learning rate.

More specifically, we assume that the batch size = 32, accumulation steps = 8, when the gradient accumulation first forward propagation speaking into parts batch accumulation steps, and then obtain the size = small batch 4 parts, each on a small batch to calculate the gradient, but does not update parameters, the gradient accumulated until we calculated the accumulation steps small batch, we'll update parameters.

Gradient accumulation can largely alleviate the GPU memory shortage problems, it recommended.

Bert's in the warehouse, on the use of this Trick, very practical, is simply that our conscience Trick beggar laboratory.

Gradient Checkpoint

Trick I have not used this, after all, it was not as big as the model.

And so I used then update it, first dug pit.

At last

Hey, if you read this article, it means one thing: ** Young man, you card is not enough ah. ** hey, beggars do not deserve deep learning laboratory, crying.

Reference

[1] science posts: depth learning GPU and memory analysis

[2] how refined the use of memory in Pytorch

[3] GPU hard-pressed to want to train high-volume model? Who can not

[4] PyTorch Why should manually cleared before the gradient back-propagation?

[5]From zero to research — An introduction to Meta-learning

[6]Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

Released seven original articles · won praise 4 · Views 210

Guess you like

Origin blog.csdn.net/Zserendipity/article/details/105301983