Summary of GPU memory and utilization of deep learning

I believe that many people, including me, have a lot of complaints about GPU memory. Problems such as CUDA out of memory have been bothering us. Today’s article is to analyze it, maybe it will be useful. help

First of all , let’s briefly talk about the importance of GPU. Taking Pytorch as an example, it uses CUDA and cuDNN to convert various calculations performed during deep learning model inference into matrix multiplication to accelerate, so as to achieve the operation from the year of the monkey to the present. Ten to a hundred times faster.
As for the video memory that we love and hate, the change of GPU running memory when data is read is used as a reference. The specific implementation mechanism is generally through worker process + queue, allowing multiple workers to read and preprocess data asynchronously. Enqueue, and then the supervisor trains the main process to fetch data from the other end of the queue. If the queue is full and the training process has no time to fetch data, the worker process will be blocked, and the running memory will not grow without limit.

torch.utils.data.DataLoader(datasets[x], batch_size=batch_size, shuffle=True, num_workers=8, pin_memory=True)

Note: Unrelated data processing and calculation operations such as printing logs and calculating eta will also take up a lot of CPU processing time.
Let’s briefly talk about the batch_size control video memory (not completely proportional, the parameters of the model itself and the extended data also occupy the video memory), num_workers control GPU utilization (so that the CPU loading data and the GPU processing speed can match, Lenovo’s operating system is can understand), the latter two will be mentioned

Obviously, data loading, preprocessing, postprocessing, etc. are placed on the CPU, that is, other IO tasks for reading and writing. And it is a good way to adjust the GPU utilization by adjusting the number of num_workers. It is better to set the quantity in a relatively large range (4-8 can be considered), but not the bigger the better. Because the larger it is, although there are more threads, the consumption of each thread is also large, so it will increase the load on the CPU, thereby reducing the utilization of the GPU. The number of num_workers is generally used in conjunction with the number of batch_size.
In addition , when your server or computer has a large memory and good performance, it is recommended to turn on pin_memory , which saves the need to transfer data from the CPU to the cache RAM and then transfer it to the GPU; for When True, it is directly mapped to the relevant memory block of the GPU, saving a little data transmission time.
Note here that when loading data, it is recommended to load all the data in the __init__ of the dataset or in the pipeline by calling the data import method instead of writing in the getitem method.

Therefore, for this, a rough method is to adopt the ddp mode (not too many threads) when there are multiple GPUs, and use the shared memory method to let the process of rank0 throw these annotation data into the shared memory, and then all other The process is read-only and maps this memory, realizing zero copy.

If the utilization rate of the GPU is low , the first is to increase the batch size, increase the memory usage of the GPU, and try to use up the memory instead of leaving half of it. The empty memory is used by other programs, and the efficiency of both tasks will be very low. Second, when loading data, set the number of num_workers threads a little larger, 8, 16, etc. are recommended, and enable pin_memory=True. Do not put the entire task in the main process, as this consumes CPU, and the speed and performance are extremely low .

Secondly , consider the influence of the model itself, namely the model parameters and the response of each layer, which are related to the complexity of the model. Taking Resnet50 as an example, there are 26 million parameters. If 32-bit floating point precision is used, then the occupied memory is: 26M * 32 bits = 99MB . At the same time, he has 16 million responses, occupying memory: 16M*32bit = 64MB .
The response is not directly related to the number of model parameters. Convolutional layers can have very large responses but few parameters; activation layers can even have no parameters.

Maybe you will feel that it is only a little more than 100M at this point, and the impact is not great. No, this is the time for batchsize to debut.

In order to effectively utilize the SIMD mechanism of the GPU, the data must be input into the network in the form of mini-batch.
If you want to fill the common 1024-bit path with 32-bit floating-point numbers, you need 32 samples to be calculated at the same time.
When using mini-batch, only one copy of the model parameters is still saved, but the responses of each layer need to be doubled according to the size of the mini-batch.

ResNet with 50 layers, mini-batch=32, each layer response occupies memory: 64MB*32 = 2GB

However , 2G is not the end. In the deep learning library, the lowering method is generally used to convert the convolution calculation into matrix multiplication**.
When calculating this type of convolution, the front layer response X XX needs to be enlarged by K^2 times.

50-layer ResNet, when considering the lowering effect, the response of each layer occupies 7.5GB of memory

Another thing to pay attention to is what is the standard for the end of your network training. Is it based on epoch or iteration. If the iteration is used as the standard, then if you increase the batch_size, the network training time will naturally increase. At this time, the iteration needs to be discounted in half when the batch_size doubles. If the epoch is used as the standard, no other changes are required when changing the batch_size.

Video memory usage ≈ model video memory usage + batch_size × video memory usage of each sample, it can be viewed in such a simple way

ResNet50 was used as an example before, and the following replacement with GPT-2 may be closer to the current network structure.

introduction

For a GPT-2 model with 1.5B parameters, 3GB of memory is required to store its weights (or parameters) at 16-bit precision, but it cannot be trained on a GPU with 32GB of video memory. During model training, most of the memory overhead is used for model states , such as optimizer states, gradients, and parameters. In addition to the model state, the remaining memory overhead comes from residual states , such as activation values, staging area cache, and memory fragmentation.

Model States

Taking Adam as an example, it needs to store two parts of the optimizer state : time averaged momentum and variance of the gradients. Therefore, when using Adam for model training, there needs to be enough memory space to store the momentum estimation and the copy value of the gradient variance. In addition, there needs to be enough space to store the gradient and weight of the model itself . The state of the optimizer often takes up a large portion of memory, especially in mixed-precision training.

In mixed precision (fp16/32) training, model parameters and activation values ​​are saved in fp16 format, and fp16 weights and activation values ​​are also used for calculation in forward and back propagation. However, in order to calculate more efficiently and ensure the correctness of gradient updates (rounding errors will occur in mixed precision training), usually a copy of fp32 weights and optimizer status will be copied at the same time.

Example: For a model with a parameter size of γ, use Adam and mixed precision for training. First, it requires 2γ of memory to store the parameters and weights copied by fp16, and 4γ of memory to store the parameters, momentum estimation and gradient variance of fp32 copies respectively. K is used here to represent the additional memory overhead multiple required to store the state of the optimizer, and K = 12 for mixed-precision Adam. In summary, the memory overhead required to train such a model is 2γ + 2γ + Kγ = 16γ. For a GPT-2 model with a parameter size of 1.5B, the required training memory consumption is at least 24GB, which is much larger than the 3GB parameter size of the model itself in the fp16 format.

Residual States

Activation

Activation values ​​can take up a large portion of memory during training. The memory occupied by the activation value of the Transformer structure model is proportional to the size of transformers_layers * hidden_dimensions * sequence_length * batch_size . For the GPT-2 model, the activation value memory is 12 * hidden_dim * bsz * seq_length * transformer_layers . Therefore, for a GPT-2 model with 1.5B parameters, sequence length 1K, and batch size 32, the required memory is 60GB.

Although the activation checkpointing technology can reduce the memory occupied by the activation value by approximately root times at the cost of recalculating the activation value by 33% (this can reduce the activation value memory required by the above model to about 8GB), for larger For the model, the effect is still limited (the GPT model with 100B parameters still needs 60G of memory on the basis of activation checkpointing).

Temporary buffers

Temporary area cache for storing intermediate calculation results, such as all-reduce of gradients, or gradient norm computation (gradient norm computation) that fuses all gradient results into a single flattened buffer before all-reduce in order to improve throughput. For a model with 1.5B parameters, the flattened buffer memory usage under fp32 will reach 6GB.

Memory Fragmentation

Even if the memory size is greater than the memory overhead required for model training in theory, due to the existence of memory fragmentation (free memory appears in different locations in a discontinuous manner, resulting in no continuous memory to meet the needs of model training), there may still be Out of memory error. When training a super-large model, in some extreme cases, there will be an OOM (out of memory) problem when there is still 30% of the remaining memory.

A simple way to use video memory on the blade

1. In order to effectively use SIMD, if the accuracy is doubled, the batch size must be doubled. Cannot reduce memory consumption. But you can use Nvidia's apex mixed-precision acceleration, and the pytorch memory is directly cut in half .
2. In-place operation : directly rewrite the original response without opening up new memory. Many activation functions can do this. Such as nn.ReLU(inplace=True), that is, use the inplace operation flag as much as possible .
It's a bit more complicated. By analyzing the entire network graph, you can find responses that only need to be used once, which can share memory with subsequent responses. For example, the memory sharing mechanism of MxNet .
3. Calculation for storage : find out the response results that are easy to calculate (such as the output of the activation function layer) instead of storage, and temporarily calculate when needed. Using this approach, the MxNet example was able to reduce the memory footprint of a 50-layer ResNet network by a factor of four.

Also check out the 10+ ways to Trick here

Finally, if you can optimize the algorithm operation and optimize the memory usage per batch size in the network graph structure, you can naturally install more batch sizes to make full use of the GPU core.

reference

  1. https://www.zhihu.com/question/476086761
  2. https://blog.csdn.net/shenxiaolu1984/article/details/71522141
  3. https://zhuanlan.zhihu.com/p/362556069
  4. https://blog.csdn.net/qq_45756171/article/details/122910838
  5. https://zhuanlan.zhihu.com/p/348122818
  6. https://blog.csdn.net/qq_34405401/article/details/108519823
  7. https://zhuanlan.zhihu.com/p/520898642
  8. https://zhuanlan.zhihu.com/p/31558973
  9. https://www.cnblogs.com/jiangkejie/p/10098995.html

Guess you like

Origin blog.csdn.net/weixin_42455006/article/details/127653740