GPU and CPU related issues when running the model

1. Various Tricks when common GPU memory is insufficient

1.1 Monitor GPU

  • nvidia-smi, this driver can be used to install Nvidia
  • gpustat, this can dynamically monitor the GPU

1.2 Estimated model memory

The memory usage of the GPU is mainly composed of two parts:

  • One is the optimizer parameters, the parameters of the model itself, and the cache of each layer in the middle of the model
  • The second is the size of the batch size

1.3 Techniques for Insufficient Video Memory

  • Reduce the batch size
  • Choose a smaller data type
  • Compact model
  • data angle
  • total_loss
  • Release unneeded tensors and variables
  • Relu's inplace parameter
  • Gradient accumulation
    In more detail, we assume that batch size = 4, accumulation steps = 8, gradient accumulation first calculates the gradient with batch_size=4 during forward propagation, but does not update the parameters, and accumulates the gradient until we calculate accumulation steps a batch, we update the parameters. In fact, it is essentially equivalent to:
真正的 batch_size = batch_size * accumulation_steps
  • gradient checkpoint
  • Mixed precision training
  • Distributed training Distribution Training

1.4 Improve GPU memory utilization

  When the number of CPU threads is not set, Volatile GPU-Util is beating repeatedly, 0% → 95% → 0%. This is actually because the GPU is waiting for the data to be transmitted from the CPU. After the data is transmitted from the bus to the GPU, the GPU will gradually start to calculate, and the utilization rate will suddenly increase. However, the GPU has a very powerful computing power and can basically process the data in 0.5 seconds, so The utilization rate will drop again next, waiting for the next batch to come in. The main bottleneck in utilization is the data throughput of the CPU.

  • Configure a more powerful memory stick, with a better CPU;
  • num_workers
      In order to improve the utilization rate, num_workers must first be set appropriately. 4, 8, and 16 are several commonly selected parameters. After testing, if num_workers is set to be very large, such as 24, 32, etc., its efficiency will decrease instead, because the model needs to evenly distribute data to several sub-threads for preprocessing, distribution and other data operations, and setting it high will affect efficiency. Of course, if the number of threads is set to 1, a single CPU will preprocess and transmit data to the GPU, and the efficiency will be low.
  • pin_memory
     When the memory of the server or computer is large and the performance is good, it is recommended to use pin_memory. When this parameter is True, it can be directly mapped to the relevant memory block of the GPU, saving a little data transmission time.

2 Reference:

- Optimize GPU memory shortage and improve GPU utilization

Guess you like

Origin blog.csdn.net/qq_54372122/article/details/130672425