Analysis and optimization of common causes of low GPU utilization

a08855b926f83c6fe60a97688d0d4240.png

来源:腾讯技术工程 深度学习爱好者
本文约2200字,建议阅读5分钟
本文分享一些关于减少GPU资源浪费问题的解放方案。

Recently, some students often report that the GPU utilization rate is low and GPU resources are seriously wasted. After analyzing some examples, I will share the solution with you through this document, hoping to help students who use GPU.


1. Definition of GPU Utilization

The GPU utilization in this article mainly refers to the utilization of the GPU on the time slice, that is, the indicator of GPU-util displayed by nvidia-smi. The statistical method is: in the sampling period, the percentage of time that the kernel is executed on the GPU.

2. The nature of low GPU utilization

The flow chart of common GPU tasks is as follows:

bc4caf645f6588d2ebf946a756fd792f.jpeg

As shown in the figure above, the GPU task will alternately use the CPU and the GPU for calculation. When the CPU calculation becomes the bottleneck, the GPU will wait for the problem, and the utilization rate of the GPU will be low when it runs idle. Then the direction of optimization is to shorten the time-consuming of all calculations using the CPU, and reduce the blocking of the GPU by the CPU calculation. Common CPU computing operations are as follows:

  • data loading

  • data preprocessing

  • model save

  • loss calculation

  • Evaluation index calculation

  • log printing

  • Index reporting

  • progress report


3. Analysis of Common Causes of Low GPU Utilization

1. Data loading related

1) Storage and computing are cross-city, loading data across cities is too slow, resulting in low GPU utilization

Explanation: For example, if the data is stored in "Shenzhen ceph", but the GPU computing cluster is in "Chongqing", it will involve cross-city use, which will have a great impact.

Optimization: Either migrate data or replace computing resources to ensure that storage and computing are in the same city.

2) The performance of the storage medium is too poor

Note: Comparison of read and write performance of different storage media: native SSD > ceph > cfs-1.5 > hdfs > mdfs.

Optimization: Synchronize the data to the local SSD first, and then read the local SSD for training. The local SSD disk is "/dockerdata", you can first synchronize the data in other media to this disk for testing, and eliminate the influence of storage media.

3) There are too many small files, causing the file io to take too long

Explanation: Multiple small files are not stored consecutively, and reading will waste a lot of time on seeking.

Optimization: Pack data into a large file, such as converting many image files into a large file such as hdf5/pth/lmdb/TFRecord.

Example of lmdb format conversion:

https://github.com/Lyken17/Efficient-PyTorch#data-loader

Please Google for other format conversion methods.

4) Multiple processes are not enabled to read data in parallel

Explanation: num_workers and other parameters are not set or set unreasonably, resulting in the cpu performance not running, which becomes a bottleneck and blocks the GPU.

Optimization: Set the num_workers parameter of the torch.utils.data.DataLoader method, the num_parallel_reads parameter of the tf.data.TFRecordDataset method, or the num_parallel_calls parameter of the tf.data.Dataset.map method.

5) The early loading mechanism is not enabled to achieve parallelism of CPU and GPU

Explanation: Parameters such as prefetch_factor are not set or set unreasonably, resulting in serial time between CPU and GPU, and GPU utilization directly drops to 0 when CPU is running.

Optimization: Set the prefetch_factor parameter of the torch.utils.data.DataLoader method or the tf.data.Dataset.prefetch() method. prefetch_factor indicates the number of samples loaded in advance by each worker (using this parameter needs to be upgraded to pytorch1.7 and above), and the parameter buffer_size of the Dataset.prefetch() method is generally set to: tf.data.experimental.AUTOTUNE, which is automatically selected by TensorFlow Appropriate value.

6) Shared memory pin_memory is not set

Note: If the pin_memory of the torch.utils.data.DataLoader method is not set or set to False, the data needs to be transferred from the CPU to the cache RAM, and then transferred to the GPU.

Optimization: If the memory is relatively rich, you can set pin_memory=True to directly map the data to the relevant memory block of the GPU, saving a little data transmission time.

2. Data preprocessing related

1) Data preprocessing logic is too complex

Note: If the data preprocessing part exceeds one for loop, it should not be put together with the GPU training part.

optimization:

a. Set the num_parallel_calls parameter of tf.data.Dataset.map to increase the degree of parallelism. Generally, it is set to tf.data.experimental.AUTOTUNE, which allows TensorFlow to automatically select the appropriate value.

b. Move part of the data preprocessing steps out of the training tasks, such as normalizing pictures, start a spark distributed task in advance or process the CPU task well, and then train.

c. Load the configuration files and other information needed for the preprocessing part into the memory in advance, and do not read them every time you calculate.

d. Regarding query operations, use dict more to speed up query operations; reduce for and while loops, and reduce preprocessing complexity.

2) Use GPU for data preprocessing -- Nvidia DALI

Description: Nvidia DALI is a library dedicated to accelerating data preprocessing, supporting both GPU and CPU.

Optimization: DALI is used to transform the CPU-based data preprocessing flow into GPU calculation.

The DALI document is as follows: https://zhuanlan.zhihu.com/p/105056158.

3. Model preservation related

1) The model is saved too frequently

Note: The model is saved as a CPU operation, too often it will easily cause the GPU to wait.

Optimization: Reduce the frequency of saving the model (checkpoint).

4. Index related

1) Loss calculation is too complicated

Description: Complicated loss calculations with for loops cause the CPU to take too long to calculate and block the GPU.

Optimization: Use low-complexity loss or use multi-process or multi-thread to speed up.

2) Indicators are reported too frequently

Description: The indicator reporting operation is too frequent, and the frequent switching between CPU and GPU leads to low GPU utilization.

Optimization: Change to sampling report, for example, report once every 100 steps.


5. Log related

1) Log printing is too frequent

Description: The log printing operation is too frequent, and the frequent switching between CPU and GPU leads to low GPU utilization.

Optimization: change to sampling printing, for example, printing once every 100 steps.


4. Description of Common Data Loading Methods

1、pytorch 的 torch.utils.data.DataLoader

 
  
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

fb6b54f1ca6edef5587c7acd68b83aa4.jpeg

From the parameter definition, we can see that DataLoader mainly supports the following functions:

  • Support loading map-style and iterable-style datasets, the main parameters involved are datasets

  • Custom data loading order, the main parameters involved are shuffle, sampler, batch_sampler, collate_fn

  • Automatically organize data into batch sequences, the main parameters involved are batch_size, batch_sampler, collate_fn, drop_last

  • Single-process and multi-process data loading, the main parameters involved are num_workers, worker_init_fn

  • Automatically read page-locked memory (memory pinning), mainly related to the parameter pin_memory

  • Support data preloading, mainly related to the parameter prefetch_factor

Reference documents:

https://pytorch.org/docs/stable/data.html


2、tensorflow 的 tf.data.Dataset

 
  
ds_train = tf.data.Dataset.from_tensor_slices((x,y))\
    .shuffle(5000)\
    .batch(batchs)\
    .map(preprocess,num_parallel_calls=tf.data.experimental.AUTOTUNE)\
    .prefetch(tf.data.experimental.AUTOTUNE)
  • Dataset.prefetch(): Allows the dataset object Dataset to prefetch several elements during training, so that the CPU can prepare data while training on the GPU, improving the efficiency of the training process

  • Dataset.map(f): The transformation function f is mapped to each element of the dataset; it can take advantage of multi-CPU resources and make full use of the advantages of multi-core to perform parallel transformation on data. Set num_parallel_calls to tf.data.experimental.AUTOTUNE to allow TensorFlow The appropriate value is automatically selected, and the data conversion process is executed in multiple processes. Setting the num_parallel_calls parameter can take advantage of the multi-core CPU

  • Dataset.shuffle(buffer_size): Shuffle the dataset, take out the first buffer_size elements and put them in, and randomly sample from the buffer, and replace the sampled data with subsequent data

  • Dataset.batch(batch_size): Divide the dataset into batches, that is, for each batch_size element, use tf.stack() to merge in the 0th dimension and become an element

Reference documents:

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#methods_2

5. Low GPU utilization common to distributed tasks

Compared with stand-alone tasks, distributed tasks have one more inter-machine communication link. If it runs well on a single machine, but problems such as low GPU utilization and slow running speed occur after expanding to multiple machines, it is likely that the communication time between machines is too long. Please check the following points:

1. Are the machine nodes in the same modules?

Answer: When the machine nodes are in different modules, the communication time between multiple machines will be much longer. The deepspeed component has added a policy of scheduling to the same module from the platform level, and the user does not need to operate; other components need to contact us to enable it.

2. Is GDRDMA enabled when there are multiple computers?

Answer: Whether GDRDMA can be enabled is related to the NCCL version. After testing, when using PyTorch1.7 (with NCCL2.7.8), GDRDMA failed to start. After communicating with Nvidia people, it was determined that it was a bug in a higher version of NCCL, and the running injection was temporarily used The way to fix it; when using PyTorch1.6 (with NCCL2.4.8), GDRDMA can be enabled. After testing, "NCCL2.4.8 + enable GDRDMA" is 4% higher than "NCCL2.7.8 + not enable GDRDMA". By setting export NCCL_DEBUG=INFO, check whether there are [receive] via NET/IB/0/GDRDMA and [send] via NET/IB/0/GDRDMA in the log. If they appear, it means that GDRDMA is enabled successfully, otherwise it fails.

fc0aaee83873f6b350d2c79faad942e6.jpeg

3. Does pytorch use DistributedDataParallel for data parallelism?

Answer: Data parallel training in PyTorch involves nn.DataParallel (DP) and nn.parallel.DistributedDataParallel (DDP). We recommend using nn.parallel.DistributedDataParallel (DDP).

Editor: Wang Jing

db10ddcc8530beb1c259e01a58eeb4ec.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131587630