Deep learning PyTorch, TensorFlow has low GPU utilization, low CPU utilization, and slow model training. Summary and analysis

        In the deep learning model training process, enter nvidia-smi on the server side or the local PC side to observe the GPU memory usage of the graphics card ( Memory-Usage ), the GPU utilization of the graphics card ( GPU-util ), and then use top to view CPU thread number (PID number) and utilization (%CPU). Many problems are often found, such as low GPU memory usage, low graphics card utilization, low CPU percentage, and so on. Next, carefully analyze these problems and solutions.

1. GPU memory usage problem

        This is often due to the size of the model and the size of the batch size, which affect this indicator. When you post that your GPU occupancy rate is very small, such as 40%, 70%, and so on. At this point, if your network structure has been fixed, you only need to change the batch size at this time to use the memory of the entire GPU as much as possible. The memory usage of GPU is mainly the size of the model , including the width, depth, parameter amount of the network, and the cache of each layer in the middle, which will open up space in the memory for storage, so the model itself will occupy a large part of the memory. The second is the size of the batch size , which will also affect the memory usage. The batch size is set to 128. Compared with the setting of 256, the memory usage is close to 2 times the relationship. When your batch size is set to 128 and the occupancy rate is 40%, when you set it to 256, the occupancy rate of the model is approximately equal to 80%, and the deviation is not large. Therefore, when the model structure is fixed, try to set the batch size as large as possible to make full use of the GPU memory. (The GPU will quickly calculate the data you feed in. The main bottleneck is the data throughput of the CPU.)

2. GPU utilization issues

        This is Volatile GPU-Util said that when the number of CPU threads is not set, this parameter is repeatedly beating, 0%, 20%, 70%, 95%, 0%. This rests for 1-2 seconds and then repeats. In fact, the GPU is waiting for the data to be transferred from the CPU . After the GPU is transferred from the bus to the GPU, the GPU gradually starts to calculate, and the utilization rate will suddenly increase. However, the GPU is very powerful and can basically process the data in 0.5 seconds. The utilization rate will then drop again, waiting for the next batch to come in. Therefore, the GPU utilization bottleneck lies in the memory bandwidth and memory medium, as well as the performance of the CPU. The best is of course to change to a better fourth-generation or more powerful memory module, with a better CPU.

        Another method is to make changes and optimizations on the data load Dataloader in the framework of PyTorch , including num_workers (number of threads) and pin_memory , which will increase the speed. Solve the problem of the bandwidth bottleneck of data transmission and the low computing efficiency of GPU. Under TensorFlow, there is also this setting for loading data.

torch.utils.data.DataLoader(image_datasets[x],
                            batch_size=batch_size, 
                            shuffle=True,
                            num_workers=8,
                            pin_memory=True)

        In order to improve the utilization rate, num_workers (the number of threads) must be set appropriately. 4, 8, 16 are several commonly selected parameters . I have tested that setting num_workers to be very large, for example, 24, 32, etc., but its efficiency is reduced , because the model needs to evenly distribute data to several sub-threads for preprocessing, distribution and other data operations. Setting high will have an impact effectiveness. Of course, if the number of threads is set to 1, a single CPU is used to preprocess and transmit data to the GPU, and the efficiency will be low. Secondly, when your server or computer has large memory and good performance, it is recommended to turn on pin_memory , which saves the need to transfer data from the CPU to the cache RAM and then transfer it to the GPU; when it is True, it is It is directly mapped to the relevant memory block of the GPU, saving a little data transmission time.

3. CPU utilization problem

        In the process of model training, many people not only pay attention to the various performance parameters of the GPU, but also often need to check how the CPU is processed and whether it is used well. This is crucial. But for the CPU, you can't blindly pursue a super high occupancy rate. As shown in the figure, for the 14339 program, its CPU usage is 2349% (my server has 32 cores, so the highest is 3200%). This shows that a 24-core CPU is used to load data and do pre-processing and post-processing. In fact, the main CPU is spent on loading and transmitting data. At this time, measuring the data loading time and found that even if the CPU utilization is so high, the actual data loading time is more than 20 times that of a properly set DataLoader , which means that this method is 20 times slower to load data. This will happen when num_workers=0 of DataLoader, or if this parameter is not set.

CPU utilization check results
CPU utilization check results

        As can be seen in the figure below, the actual load data is 12.8s, the model GPU computing time is 0.16s, and the loss backhaul and update time is 0.48s. At this time, even if the CPU is 2349%, the training speed of the model is still very slow, and most of the time the GPU is idle and waiting.

num_workers=0, running time statistics of each stage of the model

        When I set num_workers=1 , the time statistics are as follows, load data time is 6.3 , and the data loading efficiency is doubled. And the CPU utilization rate at this time is 170%, not much CPU is used, and the performance is doubled.

When num_workers=1, the running time statistics of each stage of the model

        At this time, check the performance status of the GPU (my model is trained on cards 1, 2, and 3) and found that although the memory utilization of GPU (1,2,3) is very high, basically 98%, But the utilization rate is about 0%. It seems that the network is waiting to transfer data from the CPU to the GPU at this time. At this time, the CPU is loading data frantically, and the GPU is in an idle state .

Screenshot of the memory usage and computing efficiency of GPUs 1, 2, and 3

        It can be seen that the utilization of CPU is not necessarily the best.

        For this problem, the solution is to increase the number of num_wokers of DataLoader, mainly to increase the number of child threads, to share the data processing pressure of the main thread, and multithreading to process and transmit data collaboratively, instead of being responsible in one thread All preprocessing and transmission tasks.

        I set num_workers=8,16 to achieve good results. At this time, use top to check the number of CPUs and threads. If I set num_workers=8 , the number of threads has 8 continuously developed thread PIDs, and everyone's occupancy rate is around 100%. This shows that the CPU side of the model is relatively low. Assign tasks well to improve data throughput efficiency. The effect is shown in the figure below. The CPU utilization is very average and efficient, and each thread has its maximum performance.

When num_workers=8, CPU utilization and 8 continuous PID tasks

        At this time, when using nvidia-smi to check the GPU utilization rate, several GPUs are at full load, the GPU memory is full, and the processing model with full GPU utilization rate has been greatly improved.

Optimize the result of data loading num_workers=8 and setting batch size

        As you can see in the above figure, the GPU’s memory utilization is maximized. At this time, the batch size is set to be larger, which occupies the GPU’s memory, then set num_workers=8, allocate multiple sub-threads, and set pin_memory=True, directly Map data to the dedicated memory of the GPU, reducing data transmission time. The data bottleneck of GPU and CPU is solved. The overall performance is weighed.

        The running time at this time is counted in the table:

Processing time statistics
Processing stage time
Data loading 0.25s
The model is calculated on the GPU 0.21s
Loss back transmission, parameter update 0.43s

4. Summary

        Summarizing the above analysis, the first is to increase the batch size, increase the memory usage of the GPU , and try to use up the memory instead of half. The empty memory is used by another program, and the efficiency of the two tasks will be very low . Second, when loading data, set the number of num_workers threads to a slightly larger value, recommended to be 8, 16, etc., and enable pin_memory=True . Do not put the entire task in the main process, which consumes CPU, and the speed and performance are extremely low.

 

 

                                                                                                                                                                     

                                                                                                                                                                     

                                                                                                                                                                     

Supplementary: Seeing that there are more questions answered in the comments, so add some more narratives!

Open so many threads. First, check the batch_size of your data. If the batchsize is small, the main CPU loads and processes directly, and it is not allocated to multiple GPUs (if you are using multiple GPUs); if it is a single GPU, then the CPU is reading hard Data, load the data, and then the GPU will process it all at once. Your model should be very small, or the FLOPs of the model should be very small. Check for model issues. In addition, in this case, opening 8 threads and 1 thread has no effect, and opening a num_workers is the same. If the speed is fast, there is no need to allocate multiple num_workers. When the amount of data is large, the num_workers setting is large, which will greatly reduce the time consumption of the data loading phase. This should mainly cooperate with the process.

During the debugging process, command: top to view the process utilization of your CPU in real time. This parameter corresponds to the setting of your num_workers;

Command: watch -n 0.5 nvidia-smi refreshes and displays the graphics card settings every 0.5 seconds.

View your GPU usage in real time, which is related to GPU settings. These two work well together. Including the batch_size setting.

                                                                                                                                                                  Time: September 20, 2019

5. Add content again

Many netizens are discussing some problems. Sometimes, in addition to checking the code and processing information of each module, we can actually check which slot your memory card is inserted into . The location of this slot also greatly affects the efficiency of the code running on the GPU.

In addition to reading some of my small suggestions above, the comments also have a lot of useful information. Netizens who encountered their respective problems described and discussed their different situations. After exchanges, they gave some new discoveries and solutions to the problems of CPU and GPU efficiency during training.

In response to the following questions, give a little supplementary explanation:

Question 1. The CPU is busy and the GPU is idle.

The preprocessing of the data and loading into the memory of the GPU takes time. Balance the batch size, num_workers.

Problem 2. The CPU utilization is low. When the GPU runs, the utilization rate fluctuates. First increase, then decrease, and then wait. The CPU is also floating.

  • 2.1 The following are the specific steps and countermeasures:

When pytorch is training the model, the following situations occur, description of the situation: First, the environment: 2080Ti + I7-10700K, torch1.6, cuda10.2, driver 440. Parameter settings: shuffle=True, num_workers=8, pin_memory=True; Phenomenon 1: This The code is on another computer, and the GPU utilization rate can be stabilized at about 96%. Phenomenon 2: On a personal computer, the CPU utilization rate is relatively low, resulting in slow data loading, GPU utilization floating, and training about 4 times slower; interesting Yes, when I start training accidentally, the CPU utilization rate is high, which can make the GPU run, but after only a few minutes, the CPU utilization rate drops and it can't go up, and it returns to the snail speed.

  •  Methods that can be used:

Are the configurations on both sides the same. Another computer and your computer. Look at the whole, it seems that the settings are a bit different. Including hardware, CPU core, memory size. You compare the two devices. This is the first one. The second is the configuration in the code and the efficiency of the code. When you come, the CPU utilization is low, you look at each step, where it gets stuck, where is the bottleneck, and what steps are the most time-consuming. Record the time-consuming of each major step, and then analyze it. After testing the time of each major process, you can see where the time is spent. It mainly includes loading data, forward propagation, reverse update, and then the next step.

  • 2.2 After testing, the second problem analysis:

After testing, it is found that the card of the machine is in the place where the image is loaded. Sometimes it takes more than 1s to load an image of about 10kb, which results in slow loading of the entire batch data! The code should be no problem, because it can run at full speed on other computers; on the hardware, the GPU and CPU of the machine are powerful, and there is no difference in the environment. The only difference is 16G of memory, and 32G of other test computers. May I ask this phenomenon Is it directly related to memory?

  • Situation analysis

It might be here at best. You can directly test the entire calculation when the batch size is 1. Or open the batch size to a different setting. Look at the loaded data and calculate the difference between them. It is most likely to be in this load data, read data block. The computer's running memory is 16g 32g. In fact, it is enough, and then loaded on the GPU, the GPU memory can be put down, and the impact is small. So it is estimated that your memory is relatively small, causing the problem. try it.

  • 2.3 Problem location and solution:
  • The memory module of this computer is inserted in the wrong position. For a motherboard with 4 slots, one memory should be inserted in the second slot (referenced from the cpu end), and the business that assembles the computer is unprofessional. On the 4th slot, performance is affected. After changing the position, the speed flies! Regarding the details of the slot, some friends I met went to collect it online, a lot!

When running the GPU on your own computer or on your own host computer, remember to check which slot your memory card is inserted into.

Supplementary time: January 15, 2021

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_32998593/article/details/92849585