7. Training model, CPU is often 100%, but GPU usage is only about 5%

Phenomenon:
insert image description here
Probable cause : After the GPU operation is completed, a lot of time is spent writing logs and storing pth files, so the GPU usage has been too low and the CPU usage has been high.

For the specific reason analysis, please refer to [Deep Learning] Diary of stepping on the pit: the model training speed is too slow, and the GPU utilization rate is low

Here is the direct solution:

  1. Reduce the frequency of log IO operations
  2. Use pin_memory and num_workers (num_workers adjustment is not appropriate, it will show problems such as insufficient memory, adjust according to the actual situation)
  3. Train with half precision
  4. Better Graphics, Lighter Models
  5. Increasing the batch size increases the epoch speed, but the convergence speed will also slow down, and the learning rate needs to be increased appropriately

The solution in this article : here I use to adjust the batch_size from 8 to 10 (I originally wanted to adjust it to 16, but the result shows that the GPU memory is not enough, so I can only adjust it to 10):
insert image description here

Guess you like

Origin blog.csdn.net/panchang199266/article/details/129681692