Problem Solving|About CUDA code error summary and solution

This blog is mainly about the summary of common CUDA code errors and solutions~

1.RuntimeError running error

1.1.RuntimeError: CUDA error: out of memory

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Error parsing:

The program was running fine, the code was fine, the video memory was still useless, and the video memory was sufficient, the GPU may be occupied,

It may be because of the cache problem of the previous training, because it is running in the docker container, so stop the docker container first, and then start the container~

1.2.RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

possible errors

  • The version of pytorch and cuda is wrong

  • Insufficient video memory

Refer to other blog test code

# True:每次返回的卷积算法将是确定的,即默认算法。
torch.backends.cudnn.deterministic = True
# 程序在开始时花额外时间,为整个网络的每个卷积层搜索最适合它的卷积实现算法
# 实现网络的加速。
torch.backends.cudnn.benchmark = True

final solution

Set numwork to 0

1.3.RuntimeError: CUDA out of memory

①RuntimeError: CUDA out of memory. Tried to allocate 152.00 MiB (GPU 0; 23.65 GiB total capacity; 13.81 GiB already allocated; 118.44 MiB free; 14.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Exceeding the memory occupied by the GPU, the local GPU resources should be completely sufficient. However, during the pytorch training process, due to the backpropagation and forward parameters of neural network parameters such as gradient descent , a large amount of GPU memory will be occupied, so it needs to be reduced. batch.

Solution:

  • Reduce the batch, that is, reduce the sample size of word training

  • Release video memory: torch.cuda.empty_cache()

②torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.65 GiB total capacity; 22.73 GiB already allocated; 116.56 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

torch.cuda.OutOfMemoryError: CUDA has no memory. Trying to allocate 128.00 MiB (GPU 0; 23.65 GiB total; 22.73 GiB allocated; 116.56 MiB free; 22.78 GiB total reserved by PyTorch) If reserved memory >> allocated memory, try setting max_split_size_mb to avoid fragmentation. See the documentation for memory management and PYTORCH_CUDA_ALLOC_CONF.

Analysis of the cause of the error:

During the training of the deep learning model, the code does not release the video memory every time it is trained

solution:

View nvidia-smi

At this time , the GPU is running without a program, and the video memory is still occupied, as shown in the figure

Use fuser query

fuser -v /dev/nvidia*

(Option) If you enter the above command and it prompts that there is no fuser, then install

apt-get install  psmisc

If Unable to locate package XXX appears, then

apt-get update

Force (-9) to kill the process, enter the following command

kill -9 PID

Example diagram

Just release the video memory~

Guess you like

Origin blog.csdn.net/weixin_44649780/article/details/128911586