RuntimeError: CUDA out of memory

Today, when I was training the model, I suddenly reported the problem of insufficient video memory. After analyzing it, I found a solution. I will record it here for future reference.

Note : The following solution is when this error occurs during model testing instead of model training!

RuntimeError: CUDA out of memory

Complete error message:

Traceback (most recent call last):
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 420, in <module>
    main()
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 414, in main
    train_with_cross_validate(training_epochs, kfolds, train_indices, eval_indices, X_train, Y_train, model, losser, optimizer)
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 77, in train_with_cross_validate
    val_probs = model(inputs)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 235, in forward
    x = self.camlp_mixer(x) # (batch_size, F, C, L)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 202, in forward
    x = self.time_mixing_unit(x)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 186, in forward
    x = self.mixing_unit(x)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 147, in forward
    x = self.activate(x)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 772, in forward
    return F.leaky_relu(input, self.negative_slope, self.inplace)
  File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1633, in leaky_relu
    result = torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 0; 23.70 GiB total capacity; 21.49 GiB already allocated; 550.81 MiB free; 21.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Because the program written by myself will have output after one round of training, so this information occurs during the model prediction process .

Key error message:

RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 0; 23.70 GiB total capacity; 21.49 GiB already allocated; 550.81 MiB free; 21.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It generally means that the video memory is not enough .

Use the following code to view the status of the graphics card during program running:

nvidia-smi -l 1

After the model is loaded, the status of the graphics card at this time:

insert image description here

The status of the graphics card during model training:

insert image description here

After the model training is completed, the model prediction stage starts, and after the data is input into the model, the following graphics card status appears immediately, and the duration of this status is very short. During the display process, there is only one output like this:

insert image description here

Immediately after the program reported an error, the memory of the graphics card was released, and the running program disappeared from the taskbar of the graphics card:

insert image description here

Then, I feel very strange, I think it is a gradient problem, because it is normal during training, and then there is a problem with model prediction, and then model training needs gradient information, model prediction does not need gradient information, so I try to solve the problem of gradient :

Just add the following sentence in front of the model training code:

with torch.no_grad():

The changed code looks like this:

with torch.no_grad():
    # validation
    model.eval()
    inputs = x_eval.to(device)
    val_probs = model(inputs)
    val_acc = (val_probs.argmax(dim=1) == y_eval.to(device)).float().mean()
    # print(f"Eval : Epoch : {iter} - kfold : {kfold+1} - acc: {val_acc:.4f}\n")
    epoch_val_acc += val_acc

The status of the graphics card in the model prediction stage after the change is as follows:

insert image description here
Then start a new round of training process, and the video memory usage of the graphics card has not changed.

In this way, no more errors will be reported! ! !

Guess you like

Origin blog.csdn.net/qq_41990294/article/details/128999777