Today, when I was training the model, I suddenly reported the problem of insufficient video memory. After analyzing it, I found a solution. I will record it here for future reference.
Note : The following solution is when this error occurs during model testing instead of model training!
RuntimeError: CUDA out of memory
Complete error message:
Traceback (most recent call last):
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 420, in <module>
main()
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 414, in main
train_with_cross_validate(training_epochs, kfolds, train_indices, eval_indices, X_train, Y_train, model, losser, optimizer)
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/main.py", line 77, in train_with_cross_validate
val_probs = model(inputs)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 235, in forward
x = self.camlp_mixer(x) # (batch_size, F, C, L)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 202, in forward
x = self.time_mixing_unit(x)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 186, in forward
x = self.mixing_unit(x)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/LiangXiaohan/MI_Same_limb/Joint_Motion_Decoding/SelfAten_Mixer/model/S_CAMLP_Net.py", line 147, in forward
x = self.activate(x)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 772, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "/home/pytorch/anaconda3/envs/pytorch_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1633, in leaky_relu
result = torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 0; 23.70 GiB total capacity; 21.49 GiB already allocated; 550.81 MiB free; 21.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Because the program written by myself will have output after one round of training, so this information occurs during the model prediction process .
Key error message:
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 0; 23.70 GiB total capacity; 21.49 GiB already allocated; 550.81 MiB free; 21.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It generally means that the video memory is not enough .
Use the following code to view the status of the graphics card during program running:
nvidia-smi -l 1
After the model is loaded, the status of the graphics card at this time:
The status of the graphics card during model training:
After the model training is completed, the model prediction stage starts, and after the data is input into the model, the following graphics card status appears immediately, and the duration of this status is very short. During the display process, there is only one output like this:
Immediately after the program reported an error, the memory of the graphics card was released, and the running program disappeared from the taskbar of the graphics card:
Then, I feel very strange, I think it is a gradient problem, because it is normal during training, and then there is a problem with model prediction, and then model training needs gradient information, model prediction does not need gradient information, so I try to solve the problem of gradient :
Just add the following sentence in front of the model training code:
with torch.no_grad():
The changed code looks like this:
with torch.no_grad():
# validation
model.eval()
inputs = x_eval.to(device)
val_probs = model(inputs)
val_acc = (val_probs.argmax(dim=1) == y_eval.to(device)).float().mean()
# print(f"Eval : Epoch : {iter} - kfold : {kfold+1} - acc: {val_acc:.4f}\n")
epoch_val_acc += val_acc
The status of the graphics card in the model prediction stage after the change is as follows:
Then start a new round of training process, and the video memory usage of the graphics card has not changed.
In this way, no more errors will be reported! ! !