[Exception error] nan error: training tensor(nan, device='cuda:0', dtype=torch.float64, grad_fn=<MseLossBackward>

Error example:

Nan appeared during the training process. 

train epoch 0] loss: 27.854:   6%|███████                                                                                                                       | 7/126 [00:00<00:09, 12.64it/s]WARNING: non-finite loss, ending training  tensor(nan, device='cuda:0', dtype=torch.float64, grad_fn=<MseLossBackward>)
[train epoch 0] loss: nan:   6%|███████▏                              

To resolve the error:

1. The learning rate used is too large

When reducing the learning rate, you need to appropriately reduce the batch and increase the epoch.

2. There is a problem with your data, check the data set

If you repeatedly try to change the learning rate to no avail, then there is usually something wrong with your training data set. If you are using supervised data, then check whether your data set has empty labels. I This error occurred

おすすめ

転載: blog.csdn.net/weixin_43135178/article/details/133313549