Pytorch read parameter error RuntimeError: cuda runtime error (10) : invalid device ordinal

I haven't posted a blog for a long time, but today I encountered a relatively rare bug in the parameter reading process of Pytorch.

RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:87

I have been looking for a long time without a solution, and no one has encountered similar problems on StackOverflow and pytorch's issues. In the end, I had to go into battle in person, and I saw the problem from the bottom when debugging step by step, so I recorded it here in the hope that it can help others.

-

1. Problem scenario
When I trained a model on the server and loaded the saved parameters on the local machine for analysis of the results, this error occurred. In the Load parameter, I was prompted that cuda could not be found. The detailed error message is as follows:

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=87 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "/home/sw/Shin/Codes/DL4SS_Keras/Torch_multi/main_run_multi_selfSS_subeval.py", line 557, in <module>
    main()
  File "/home/sw/Shin/Codes/DL4SS_Keras/Torch_multi/main_run_multi_selfSS_subeval.py", line 482, in main
    mix_hidden_layer_3d.load_state_dict(torch.load('params/param_mix101_dbag1nosum_WSJ0_hidden3d_190'))
  File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 231, in load
    return _load(f, map_location, pickle_module)
  File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 379, in _load
    result = unpickler.load()
  File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 350, in persistent_load
    data_type(size), location)
  File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 85, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python2.7/dist-packages/torch/serialization.py", line 67, in _cuda_deserialize
    return obj.cuda(device_id)
  File "/usr/local/lib/python2.7/dist-packages/torch/_utils.py", line 58, in _cuda
    with torch.cuda.device(device):
  File "/usr/local/lib/python2.7/dist-packages/torch/cuda/__init__.py", line 128, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:87

I thought it was a problem with my local machine CUDA, but if I change the parameters that I saved before, there is no such problem, and it can work normally. After analysis, there should be some problems with the parameter to be read.


After a low-level inspection, the problem was found. It turns out that Pytorch will register a location related to the original parameter location when the parameter is saved. For example, if you train on GPU1 on the server, this location is likely to be GPU1. If there is only one GPU on your desktop, that is, GPU0, then the Location information brought in by this parameter is incompatible with your desktop, and the problem of not finding the cuda device will occur.

2. Solution
In the Load parameter, map according to the GPU state on your current machine. For example, the original GPU1 can be converted to GPU0, which has an optional parameter in the torch.load function.

load(f, map_location=None, pickle_module=pickle)

In my experiments, the final sentence is this:

att_speech_layer.load_state_dict(torch.load('params—xxxxx',map_location={'cuda:1':'cuda:0'}))

It is solved, of course, you have to adjust the map according to your computer.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325918518&siteId=291194637