CUDA metaphysics problem? i don't understand

Recently, when running the code based on Stable diffusion, I encountered a mysterious problem several times, and the solution was also very metaphysical, and I was confused all the way.

My purpose:

I run the program on a server with 8 graphics cards. I have been running the program on the first graphics card with cuda:0 by default. I want to specify another card to run the program.

My attempt:

  1. Stable diffusion seems to specify the gpu serial number through the config file, so I changed it there first, it was useless, the program still runs by default cuda:0

  1. Specify the graphics card used by torch.cuda.set_device(6) in the file:

if __name__ == "__main__":
    torch.cuda.set_device(6)
    main()

Error: The serial number 6 in cuda does not exist

 File "scripts/inference.py", line 587, in <module>
    torch.cuda.set_device(6)
 File "anaconda3/envs/PbE/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

So I print(torch.cuda.device_count()) before torch.cuda.set_device(6), and found that there is only one cuda available. However, this is an 8-card machine!

In my other .py files, print(torch.cuda.device_count()) can get 8 normally, and I can get the answer of 8 normally by typing the following two lines through the python command on the terminal. Only this file that I am trying to run is not normal.

>>>import torch
>>>print(torch.cuda.device_count())

Out of intuition, I found the import torch line in a bunch of imports in the file to be run, and inserted a line of print(torch.cuda.device_count()) after it, and the result! ! ! Got the correct answer of 8, and can successfully call torch.cuda.set_device(6), throw the program to cuda:6 to run.

But it is really hard to understand the reason behind it. Just put print(torch.cuda.device_count()) into import torch, and no error will be reported; after importing other packages, print(torch.cuda.device_count()) will be wrong. Could it be that the import operation can also change the available number of cuda? It's too metaphysical.

  1. Enter CUDA_VISIBLE_DEVICES=6 python xxx.py directly through the terminal to specify the graphics card

It is found that the actual running is cuda:1, changing 6 to 4 or 0 still runs cuda:1...do not specify CUDA_VISIBLE_DEVICES, directly python xxx.py, good guy, or cuda:1. Why does the default graphics card change from cuda0 to cuda1? What kind of supernatural event is this?

Guess you like

Origin blog.csdn.net/qq_43522986/article/details/129643788