Install fabricmanager to solve print(torch.cuda.is_available()) error NumCudaDevices()

Install fabricmanager

Problem: print(torch.cuda.is_available()) reports an error, but both CUDA and cudnn are installed and the versions correspond well. The error is as follows

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

Explanation: NVIDIA NVLink A100 GPU card requires additional installation of the nvidia-fabricmanager service corresponding to the driver version so that the GPU cards can be interconnected through NVSwitch. If only the NVIDIA GPU driver is installed, the GPU will not function properly. The installation steps are as follows:

Download the fabricmanager corresponding driver version from the website: Index of /compute/cuda/repos/ubuntu2204/x86_64 (nvidia.cn)

#若有旧的版本,请删去后重新下载

#手动安装
sudo apt-get install ./nvidia-fabricmanager-535_535.104.05-1_amd64.deb
#解除禁用
sudo systemctl enable nvidia-fabricmanager
#重启
sudo systemctl restart nvidia-fabricmanager
#检查状态
sudo systemctl status nvidia-fabricmanager
#安装成功

Guess you like

Origin blog.csdn.net/gary101818/article/details/132687029