Record the process of using the nvidia-smi command to report "Failed to initialize NVML: Driver/library version"

Problem Description:

The system version is centos7.6. After restarting the server, the nvidia-smi command is entered and the error "Failed to initialize NVML: Driver/library version" is reported.

Cause Analysis:

Recalling the operation of the server a few days ago, I upgraded the cuda version and cudnn. I guess it should be caused by this reason. Look at the current cuda version, and it shows that the current version is 10.2. I remember that it was 9.0 before. It should be that nvidia was also updated when cuda was upgraded. driver, resulting in a mismatch between the nvidia driver and the system. Solution, 1. Set bios to disable the driver that comes with the graphics card; 2. Uninstall the nvidia graphics card driver through the command; 3. Reinstall the graphics card driver;

nvcc -V

insert image description here

solution:

Since it is a server, it is difficult to set the bios, and restarts are infrequent, so choose the second option and enter the following commands in sequence:

View the current nvidia dependencies of the kernel mod

lsmod | grep nvidia

Then enter the following command

sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia

Enter nvidia-smi after entering the above command, it can be displayed normally
insert image description here

Summarize:

After solving the problem through the above commands, an error may still appear when you restart and enter nvidia-smi next time. If you don’t mind the trouble, go through it again. If you want to solve it completely, set the bios or reinstall the nvidia driver.

Guess you like

Origin blog.csdn.net/h363924219/article/details/120260637