Problem Description:
The system version is centos7.6. After restarting the server, the nvidia-smi command is entered and the error "Failed to initialize NVML: Driver/library version" is reported.
Cause Analysis:
Recalling the operation of the server a few days ago, I upgraded the cuda version and cudnn. I guess it should be caused by this reason. Look at the current cuda version, and it shows that the current version is 10.2. I remember that it was 9.0 before. It should be that nvidia was also updated when cuda was upgraded. driver, resulting in a mismatch between the nvidia driver and the system. Solution, 1. Set bios to disable the driver that comes with the graphics card; 2. Uninstall the nvidia graphics card driver through the command; 3. Reinstall the graphics card driver;
nvcc -V
solution:
Since it is a server, it is difficult to set the bios, and restarts are infrequent, so choose the second option and enter the following commands in sequence:
View the current nvidia dependencies of the kernel mod
lsmod | grep nvidia
Then enter the following command
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
Enter nvidia-smi after entering the above command, it can be displayed normally
Summarize:
After solving the problem through the above commands, an error may still appear when you restart and enter nvidia-smi next time. If you don’t mind the trouble, go through it again. If you want to solve it completely, set the bios or reinstall the nvidia driver.