Ubuntu driver nvidia update

ubuntu18.04

nvidia-smi is the system management interface of nvidia, where smi is the abbreviation of System management interface, it can collect various levels of information and view the memory usage. In addition, GPU configuration options (such as ECC memory function) can be enabled and disabled.

View GPU information error is as follows:

root@iZ2zeiflf48wp1ved7nnnmZ:~# nvidia-smi

Failed to initialize NVML: Driver/library version mismatch

Find the native kernel version:

cat /proc/driver/nvidia/version

View the client driver version:

cat /var/log/dpkg.log | grep nvidia

 

# Found that it is obviously different, one is 400.82, the other is 400.100, the kernel version is lower than the client version.

Check the system log again:

 

The direct reason is: the NVIDIA kernel driver version is inconsistent with the system driver

Solution:

Uninstall the driver:

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia

rmmod: ERROR: Module nvidia is in use by: nvidia_uvm nvidia_modeset

Uninstallation fails, prompt to uninstall dependencies first:

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia_uvm

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia_modeset

rmmod: ERROR: Module nvidia_modeset is in use by: nvidia_drm

Continue to uninstall dependencies according to the prompts:

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia_drm

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia_modeset

root@iZ2zeiflf48wp1ved7nnnmZ:~# sudo rmmod nvidia

Finally review the GPU information:

root@iZ2zeiflf48wp1ved7nnnmZ:~# nvidia-smi

 

Recheck that the driver kernel version and the client version are consistent:

 

This is the information of Ubuntu 18.04 on the server. 
In the above table:
Fan: 0 in the  first column means there is only one GPU, and N/A below 0 is the fan speed, which varies from 0 to 100%. This speed is the fan speed expected by the computer. If the rotor is locked, the displayed speed may not be reached. Some devices will not return to the speed because it does not rely on fan cooling but keeps the temperature low through other peripherals (such as cloud hosts). 
Temp in the second column: is the temperature in degrees Celsius. 
Perf in the third column: is the performance status, from P0 to P12, P0 represents the maximum performance, P12 represents the minimum performance of the state. 
Pwr at the bottom of the fourth column: is the energy consumption, 28W / 250W represents the current power and total power; Persistence-M at the top: is the GPU resident continuous mode. Although the continuous mode consumes a lot of energy, when a new GPU application starts, It takes less time, here is the off state. 
Bus-Id in the fifth column: 00000000:00:09.0 is related to the GPU bus, domain:bus:device.function (domain: bus: device.function) 
Disp.A in the sixth column is Display Active (display activity) , Indicating whether the GPU display is initialized. 
The Memory Usage under the fifth and sixth columns is the memory usage rate, 0MiB / 16280MiB means: the amount of video memory occupied by the system/total size of the video memory. 
The seventh column is the floating GPU utilization. 
Above the eighth column is something about ECC, which is the display off (closed). 
Compute M at the bottom of the eighth column: The default mode is computing. 
If there is a process, it will display a grid: it indicates the memory usage of each process.

Guess you like

Origin blog.csdn.net/Doudou_Mylove/article/details/108355182