The graphics card prompts Faild after nvidia-smi. The solution process includes steps to uninstall and reinstall the NVIDIA driver.

Graphics card abnormality: The graphics card prompts Faild after nvidia-smi. The solution process is to uninstall and reinstall the nvidia driver.

1 reason

The program ran fine at first, but suddenly there was no detection output. Try the following command:

nvidia-smi

The error is reported as follows, recorded as [Error1]

Unable to determine the device handle for GPU 8000:01:00.0: Unknown Error

After rebooting the machine, the output is as follows, recorded as [Error2]

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2 Solution process

(1) First of all, the first reaction was that the driver could not be connected, so I reinstalled the driver, and then the graphics card information and usage status could be output normally again.

(2) But not long after, the program made an error again, and then entered nvidia-smi and reported the above [Error1] problem. After restarting the machine, the output of [Error2] is still there. I suspect that the graphics card is physically separated from the server, and the pins may be loose. So we did the following experiments:

(2.1) Take a faulty server, enter nvidia-smi, the graphics card outputs normally, and shut down.

(2.2) Unplug the server graphics card, enter nvidia-smi after booting, enter the aforementioned [Error2] on the graphics card, and shut down.

(2.3) Plug in the server graphics card again. After booting, enter nvidia-smi and the graphics card outputs normally. Here, we didn't even reinstall the graphics card driver after booting the computer. It is worth noting that I always thought that removing the graphics card and plugging it back in again required reinstalling the graphics card driver. After re-plugging the graphics card here, I was shocked that I didn’t need to reinstall the driver.

3 commands required for the process

(1) If the following command is output (rev ff), the graphics card may be physically loose.

lspci| grep -i nvidia

An example of normal output is as follows:

02:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)

(2) The command to check whether there is a physical graphics card is as follows

sudo lshw -C display 

=====>Print does not produce any output. Normally, physical facility information will be output. An example of normal output is as follows:

  *-display
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:02:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:137 memory:a2000000-a2ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:4000(size=128) memory:c0000-dffff

(3) Display the operating system release version number

uname -r 

(4) Display the system name, node name, operating system release number, kernel version, etc.

uname -a

(5)

lspci

Normal output contains NVIDIA information, but abnormal output does not.

Insert image description here

4 solve

In the end, we felt that there must be a problem with the physical graphics card or the card slot.

4.1 Re-insert the graphics card
How to uninstall the NVIDIA driver (restart the machine after uninstalling and then install it)
方法一
sudo bash NVIDIA-Linux-x86_64-510.47.03.run -uninstall
方法二
sudo apt-get --purge remove nvidia*
sudo apt autoremove
卸载完成之后,记得重启机器再安装,然后再如下命令安装
sudo ./NVIDIA-Linux-x86_64-510.47.03.run -no-x-check

After reinstalling the driver, enter nvidia-smi and the output will be correct.

4.2 Try changing a graphics card

Since it was solved in 4.1, I didn’t try it.

4.3 Put the graphics card into other machines of the same model to test its performance

Since it was solved in 4.1, I didn’t try it.

5 Conclusion

​ Regarding the hardware problem, we installed the equipment outdoors. It is still summer. After taking the equipment apart, we found that the graphics card power adapter cable was too close to the fan, which affected the rotation of the fan. At the same time, the rotation of the fan also caused the power adapter cable to become damaged. Poor contact. These reasons are ① the temperature is too high and short-term hardware failure causes driver anomalies; ② the fan causes poor contact of the power adapter cable and causes the graphics card to lose power.

Guess you like

Origin blog.csdn.net/qq_42835363/article/details/132305212