The docker container hangs, and the graphics card driver is abnormal: nvidia-container-cli: initialization error: nvml error: driver not loaded...

The docker container can't start, Nvidia driver related issues

1. Specific error reporting
insert image description here

Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Error: failed to start containers: xxxxxxxxx
xxx@xxx:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2. Problem analysis
The container depends on the Nvidia graphics card driver, and the driver is broken.
Off-topic: The client's server has been connected to the Internet for many years, and the graphics card driver has been disconnected more than once in the past few months. The reason for the investigation is that the Linux/Ubuntu kernel is automatically updated, and the driver is invalid. Reinstalling the driver is often a solution, but once the kernel fails again Updates may still cause graphics card drivers to fail.

3. Solution
Turn off automatic kernel update,
change all the values ​​in the following two configuration files to "0", save and restart

xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ uname -r
5.15.0-58-generic
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/10periodic
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
   
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/20auto-upgrades 
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";

xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/10periodic 
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/20auto-upgrades 
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo reboot -i

Then reinstall the driver, and after installing it, I found that the container can start up, and nvidia-smi can be used outside the container, but not inside the container, and the program cannot run

RuntimeError: No CUDA GPUs are available
(xxxxai) root@xxxxxx:/workspace/projects/xxxxx/xxxxai/xxxxx# nvidia-smi
No devices were found

Restart the docker service

 systemctl restart docker 

OK, it works fine!


A big guy said that "the driver with the dkms option can be installed", I haven't tested it, you can also refer to it: https://blog.csdn.net/wtlll/article/details/126541686

Guess you like

Origin blog.csdn.net/ghcony/article/details/129702942