The docker container can't start, Nvidia driver related issues
1. Specific error reporting
Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Error: failed to start containers: xxxxxxxxx
xxx@xxx:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2. Problem analysis
The container depends on the Nvidia graphics card driver, and the driver is broken.
Off-topic: The client's server has been connected to the Internet for many years, and the graphics card driver has been disconnected more than once in the past few months. The reason for the investigation is that the Linux/Ubuntu kernel is automatically updated, and the driver is invalid. Reinstalling the driver is often a solution, but once the kernel fails again Updates may still cause graphics card drivers to fail.
3. Solution
Turn off automatic kernel update,
change all the values in the following two configuration files to "0", save and restart
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ uname -r
5.15.0-58-generic
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/10periodic
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/20auto-upgrades
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/10periodic
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/20auto-upgrades
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo reboot -i
Then reinstall the driver, and after installing it, I found that the container can start up, and nvidia-smi can be used outside the container, but not inside the container, and the program cannot run
RuntimeError: No CUDA GPUs are available
(xxxxai) root@xxxxxx:/workspace/projects/xxxxx/xxxxai/xxxxx# nvidia-smi
No devices were found
Restart the docker service
systemctl restart docker
OK, it works fine!
A big guy said that "the driver with the dkms option can be installed", I haven't tested it, you can also refer to it: https://blog.csdn.net/wtlll/article/details/126541686