docker容器挂了,显卡驱动异常问题:nvidia-container-cli: initialization error: nvml error: driver not loaded...

docker容器起不来,Nvidia驱动相关问题

1.具体报错
在这里插入图片描述

Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Error: failed to start containers: xxxxxxxxx
xxx@xxx:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2.问题分析
容器依赖于Nvidia显卡驱动,驱动掉了。
题外话:该客户的服务器长年有外网,显卡驱动在这几个月不止一次掉,排查原因Linux/Ubuntu内核自动更新,驱动程序失效,重装驱动往往是一个解决办法,但是内核一旦再次更新,可能还是会导致显卡驱动失效。

3.解决办法
关闭内核自动更新
将下面两个配置文件里的值全改为“0”,保存后重启

xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ uname -r
5.15.0-58-generic
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/10periodic
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
   
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ cat /etc/apt/apt.conf.d/20auto-upgrades 
#把下面值全改为“0”
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";

xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/10periodic 
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo vim /etc/apt/apt.conf.d/20auto-upgrades 
xxxx@xxxx:/xxxxxx/xxxxxxxxxxx/xxxxx$ sudo reboot -i

然后重装驱动,装好之后发现容器能起来,容器外可以nvidia-smi, 但是容器内不行,程序也跑不了

RuntimeError: No CUDA GPUs are available
(xxxxai) root@xxxxxx:/workspace/projects/xxxxx/xxxxai/xxxxx# nvidia-smi
No devices were found

重启一下docker服务

 systemctl restart docker 

OK,运行正常!


有一位大佬说可以“可以安装带有 dkms 选项的驱动程序”,我没测试,大家也可以参考一下:https://blog.csdn.net/wtlll/article/details/126541686

猜你喜欢

转载自blog.csdn.net/ghcony/article/details/129702942
今日推荐