docker study notes (9): nvidia-docker installation, deployment and use

introduction

The NVIDIA Deep Learning GPU Training System (aka DIGITS) is a web application for training deep learning models. It puts the power of deep learning in the hands of engineers and data scientists. It can be used to quickly train high-precision deep neural networks (DNNs) for image classification, segmentation, and object detection tasks. Currently supported frameworks are: Caffe, Torch and Tensorflow.

With the latest 19.03.0 Beta release, now you don't need to spend time downloading the NVIDIA-DOCKER plugin and instead rely on nvidia-wrapper to launch GPU containers. You can now use the --gpus option in docker runCLI to allow containers to use GPU devices seamlessly.

New Docker CLI API Support for NVIDIA GPUs under Docker Engine 19.03.0 Pre-Release

nvidia-docker deployment uses

Pre-environment

insert image description here

First of all, cuda and cuda's corresponding gcc, g+ and other dependencies are required. The current gcc in 2019 is 8.3.1, and cuda is upward compatible, so if the graphics card driver only needs a minimum version higher than the minimum version limit accepted by cuda in the above table, it will do. Then you can read my previous two articles about the installation of cuda and docker:

Starting from 0 under Linux to build the GPU environment and start the test

Docker usage notes (1): docker introduction and installation

If both cuda and docker have been installed, then check the status of current cuda, nvidia driver and docker, cuda is:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

If you choose cuda sample during installation and you have not run other programs, you can compile it under its path deviceQuery:

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

If the installation is successful, it will display:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          11.0 / 10.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 15110 MBytes (15843721216 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

If you don't download the sample, python can use cuda's is_available() in torch, and if it is C++, you can directly: cudaSetDevice() to see if it is available. There are cuda tests in the pre-installation of the above environment, and the general cuda failure log is:

CUDA driver version is insufficient for CUDA runtime version
Result = Fail

This is caused by the inconsistency between cuda and the current driver. Don't ask me why I still remember this error so clearly, because it just appeared in the container a few days ago. . But I have no version inconsistencies, go to Git according to the official words:

I suspect you somehow ended up with CUDA runtime libraries installed into your image from a host machine that are a mismatch with the driver version running on your current host. How did you generate the submarineas/centos:v0.1 image?
You can’t do this. The image must run with the host CUDA libraries injected into it. This is one of the primary functionalities that nvidia-dockerprovides. To fix the situation, you need to go into /usr/lib/x86_64-linux-gnu/ inside your container and remove any files of the form *.so. (e.g. libnvidia-ml.so.410.104) that don’t match the driver version on your host.

The docker status is viewed as systemctl status docker.service:

Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.895532061+08:00" level=info msg="ccResolverWrapper: sending update ...ule=grpc
Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.895547307+08:00" level=info msg="ClientConn switching balancer to \...ule=grpc
Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.904539499+08:00" level=info msg="[graphdriver] using prior storage ...verlay2"
Sep 08 21:37:03 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:03.235120765+08:00" level=info msg="Loading containers: start."
Sep 08 21:37:05 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:05.503083212+08:00" level=info msg="Default bridge (docker0) is assign...address"
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.347242198+08:00" level=info msg="Loading containers: done."
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.507743081+08:00" level=info msg="Docker daemon" commit=633a0ea grap...=19.03.5
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.507838124+08:00" level=info msg="Daemon has completed initialization"
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.574231587+08:00" level=info msg="API listen on /var/run/docker.sock"

Here's where errors can also occur:

Sep 08 14:11:41 10-9-111-182 dockerd[1058]: time="2020-09-08T14:11:41.522856125+08:00" level=error msg="Handler for POST /v1.40/containers/a3d065de1ea9/restar>
Sep 08 15:57:46 10-9-111-182 dockerd[1058]: time="2020-09-08T15:57:46.154184854+08:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 08 15:57:46 10-9-111-182 dockerd[1058]: time="2020-09-08T15:57:46.154205023+08:00" level=error msg="stream copy error: reading from a closed fifo"

Don't ask me here why I have this kind of thing, I have been fooled by all kinds of garbage blogs. . . If docker has this kind of error, it means that the daemon.json file has been modified before, and it has been updated or deleted by mistake, and it Docker daemonhas become invalid, so you need to restart docker. And if this log appears in the started container, you need to change all the ports and data volumes and redo them.
insert image description here

The last is the state of the drive, which is generally not wrong. If something goes wrong, shut down and think about life first.

nvidia-docker install

ubuntu:

curl https://get.docker.com | sh

sudo systemctl start docker && sudo systemctl enable docker

# 设置stable存储库和GPG密钥:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# 要访问experimental诸如WSL上的CUDA或A100上的新MIG功能之类的功能,您可能需要将experimental分支添加到存储库列表中.
# 可加可不加
curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

# nvidia-docker2更新软件包清单后,安装软件包(和依赖项):
sudo apt-get update

sudo apt-get install -y nvidia-docker2

# 设置默认运行时后,重新启动Docker守护程序以完成安装:
sudo systemctl restart docker

So far, if there is nothing wrong, ubuntu’s nvidia-docker has been installed successfully. The above installation is from nvidia’s official installation method, but I started with centos, and ubuntu is nvidia’s pro-son series. Half a quarter of an hour, maybe Google’s search method is wrong, but Baidu is definitely a pit, most of them didn’t even copy the official website, and then I thought that nvidia didn’t write centos, so I found another link on nvidia’s official website, which is nvidia-docker The bulk series:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

sudo yum-config-manager --enable libnvidia-container-experimental
sudo yum-config-manager --enable nvidia-container-experimental

sudo yum-config-manager --disable libnvidia-container-experimental
sudo yum-config-manager --disable nvidia-container-runtime-experimental

Because nvidia-docker has three core things, one nvidia-container-runtime, one libnvidia-container-experimental, and one cudakit. However, it is best to follow the official documents if there are official documents. The two examples are:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

https://nvidia.github.io/nvidia-container-runtime/

nvidia-docker deployment issues

We can run the test image given by nvidia. What needs to be distinguished here is nvidia-docker and nvidia-docker2. If you don’t use the official installation method, but go the wild way like me, then you have to look at your own docker. Is it 1 or 2, then the slightly different starting methods are:

# nvidia-docker:nvidia-container-toolkit的安装方式
docker run --gpus=all --rm nvidia/cuda:10.0-base nvidia-smi

# nvidia-docker2
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --rm nvidia/cuda:10.0-base nvidia-smi
or 
nvidia-docker run -e NVIDIA_VISIBLE_DEVICES=all --rm nvidia/cuda:10.0-base nvidia-smi

If there is no distinction, you can change it and run it all over again. For example, I found an error after running the first one on zaicentos:

could not select device driver “” with capabilities: [[gpu]].

This problem will not happen on docker2, according to one of the contributors of issue 1034:

Hello!
If you didn’t already make sure you’ve installed the nvidia-container-toolkit.
If this doesn’t fix it for you, make sure you’ve restarted docker systemctl restart dockerd

nvidia-container-toolkit, There is also the same answer on stackoverflow, because the device cannot find the GPU address because it is missing. Therefore, the corresponding driver must be found according to the nvidia-container-installation I provided above:
insert image description here

ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\n\""": unknown.

I don’t know whether this error is unique to the deepin system or only occurs on non-mainstream servers. That’s right, this error happened to me on deepin. Don’t ask me why there are still ubuntu and centos in front of me, but here is deepin again. . . It's really a sad story. When this problem is still running, there will be a bunch of nvidia-docker-container logs in front.

Solving this problem is very simple, just update the dynamic library configuration file and link it to the wrong path:

sudo ldconfig -v	# 显示所有链接
or
ldconfig	# 不报错

ln -s /sbin/ldconfig /sbin/ldconfig.real

The one here is to update ldconfig, which is equivalent to source. -v displays all records. If ldconfig is useless, you need to add -v to see which one is the problem.

For example, I encountered a problem with libcudnn after ldconfig of this system, then the link is:

 sudo ln -sf /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7

Then link /sbin/ldconfig and start again, and this error will not be reported.

starting container process caused “exec: “nvidia-smi”: executable file not found in $PATH”

This error is difficult to deal with, let's go step by step.

First of all, according to the semantics, the container does not find the path. If it is followed by the words cuda >=, then it can be determined that the cuda version is wrong. Let’s first check the volume of the docker:

$ nvidia-docker volume ls
DRIVER              VOLUME NAME
local               f32bc4d3933b47c923b0e3e86222e2476e7131566950daad756790bc4129626d
nvidia-docker       nvidia_driver_450.51.06

If there is no nvidia-docker, please manually create a volume:

docker volume create --driver=nvidia-docker --name=nvidia_driver_$(modinfo -F version nvidia)

After creation, if it is a version problem, upgrade the version, if not, then check:

systemctl status docker.service		# 查看docker日志
sudo systemctl start nvidia-docker.service		# 查看nvidia-docker日志

The docker log is:

● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─override.conf
   Active: active (running) since Tue 2020-09-08 11:09:10 CST; 7min ago
     Docs: https://docs.docker.com
 Main PID: 30459 (dockerd)
    Tasks: 40
   Memory: 66.7M
   CGroup: /system.slice/docker.service
           ├─30459 /usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
           ├─30610 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 3306 -container-ip 172.18.0.2 -container-port 3306
           ├─30672 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 15672 -container-ip 172.18.0.4 -container-port 15672
           └─30688 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 5672 -container-ip 172.18.0.4 -container-port 5672

Sep 08 11:09:09 10-9-111-182 dockerd[30459]: time="2020-09-08T11:09:09.703146487+08:00" level=info msg="Loading containers: start."
Sep 08 11:09:09 10-9-111-182 dockerd[30459]: time="2020-09-08T11:09:09.817109186+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --

The nvidia-docker log is similar, if you see a problem, then look for information based on the problem, why it appears, because there are many places involved, if there is no problem in the two states, then continue to search.

$ yum search libcuda
"""
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
Last metadata expiration check: 0:14:56 ago on Tue 08 Sep 2020 01:50:04 PM CST.
No matches found.
"""

There is no problem with libcuda, three nvidia-docker environments.

nvidia-container-cli -k -d /dev/tty info

"""
-- WARNING, the following logs are for debugging purposes only --

I0908 06:06:06.277294 106114 nvc.c:282] initializing library context (version=1.3.0, build=af0220ff5c503d9ac6a1b5a491918229edbb37a4)
I0908 06:06:06.277332 106114 nvc.c:256] using root /
I0908 06:06:06.277337 106114 nvc.c:257] using ldcache /etc/ld.so.cache
I0908 06:06:06.277341 106114 nvc.c:258] using unprivileged user 65534:65534
I0908 06:06:06.277362 106114 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0908 06:06:06.277498 106114 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I0908 06:06:06.278499 106115 nvc.c:192] loading kernel module nvidia
I0908 06:06:06.278650 106115 nvc.c:204] loading kernel module nvidia_uvm
I0908 06:06:06.278713 106115 nvc.c:212] loading kernel module nvidia_modeset

.......

CUDA version:   11.0

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-8546d1d2-7f12-2014-2498-6738e7ac1d2b
Bus Location:   00000000:00:03.0
Architecture:   7.5
I0908 08:22:40.155167 15854 nvc.c:337] shutting down library context
I0908 08:22:40.223031 15856 driver.c:156] terminating driver service
I0908 08:22:40.223527 15854 driver.c:196] driver service terminated successfully

"""

There is a problem here dxcore initialization failed, continuing assuming a non-WSL environment, but I have not used win things, it may be related to libnvidia-container-experimental, if you have read the previous nvidia-container-time article, then yum search libcuda search or apt-get search Yes, so don't worry about it.

If you still have problems, run:

nvidia-docker run --rm nvidia/cuda:10.0-devel "echo $PATH"

All paths found by nvidia-docker through dependencies will be displayed, if not, please add them, or manually add the directory. Then I finally encountered another problem:
insert image description here
this problem has no results for the time being. I asked the nvidia official staff, and there is no way to control me there. If you have the same error as me, if you solve it, please post it in the comment area or send me a private message. Grateful. If it doesn't work out, I can only congratulate you. . . solved.

Guess you like

Origin blog.csdn.net/submarineas/article/details/108477031