Use NVIDIA-GPU train the neural network based Docker container

A, nvidia K80 driver installation

1, Nvidia (Nvidia) graphics card on the View server information, the command lspci | grep NVIDIA

05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

2, press down and install the graphics driver, this step by the host operating group of colleagues, to see the current version: Driver Version: 396.37, run nvidia-smi this information is available.

NVIDIA-SMI 396.37                 Driver Version: 396.37

The next installation, it is recommended to install the newer version of the driver (410 and above), because this driver version, will affect many of the details behind the pick.

Driver URL: https://www.nvidia.com/Download/index.aspx?lang=en-us .

 

 

Two, NVIDIA CUDA Toolkit installed

1, according to a prior driving, suitable first CUDA Toolkit version to download and install.

Download URL: https://developer.nvidia.com/cuda-toolkit-archive

CUDA driver and a correspondence relationship as shown below:

The next installation, we recommend installing version 10.0 and above, because the mainstream learning framework has better performance and support for these versions.

2, the use cuda deviceQuery installation information thereof can be obtained:

Third, the installation cuDNN

1, cuDNN is a library, URL GPU Accelerated Computing deep neural networks:

https://developer.nvidia.com/rdp/cudnn-archive

2. Select the corresponding version of cuda, download extract correspond to cuda directory.

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

*** If NVIDIA except in the GPU on the host, then, after the above steps, it can install the learning framework to operate. But if you want to make a more flexible and clean applications running in the container, you also need to prepare the next operation. ***

Four, nvidia-docker2 installation

1, Nvidia support on the docker project, github URL: https://github.com/NVIDIA/nvidia-docker . It uses a nvidia-docker2 docker application to extending built-in functionality. Architecture as follows:

2, nvidia-docker2 claim docker newer version installed, the installed version ce-18.09.6-3. Nvidia-docker2 installed version 2.0.3-3. (Which I have received several rpm package download, and can be shared at any time)

Installation command (rpm package dependencies need to be addressed, not listed):

Sudo yum install docker-ce-18.09.6-3.el7.x86_64.rpm
Sudo yum install nvidia-docker2-2.0.3-3.docker18.09.6.ce.noarch.rpm

3,在安装完成之后,有两个配置文件需要更改或新建:

a,/etc/systemd/system/docker.service.d/docker.conf

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --graph=/docker_data --storage-driver=overlay --insecure-registry harbor.xxxx.com.cn

graph参数指定docker镜像的存放目录,需要一个较大的硬盘空间。

b,/etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

此文件为nvidia-docker2应用所需,用于替换docker的默认runc。

用两个文件来自定义docker配置,即可以替换runc,又可以提定内部仓库,合理~

4,当这些更改应用之后,再启动docker服务,使用docker info会看到相应的更改已生效。

...
Server Version: 18.09.6
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Runtimes: nvidia runc
Default Runtime: nvidia
Docker Root Dir: /data05
Insecure Registries:
 harbor.xxxx.com.cn
 127.0.0.0/8

...

五,Docker镜像测试:

1,此次测试的镜像如下:

---anibali/pytorch:cuda-9.2

---tensorflow/tensorflow:1.12.0-gpu-py3

2,在服务器新装之后,如果有特别个性化的需求,可以考虑以anaconda镜像为基础进行自定义制作。

3,在以上两个镜像中,GPU加速的效果,都可以达到CPU的10倍左右。

 

六,使用K8S管理docker容器的部署

未完待续

Guess you like

Origin www.cnblogs.com/aguncn/p/10973249.html