Kubernetes of GPU usage scenarios

Verbatim https://www.dazhuanlan.com/2019/08/25/5d623a2c4a7e6/


kubernetes provide experimental support for AMD GPU and NVIDIA GPU management on distributed nodes. In V1.6 we have added support for NVIDIA GPU, and experienced a number of
backwards incompatible iterations. Plug the device by adding support for the AMD GPU in v1.9.

From the 1.8 version, use the recommended method is to use the GPU driver plug-ins. If enabled by GPU support device plug-in versions prior to 1.10, the entire system must function in the DevicePlugins
set to true: --feature-gates="DevicePlugins=true. After the 1.10 version does not need to do that.

Then, you must install the appropriate vendor GPU-driven processes on the node, and from the GPU vendor corresponding (AMD / NVIDIA) device plug-ins to run.

Two .kubernetes GPU cluster deployment

kubernetes Cluster Version: 1.13.5
Docker Version: 18.06.1-ce
os system version: centos7.5
kernel version: 4.20.13-1.el7.elrepo.x86_64
Nvidia GPU Model:

2.1 install nvidia driver

Gcc 2.1.1 installed

1
[root@k8s-01 ~]# yum install -y gcc

2.1.2 download nvidia drivers

Download Link: https://www.nvidia.cn/Download/driverResults.aspx/141795/cn

Here we downloaded version is as follows:

1
2
[root@k8s-01 ~]# ls NVIDIA-Linux-x86_64-410.93.run -alh
-rw-r--r-- 1 root root 103M Jul 25 17:22 NVIDIA-Linux-x86_64-410.93.run

2.1.3 modify /etc/modprobe.d/blacklist.conf document, block the loading of the nouveau module

1
[root@k8s-01 ~]# echo -e "blacklist nouveaunoptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

2.1.4 re-establish initramfs image

1
2
[root@k8s-01 ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@k8s-01 ~]# dracut /boot/initramfs-$(uname -r).img $(uname -r)

2.1.5 execute driver installation

1
[root@k8s-01 ~]# sh NVIDIA-Linux-x86_64-410.93.run -a -q -s

2.1.6 Installation Kit

Only drive is not enough, we need some kit for us to use, which cuda, cudnn related toolkit.

1
2
3
4
5
6
7
8
[root@k8s-01 ~]# cat /etc/yum.repos.d/cuda.repo 
[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub
[root@k8s-01 ~]#

2.2 Installation nvidia-docker2

nvidia-docker docker is a use of the GPU, based on the docker layer made package. Currently basically it has been abandoned.
nvidia-docker2 is a runtime, better and docker compatible.

  • Get the yum source nvidia-docker2
1
2
[root@k8s-01 ~]# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
[root@k8s-01 ~]# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
  • View a list of nvidia-docker2

Here we need to install the support docker-18.06.1-ce version of nvidia-docker2 version, otherwise it will not support.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[root@k8s-01 ~]# yum list nvidia-docker2 --showduplicate
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* epel: mirror01.idc.hinet.net
* extras: mirrors.aliyun.com
* updates: mirrors.163.com
Installed Packages
nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce @nvidia-docker
Available Packages
nvidia-docker2.noarch 2.0.0-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.06.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.12.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.12.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.12.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.03.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.2 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.2 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.3.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.4.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-2.docker18.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-2.docker18.09.5.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.06.3.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.5.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.6.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.7.ce nvidia-docker
nvidia-docker2.noarch 2.1.0-1 nvidia-docker
nvidia-docker2.noarch 2.1.1-1 nvidia-docker
nvidia-docker2.noarch 2.2.0-1 nvidia-docker

Here we install 2.0.3-1.docker18.06.1.ce version.

  • Install nvidia-docker2
1
[root@k8s-01 ~]# yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
  • The default configuration for the nvidia docker runtime
1
2
3
4
5
6
7
8
9
10
[root@k8s-01 ~]# cat /etc/docker/daemon.json 
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
  • 重启docker
1
[root@k8s-01 ~]# systemctl restart docker
  • 查看docker信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
[root@k8s-01 wf-deploy]# docker info
Containers: 63
Running: 0
Paused: 0
Stopped: 63
Images: 51
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc nvidia
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340-dirty (expected: 69663f0bd4b60df09991c08812a60108003fa340)
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.20.13-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.79GiB
Name: k8s-01
ID: DWPY:P2I4:NWL4:3U3O:UTGC:PLJC:IGTO:7ZXJ:A7CD:SJGT:7WT5:WNGX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
192.168.50.2
127.0.0.0/8
Live Restore Enabled: false

It can be seen as a docker default runtime nvidia

2.3 install the driver plug-in

  • Get the latest documentation plugin yaml
1
[root@k8s-01 ~]# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

Three .CLI Introduction

  • nvidia-container-cli

nvidia-container-cli is a command-line tool, for configuring the Linux containers of GPU hardware. Support:
. 1) List: Print nvidia driver library and Path
2) info: Print all Nvidia GPU apparatus
3) configure: into a given process namespace, be performed using the designated GPU and a corresponding capacity in the necessary operations to ensure that the container (designated Nvidia driver library). We use to configure is the primary command, so the document will Nvidia GPU device driver library and information mapped to the container by way of a document mount.

Guess you like

Origin www.cnblogs.com/petewell/p/11408215.html