Verbatim https://www.dazhuanlan.com/2019/08/25/5d623a2c4a7e6/
kubernetes provide experimental support for AMD GPU and NVIDIA GPU management on distributed nodes. In V1.6 we have added support for NVIDIA GPU, and experienced a number of backwards incompatible iterations. Plug the device by adding support for the AMD GPU in v1.9.
From the 1.8 version, use the recommended method is to use the GPU driver plug-ins. If enabled by GPU support device plug-in versions prior to 1.10, the entire system must function in the DevicePlugins set to true: --feature-gates="DevicePlugins=true
. After the 1.10 version does not need to do that.
Then, you must install the appropriate vendor GPU-driven processes on the node, and from the GPU vendor corresponding (AMD / NVIDIA) device plug-ins to run.
Two .kubernetes GPU cluster deployment
kubernetes Cluster Version: 1.13.5 Docker Version: 18.06.1-ce os system version: centos7.5 kernel version: 4.20.13-1.el7.elrepo.x86_64 Nvidia GPU Model:
2.1 install nvidia driver
Gcc 2.1.1 installed
1
[root@k8s-01 ~]# yum install -y gcc
2.1.2 download nvidia drivers
Download Link: https://www.nvidia.cn/Download/driverResults.aspx/141795/cn
Here we downloaded version is as follows:
1 2
[root@k8s-01 ~]# ls NVIDIA-Linux-x86_64-410.93.run -alh -rw-r--r-- 1 root root 103M Jul 25 17:22 NVIDIA-Linux-x86_64-410.93.run
2.1.3 modify /etc/modprobe.d/blacklist.conf document, block the loading of the nouveau module
1
[root@k8s-01 ~]# echo -e "blacklist nouveaunoptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
2.1.4 re-establish initramfs image
1 2
[root@k8s-01 ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak [root@k8s-01 ~]# dracut /boot/initramfs-$(uname -r).img $(uname -r)
2.1.5 execute driver installation
1
[root@k8s-01 ~]# sh NVIDIA-Linux-x86_64-410.93.run -a -q -s
2.1.6 Installation Kit
Only drive is not enough, we need some kit for us to use, which cuda, cudnn related toolkit.
1 2 3 4 5 6 7 8
[root@k8s-01 ~]# cat /etc/yum.repos.d/cuda.repo [cuda] name=cuda baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64 enabled=1 gpgcheck=1 gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub [root@k8s-01 ~]#
2.2 Installation nvidia-docker2
nvidia-docker docker is a use of the GPU, based on the docker layer made package. Currently basically it has been abandoned. nvidia-docker2 is a runtime, better and docker compatible.
Get the yum source nvidia-docker2
1 2
[root@k8s-01 ~]# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) [root@k8s-01 ~]# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
View a list of nvidia-docker2
Here we need to install the support docker-18.06.1-ce version of nvidia-docker2 version, otherwise it will not support.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[root@k8s-01 ~]# yum list nvidia-docker2 --showduplicate Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.aliyun.com * epel: mirror01.idc.hinet.net * extras: mirrors.aliyun.com * updates: mirrors.163.com Installed Packages nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce @nvidia-docker Available Packages nvidia-docker2.noarch 2.0.0-1.docker1.12.6 nvidia-docker nvidia-docker2.noarch 2.0.0-1.docker17.03.2.ce nvidia-docker nvidia-docker2.noarch 2.0.0-1.docker17.06.1.ce nvidia-docker nvidia-docker2.noarch 2.0.0-1.docker17.06.2.ce nvidia-docker nvidia-docker2.noarch 2.0.0-1.docker17.09.0.ce nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker1.12.6 nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker1.13.1 nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker17.03.2.ce nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker17.06.2.ce nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker17.09.0.ce nvidia-docker nvidia-docker2.noarch 2.0.1-1.docker17.09.1.ce nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker1.12.6 nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker1.13.1 nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker17.03.2.ce nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker17.06.2.ce nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker17.09.0.ce nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker17.09.1.ce nvidia-docker nvidia-docker2.noarch 2.0.2-1.docker17.12.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker1.12.6 nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker1.13.1 nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.03.2.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.06.2.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.09.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.09.1.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.12.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker17.12.1.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.03.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.06.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.06.2 nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.0.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.1.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.2 nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.2.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.3.ce nvidia-docker nvidia-docker2.noarch 2.0.3-1.docker18.09.4.ce nvidia-docker nvidia-docker2.noarch 2.0.3-2.docker18.06.2.ce nvidia-docker nvidia-docker2.noarch 2.0.3-2.docker18.09.5.ce nvidia-docker nvidia-docker2.noarch 2.0.3-3.docker18.06.3.ce nvidia-docker nvidia-docker2.noarch 2.0.3-3.docker18.09.5.ce nvidia-docker nvidia-docker2.noarch 2.0.3-3.docker18.09.6.ce nvidia-docker nvidia-docker2.noarch 2.0.3-3.docker18.09.7.ce nvidia-docker nvidia-docker2.noarch 2.1.0-1 nvidia-docker nvidia-docker2.noarch 2.1.1-1 nvidia-docker nvidia-docker2.noarch 2.2.0-1 nvidia-docker
Here we install 2.0.3-1.docker18.06.1.ce version.
1
[root@k8s-01 ~]# yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
The default configuration for the nvidia docker runtime
1 2 3 4 5 6 7 8 9 10
[root@k8s-01 ~]# cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
1
[root@k8s-01 ~]# systemctl restart docker
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[root@k8s-01 wf-deploy]# docker info Containers: 63 Running: 0 Paused: 0 Stopped: 63 Images: 51 Server Version: 18.06.1-ce Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc nvidia Default Runtime: nvidia Init Binary: docker-init containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e runc version: 69663f0bd4b60df09991c08812a60108003fa340-dirty (expected: 69663f0bd4b60df09991c08812a60108003fa340) init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 4.20.13-1.el7.elrepo.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.79GiB Name: k8s-01 ID: DWPY:P2I4:NWL4:3U3O:UTGC:PLJC:IGTO:7ZXJ:A7CD:SJGT:7WT5:WNGX Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 192.168.50.2 127.0.0.0/8 Live Restore Enabled: false
It can be seen as a docker default runtime nvidia
2.3 install the driver plug-in
Get the latest documentation plugin yaml
1
[root@k8s-01 ~]# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
Three .CLI Introduction
nvidia-container-cli is a command-line tool, for configuring the Linux containers of GPU hardware. Support: . 1) List: Print nvidia driver library and Path 2) info: Print all Nvidia GPU apparatus 3) configure: into a given process namespace, be performed using the designated GPU and a corresponding capacity in the necessary operations to ensure that the container (designated Nvidia driver library). We use to configure is the primary command, so the document will Nvidia GPU device driver library and information mapped to the container by way of a document mount.