NVIDIA environment and installation k8s

Installation Environment

System Requirements

CPU: 2 cores

Memory: 2GB

Graphics: NVIDIA Series

Installation docker

apt install docker.io

Installation k8s

Add source software

Convenience, modify Download Ubuntu software management Ali cloud.

Screenshot from 2019-12-18 14-25-44

Add k8s in /etc/apt/source.list source software

deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main

Updateapt update

Question: NO_PUBKEY

NO_PUBKEY BA300B7755AFCFAE
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 6A030B21BA07F4FB

Question: depend on sth

Execution apt update
orapt --fix-broken install

Modify HOST

vim /etc/hosts
Commented
127.0.0.1 computer_name
add the IP address of the cluster to be formed
192.168.9.103 master
192.168.9.104 node1

192.168.9.105 node2 ......

#127.0.0.1      localhost
#127.0.1.1      dell3

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.168.9.103 master

Installation kubeadm

apt install kubeadm

Kubectl automatically installed when you install kubeadm, kubelet.

Mirroring needs listed

kubeadm config images list

The results are:

k8s.gcr.io/kube-apiserver:v1.17.0
k8s.gcr.io/kube-controller-manager:v1.17.0
k8s.gcr.io/kube-scheduler:v1.17.0
k8s.gcr.io/kube-proxy:v1.17.0
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.4.3-0
k8s.gcr.io/coredns:1.6.5

The use of domestic sources to download these images

docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
  docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5

 Use the command tag marking

docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 k8s.gcr.io/pause:3.1
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0 k8s.gcr.io/kube-apiserver:v1.17.0
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0 k8s.gcr.io/kube-controller-manager:v1.17.0
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0 k8s.gcr.io/kube-scheduler:v1.17.0
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0 k8s.gcr.io/kube-proxy:v1.17.0
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0 k8s.gcr.io/etcd:3.4.3-0
  docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5 k8s.gcr.io/coredns:1.6.5

Configure master

Turn off swap

swapoff -a

Initialized:

root@dell3:~# kubeadm init --kubernetes-version=v1.17.0 --pod-network-cidr 192.168.0.0/16
W1218 14:48:40.560734   20883 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1218 14:48:40.560767   20883 validation.go:28] Cannot validate kubelet config - no validator is available
[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
    [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
    [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

Close swap information tips

swapoff -a

After the installation at

Information to complete the installation:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.9.103:6443 --token tn3e9a.6fgbdbu3vvus8ia9 \
    --discovery-token-ca-cert-hash sha256:ce5aa219f8fd1da40646997f2c3d27ee905989812b115146356ecfc9304036ba 

Follow the prompts to execute three commands:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Adding Network Configuration

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Complete information is:

podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created

View pod

root@dell3:~# kubectl get pod
No resources found in default namespace.
root@dell3:~# kubectl get pod -n kube-system
NAME                            READY   STATUS     RESTARTS   AGE
coredns-6955765f44-gccbp        0/1     Pending    0          2m33s
coredns-6955765f44-gl7zg        0/1     Pending    0          2m33s
etcd-dell3                      1/1     Running    0          2m33s
kube-apiserver-dell3            1/1     Running    0          2m33s
kube-controller-manager-dell3   1/1     Running    0          2m33s
kube-flannel-ds-amd64-rrhng     0/1     Init:0/1   0          70s
kube-proxy-srnvg                1/1     Running    0          2m33s
kube-scheduler-dell3            1/1     Running    0          2m33s

If k8s core components are in operation, indicating k8s successful installation.

Installation cuda

NVIDIA is recommended to install cuda install the NVIDIA driver.

./cuda.run --overrideUse --overrideparameters to check for gcc version cancel the installation.

Install NVIDIA drivers

Find the right version

According to their model GPU, NVIDIA Web site to find the right version.

installation

Convenience, directly driven management software Additional Drivers Ubuntu19 provided to install

Screenshot from 2019-12-18 16-50-06

Sign of success

Open the NVIDIA X Server see the details of the graphics card. If you open a blank, indicating that the current did not install the NVIDIA driver.

Screenshot from 2019-12-18 15-09-12

Input nvidia-smi driver can see more information.

root@dell3:~# nvidia-smi
Wed Dec 18 15:07:25 2019       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.107    Driver Version: 340.107        |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 610      Off  | 0000:01:00.0     N/A |                  N/A |
|100%   56C    P8    N/A /  N/A |    129MiB /  1023MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 610      Off  | 0000:06:00.0     N/A |                  N/A |
|100%   46C    P8    N/A /  N/A |      3MiB /  1023MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+

problem

Repeat Login

After boot press Ctrl + Alt + F [1-6] into the character interface, NVIDIA driver unloaded.

apt remove --purge nvidia-*

If the drive installation package may be executed

./nvidia-*.run --uninstall

Reboot into native graphical interface again, and then close the login password in the settings.

Install the NVIDIA driver again.

No access to the graphical interface

Reinstall the NVIDIA driver.

NVIDIA installation of plug-ins for k8s

Install nvidia-docker2

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

Test nvidia-docker2

Use nvidia-docker2 run cuda:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

Screenshot from 2019-12-18 15-28-48

The first time you run need to download the image, if the image is too slow to download, you can add an accelerator.

Operating results are:

root@dell:~# docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Wed Dec 18 08:55:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 33%   30C    P8     1W /  38W |    359MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Add Configuration

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Modify runtime

/Etc/docker/daemon.json modify the file, add the default-runtime key.

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart docker

# systemctl daemon-reload

# systemctl restart docker

use

See if k8s identified the GPU

Execution kubectl describe node node_nameto see details of this node:

root@dell:~/mypod# kubectl describe nodes
Name:               dell
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    disktype=ssd
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dell
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:9a:2d:50:03:4b"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.8.52
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 12 Dec 2019 10:25:16 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 18 Dec 2019 17:17:55 +0800   Mon, 16 Dec 2019 10:30:19 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.8.52
  Hostname:    dell
Capacity:
 cpu:                4
 ephemeral-storage:  479152840Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             24568140Ki
 nvidia.com/gpu:     1
 pods:               110
Allocatable:
 cpu:                4
 ephemeral-storage:  441587256613
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             24465740Ki
 nvidia.com/gpu:     1
 pods:               110
System Info:
 Machine ID:                 833fac65cd12401db017c0b0033439e7
 System UUID:                28d52460-d7da-11dd-9d00-40167e218cad
 Boot ID:                    7a4a6548-28da-447c-845a-fab20ed82181
 Kernel Version:             5.3.0-24-generic
 OS Image:                   Ubuntu 19.10
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.2
 Kubelet Version:            v1.16.3
 Kube-Proxy Version:         v1.16.3
PodCIDR:                     192.168.0.0/24
PodCIDRs:                    192.168.0.0/24
Non-terminated Pods:         (13 in total)
  

If you see this time contains information nvidia.com/gpu:1 capacity, indicating k8s recognizes native comprising a GPU.

Call GPU

Create a call to the GPU pod

Create a file gpu-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: tf-pod
spec:
  containers:
    - name: tf-container
      image: tensorflow/tensorflow:latest-gpu
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPUs

Then execute kubectl apply -f gpu-pod.yaml
using the kubectl get podView pod state.

root@dell:~/mypod# kubectl get pod
NAME            READY   STATUS              RESTARTS   AGE
busybox-6t962   0/1     Completed           0          6d2h
cuda3           0/1     Completed           47         2d7h
gpu-cuda        0/1     Completed           0          5d
gpu-pod         0/1     Completed           0          5d1h
gpu-pod23       0/2     Pending             0          2d6h
hello-world     0/1     ContainerCreating   0          6d5h
myjob-k9hx5     0/1     Completed           0          6d3h
myjob2-xmdm8    0/1     Completed           0          6d1h
pi-9cttz        0/1     Completed           0          6d2h

Use kubectl describe pod pod_nameDetails to view the pod.

Question: kubectl command invalid

An error like this:

root@dell3:~# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml The connection to the server 192.168.9.103:6443 was refused - did you specify the right host or port?

Multi-Rom docker After the restart, the solution:

# swapoff -a
# systemctl daemon-reload
# systemctl restart docker
# systemctl restart kubelet

The most important is to disable swap.

Guess you like

Origin www.cnblogs.com/liuluopeng/p/12098071.html