Installation Environment
System Requirements
CPU: 2 cores
Memory: 2GB
Graphics: NVIDIA Series
Installation docker
apt install docker.io
Installation k8s
Add source software
Convenience, modify Download Ubuntu software management Ali cloud.
Add k8s in /etc/apt/source.list source software
deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main
Updateapt update
Question: NO_PUBKEY
NO_PUBKEY BA300B7755AFCFAE
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 6A030B21BA07F4FB
Question: depend on sth
Execution apt update
orapt --fix-broken install
Modify HOST
vim /etc/hosts
Commented
127.0.0.1 computer_name
add the IP address of the cluster to be formed
192.168.9.103 master
192.168.9.104 node1
192.168.9.105 node2
......
#127.0.0.1 localhost
#127.0.1.1 dell3
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.9.103 master
Installation kubeadm
apt install kubeadm
Kubectl automatically installed when you install kubeadm, kubelet.
Mirroring needs listed
kubeadm config images list
The results are:
k8s.gcr.io/kube-apiserver:v1.17.0
k8s.gcr.io/kube-controller-manager:v1.17.0
k8s.gcr.io/kube-scheduler:v1.17.0
k8s.gcr.io/kube-proxy:v1.17.0
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.4.3-0
k8s.gcr.io/coredns:1.6.5
The use of domestic sources to download these images
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5
Use the command tag marking
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 k8s.gcr.io/pause:3.1
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0 k8s.gcr.io/kube-apiserver:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0 k8s.gcr.io/kube-controller-manager:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0 k8s.gcr.io/kube-scheduler:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0 k8s.gcr.io/kube-proxy:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0 k8s.gcr.io/etcd:3.4.3-0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5 k8s.gcr.io/coredns:1.6.5
Configure master
Turn off swap
swapoff -a
Initialized:
root@dell3:~# kubeadm init --kubernetes-version=v1.17.0 --pod-network-cidr 192.168.0.0/16
W1218 14:48:40.560734 20883 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1218 14:48:40.560767 20883 validation.go:28] Cannot validate kubelet config - no validator is available
[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
Close swap information tips
swapoff -a
After the installation at
Information to complete the installation:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.9.103:6443 --token tn3e9a.6fgbdbu3vvus8ia9 \
--discovery-token-ca-cert-hash sha256:ce5aa219f8fd1da40646997f2c3d27ee905989812b115146356ecfc9304036ba
Follow the prompts to execute three commands:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Adding Network Configuration
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Complete information is:
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created
View pod
root@dell3:~# kubectl get pod
No resources found in default namespace.
root@dell3:~# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-gccbp 0/1 Pending 0 2m33s
coredns-6955765f44-gl7zg 0/1 Pending 0 2m33s
etcd-dell3 1/1 Running 0 2m33s
kube-apiserver-dell3 1/1 Running 0 2m33s
kube-controller-manager-dell3 1/1 Running 0 2m33s
kube-flannel-ds-amd64-rrhng 0/1 Init:0/1 0 70s
kube-proxy-srnvg 1/1 Running 0 2m33s
kube-scheduler-dell3 1/1 Running 0 2m33s
If k8s core components are in operation, indicating k8s successful installation.
Installation cuda
NVIDIA is recommended to install cuda install the NVIDIA driver.
./cuda.run --override
Use --override
parameters to check for gcc version cancel the installation.
Install NVIDIA drivers
Find the right version
According to their model GPU, NVIDIA Web site to find the right version.
installation
Convenience, directly driven management software Additional Drivers Ubuntu19 provided to install
Sign of success
Open the NVIDIA X Server see the details of the graphics card. If you open a blank, indicating that the current did not install the NVIDIA driver.
Input nvidia-smi driver can see more information.
root@dell3:~# nvidia-smi
Wed Dec 18 15:07:25 2019
+------------------------------------------------------+
| NVIDIA-SMI 340.107 Driver Version: 340.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 610 Off | 0000:01:00.0 N/A | N/A |
|100% 56C P8 N/A / N/A | 129MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 610 Off | 0000:06:00.0 N/A | N/A |
|100% 46C P8 N/A / N/A | 3MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
problem
Repeat Login
After boot press Ctrl + Alt + F [1-6] into the character interface, NVIDIA driver unloaded.
apt remove --purge nvidia-*
If the drive installation package may be executed
./nvidia-*.run --uninstall
Reboot into native graphical interface again, and then close the login password in the settings.
Install the NVIDIA driver again.
No access to the graphical interface
Reinstall the NVIDIA driver.
NVIDIA installation of plug-ins for k8s
Install nvidia-docker2
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
Test nvidia-docker2
Use nvidia-docker2 run cuda:
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
The first time you run need to download the image, if the image is too slow to download, you can add an accelerator.
Operating results are:
root@dell:~# docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Wed Dec 18 08:55:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A |
| 33% 30C P8 1W / 38W | 359MiB / 1999MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Add Configuration
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
Modify runtime
/Etc/docker/daemon.json modify the file, add the default-runtime key.
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart docker
# systemctl daemon-reload
# systemctl restart docker
use
See if k8s identified the GPU
Execution kubectl describe node node_name
to see details of this node:
root@dell:~/mypod# kubectl describe nodes
Name: dell
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
disktype=ssd
kubernetes.io/arch=amd64
kubernetes.io/hostname=dell
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:9a:2d:50:03:4b"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.8.52
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 12 Dec 2019 10:25:16 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 18 Dec 2019 17:17:55 +0800 Mon, 16 Dec 2019 10:30:19 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.8.52
Hostname: dell
Capacity:
cpu: 4
ephemeral-storage: 479152840Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 24568140Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 441587256613
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 24465740Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 833fac65cd12401db017c0b0033439e7
System UUID: 28d52460-d7da-11dd-9d00-40167e218cad
Boot ID: 7a4a6548-28da-447c-845a-fab20ed82181
Kernel Version: 5.3.0-24-generic
OS Image: Ubuntu 19.10
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.2
Kubelet Version: v1.16.3
Kube-Proxy Version: v1.16.3
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (13 in total)
If you see this time contains information nvidia.com/gpu:1 capacity, indicating k8s recognizes native comprising a GPU.
Call GPU
Create a call to the GPU pod
Create a file gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: tf-pod
spec:
containers:
- name: tf-container
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUs
Then execute kubectl apply -f gpu-pod.yaml
using the kubectl get pod
View pod state.
root@dell:~/mypod# kubectl get pod
NAME READY STATUS RESTARTS AGE
busybox-6t962 0/1 Completed 0 6d2h
cuda3 0/1 Completed 47 2d7h
gpu-cuda 0/1 Completed 0 5d
gpu-pod 0/1 Completed 0 5d1h
gpu-pod23 0/2 Pending 0 2d6h
hello-world 0/1 ContainerCreating 0 6d5h
myjob-k9hx5 0/1 Completed 0 6d3h
myjob2-xmdm8 0/1 Completed 0 6d1h
pi-9cttz 0/1 Completed 0 6d2h
Use kubectl describe pod pod_name
Details to view the pod.
Question: kubectl command invalid
An error like this:
root@dell3:~# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml The connection to the server 192.168.9.103:6443 was refused - did you specify the right host or port?
Multi-Rom docker After the restart, the solution:
# swapoff -a
# systemctl daemon-reload
# systemctl restart docker
# systemctl restart kubelet
The most important is to disable swap.