Notes on Ubuntu implementing K8S scheduling NVIDIA GPU

Scenario requirements

Recently, a new GPU server has arrived. The system is Ubuntu 20.04. It is necessary to build a K8S cluster in the server to do some container-related business scenarios. Although its CPU configuration is quite high, its GPU cannot be wasted. Therefore, this article Just record the whole process of using his GPU.

nvidia-docker

nvidia-docker is a product produced by Nvidia. I believe that those who can find my broken article hidden in the corner must have understood the similarities and differences between him and docker. Here is a portal to know Nvidia’s official website. Related introduction >>>Poke here<<< (Actually, I am too lazy to write).

nvidia driver

If you want to use GPU resources, you must first have a GPU. With a GPU, you need to install the corresponding driver to use it normally.

Check the GPU model

In Ubuntu, the GPU model can be viewed with the following command:

ubuntu-drivers devices

Here my running results are as follows:
image.png
(It’s still a piece of RTX2070 SUPER!! I want to take it off and take it home and install it on my computer to play chicken)
Among them, you can see recommended (recommended) in the penultimate line, which indicates that it is recommended to install The driver is:nvidia-driver-470 - distro non-free

download driver

Here you need to download the driver from NVIDIA's official website >>>NVIDIA driver official website<<< , after entering the page, fill in according to the graphics card model you queried:
image.png
click search and you can see the specific driver instructions, click download:
image.png

install driver

Generally, by default, Ubuntu will install an open source driver for the GPU (maybe there may not be one, if not, don’t worry about it), now we have a dedicated driver, so we need to disable the open source driver:

sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"

Or it is not impossible to uninstall directly:

apt-get remove --purge nvidia*

Restart the computer after finishing the work, and then you can install the driver. The following driver file names depend on your own:

chmod 755 NVIDIA-Linux-x86_64-470.74.run
./NVIDIA-Linux-x86_64-470.74.run

When installing the driver, the installation may fail due to lack of some dependencies. You can perform apt install -fautomatic installation of dependencies after the installation fails. Agree all the way through the installation process.
Possible problems during installation: WARNING: Unable to find suitable destination to install 32-bit compatibility libraries
Execute the following command to install:

sudo dpkg --add-architecture i386
sudo apt update
sudo apt install libc6:i386

nvidia-docker install

To install nvidia-docker, you can refer to the tutorial on the official website >>>Installation tutorial<<< , I am really too lazy to relocate here:-)

Create a K8S cluster

For the sake of convenience, I used rancher to build k8S. There are related processes in the previous article, so I won’t go into details here. I will go directly to the steps after building the K8S cluster. Regarding the process of building K8S, here is a portal >>> Rancher installation And create K8S cluster <<< (alternate address: >>>CSDN: Rancher installation and create K8S cluster <<< )

Install the Nvidia GPU plugin for K8S

In fact, the relevant scheduling method is also available on the K8S official website >>>K8S<<< . In fact, it is to install a related plug-in in the K8S cluster:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

test

We go to dockerhub and pull a tensorflow-gpu image, here we use opensciencegrid/tensorflow-gpu:latest, here I use the Rancher management interface to create a load, GPU allocation 1: After that, we use tensorflow in this load container to
image.png
test Whether the GPU can be transferred:
first execute python3 to enter the python interactive terminal, and enter the following code:

import tensorflow as tf
print('GPU', tf.test.is_gpu_available())

If GPU True is displayed, the scheduling is successful.
image.png
If in the above steps, the driver is not installed, the Nvidia support plug-in of K8S is not installed, etc., the scheduling cannot be performed, and GPU False is displayed.
The above is a simple record. If there is time later, the unrelocated content in the article will be updated to complete the flowering.

Guess you like

Origin blog.csdn.net/u012751272/article/details/120488513