Scenario requirements
Recently, a new GPU server has arrived. The system is Ubuntu 20.04. It is necessary to build a K8S cluster in the server to do some container-related business scenarios. Although its CPU configuration is quite high, its GPU cannot be wasted. Therefore, this article Just record the whole process of using his GPU.
nvidia-docker
nvidia-docker is a product produced by Nvidia. I believe that those who can find my broken article hidden in the corner must have understood the similarities and differences between him and docker. Here is a portal to know Nvidia’s official website. Related introduction >>>Poke here<<< (Actually, I am too lazy to write).
nvidia driver
If you want to use GPU resources, you must first have a GPU. With a GPU, you need to install the corresponding driver to use it normally.
Check the GPU model
In Ubuntu, the GPU model can be viewed with the following command:
ubuntu-drivers devices
Here my running results are as follows:
(It’s still a piece of RTX2070 SUPER!! I want to take it off and take it home and install it on my computer to play chicken)
Among them, you can see recommended (recommended) in the penultimate line, which indicates that it is recommended to install The driver is:nvidia-driver-470 - distro non-free
download driver
Here you need to download the driver from NVIDIA's official website >>>NVIDIA driver official website<<< , after entering the page, fill in according to the graphics card model you queried:
click search and you can see the specific driver instructions, click download:
install driver
Generally, by default, Ubuntu will install an open source driver for the GPU (maybe there may not be one, if not, don’t worry about it), now we have a dedicated driver, so we need to disable the open source driver:
sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
Or it is not impossible to uninstall directly:
apt-get remove --purge nvidia*
Restart the computer after finishing the work, and then you can install the driver. The following driver file names depend on your own:
chmod 755 NVIDIA-Linux-x86_64-470.74.run
./NVIDIA-Linux-x86_64-470.74.run
When installing the driver, the installation may fail due to lack of some dependencies. You can perform apt install -f
automatic installation of dependencies after the installation fails. Agree all the way through the installation process.
Possible problems during installation: WARNING: Unable to find suitable destination to install 32-bit compatibility libraries
Execute the following command to install:
sudo dpkg --add-architecture i386
sudo apt update
sudo apt install libc6:i386
nvidia-docker install
To install nvidia-docker, you can refer to the tutorial on the official website >>>Installation tutorial<<< , I am really too lazy to relocate here:-)
Create a K8S cluster
For the sake of convenience, I used rancher to build k8S. There are related processes in the previous article, so I won’t go into details here. I will go directly to the steps after building the K8S cluster. Regarding the process of building K8S, here is a portal >>> Rancher installation And create K8S cluster <<< (alternate address: >>>CSDN: Rancher installation and create K8S cluster <<< )
Install the Nvidia GPU plugin for K8S
In fact, the relevant scheduling method is also available on the K8S official website >>>K8S<<< . In fact, it is to install a related plug-in in the K8S cluster:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
test
We go to dockerhub and pull a tensorflow-gpu image, here we use opensciencegrid/tensorflow-gpu:latest, here I use the Rancher management interface to create a load, GPU allocation 1: After that, we use tensorflow in this load container to
test Whether the GPU can be transferred:
first execute python3 to enter the python interactive terminal, and enter the following code:
import tensorflow as tf
print('GPU', tf.test.is_gpu_available())
If GPU True is displayed, the scheduling is successful.
If in the above steps, the driver is not installed, the Nvidia support plug-in of K8S is not installed, etc., the scheduling cannot be performed, and GPU False is displayed.
The above is a simple record. If there is time later, the unrelocated content in the article will be updated to complete the flowering.