Install cuda driver version of NVSwitch GPU server, nvidia-docker guide

1. Install the Cuda driver

You can refer to the article written by the author before: Guide to upgrading the cuda driver version of the GPU server

If the following error occurs, you need to install gcc and kernel-devel. Please refer to the second step below to install gcc and kernel-devel.
insert image description here

Second, install gcc, kernel-devel

1. Install gcc and kernel-devel

If you directly execute the following command to install, if the default version is inconsistent, you will encounter an error as shown below:

yum -y install gcc kernel-devel
./NVIDIA-Linux-x86_64-515.86.01.run

insert image description here

2. The reason for the error

Use the following command to check whether the kernel version is consistent

uname -r
rpm -q kernel-devel

The normal result should be as shown in the figure below, the kernel version is consistent, if not, go to the next step to solve this problem.
insert image description here

3. Solve the problem of inconsistent kernel versions

To uninstall kernel-devel, execute the following command:

yum remove kernel-devel
4. Install the kernel-devel that matches the version
1) View the kernel version number
uname -a

The red box circled is the kernel version number:
insert image description here
kernel-devel-3.10.0-1160.el7.x86_64.rpm

2) Download the version matching kernel-devel

Kernel-devel download address: https://ftp.sjtu.edu.cn/sites/ftp.scientificlinux.org/linux/scientific/7.9/x86_64/os/Packages/
Find the kernel-devel version corresponding to the kernel version number:kernel-devel-3.10.0-1160.el7.x86_64.rpm

# 在终端 使用 wget 下载 kernel-devel 
wget https://ftp.sjtu.edu.cn/sites/ftp.scientificlinux.org/linux/scientific/7.9/x86_64/os/Packages/kernel-devel-3.10.0-1160.el7.x86_64.rpm
2) Install kernel-devel
rpm -ivh kernel-devel-3.10.0-1160.el7.x86_64.rpm
3) Check whether the kernel development environment package kernel-devel is successfully installed
uname -a ; rpm -qa kernel\* | sort

If the installation is successful, as shown below:
insert image description here

Three, gcc and g++

yum install gcc
yum install gcc-c++

Fourth, install the Cuda driver

./NVIDIA-Linux-x86_64-515.86.01.run

Five, install the NVIDIA-Fabric Manager software package

For NVIDIA NVSwitch A100 GPU cards, it is necessary to additionally install the nvidia-fabricmanager service corresponding to the driver version so that the GPU cards can be interconnected through NVSwitch . If only the NVIDIA GPU driver is installed, the GPU cannot be used normally.

1. Install nvidia-fabricmanager in detail

The details of installing nvidia-fabricmanager are as follows: https://www.volcengine.com/docs/6419/73634

GPU Cloud Server -> User Guide -> Install NVIDIA Driver -> Install NVIDIA-Fabric Manager Software Package

2. Organize nvidia-fabricmanager installation steps
1) Check the system driver version
nvidia-smi

insert image description here

2) Install through the installation package
  • CentOS 8.x
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel8/x86_64/nvidia-fabric-manager-470.57.02-1.x86_64.rpm
rpm -ivh nvidia-fabric-manager-470.57.02-1.x86_64.rpm
  • CentOS 7.x
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-470.57.02-1.x86_64.rpm
rpm -ivh nvidia-fabric-manager-470.57.02-1.x86_64.rpm
  • Ubuntu 20.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb
dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
  • Ubuntu 18.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb
dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
  • Debain 10、veLinux 1.0
wget https://developer.download.nvidia.cn/compute/cuda/repos/debian10/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb
dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
3) Install from source
  • CentOS 8.x
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
dnf module enable -y nvidia-driver:470
dnf install -y nvidia-fabric-manager-0:470.57.02-1
  • CentOS 7.x
yum -y install yum-utils 
yum-config-manager --add-repo https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
yum install -y nvidia-fabric-manager-470.57.02-1
  • Ubuntu 20.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
apt-key add 7fa2af80.pub
rm 7fa2af80.pub
echo "deb http://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list
apt-get update
apt-get -y install nvidia-fabricmanager-470=470.57.02-1
  • Ubuntu 18.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-key add 7fa2af80.pub
rm 7fa2af80.pub
echo "deb http://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list
apt-get update
apt-get -y install nvidia-fabricmanager-470=470.57.02-1
4) Start NVIDIA-Fabric Manager
# 1. 执行以下命令启动Fabric Manager服务。
sudo systemctl start nvidia-fabricmanager

# 2. 执行以下命令查看Fabric Manager服务是否正常启动,回显active(running)表示启动成功。
sudo systemctl status nvidia-fabricmanager

# 3. 执行以下命令配置Fabric Manager服务随实例开机自启动。
sudo systemctl enable nvidia-fabricmanager

Six, install nvidia-docker

1, Install nvidia-docker in detail

Detailed explanation of installing nvidia-docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

2, Summarize and install nvidia-docker
1) Install on Ubuntu and Debian


The following steps can be used to set up NVIDIA Container Toolkit on Ubuntu LTS (18.04, 20.04, and 22.04) and Debian (Stretch, Buster) distributions .

  • Set up Docker-CE
curl https://get.docker.com | sh && sudo systemctl --now enable docker
  • Set up the package repository and GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

To access experimental features and release candidates, you may need to add the experimental branch to your list of repositories:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  • Install package (and dependencies) after nvidia-docker2 update package list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
  • After setting the default runtime, restart the Docker daemon to complete the installation:
sudo systemctl restart docker
  • At this point, a working setup can be tested by running a basic CUDA container:
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
  • This should produce console output that looks like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2) Installed on CentOS 7/8

The following steps can be used to setup NVIDIA Container Toolkit on CentOS 7/8.

  • Set up the official Docker CE repository:

    • CentOS 8.x
    sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
    
    • CentOS 7.x
    sudo yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
    
  • Now you can observe the packages available in the docker-ce repository:

    • CentOS 8.x
    sudo dnf repolist -v
    
    • CentOS 7.x
    sudo yum repolist -v
    


Since CentOS does not support the specific version of the package containerd.io required for newer versions of Docker-CE , one option is to manually install the package containerd.io and then proceed to install the docker-ce package.

  • Install the containerd.io package:

    • CentOS 8.x
    sudo dnf install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm
    
    • CentOS 7.x
    sudo yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm
    
  • Now install the latest docker-ce package:

    • CentOS 8.x
    sudo dnf install docker-ce -y
    
    • CentOS 7.x
    sudo yum install docker-ce -y
    
  • Make sure the Docker service is running with the following command:

    sudo systemctl --now enable docker
    
  • Finally, test your Docker installation by running the hello-world container:

    sudo docker run --rm hello-world
    
  • This should produce console output that looks like this:

    Unable to find image 'hello-world:latest' locally
    latest: Pulling from library/hello-world
    0e03bdcc26d7: Pull complete
    Digest: sha256:7f0a9f93b4aa3022c3a4c147a449bf11e0941a1fd0bf4a8e6c9408b2600777c5
    Status: Downloaded newer image for hello-world:latest
    
    Hello from Docker!
    This message shows that your installation appears to be working correctly.
    
    To generate this message, Docker took the following steps:
    1. The Docker client contacted the Docker daemon.
    2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
       (amd64)
    3. The Docker daemon created a new container from that image which runs the
       executable that produces the output you are currently reading.
    4. The Docker daemon streamed that output to the Docker client, which sent it
       to your terminal.
    
    To try something more ambitious, you can run an Ubuntu container with:
    docker run -it ubuntu bash
    
    Share images, automate workflows, and more with a free Docker ID:
    https://hub.docker.com/
    
    For more examples and ideas, visit:
    https://docs.docker.com/get-started/
    
  • Set up the repository and GPG key:

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
       && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    
  • To access experimental features and release candidates, you may need to add the experimental branch to your list of repositories:

    yum-config-manager --enable libnvidia-container-experimental
    
  • Install packages (and dependencies) after nvidia-docker2 updates the package list:

    • CentOS 8.x
    sudo dnf clean expire-cache --refresh
    
    • CentOS 7.x
    sudo yum clean expire-cache
    
    • CentOS 8.x
    sudo dnf install -y nvidia-docker2
    
    • CentOS 7.x
    sudo yum install -y nvidia-docker2
    
  • After setting the default runtime, restart the Docker daemon to complete the installation:

    sudo systemctl restart docker
    
  • At this point, a working setup can be tested by running a basic CUDA container:

    sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
    
  • This should produce console output that looks like this:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
    | N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

Guess you like

Origin blog.csdn.net/TFATS/article/details/128249249