NVIDIA graphics card driver update, NVIDIA Driver, CUDA Toolkit, cuDNN installation guide

Only the graphics card drivers used in deep learning will be discussed.
Take NVIDIA-driver-515.105 and cuda-11.7 as examples

1. Uninstall the graphics card driver

CentOS / RHEL

Method 1: Find the old version of the graphics card driver .run file:

sh NVIDIA-Linux-x86_64-418.126.02.run --uninstall

Method 2: Clear all nvidia related files and dependencies

yum remove nvidia-*

Further cleaning (clean up all nvidia-driver related components):

rpm -qa|grep -i nvid|sort
yum remove kmod-nvidia-*

Clear cuda

yum remove "*nvidia*"
yum remove "*cublas*" "cuda*"

Uninstall the driver and restart

sudo reboot

Ubuntu LTS

It is worth noting that due to different kernel systems, the command methods adopted are different.
apt-get belongs to ubuntu, Debian’s package management tool
yum belongs to Redhat, and Centos package management tool
. When choosing what command to use to delete, you should first determine your own What is the system.
If sudo apt-get purge nvidia-*insteadyum remove nvidia-*

sudo apt-get purge nvidia-*
sudo apt-get --purge remove cuda

2. Installation of graphics card driver

basic knowledge

What exactly are graphics cards, graphics driver, nvcc, cuda driver, cudatoolkit, and cudnn? - Zhihu
Probably add some knowledge.
CUDA Driver : The CUDA driver is a software component used to communicate with the GPU. It is responsible for managing the hardware resources of the GPU and executing CUDA applications.
CUDA Toolkit : CUDA Toolkit is a software package for developing and optimizing CUDA applications, which includes CUDA drivers and CUDA runtime libraries.
CUDA runtime library : The CUDA runtime library is a software component used to execute CUDA applications on the GPU. It provides a set of CUDA API functions to manage GPU memory and execute CUDA kernels.

Preconditions

Verify whether gcc, g++, tar, and make are installed on the system. If not, manually configure the yum source for installation.
Check the graphics card version command:

# 查看自己的显卡信息
lspci | grep -i nvidia

# GPU驱动版本,driverAPI(支持的最高cuda版本)
nvidia-smi

# 动态监控显卡状态
watch -t -n 1 nvidia-smi

# cuda版本,timeAPI(运行时API)
nvcc -V

Query and select the corresponding version of graphics card driver, CudaToolkit and cudnn:
Query the corresponding version relationship between NVIDIA graphics card and cuda. NVIDIA CUDA Toolkit Release NotesQuery
Insert image description here
the corresponding version relationship between PyTorch and cudaPrevious PyTorch Versions

This article selects NVIDIA-driver-515.105 plus cuda-11.7.

Install NVIDIA graphics card driver (NVIDIA Driver)

Download the NVIDIA driver . If you are online, you can use wget to download it, or use the copy address to download and copy it to the server.
Grant permissions and install.

chmod +x NVIDIA-Linux-x86_64-515.105.01.run
./NVIDIA-Linux-x86_64-515.105.01.run -no-x-check

Questions may appear during the installation process, choose Noto continue.
If a warning appears, you can ignore it and continue until the installation is complete.

> Install NVIDIA's 32-bit compatibillity libraries?
>                   Yes             [No]

Insert image description here

If there is any problem , check whether to uninstall the driver or see question 1:./NVIDIA-Linux-x86_64-515.105.01.run -no-x-check

Test whether the graphics card driver is installed successfully

nvidia-smi

Install CUDA

Download the CUDA Toolkit download address , or search for the old version to download the CUDA Toolkit old version download address .
Grant permissions and install.

chmod +x cuda_11.7.1_515.65.01_linux.run
./cuda_11.7.1_515.65.01_linux.run

During the installation process, you will be asked whether you need to download the driver (Drive). Under normal circumstances, please do not download it, that is, select No.
Cancel the
Insert image description here

After installation, the following will appear:

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.7/

Please make sure that
 -   PATH includes /usr/local/cuda-11.7/bin  
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root  
  
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-11.7/bin  
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.7/doc/pdf for detailed information on setting up CUDA.  
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 515.105 is required for CUDA 11.7 functionality to work.  
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:  
    sudo <CudaInstaller>.run -silent -driver  

Configure environment variables and add the following content to ~/.bashrcthe file.
open a file

vim ~/.bashrc

Add the following two lines at the end of the file to replace cuda version 11.7 with the installed version, such as cuda-12.2.

export PATH=/usr/local/cuda-11.7/bin${
    
    PATH:+:${
    
    PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Use the following command to refresh ~/.bashrcthe configuration file so that the configuration takes effect.

source ~/.bashrc

Test and query the nvcc version to check whether the installation is successful

nvcc -V

Install cudnn

Download cuDNN download address .

rpm -i cudnn-local-repo-rhel7-8.9.2.26-1.0-1.x86_64.rpm

3. Docker graphics card adaptation

Software version:
Docker: Docker version 20.10.9, build c2ea9bc
CUDA: NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7
System: CentOS-7

After version 19.03, docker no longer needs to install nvidia-docker independently to support graphics cards. You only need to configure docker and the CUDA environment. Now, a new method is used to install NVIDIA Container to support docker calling the graphics card.
Nvidia-Docker installation requires the installation of two parts, Docker-CE and NVIDIA Container Toolkit, which means Docker-CE is no longer required.

NVIDIA-Container-Toolkit Architecture

An overview of the architecture of NVIDIA's official website can be read carefully using Chrome's built-in web page translator. This article only briefly introduces it.
The main components of NVIDIA Container include nvidia-container-runtime, nvidia-container-toolkit, libnvidia-containerwhich need to be installed in advance during installation CUDA驱动;
after version 3.6.0, the runtime package becomes a toolkitpackage that only depends on the package (referring to the container-toolkit instead of the nvidia CUDA toolkit). It is also pointed out on the official website, For general applications, nvidia-container-toolkitit can meet most needs.
Insert image description here

Install package dependencies

The official website document dependency diagram is as follows.

├─ nvidia-container-toolkit (version)
│    ├─ libnvidia-container-tools (>= version)
│    └─ nvidia-container-toolkit-base (version)
│
├─ libnvidia-container-tools (version)
│    └─ libnvidia-container1 (>= version)
└─ libnvidia-container1 (version)

nvidia-container-toolkit-base is now included in nvidia-container-toolkit and no longer requires nvidia-container-runtime to be installed. (Previous nvidia-docker required the installation of two more packages, nvidia-container-runtime and nvidia-docker2.)

According to the above dependencies, install the three software packages in the order of

libnvidia-container1 -> libnvidia-container-tools -> nvidia-container-toolkit

Offline download and installation

Download the installation package here for offline installation.
The official website provides GitHub link:
1. nvidia-container-toolkitInstallation package download address
Find the installation package download corresponding to the system version.
For example, on the CentOS7 system I use, you can download the installation package nvidia-container-runtime/stable/centos7/x86_64/in the directory belownvidia-container-toolkit-1.5.1-2.x86_64.rpm

2. libnvidia-container1Find the libnvidia-container-toolsinstallation package download address
corresponding to the system version.
Similarly, for the CentOS7 system I use, you can download the installation package nvidia-container-runtime/stable/centos7/x86_64/from the directory below (click on the file and there is a Download raw file button in the upper right corner . If there is no response, check whether the network is scientifically connected)nvidia-container-toolkit-1.5.1-2.x86_64.rpm

After downloading all three packages, import them into the system and choose to install them in the corresponding folder.
rpm package installation method, all rpm installation packages in the installation folder

rpm -ivh *.rpm

deb package installation method

dpkg -i *.deb

After installation, restart the docker service.

systemctl restart docker
systemctl status docker

success! You can --gpus allstart a container to test whether the container uses the GPU normally.

4. Test

Use docker run --gpus allto start a container and enter inside the container to test whether the GPU is used normally.
Checking in python is also the most commonly used method to check whether the GPU is available (but it may not actually be used)

import torch
torch.cuda.is_available()

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

5. Summary of errors

1.ERROR: You appear to be running an X server; please exit X before installing. For further details, …

This error occurs when installing the NVIDA driver. Mainly due to the installation of remote control lightgm causing X-server to start.
Solution:

sudo chmod +x NVIDIA-Linux-X86_64-515.105.run
sudo ./NVIDIA-Linux-X86_64-515.105.run -no-x-check

Add -no-x-checkthe command at the end without checking the Xserver, and the installation will be successful!

Other parameters:
--no-opengl-files: Indicates that only driver files will be installed, and OpenGL files will not be installed. This parameter cannot be omitted, otherwise it will cause an infinite loop in the login interface. In English, it is generally called "login loop" or "stuck in login".
--no-x-check: Indicates that the X service is not checked when installing the driver, not required.
--no-nouveau-check: Indicates that nouveau is not checked when installing the driver, not required.
-Z, --disable-nouveau: Disable nouveau. This parameter is not required because nouveau has been manually disabled previously.
-A: See more advanced options.

Method 2: Modify the run level to text mode: upgrade nvidia driver - EchoZQN - Blog Park

2.Error response from daemon: could not select device driver “” with capabilities: [[gpu]]

After Nvidia Docker is installed, an error occurs when using the image to create a container. The error message is:

Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Need to install: NVIDIA Container Toolkit
server nvidia driver has been installed, GPU use is no problem, but docker cannot use GPU, then you need to check whether NVIDIA Container Toolkit has been installed. The NVIDIA Container Toolkit allows users to build and run GPU-accelerated Docker containers (after docker version 19, before 18 using the nvidia-docker command), so only after installing this can you use the GPU within docker.
According to your system, find the corresponding installation command on the official website.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
official website is very detailed, just follow it step by step.
Chapter 3 is very detailed

3.file /usr/lib64/libnvidia-container.so.1 from install of libnvidia-container1-1.13.5-1.x86_64 conflicts with file form package libnvidia-container1-1.0.0-0.1.beta.1.x86_64

This error means that the file "/usr/lib64/libnvidia-container.so.1" in the already installed package "libnvidia-container1-1.0.0-0.1.beta.1.x86_64" is different from the package to be installed" Conflict with files with the same name in libnvidia-container1-1.13.5-1.x86_64".
This may be caused by the package manager trying to install a new version of a package on your system whose files conflict with files in an existing package. One way to resolve this conflict is to uninstall the older version of the package or update the conflicting files by updating or replacing them.
Try to resolve the conflict using the following command:

sudo yum remove libnvidia-container1-1.0.0-0.1.beta.1.x86_64

4.ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s docum…

This problem is caused by the system currently using the Nouveau graphics driver, and the NVIDIA driver is incompatible with the Nouveau driver. To resolve this issue, the Nouveau driver needs to be disabled.

Method 1: Disable Nouveau driver through blacklisting
1) Add two lines to /usr/lib/modprobe.d/dist-blacklist.conf:

blacklist nouveau
options nouveau modeset=0

2) Make a backup of the current image

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

3) Create a new image

dracut /boot/initramfs-$(uname -r).img $(uname -r)

4) Restart

sudo init 6

Method 2: Add parameters
Add parameters--no-opengl-files

./NVIDIA-Linux-x86_64-515.105.01.run --no-opengl-files

5.docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Runing hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver rpc error: timed out: unknown.

A very strange error, an error --gpus allwill be reported when loading the graphics card when starting the container. This link gives the answer . The reason is that the graphics card resource does not turn on Persistence Mode.

After consulting a lot of information, I found that a great expert on the Internet consulted NVIDIA. The explanation given by NVIDIA is: "The graphics card resource does not turn on Persistence Mode ." Enter the following command to solve the problem:

nvidia-smi -pm ENABLED

If a prompt appears after pressing Enter in the above command, just follow the prompts to install. After the installation is complete, execute the above command again and it will be fine.

Reference content

NVIDIA driver download address.cn
NVIDIA driver download address.com
NVIDIA driver and CUDA Toolkit compatible versions and minimum version requirements
CUDA Toolkit download address
CUDA Toolkit old version download address
CUDA Toolkit 11.7.1 download address
cuDNN download address

CentOS.7 Uninstall and install Nvidia Driver_centos Uninstall nvidia driver_Aaron_Qin Feng's blog-CSDN blog
Linux Centos7 installation and update GPU driver and cuda:_linux upgrade cuda version_Big data lsy's blog-CSDN blog
upgrade nvidia driver-EchoZQN
- How to downgrade the cuda version in the blog park
- Python technology station openpose environment to build ubuntu16.04+nvidia396.37+cuda9.2+cudnn7.1.4_tudou880306's blog - CSDN blog

How to check version information in linux - linux operation and maintenance - PHP Chinese website
cuda, cudnn, cudatoolkit all versions download URL_cudnn download_QT-Smile's blog - CSDN blog
python check graphics card information python check gpu
Docker offline installation Nvidia-container-toolkit Implementing GPU calls within the container_NekoTom's blog-CSDN blog

docker: Error response from daemon: could not select device driver ““ with capabilities: [[gpu]]Error reporting_–gpus all Error reporting_Da Meow’s Blog who wants to lie down every day-CSDN BlogError: your appear to running an
x server;please exit x before installing .for further details_error: you appear to be running
an Difficulties encountered during the process_Software applications_What is worth buying?

Guess you like

Origin blog.csdn.net/aiaidexiaji/article/details/131973342