1. Confirm the GPU model and operating system version. In this example, the A100 and the operating system are Centos 7.9.
Prepare the GPU driver and CUDA 11.2 software package, and download the driver package and CUDA package from nvidia’s official website.
Link: link
Linux 64-bit
CUDA Toolkit is the latest version for all Linux systems
. If you need an old version of CUDA, please go to the old version of CUDA to download
this example Using CUDA 11.2 in .
Visit the official website of nvidia, download CUDA, the link of cuda is: https://developer.nvidia.com/cuda-downloads,
select the runfile file to install.
2. Check the server GPU identification
3. Before installing the GPU driver, you need to check whether the GPU card can be fully recognized under the operating system. If it cannot be recognized, you need to perform hardware inspections such as re-plugging and swapping tests.
View all GPUs
lspci | grep -i nvidia
4. Uninstall the old version software package (optional)
GPU driver offload
/usr/bin/nvidia-uninstall
CUDA uninstall method:
/usr/local/cuda/bin/cuda-uninstaller
6. Disable the nouveau module that comes with the system
Check whether the nouveau module is loaded, if it is loaded, disable it first
lsmod | grep nouveau
7. Install gcc, g++ compiler
cuda needs g++ when installing the samples test program for make, but it does not need to install the cuda package.
yum -y install gcc gcc-c++ kernel-devel make
8. Disable the nouveau module that comes with the system
Check whether the nouveau module is loaded, if it is loaded, disable it first
lsmod | grep nouveau
9. If there is no blacklist-nouveau.conf file, create it
vim /usr/lib/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
Execute the following command to make the kernel take effect (you need to restart the server to actually disable nouveau)
dracut -force
10. Restart the operating system
reboot
11. Restart the system, and then check whether disabling the nouveau module configuration and text mode takes effect.
lsmod | grep new
12. Modify the system operating level to text mode GPU driver installation must be performed in text mode
systemctl set-default multi-user.target
GPU driver installation
GPU driver under the root user
chmod +x NVIDIA-Linux-x86_64-450.80.02.run
./NVIDIA-Linux-x86_64-450.80.02.run --no-opengl-files --ui=none --no-questions --accept-license
Configure GPU driver memory resident mode
nvidia-persistenced
Set up autostart
vim /etc/rc.d/rc.local
Add a line to the file
nvidia-persistenced
Give executable permission to /etc/rc.d/rc.local file
chmod +x /etc/rc.d/rc.local
If there is no /etc/rc.d/rc.local, it can also be modified
vim /etc/rc.local
chmod +x /etc/rc.local
After installing the GPU driver, check the GPU status and related configurations.
nvidia-smi
CUDA installation Install
CUDA
Pay attention when installing CUDA. If you have already installed the GPU driver, do not choose to install the GPU driver when installing CUDA.
chmod +x cuda_11.1.1_455.32.00_linux.run
sh cuda_11.1.1_455.32.00_linux.run --no-opengl-libs
New version CUDA installation interface: Pay attention to the Driver option, indicating whether to install the GPU driver. If the GPU driver has already been installed, do not check it here.
Configure environment variables
to be added to the /etc/profile file and take effect for all users
vim /etc/profile
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
source /etc/profile
Test whether the cuda installation is correct and whether the environment variable is recognized successfully
nvcc -V
Reference link
link: link
Docker - Solve could not select device driver...gpu problem (install nvidia-container-runtime)
Link: [link]https://www.hangge.com/blog/cache/detail_3184.html)
Link: link