Guide to upgrading the cuda driver version of the GPU server

1. Download the corresponding driver version

1. Select the corresponding driver version

Download driver address: https://www.nvidia.in/Download/index.aspx?lang=en

The contents of this article take A30(NCCL), A100(NV-Link), A100(NV-Switch) as examples:
insert image description here

2. Obtain the download link of the selected driver
1) Confirm whether the following version information is correct

insert image description here
If correct, click to download

2) Copy the download link

insert image description here

3. Download the driver from the server
1) Download the driver

Terminal execution: wget [link copied]

eg:

wget https://cn.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
2) Give the file execution permission

chmod +x [driver file]

eg:

chmod +x NVIDIA-Linux-x86_64-515.65.01.run

Second, stop all applications and containers that are related to memory usage

1. Stop the container
# nvidia-smi --query-compute-apps=gpu_uuid,pid,used_memory --format=csv | grep '[0-9]' | sed 's/[[:space:]]//g' | sed 's/MiB//g'
# docker inspect -f '{
    
    { .Name }}' $(ps -e -o pid,comm,cgroup | grep -v "/docker/" | grep <PID> | awk '{print $3}' | awk -F "[/.-]" '{print $5}') | sed 's/\///g'
docker ps | awk '{print $1}' | grep -v CONTAINER | xargs docker stop
2. View nvidia occupation application
sudo lsof -n -w /dev/nvidia*

After viewing the PID, you can use the kill command to end the process

3. Confirm whether there is an nvidia application occupying
ps -aux | grep nvidia
4. Check whether there is k8s application occupation
# 使用下面命令查看
systemctl status kubelet

# 若存在则执行下面命令
systemctl stop kubelet

3. Perform driver upgrade

./NVIDIA-Linux-x86_64-515.65.01.run 

Note:

  • Select YES for all options
  • You can start another terminal to view the nvidia update upgrade log: tail -f /var/log/nvidia-installer.log

Fourth, start the multi-card persistence model

nvidia-smi -pm 1

Five, upgrade fabric (Nccl does not need this step; NV-Link, NV-Switch implementation)

1. View the current fabric name
rpm -qa | grep fabric
2. Uninstall fabric
1) Execute the uninstall command according to the queried fabric name

eg:

yum remove nvidia-fabricmanager-465-465.19.01-1.x86_64
2) Check whether the uninstallation is successful

Execute the following command, if there is no return value, it means success

 rpm -qa | grep fabric
3. Modify gpg check parameters
cd /etc/yum.repos.d/ && vim cuda-rhel7.repo

Modify gpgcheck as follows:

# gpgcheck=1
gpgcheck=0
4. yum upgrade
yum update -y 
5. Install the new version of fabric
yum install -y cuda-drivers-fabricmanager nvidia-fabric-manager
6. Check whether the installation is successful
rpm -qa | grep fabric
7. Start fabric
nv-fabricmanager
8. Check whether the startup is successful
ps -ef | grep fabric

Six, restart and verify

1. Restart
reboot
2. Verification
1) Check whether the status of the GPU graphics card is normal
nvidia-smi
2) Use a single-machine multi-card running model to check whether it can be trained normally

Guess you like

Origin blog.csdn.net/TFATS/article/details/126423823