1. Download the corresponding driver version
1. Select the corresponding driver version
Download driver address: https://www.nvidia.in/Download/index.aspx?lang=en
The contents of this article take A30(NCCL), A100(NV-Link), A100(NV-Switch) as examples:
2. Obtain the download link of the selected driver
1) Confirm whether the following version information is correct
If correct, click to download
2) Copy the download link
3. Download the driver from the server
1) Download the driver
Terminal execution: wget [link copied]
eg:
wget https://cn.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
2) Give the file execution permission
chmod +x [driver file]
eg:
chmod +x NVIDIA-Linux-x86_64-515.65.01.run
Second, stop all applications and containers that are related to memory usage
1. Stop the container
# nvidia-smi --query-compute-apps=gpu_uuid,pid,used_memory --format=csv | grep '[0-9]' | sed 's/[[:space:]]//g' | sed 's/MiB//g'
# docker inspect -f '{
{ .Name }}' $(ps -e -o pid,comm,cgroup | grep -v "/docker/" | grep <PID> | awk '{print $3}' | awk -F "[/.-]" '{print $5}') | sed 's/\///g'
docker ps | awk '{print $1}' | grep -v CONTAINER | xargs docker stop
2. View nvidia occupation application
sudo lsof -n -w /dev/nvidia*
After viewing the PID, you can use the kill command to end the process
3. Confirm whether there is an nvidia application occupying
ps -aux | grep nvidia
4. Check whether there is k8s application occupation
# 使用下面命令查看
systemctl status kubelet
# 若存在则执行下面命令
systemctl stop kubelet
3. Perform driver upgrade
./NVIDIA-Linux-x86_64-515.65.01.run
Note:
- Select YES for all options
- You can start another terminal to view the nvidia update upgrade log:
tail -f /var/log/nvidia-installer.log
Fourth, start the multi-card persistence model
nvidia-smi -pm 1
Five, upgrade fabric (Nccl does not need this step; NV-Link, NV-Switch implementation)
1. View the current fabric name
rpm -qa | grep fabric
2. Uninstall fabric
1) Execute the uninstall command according to the queried fabric name
eg:
yum remove nvidia-fabricmanager-465-465.19.01-1.x86_64
2) Check whether the uninstallation is successful
Execute the following command, if there is no return value, it means success
rpm -qa | grep fabric
3. Modify gpg check parameters
cd /etc/yum.repos.d/ && vim cuda-rhel7.repo
Modify gpgcheck as follows:
# gpgcheck=1
gpgcheck=0
4. yum upgrade
yum update -y
5. Install the new version of fabric
yum install -y cuda-drivers-fabricmanager nvidia-fabric-manager
6. Check whether the installation is successful
rpm -qa | grep fabric
7. Start fabric
nv-fabricmanager
8. Check whether the startup is successful
ps -ef | grep fabric
Six, restart and verify
1. Restart
reboot
2. Verification
1) Check whether the status of the GPU graphics card is normal
nvidia-smi