Article Source| Hengyuan Cloud Community
Original Address | [Tips - Graphics Card]
1. How to check the graphics card usage?
Execute the nvidia-smi command through the terminal to view the status of the graphics card, power consumption, memory usage, etc. of the graphics card.
root@I15b96311d0280127d:~# nvidia-smi
Mon Jan 11 13:42:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A |
| 63% 55C P2 298W / 370W | 23997MiB / 24268MiB | 62% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------
Because the instances are all Docker containers, using nvidia-smi will not see the process due to the limitation of container PID isolation.
Execute the py3smi command in the terminal to see if any process is using the graphics card.
root@I15b96311d0280127d:~# py3smi
Mon Jan 11 13:43:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI Driver Version: 460.27.04 |
+---------------------------------+---------------------+---------------------+
| GPU Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
+=================================+=====================+=====================+
| 0 63% 55C 2 284W / 370W | 23997MiB / 24268MiB | 80% Default |
+---------------------------------+---------------------+---------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU Owner PID Uptime Process Name Usage |
+=============================================================================+
| 0 ??? 10494 23995MiB |
+-----------------------------------------------------------------------------+
2. Does the GPU utilization increase during training?
Check the utilization rate of the graphics card during the training process, and find that the core utilization and power consumption of the graphics card are low, and the graphics card is not fully utilized.
This situation may be that in each training step, in addition to using the GPU, most of the time is consumed in the CPU, resulting in periodic changes in GPU utilization.
To solve the problem of utilization rate, the code needs to be improved. You can refer to Xi Xiaoyao's low training efficiency? Does the GPU utilization go up? of this article.
3. What are the versions of CUDA and CUDNN?
The CUDA Version checked by nvidia-smi is the version supported by the current driver, and does not represent the installed version of the instance.
The specific version is subject to the official image version selected when creating the instance.
# 查看 CUDA 版本
root@I15b96311d0280127d:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
# 查看 CUDNN 版本
root@I15b96311d0280127d:~# dpkg -l | grep libcudnn | awk '{print $2}'
libcudnn8
libcudnn8-dev
# 查看 CUDNN 位置
root@I15b96311d0280127d:~# dpkg -L libcudnn8 | grep so
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
...
4. Will it get stuck when starting training on an RTX 30 series graphics card?
Check to see if the CUDA version used by the library is lower than 11.0.
RTX 3000 series graphics cards require at least CUDA 11 and above. Using a version lower than 11 will cause the process to be stuck.