nvidia-smi common commands:
nvidia-smi dmon -s xxx
(device monitor) View the relevant information of the device, the following are optional values
如 nvidia-smi dmon -s t -i 1 查看gpu卡1的pcie读写带宽
p:电源使用情况和温度(pwr:功耗,temp:温度)
u:GPU使用率(sm:流处理器,mem:显存,enc:编码资源,dec:解码资源)
c:GPU处理器和GPU内存时钟频率(mclk:显存频率,pclk:处理器频率)
v:电源和热力异常
m:FB内存和Bar1内存
e:ECC错误和PCIe重显错误个数
t:PCIe读写带宽
nvidia-smi dmon -i 0 -s mutc -d 1 -o TD
Use this command to directly view the information related to the current operation of gpu card 0
#Date Time gpu fb bar1 sm mem enc dec rxpci txpci mclk pclk
#YYYYMMDD HH:MM:SS Idx MB MB % % % % MB/s MB/s MHz MHz
20221215 15:43:54 1 3217 13 10 15 100 30 15 14 6250 1455
20221215 15:43:55 1 3217 13 9 14 88 32 45 12 6250 1507
20221215 15:43:56 1 3217 13 9 13 80 30 23 9 6250 1260
20221215 15:43:57 1 3217 13 9 14 95 31 33 22 6250 1372
20221215 15:43:58 1 3217 13 10 15 100 30 44 25 6250 1440
20221215 15:43:59 1 3217 13 10 15 100 28 14 12 6250 1530
20221215 15:44:00 1 3217 13 10 15 100 30 39 15 6250 1297
nvidia-smi pmon -i 1
Use this command to view the current running program usage of gpu card 1
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
1 3524892 C 8 13 95 29 Pangu
1 3524892 C 8 14 94 29 Pangu
1 3524892 C 8 14 95 30 Pangu
1 3524892 C 9 14 96 30 Pangu
1 3524892 C 8 14 96 30 Pangu
lspci -vv |grep xxx -C50
(xxx is the device name, such as A16) or nvidia-smi -q -i 1
view the relevant information of the pci device, you can find the pci rate of the GPU.
Timestamp : Thu Dec 15 15:51:49 2022
Driver Version : 515.43.04
CUDA Version : 11.7
Attached GPUs : 4
GPU 00000000:47:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : xxxxxxxxxxxx
GPU UUID : xxxxxxxxxxxx
Minor Number : 1
VBIOS Version : 94.07.54.00.01
MultiGPU Board : Yes
Board ID : xxxxxxxxxxxxxxx
GPU Part Number : xxxxxxxxxxxxxxx
Module ID : 0
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 515.43.04
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x47
Device : 0x00
Domain : 0x0000
Device Id : xxxxxxxxxxx
Bus Id : 00000000:47:00.0
Sub System Id : xxxxxxxxxxx
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 2000 KB/s
Rx Throughput : 6000 KB/s
Fan Speed : 0 %
Performance State : P8
Example of problem analysis
12-way cuda->nv12->nv12->cuda codec performance and pcie read and write speed
#Date Time gpu fb bar1 sm mem enc dec rxpci txpci mclk pclk
#YYYYMMDD HH:MM:SS Idx MB MB % % % % MB/s MB/s MHz MHz
20221215 15:33:04 1 1149 7 22 6 31 11 566 635 6250 1755
20221215 15:33:05 1 1149 7 23 7 46 13 670 672 6250 1755
20221215 15:33:06 1 1149 7 22 6 35 11 619 738 6250 1755
20221215 15:33:07 1 1149 7 19 5 31 10 565 548 6250 1755
20221215 15:33:08 1 1149 7 20 6 36 11 485 641 6250 1755
20221215 15:33:09 1 1149 7 18 6 36 11 466 555 6250 1755
20221215 15:33:10 1 1149 7 20 5 31 10 481 595 6250 1755
20221215 15:33:12 1 1149 7 21 7 43 12 512 518 6250 1755
20221215 15:33:13 1 1149 7 18 6 32 10 564 593 6250 1755
20221215 15:33:14 1 1149 7 18 6 35 10 383 605 6250 1755
20221215 15:33:15 1 1149 7 21 6 39 11 497 601 6250 1755
20221215 15:33:16 1 1149 7 19 6 35 11 488 565 6250 1755
20221215 15:33:17 1 1149 7 20 6 36 11 504 539 6250 1755
20221215 15:33:18 1 1149 7 20 6 37 11 486 655 6250 1755
20221215 15:33:19 1 1149 7 19 6 36 10 643 703 6250 1755
20221215 15:33:20 1 1149 7 19 6 34 11 408 609 6250 1755
20221215 15:33:21 1 1149 7 21 6 36 11 356 580 6250 1755
20221215 15:33:22 1 1149 7 23 6 41 11 513 582 6250 1755
20221215 15:33:23 1 1149 7 21 6 36 11 691 654 6250 1755
30-way cuda->cuda codec performance
#Date Time gpu fb bar1 sm mem enc dec rxpci txpci mclk pclk
#YYYYMMDD HH:MM:SS Idx MB MB % % % % MB/s MB/s MHz MHz
20221215 15:43:07 1 3203 13 9 14 100 30 47 8 6250 1260
20221215 15:43:08 1 3203 13 9 15 100 30 28 9 6250 1500
20221215 15:43:09 1 3203 13 9 15 100 30 27 13 6250 1567
20221215 15:43:10 1 3203 13 9 15 100 30 37 15 6250 1552
20221215 15:43:11 1 3203 13 10 15 100 30 19 20 6250 1710
20221215 15:43:13 1 3203 13 10 15 100 30 15 4 6250 1747
20221215 15:43:14 1 3205 13 10 15 100 30 62 12 6250 1102
20221215 15:43:15 1 3207 13 10 15 100 30 22 14 6250 1432
20221215 15:43:16 1 3209 13 8 13 88 31 39 6 6250 1590
20221215 15:43:17 1 3209 13 8 13 82 31 21 13 6250 1485
20221215 15:43:18 1 3215 13 8 13 87 31 42 16 6250 1372
20221215 15:43:19 1 3215 13 9 13 83 31 43 23 6250 1447
20221215 15:43:20 1 3215 13 10 15 100 30 16 12 6250 1590
20221215 15:43:21 1 3215 13 10 15 100 30 47 6 6250 1470
Judging from the results at 12 channels, no matter the GPU memory, performance, or codec performance has not reached the peak value, and the previously suspected transmission rate is only read: 600MB/s write: 600MB/s, and check the card's PCIe setting
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap is the graphics card configuration, LnkSta is the actual value, and the Speed will be increased from 2.5GT/s (PCIE 1.0) to 16GT/s (PCIE 4.0) when there is a program running, and the Width(lane) is degraded, but it can be seen from the table It is concluded that there is still a bandwidth of 7.88GB/s. . .
Factors that may affect performance include temperature, etc., but the nvidia-smi -q -i 1
temperature does not reach the temperature that slows down the GPU by checking the temperature. . .
Temperature
GPU Current Temp : 87 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C