Detailed explanation of nvidia-smi command

nvidia-smi(NVIDIA System Management Interface) is a command-line utility for monitoring and managing the status and performance of NVIDIA GPUs (graphics processing units) . It provides a simple and powerful way to get real-time information about the GPU and can be used to diagnose, optimize and manage GPU resources.

Detailed information can be found in the manual:man nvidia-smi

nvidia-smiInstalled with the NVIDIA GPU driver in most cases, nvidia-smithe tools are usually automatically included in the driver package when the NVIDIA GPU driver is installed and placed in place during installation.

Table of contents

1. Analysis of nvidia-smi panel

2. nvidia-smi common options

3. The difference between video memory and GPU


1. nvidia-smi panel analysis

  • GPU: The number of the GPU in this machine, starting from 0, the above picture shows four GPUs of 0, 1, 2, and 3
  • Fan: fan speed (0%-100%), N/A means no fan
  • Name: GPU name/type, the above four blocks are all NVIDIA GeForce RTX 3080
  • Temp: GPU temperature (GPU temperature is too high will cause GPU frequency to drop)
  • Perf: Performance status, from P0 (maximum performance) to P12 (minimum performance), the above figure is P2
  • Pwr: Usage/Cap: GPU power consumption, Usage indicates how much is used, and Cap indicates how much in total
  • Persistence-M: Persistence mode state. Persistence mode consumes a lot of energy, but it takes less time to start a new GPU application. The above pictures are all On
  • Bus-Id: GPU bus
  • Disp.A: Display Active, indicating whether the GPU is initialized
  • Memory-Usage: memory usage
  • Volatile GPU-UTil: The difference between GPU usage and video memory usage can refer to video memory and GPU
  • Uncorr. ECC: Whether to enable error checking and error correction technology, 0/DISABLED, 1/ENABLED, all of the above pictures are N/A
  • Compute M: computing mode, 0/DEFAULT, 1/EXCLUSIVE_PROCESS, 2/PROHIBITED, all of the above pictures are Default
  • Processes: Display the video memory usage, process number, and GPU occupied by each process

Reference:Adenialzz

Supplement: ECC error correction

ECC (Error Correction Code) is a technology used to detect and correct errors during data transmission or storage. During data transmission, data errors may occur due to reasons such as noise, interference, or equipment failure. ECC error-correcting codes are designed to detect these errors and, where possible, correct them to ensure data integrity and accuracy.

ECC error-correcting codes use a series of algorithms and techniques to encode the original data into a set of redundant data called check codes. These checksums are calculated according to specific rules of the data and are transmitted or stored along with the data. After receiving the data, the receiver will use the same algorithm and technology to calculate the check code for the received data. Then, the receiver will compare the original data with the check code, and if an error is found, it will try to automatically repair the error through a correction algorithm to restore the accuracy of the original data.

The use of ECC error correction codes can improve the reliability of data transmission and storage systems. It is commonly used in storage media (such as hard drives, flash memory) and communication channels (such as network transmission) to ensure data integrity and reliability. The application fields of ECC error correction codes include computer storage systems, wireless communications, digital broadcasting, etc. Different types of ECC error correction codes have different error correction capabilities, and an appropriate error correction code can be selected according to specific requirements.

2. nvidia-smi Common options

Note ⚠️: The available options and output of the command may vary depending on the NVIDIA driver version and GPU model, you can nvidia-smi --helpview the complete list of options and usage instructions through the command.

  • -h View the help manual:nvidia-smi -h 
  • Observe the state of the GPU dynamically:watch -n 0.5 nvidia-smi 
  • -i View the specified GPU:nvidia-smi -i 0
  • -L View the list of GPUs and their UUIDs:nvidia-smi -L
  • -l Specify the dynamic refresh time, refresh once every 5 seconds by default, stop by Ctrl+C:nvidia-smi -l 5
  • -q Query GPU details:nvidia-smi -q
  • Only list the detailed information of a GPU, which can be  -i specified with options:nvidia-smi -q -i 0
  • Enable persistence mode on all GPUs:nvidia-smi -pm 1
  • Specify to enable the persistent mode of a graphics card:nvidia-smi -pm 1 -i 0
  • Monitor overall GPU usage with an update interval of 1 second:nvidia-smi dmon
  • Monitor GPU usage per process with 1 second update interval:nvidia-smi pmon

Supplement: UUID

The UUID (Universally Unique Identifier) ​​of the GPU is a string used to uniquely identify the GPU device. It is an identifier consisting of a string of characters and numbers used to distinguish different GPU devices.

Each GPU device has a unique UUID , which is usually assigned by the hardware manufacturer or driver and recorded in the system. How the UUID is generated can vary depending on the manufacturer and operating system of the GPU device. UUIDs are widely used in computer systems. In GPU computing, UUID can be used to identify and manage different GPU devices. It can be used as a device index in the system, enabling software to explicitly interact and communicate with a specific GPU device.

3. The difference between video memory and GPU

Video RAM (Video RAM, VRAM) and GPU (Graphics Processing Unit) are two different concepts in computer graphics processing.

  • Video memory (VRAM): Video memory is a special type of memory used to store data related to image display, such as graphics data and textures. It is usually located on a discrete graphics card (or an integrated graphics processor where the graphics card is integrated into the motherboard), and is also called graphics memory. With high bandwidth and low latency, video memory can be used to quickly read and write image data for graphics rendering and processing by the GPU. The capacity of video memory is usually measured in megabytes (MB) or gigabytes (GB).
  • GPU (Graphics Processing Unit): A GPU is a processor specifically designed to process graphics and image data. It is a key component in computer graphics rendering and acceleration. The GPU is responsible for performing various stages in the graphics rendering pipeline, including geometric calculations, rasterization, pixel processing, etc., to generate the final image. GPUs can also perform general-purpose computing tasks, so they are widely used in many fields, such as scientific computing, machine learning, and password cracking. Video memory is a part of GPU that is used to store graphics data required for GPU processing.

To sum up, video memory is a kind of memory specially used to store graphics data, and GPU is a kind of processor specially used to process graphics and image data. Video memory and GPU are closely related, and GPU uses video memory to store and process graphics data to achieve high-performance graphics rendering and processing capabilities.

Guess you like

Origin blog.csdn.net/daydayup858/article/details/131633445