[nvidia-smi: command not found] How to use nvidia-smi to view GPU information on a cluster server

1. nvidia-smi command output analysis

For ordinary multi-card servers, the nvidia-smi command can display detailed information about NVIDIA graphics cards and GPUs, such as entering

nvidia-smi

Get the following output, you can see the corresponding CUDA version, GPU memory size and other information.
Insert image description here

2. Use the nvidia-smi command on the cluster

If you enter nvidia-smi directly on the command line after logging in to the server, the following error will be reported:
bash: nvidia-smi: command not found.
This is because in the cluster, we only logged in to the server, but did not run the job. Not allocated to GPU . We need to submit a job and run the nvidia-smi command in the job to read relevant information from the output file.

Taking the LSF job scheduling system as an example, when submitting a job, you often need to write a check_nvidia_smi.sh file, as shown below:

#/bin/bash
#BSUB -J nvidia-smi
#BSUB -n 1
#BSUB -q gpu
#BSUB -o 你的输出目录/nvidia_smi.txt
#BSUB -gpu "num=1:mode=exclusive_process"
nvidia-smi

Among them,
-J specifies the job name
-n represents the number of cores used by the job
-q represents the queue to submit the job, gpu represents the use of the gpu queue, if you are not sure about the queue name, you can use the bqueues command to view
-o represents the specified path of the output file
- gpu “num=1:mode=exclusive_process” means exclusive 1 card to run the job

Then submit this job in the command line

bsub < check_nvidia_smi.sh

You can find the output file in the output path and view the corresponding information:
Insert image description here

Guess you like

Origin blog.csdn.net/a61022706/article/details/131799575