Daily management of GPU by Linux users (NVIDIA)

2020/05/19

        The most commonly used code we use to run code on the server is NVIDIA graphics cards, such as GTX2080. Below is a record of the commonly used codes I found.

        After the NVIDIA graphics card can be used normally, enter

nvidia-smi

You can see a very comprehensive list of information including each graphics card model, memory usage, utilization, occupied threads, etc.

        This command can be used with watch to achieve real-time monitoring. It is my most commonly used command:

watch -n 1 nvidia-smi

Among them, -n 1 sets the nvidia-smi command to be executed every 1 second. It is set to 1 second because some versions of the nvidia-smi command execute slowly and consume more CPU. However, the watch command cannot display all the content due to window size limitations. The specific processes occupying each card cannot be seen in the figure below.

watch -n 1 nvidia-smi

 

 

        Press Ctrl + C key combination to exit watch monitoring.


        When running a multi-process program, sometimes the main process program hangs up, but the sub-process program is still occupying video memory silently. Generally, the video memory is occupied by a single card, but its utilization remains at 0%, and nvidia-smi cannot see the process occupying the card. For example, in the picture above, if the occupied video memory of card 1 is changed to 5754MiB, and other things remain unchanged, then the video memory of card 1 may be occupied by a zombie process.

        There is a magical command here that can view all processes occupying each card by the current user.

fuser -v /dev/nvidia*

 

fuser -v /dev/nvidia*

        

        Then the process listed above is the process that yiyuiii currently interacts with the 0th card. You can shut them down by feeding the process ID to the kill command.

kill -9 [PID]

        The kill command can shut down processes in batches. Just list the process number and PID on the right. In the above case, the command would be

kill -9 17521 17540 17541 17542 17543 17544 17545 17546 17547 17551 17555 17556 17557 17558 17559 17560 17561 18304 18332 18333 18334 18335 18336 18337 18338 18339 18340 18344 18345 18346 18347 18348 18349

        In addition, as an automation player, here is a python program to extract the list of process numbers required for kill.

import os
import re

with open('rawtext.txt','r') as f:
    s = f.read()
    pattern = re.compile(r'\d+')   # 查找数字
    results = pattern.findall(s)
    output = ''
    if results:
        for result in results:
            output += result + ' '
    print(output)

2020/06/03

        ​​​​For the case of using the nohup command file input/output, the following code can directly and accurately obtain the process PID:

fuser <filename>

        To be continued... ->

Supongo que te gusta

Origin blog.csdn.net/hizcard/article/details/106222876
Recomendado
Clasificación