Foreword
This morning I would like to use multiple GPU test model, so he used the PyTorch inside torch.nn.parallel.DistributedDataParallel
to support the use of multiple GPU with simultaneous (hereinafter referred to simply as Dist ).
The program is running, because the code (and Dist independent code) in other parts of the program error has occurred, causing the program to quit. This does not consider the use and handling of such Dist program crashes, and therefore does not quit before the program closed with Dist generated by all processes, ultimately leading to the GPU memory to run this process is not released (by observation, that is because there is no use Dist shut down all processes, leading to the program running as well as part of the process is running).
The following describes the process this time I solve the problem.
text
MVE
Minimal Verifiable Examples, the program code on the question as follows:
import torch.distributed as dist
# 一些代码:定义model等
some code
# 初始化并行训练
dist.init_process_group(xxxx) # 函数参数省略
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
# 一些代码:训练、测试模型等
some code # 我的程序在这个部分出错且程序直接退出,导致下面的关闭进程的代码没有运行
# 关闭所有进程
dist.destroy_process_group()
Of problems
As shown below, the program exits, and no process is using the numbers 0 GPU, but the number 0 GPU's memory was occupied. The reason is of no use Dist close all the processes before the program exits, part of the process is still running, the process consumes No. 0 GPU video memory.
No. 7 GPU occupancy process is another of my process, unrelated issues discussed herein.
GPU memory occupied by the positioning of PID
Execute the following command
fuser -v /dev/nvidia*
The command execution result shown in the figure below, you can see is the PID of the process takes up to 285 448 No. 0 GPU.
The following graph forget to play the mosaic, and later with the black hiding a bit of information, so this column is USER appears to be empty.
Execute the following command to view the information of the process, the process can be found PPID (PID parent process) is 1, indicating that the process is not a process I occupy No. 7 GPU generation, and now only use it in the No. 0 GPU. This process can be inferred because the program is run error caused it not been closed, so you can manually shut down the process.
ps -f -p 285448
The following graph forget to play the mosaic, and later with the black hiding a bit of information, so the road map is not very clear.
It has executed the following two commands, kill the process, and then view the GPU case, you can see the number 0 GPU's memory has been released, and now the GPU memory usage is normal.
kill -9 2885448
nvidia-smi
Author: @ smelly salted fish
Please indicate the source: https://www.cnblogs.com/chouxianyu/
Welcome to discuss and share!