GPU memory is not released to solve the problem

Foreword

This morning I would like to use multiple GPU test model, so he used the PyTorch inside torch.nn.parallel.DistributedDataParallelto support the use of multiple GPU with simultaneous (hereinafter referred to simply as Dist ).

The program is running, because the code (and Dist independent code) in other parts of the program error has occurred, causing the program to quit. This does not consider the use and handling of such Dist program crashes, and therefore does not quit before the program closed with Dist generated by all processes, ultimately leading to the GPU memory to run this process is not released (by observation, that is because there is no use Dist shut down all processes, leading to the program running as well as part of the process is running).

The following describes the process this time I solve the problem.

text

MVE

Minimal Verifiable Examples, the program code on the question as follows:

import torch.distributed as dist

# 一些代码:定义model等
some code

# 初始化并行训练
dist.init_process_group(xxxx)  # 函数参数省略
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)

# 一些代码:训练、测试模型等
some code  # 我的程序在这个部分出错且程序直接退出,导致下面的关闭进程的代码没有运行

# 关闭所有进程
dist.destroy_process_group()

Of problems

As shown below, the program exits, and no process is using the numbers 0 GPU, but the number 0 GPU's memory was occupied. The reason is of no use Dist close all the processes before the program exits, part of the process is still running, the process consumes No. 0 GPU video memory.

No. 7 GPU occupancy process is another of my process, unrelated issues discussed herein.

image-20200404070619631

GPU memory occupied by the positioning of PID

Execute the following command

fuser -v /dev/nvidia*

The command execution result shown in the figure below, you can see is the PID of the process takes up to 285 448 No. 0 GPU.

The following graph forget to play the mosaic, and later with the black hiding a bit of information, so this column is USER appears to be empty.

image-20200404070704618

Execute the following command to view the information of the process, the process can be found PPID (PID parent process) is 1, indicating that the process is not a process I occupy No. 7 GPU generation, and now only use it in the No. 0 GPU. This process can be inferred because the program is run error caused it not been closed, so you can manually shut down the process.

ps -f -p 285448

The following graph forget to play the mosaic, and later with the black hiding a bit of information, so the road map is not very clear.

image-20200404070805983

It has executed the following two commands, kill the process, and then view the GPU case, you can see the number 0 GPU's memory has been released, and now the GPU memory usage is normal.

kill -9 2885448
nvidia-smi

image-20200404070901921


Author: @ smelly salted fish

Please indicate the source: https://www.cnblogs.com/chouxianyu/

Welcome to discuss and share!


Guess you like

Origin www.cnblogs.com/chouxianyu/p/12630665.html