Pytorch distributed parallel DDP stuck and hung

Problem Description:

1, using A30 graphics card, using Distributed Data Parallel, the graphics card memory is full when running the program, stuck at the local_rank setting, and the process group is not started
2, as shown in the figure:
Insert image description here
Insert image description here

solution:

0, The latest solution, for Supermicro motherboard: BIOS->Advanced->NB Configuration->IOMMU-> Disabled
Insert image description here
==The BIOS of other motherboard models may also require disabling ACS:
https://zhuanlan.zhihu.com/p/607203976
https://www.supermicro.com/support/faqs/faq.cfm?faq=20264
https://www.supermicro.com/support/faqs/faq.cfm?faq=22226
Don’t read the following 1-4~

1, change the backend to "Gloo", and execute the shell command normally to run the program.

torch.distributed.init_process_group(backend="Gloo")
python -m torch.distributed.launch --nproc_per_node=7 --master_port 8888 main.py

2, still use the "NCCL" backend, but you need to change the environment variables and add disable P2P before the shell command.

torch.distributed.init_process_group(backend="NCCL")
NCCL_P2P_DISABLE=1 python -m torch.distributed.launch --nproc_per_node=7 --master_port 8888 main.py

3, still use the "NCCL" backend, but you need to change the environment variables, permanently change the environment settings, and execute the shell command normally to run the program.

torch.distributed.init_process_group(backend="NCCL")
vim ~/.bashrc
export NCCL_P2P_DISABLE=1
source ~/.bashrc.
python -m torch.distributed.launch --nproc_per_node=7 --master_port 8888 main.py

4. It is recommended to use the third solution. According to my test, the Gloo backend does not have as fast communication speed as the NCCL backend, and the program runs faster with NCCL. In addition, it is quite annoying to add the command to modify the environment variables every time. Modifying the bash environment variables can be done once and for all.

Bug analysis:

NCCL_P2P_DISABLE=1 will disable direct communication between GPUs (such as using NVlink or PCIe). Since the NVDIA official website shows that A30 supports NVlink or PCIe, it is judged that it may be a hardware failure or a mismatch in software versions that causes P2P communication to be blocked, blocking the process and program hang.

reference:

1:https://zhuanlan.zhihu.com/p/60054075
2:https://github.com/pytorch/pytorch/issues/23074

Guess you like

Origin blog.csdn.net/qq_40947610/article/details/128118180