Basic concepts of Pytorch distributed DPP

Reference

Basic concepts and issues involved in PyTorch distributed DPP (strongly recommended)

GO

Blowout prevention guide, elementary school students, know nothing, welcome to point out mistakes.

1. Explanation of distributed parameters

  • rank: Used to indicate the serial number of the process in the entire distributed task, each process corresponds to a rank process, and the entire distributed training is completed by many rank processes. Rank, in my personal understanding, is equivalent to the index of the process, through which the corresponding process can be found.
  • node: Physical node, generally speaking, refers to a machine, and there can be multiple GPUs inside the machine
  • local_rank: local_rank is different from process rank in that it is a number relative to a node, and the local_rank between each node is relatively independent. If it is a machine, rank is generally equal to local_rank.
  • world size: the total number of ranks in a distributed task.


Image from reference link

In the figure world_size=12, there are 12 ranks in total.

1.1 Note:

1. There is no necessary correspondence between rank and GPU. A rank can contain multiple GPUs; a GPU can also serve multiple ranks (multiple processes share the GPU).

This is more important when understanding the principle of distributed communication. Because many materials explain RingAllReduce, PS-WorK and other modes, it is customary to default that a rank corresponds to a GPU, which leads many people to think that the rank is the number of the GPU.

2. Communication Participation and Mode

The communication process is mainly to complete the transfer of parameter information during the model training process. The communication backend and communication mode selection are mainly considered. The backend and mode have a great influence on the convergence speed of the entire training, and the difference can reach 2 to 10 times. Several common communication libraries are supported in DDP, and the data processing mode is written in the bottom layer of Pytorch, and the backend is mainly for users to choose. Need to be set during initialization:

  • backend: Communication backend, optional nccl (introduced by NVIDIA), gloo (introduced by FACEBOOK), mpi (OpenMPI), from the test results, if the graphics card supports nccl, it is recommended to choose nccl.

torch provides an api interface to determine whether the underlying communication library is available

torch.distributed.is_nccl_available()  # 判断nccl是否可用
torch.distributed.is_mpi_available()  # 判断mpi是否可用
torch.distributed.is_gloo_available() # 判断gloo是否可用
  • master_addr and master_port: The address and port of the master node, used by the tcp method of init_method. Because the network communication in pytorch is established by the slave to connect to the host, running ddp only needs to specify the IP and port of the master node, and the IP of other nodes does not need to be filled in. These two parameters can be passed in through environment variables or init_method.
# 方式1:
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", 
                        rank=rank, 
                        world_size=world_size)
# 方式2:
dist.init_process_group("nccl", 
                        init_method="tcp://localhost:12355",
                        rank=rank, 
                        world_size=world_size)
                       

Guess you like

Origin blog.csdn.net/REstrat/article/details/127181762