Deep learning single machine multi-card/multi-machine multi-card training

1. Conceptual distinction

Distributed : refers to multiple GPUs on multiple machines, that is, multiple machines and multiple cards.

Parallel : refers to multiple GPUs on a machine. That is, a single machine with multiple cards.

Synchronous update : means that after all GPUs have calculated the gradients, they are accumulated together to calculate the average value for parameter update, and then proceed to the next round.

rank represents the global process number, and local_rank represents the process number of the local sub-machine. world_size represents the number of global processes.

For example, there are three machines, and all four cards of each machine are used, then there is group=1 , world size=12,
machine one: node=0 rank=0,1,2,3 local_rank=0,1,2 ,3 The node=0, rank=0 here is the master
machine two: node=1 rank=4,5,6,7 local_rank =0,1,2,3
machine three: node=2 rank=8,9, 10,11 local_rank=0,1,2,3

2. DP and DDP (in pytorch)

DP (DataParallel) mode mainly supports multiple cards on a single machine. There is one process and multiple threads (limited by the global interpreter lock). local_rank0 is the master node, which is equivalent to the parameter server and broadcasts parameters to other GPUs. After backpropagation, each card concentrates the gradients on the master node and averages them. Update the parameters and then send the parameters to other cards. GPUS and master nodes need to be specified during training.

DDP (Distributed Data Parallel) supports distributed multi-machine and multi-card training, and also supports single-machine and multi-card training. Compared with DP, it has multiple processes for parallel training and there is no restriction of global interpreter lock. At the same time, the model broadcast is only broadcast once during initialization, instead of broadcasting before each forward propagation like DP. Because each process of DDP has its own independent optimizer and performs its own update process, and the gradient is passed to each process through communication, the execution content is the same.

Another difference between DDP and DP is that we need to use DistributedSample to process my dataset, so that each process will only call a part of the dataset. There will be no data interaction between different processes. Otherwise, if you use the default DP data allocation method, there will be a problem with multi-machine communication, which is very time-consuming. DP divides batch_size data into different cards. For example, batch_size30, two GPUs, then the actual bs of each GPU is 15. For DDP, the batch_size is 30, which means that each card has a batch_size of 30.

3. SyncBN

Because standard BN uses data parallelism (DP), it only normalizes the samples on a single card, so the batch_size is very small, affecting model convergence. Syncbn, on the other hand, finds a mean and variance of the global data during forward propagation (multiple machines and multiple cards, single machine and multiple cards), and then passes the message to each card for normalization operation. This can prevent the problem of batch_size being too small, causing BN failure. Let me mention here that for general communication between GPUs, we all use the NCCL communication framework by default.

Guess you like

Origin blog.csdn.net/slamer111/article/details/132716482