[Notes] PyTorch DDP and Ring-AllReduce

Please indicate the source when reprinting: Senior Xiaofeng's Big Bang Life [xfxuezhang.cn]

 

If there are any mistakes in the text, please point them out!


image.png

         What I want to share with you today is an old but classic article. This is a technology that will be used in distributed training. It is actually called ringallreduce. Why is it called this? Because many frameworks now, such as pytorch, use this for its internal distributed training. So if we know his principle, it will be convenient for us to improve and optimize him later. It is a technology from HPC, but in fact many technologies in distributed machine learning are borrowed from HPC. Part of the following content comes from papers, and the other part comes from the Internet.

        Here's a little background first.
        Taking data parallelism as an example, in distributed training, it is necessary to distribute data to different GPUs for training, and then perform gradient update after training for one epoch. The updates here can be divided into synchronous and asynchronous. For the convenience of understanding, I have drawn a few pictures here. Synchronization is easier to understand. After each GPU is completed, it needs to wait for other GPUs to complete before gradient updates can be performed. Asynchronous means that each GPU can perform gradient updates independently, and perform gradient exchange at a certain point in time, so there is no need to wait for other GPUs to complete. The gradient exchange in these two methods has also led to a lot of research content.
        In general, it is based on two methods, one is to use the parameter server, and the other is to use the reduce operation. The method of the parameter server is to coordinate the calculation of the gradient of each GPU by specifying a server. His shortcomings are also obvious. With the increase of GPUs, the communication of the parameter server becomes a bottleneck. The reduce method is to remove the existence of the parameter server and let each GPU communicate with each other. It is also divided into map-reduce, all-reduce, ring-reduce, ring-allreduce and so on.

https://zh.d2l.ai/chapter_computational-performance/parameterserver.html

image.png

        Here we first introduce some concepts of communication primitives for later understanding. 

image.png

        Then come back to the paper.
        The butterfly algorithm is more commonly used in allreduce. In the absence of network contention, this algorithm is optimal in both latency and bandwidth, but in fact, this communication pattern can lead to network contention in many contemporary clusters, such as widely deployed SMP/multi-core clusters, because These clusters often share some network resources.
In the absence of network competition, the reason why the butterfly algorithm performs optimally in terms of delay and bandwidth is mainly due to the following reasons:
        1. Peer-to-peer communication mode: The butterfly algorithm adopts a peer-to-peer communication mode, that is, each Nodes have established direct connections with other nodes. In the absence of network competition, the communication paths between nodes are independent and there is no interference from other nodes. This minimizes the latency of communication since messages can reach the destination node as quickly as possible via the shortest path.
        2. Level-by-level communication: The butterfly algorithm aggregates data level by level through multiple communication stages. In each stage, a node communicates with its closest nodes, and then gradually expands to further nodes. This level-by-level communication method makes the data aggregation process more efficient, reducing the number of communication times and the total delay.
        3. Load balancing: The butterfly algorithm ensures load balancing in the communication process by aggregating data in stages. Even in the case of differences in computing power or bandwidth between nodes, the butterfly algorithm can still maintain a relatively balanced load in the communication process and maximize the use of computing resources of each node.
        4. Bandwidth optimization: The butterfly algorithm only transmits part of the data in each stage of communication, instead of directly transmitting all the data. In this way, the amount of data in a single communication can be reduced, thereby making better use of bandwidth resources. In the absence of network competition, communication between nodes can usually occupy the entire available bandwidth, so by optimizing the amount of data in a single communication, the butterfly algorithm can maximize bandwidth utilization.
        (The process of butterfly global summation is that the first step is to divide two adjacent nodes into a group and communicate their sum with each other, then the sum in each node of this two-node group is the local of this group And. The second step divides four nodes into a group, the first half and the second half communicate with each other, then the sum in each node of this four-node group is the local sum of this group. This step is repeated until the group The capacity is greater than the total number of processes.)
        The ring-based approach proposed by the author claims to achieve contention-free communication in almost all contemporary clusters, and can require relatively small memory requirements and does not require power-of-two nodes. But it also has some problems, such as optimization only in bandwidth; and there may be precision problems. This precision problem means that due to floating-point calculations involved in parallel computing, the results of calculations performed on different nodes may be affected by rounding errors, because different nodes may have different precisions for floating-point calculations. In addition, although the Ring algorithm is very advantageous in medium-scale operations, the amount of transmitted data is small, there is no bottleneck, and the bandwidth is fully utilized. However, in large-scale cluster computing, with huge data in the server and extremely long Ring rings, the Ring method of segmenting data blocks is no longer dominant.

1.png

2.png

3.png

1.png

4.png

5.png

6.png

7.png

8.png

9.png

10.png

        This paper is very long, and there are many mathematical formulas, so we will not look at the mathematical proof, but directly look at his implementation process. However, he didn't introduce much about the process, so he found some information from the Internet.
        The proposed method is mainly a combination of three existing techniques. Take the following figure as an example to see his execution process.
        First divide the data into N blocks, each GPU is responsible for 1 block.
        Then, the Nth GPU sends the Nth block, and receives the N-1th block.
        ...
        what is the benefit of doing this?
        Each GPU receives data N-1 times in the Scatter Reduce stage, and N is the number of GPUs; each GPU receives data N-1 times in the allgather stage; each GPU sends K/N size data blocks each time, and K is the total Data size; therefore, each GPU's Data Transferred=2(N−1)*K/N = (2(N−1)/N)*K, as the number N of GPUs increases, the total transfer amount remains constant! (My understanding is that as N gets larger, 1/N keeps decreasing, and the total transmission volume tends to a fixed value?) The constant total transmission volume means that the communication cost does not increase with the number of GPUs, which means that our system has Theoretical linear acceleration capability.
        The speed of allreduce is limited by the slowest (lowest bandwidth) connection between adjacent GPUs in the ring. Given the correct neighbor selection for each GPU, this algorithm is bandwidth-optimal and the fastest algorithm to perform allreduce (assuming the latency cost is negligible compared to bandwidth).

image.png

        There are two ways to implement distributed data parallel training in Pytorch, namely DP and DDP. Since DP only manages multiple tasks and updates parameters in the main process, its calculation and communication loads are heavy, and the training efficiency is very low.
        DDP is commonly used now. In simple terms, DDP is to copy the model on each computing node and generate gradients independently, and then transfer these gradients to each other and synchronize them in each iteration to keep each node. Model Consistency.
        The DDP in pytorch actually uses Ring-ALLReduce to implement the AllReduce algorithm.
        The execution process of DDP is roughly as follows:
        Each GPU first performs environment initialization and model broadcasting to make the initial state the same. Then initialize the model bucket and reducer. In the training phase, data is obtained by sampling, forward propagation is calculated, then backpropagation is performed and all-reduce is used for gradient synchronization, and finally the parameters are updated.
        Here it is explained that allreduce is used for gradient synchronization after the forward propagation is completed, and a new word bucket is mentioned.
        (Model parameters are allocated into buckets in (roughly) the reverse order of the given model Model.parameters() . The reason for using the reverse order is that DDP expects gradients to be ready during the backward pass in roughly that order.)

image.png

        In fact, the design in DDP is to divide all model parameters into countless small buckets, and then establish allreduce at the bucket level. For example, when the gradient calculation of bucket0 in all processes is completed, the communication starts immediately, while the gradient in bucket1 is still being calculated at this time.
        This enables temporal overlap of computation and communication processes. This design can make the training of DDP more efficient, and can obtain a good acceleration effect when the number of parameters is large.

        Briefly summarize the previous content.
        1. DDP is mostly used for distributed training in Pytorch;
        2. Allreduce in DDP uses ring-allreduce, and uses buckets to introduce asynchrony;
        3. Allreduce occurs in the gradient synchronization stage after forward propagation, and Overlapping with backpropagation calculation;
        4. Ring-allreduce optimizes the bandwidth and is suitable for medium-scale clusters, but it may have precision problems, so it is not suitable for large-scale clusters; 5. The speed
        of allreduce is limited by the difference between adjacent GPUs in the ring. The limit of the slowest connection between;

Guess you like

Origin blog.csdn.net/sxf1061700625/article/details/132005688