NCCL (NVIDIA Collective Communications Library)

Overview of NCCL

NCCL : NVIDIA Collective Communications Library NVIDIA Collective Communication Library
Provides send/receive primitives for collective communication and point-to-point communication. Not a full-fledged parallel programming framework; rather a library to accelerate intra-GPU communication

NCCL provides the following collective communication primitives:

  • AllReduce
  • Broadcast
  • Reduce
  • AllGather
  • ReduceScatter

He also allows point-to-point sending and receiving communications, including: scatter (scatter), gather (gather), or all-to-all operations.

Tight synchronization between communication processors is a key aspect of collective communication. CUDA-based collectives are traditionally implemented through CUDA memory copy operations and CUDA kernels for local redection. NCCL, which implements each collection to handle communication and computation operations on a single core. This allows for fast synchronization and reduces the resources required to achieve peak bandwidth.

NCCL conveniently removes the need for developers to optimize their applications on specific machines. NCCL provides fast collectives of multiple GPUs within a node or across nodes. It supports various interconnect technologies, including PCIe, NVLINK, InfiniBand Verbs, and IP sockets.

In addition to performance, ease of use is also a consideration in the design of NCCL. NCCL uses the familiar C API that can be easily accessed from multiple programming languages.

NCCL is compatible with almost any multi-GPU parallelization model, such as:

  • Single thread controls all GPUs
  • Multi-threaded, i.e. each GPU is controlled by a thread
  • Multi-process, such as MPI

NCCL plays a huge role in the deep learning framework, and the AllReduce set is widely used in neural network training. Through the multi-GPU and multi-node communication provided by NCCL, the neural network training can be effectively expanded.

Collective Operations

Collective operations need to be used by each rank (the rank refers to the CUDA device) to form a complete collective operation. Failure to do so will cause other teams to wait indefinitely.

AllRedeuce

The AllReduce operation performs a reduce operation on data across devices and writes the result to each rank's receive buffer.

AllReduce operations are independent of rank ordering.

AllReduce starts with K arrays containing N values ​​individually. and end up with N arrays of S with the same value. For each rank, S[i] = V0[i] + V1[i] +...+Vk-1[i]

insert image description here

Broadcast

Copy the buffer of N elements of a root rank to all ranks

insert image description here

Reduce

The execution process is similar to allReduce, but only writes the result to a specific root rank

insert image description here
After a reduce operation, the effect of performing a broadcast operation is the same as that of allreduce

AllGather

In K ranks, N values ​​in each rank are gathered together. The output is sorted according to the rank label.

insert image description here

ReduceScatter

The execution is similar to reduce, but the results will be scattered in the blocks of each rank. Each rank gets a part of data according to its label.
insert image description here
ReduceScatter will be affected by the rank layout.

ring-allreduce

Reference: Zhihu article CSDN article
insert image description here
GPU distributed computing, GPU1~4 cards, responsible for the training of network parameters, the same deep learning network is arranged on each card, and each card is assigned to a minibatch of different data. After the training of each card, the network parameters are synchronized to GPU0, that is, the reducer card, and then the average of the parameter transformation is calculated and sent to each computing card. The whole process is a bit like the principle of mapreduce.

Two issues are involved :

  • Each round of training iterations requires all the cards to synchronize the data and do a reduce before it ends. If the number of cards is relatively small, the impact is actually small, but if there are many parallel cards, it involves the situation that the fast computing card needs to wait for the slow computing card, resulting in a waste of computing resources.

  • All computing GPU cards need to communicate with the Redcue card for all model parameters in each iteration. The amount of data is large, and the communication overhead is high. As the number of cards increases, the overhead will increase linearly.

Ring Allredcue, by splicing the communication model of the GPU card into a ring, thereby reducing the resource consumption caused by the increase in the number of cards.

insert image description here
The process of the algorithm is mainly divided into two steps: 1. scatter-reduce 2. allgather

1.
If scatter-reduce has n GPUs, divide the data on the GPU into n blocks, specify the left and right neighbors of the GPU, and
then start n-1 operations. In the i-th operation, GPU j will divide its ( ji)%n send fast data to GPU j+1, and accept (ji-1)%n data from GPU j-1, as shown in the figure below, when n-1
insert image description here
operations are completed, the first big step of ring-allreduce The scatter-reduce has been completed. At this time, the (i + 1) % nth piece of data of the i-th gpu has collected the (i + 1) % nth piece of data of all the n-th gpus, then, do it again allgather can complete the algorithm.

The second part: allgather, transfer the (i+1)%n block data of the i-th GPU to other GPUs through n-1 transfer.
Finally each GPU becomes as follows:

insert image description here
The picture below is from https://blog.csdn.net/dpppBR/article/details/80445569
to give an example of 3gpu:
the first is the first step, scatter-reduce:
insert image description here
and then the example of allgather:insert image description here

Guess you like

Origin blog.csdn.net/greatcoder/article/details/125668186