NCCL related notes

This article represents only my personal views and is not guaranteed to be correct.

1. Introduction to NCCL

1.What is NCCL

NCCL is the abbreviation of NVIDIA Collective Communications Library. It is a library used to accelerate communication between multiple GPUs and can achieve collective communication and point-to-point communication. NCCL has made a lot of optimizations in communication. It can realize collective communication and point-to-point communication. It can provide fast GPU communication services within the same node or between different nodes. It supports multiple interconnection technologies, including NVLink and PCIe within the same node. , Shared memory, GPU Direct P2P, supports GPU Direct RDMA, Infiniband, socket between different nodes

The following is a schematic diagram of the overall architecture of NCCL in neural network training. In this architecture, the top-level gray rectangular box represents the deep learning framework. This framework requires three parts. CUDNN is a deep learning library that implements and optimizes a variety of neural network algorithms and speeds up deep learning training. Speed, CUBLAS linear algebra library, provides common linear algebra operations, well supports deep learning frameworks, and NCCL is used to accelerate communication between multiple GPUs. The bottom layers of NCCL, CUDNN, and CUBLAS all require the support of the CUDA library to be implemented. CUDA provides some interfaces to easily write GPU programs and run them in parallel on multiple GPUs.

2.GPU communication method

Before the emergence of NCCL, the common communication method between GPUs was to use MPI (Message Passing Interface). MPI is a standard interface for writing parallel programs, which can perform message passing and synchronization operations between multiple computers. . When using MPI, the GPU can copy the data into the host memory and then send it to the host memory of other GPUs through MPI. In the same host, GPUs can use the point-to-point communication function provided by the CUDA API. If the same PCIE bus is used, this communication method can save the step of copying data from the GPU to the host memory and reduce the cost of data copying. overhead.

3. Advantages of NCCL

NCCL can realize communication between multiple cards on a single machine or between multiple machines. It integrates and optimizes communication methods, provides fast collective communication services on multiple GPUs within nodes and between nodes, and supports various interconnection technologies. , including PCIe, NVLINK, InfiniBand Verbs and IP sockets, NCCL is well compatible with most multi-GPU parallelization models.

2. Common related technologies

1.NVLink

 Before the launch of the NVLINK solution, in order to obtain more powerful computing nodes, multiple GPUs were usually directly connected to the CPU with a PCIe Switch. However, this solution was limited by the bandwidth of the PCIE. In order to deal with this problem, a new interconnection architecture NVLink was proposed. , NVLink can not only realize the interconnection between GPUs and between GPUs and CPUs, but also realize the interconnection between CPUs.

 

nvlink is a duplex dual channel. The signal transmission rate of the second generation of nvlink has reached 25Gb/s, and the total dual channel is 50GB/s. At the same time, the number of nvlinks has been increased to 6, and the total bandwidth of each v100 has reached 300GB/s. The speed of the fourth generation NVLINK reaches 900GB/s.

 2.GDR

GDR (GPU Direct RDMA) is a technology for direct communication between GPU and remote GPU. Based on RDMA technology, GPU directly communicates with remote GPU through RNIC connected to the same PCIE switch. Compared with previous technologies, this process does not require the participation of the CPU, eliminating the step of system memory copying and reducing the number of PCIE transmissions.

3. Selection of NCCL data communication link

Use the selectTransport function to select the available transmission solutions for the specified channel. First obtain the relevant information of the local node and the remote node, and then call the canConnect() function for each available transmission method to check whether the transmission method can be used for communication. The order of transmission method selection is P2P>SHM>netTransport>collnet. After finding the available transmission method, this function calls setup() to configure the connection and stores the transmission method in the corresponding connector. Finally, this function stores the selected transmission method in the transportType pointer for subsequent use.

ncclTopoSelectNets is used to select the appropriate network for the specified GPU device in the topology system. This function generates a list for each GPU to communicate. For each GPU node, this function filters the NICs with matching types and stores the results locally. NIC counter (localNetCount) and local NIC index list (localNets), then use the NVML device to shuffle the local NIC index list to ensure that multiple GPUs on the same PCI switch do not use the same NIC at the same time, and finally each new found NIC added to nets

 

Guess you like

Origin blog.csdn.net/eternal963/article/details/130754512