Detailed Explanation by Senior Technical Experts: How Alibaba Cloud eRDMA-Based GPU Instances Can Significantly Improve Multi-machine Training Performance

On March 23, 2023, the NVIDIA GTC developer conference Alibaba Cloud developer community viewing portal was officially opened. Examples greatly improve the performance of multi-machine training", this article is organized based on the content of the speech.

eRDMA is developed based on RDMA technology, and the features and advantages of RDMA are also applicable to eRDMA.
insert image description here

Initially, the emergence and wide application of RDMA technology benefited from its obvious advantages over the traditional TCP network communication protocol. The movement of data from the memory to the network card no longer requires the continuous intervention of the CPU, but completes the data movement through direct memory access through the DMA engine integrated in the physical network card. At the same time, frequent switching between user mode and kernel mode is avoided during the migration process, and the application layer directly triggers DMA access on the physical network card, thereby greatly saving CPU overhead, reducing delay, and ultimately greatly improving communication efficiency.

During the data communication process between server nodes, the CPU is only responsible for specifying the source and destination physical addresses of RDMA communication, and triggering communication actions, and the DMA engine on the physical network card is responsible for the rest of the work.

Compared with traditional TCP communication, the local and remote physical network cards can directly write local memory data into the remote memory address space through cooperation, or read data from the remote physical memory space to the local memory, and wait for the data to After the transmission is completed, the CPU is notified to take the next step, and the data transmission and calculation are separated, so as to realize efficient parallel computing processing.
insert image description here

RDMA efficiently handles memory data movement between distributed nodes, and its essence is DMA access.

The DMA engine also has direct access to device addresses. In the application scenario of heterogeneous computing, not only the data in memory but also the data in GPU memory are synchronized between server nodes. GDR is implemented to realize physical card binding and directly move GPU memory data between service nodes.

In order to support this function, NVIDIA's GPU driver provides a set of standard interfaces for obtaining the physical address corresponding to the video memory requested by the application layer. The physical network card can use this physical address to complete DMA direct access. The detailed implementation and application example code of this part can be obtained on NVIDIA's official website.

As a standard PCI device, the GPU communicates with the RDMA device in different links depending on the hardware topology of the physical server. As shown in the above figure, it shows the comparison of GDR communication links and efficiency under different hardware topologies.

The RDMA device and GPU1 in the above figure are connected under the same PCIeS Witch, and the communication between the two can be completed through the PCIe Switch. This type is called PIX communication and has the best communication efficiency.

The RDMA device and GPU2 are located under two different Root Complexes, and the communication between them needs to go through two PCIe Switches and two RCs. This type of communication is called PHB, and the communication efficiency will be affected.

The distance between the RDMA device and GPU3 is farther, it is located under the sockets of the two CPUs, and the communication between the two needs to go through the UPI channel between the CPUs to complete. This type of communication is named SYS and has the worst efficiency.

Some CPU architectures can support this level of GDR, but in actual use, it needs to be weighed according to the actual application situation, and you can choose to turn it on or off.
insert image description here

The eRDMA technology was released by Alibaba Cloud at the Yunqi Conference in 2021 and is fully compatible with the RDMA ecosystem. Application layer code running in an RDMA environment can be directly migrated to an eRDMA environment without modification.

In addition, based on the characteristics of Alibaba Cloud's elastic VPC, it can support higher elastic expansion, and can reuse the tenant isolation of the VPC network to achieve data communication security. In the areas covered by the VPC network, eRDMA networking can be realized, breaking through the cluster limit, and realizing a super-large-scale networking of nearly 100,000 levels.

eRDMA is based on the fourth-generation Shenlong CIPU, with a bandwidth of up to 200G. Higher bandwidth can be achieved by expanding hardware. The ultimate delay can be as low as 5us, and the throughput can reach 30M packets/s.

insert image description here

How to activate eRDMA in Alibaba Cloud Elastic Compute instance?

This can be achieved through Elastic RDMA Interfaces (ERI). It can be bound to the virtual network card of the ECS instance, but it must be attached to the elastic network card, and the RDMA function is enabled on the basis of the elastic network card.

ERI reuses the network (VPC) to which the ENI belongs, without changing the business network, and can use the RDMA function on the original network to experience the ultra-low latency brought by RDMA. eRDMA can be used on Alibaba Cloud servers based on the fourth-generation Shenlong architecture.

eRDMA is based on Alibaba Cloud's VPC network, so whether it is a pure CPU server or a GPU server, the eRDMA function can be activated by adding an eRDMA device. Existing businesses can be upgraded smoothly and fully inherit the user interface of Alibaba Cloud.

The right side of the picture above shows two perspectives, the lower one is the bottom view of Shenlong, and the upper one is the user's perspective.

In the implementation of the bottom layer of Shenlong, it simulates the VirtIO network card device and ERI device by Shenlong CIPU, and distributes it to the elastic server through the virtualization layer of Shenlong. From the user's perspective, there are two devices, namely VirtIO and ERI. The access to the underlying physical device is completely transparent to the user. The user only needs to install the eRDMA driver provided by Alibaba Cloud to use it.
insert image description here

What are the technical advantages of eRDMA?

First of all, in the implementation of eRDMA, a self-developed congestion control algorithm is used, which can tolerate changes in transmission quality in the VPC network, such as delay jitter and packet loss, and can still have good performance in a lossy network environment. Performance.

In addition, compared with IB and RoCE networks, its infrastructure does not need to be re-deployed, nor does it require specific network switch configurations. Relying on Alibaba Cloud's VPC network, it can be expanded and deployed anywhere in the region. Users can purchase on-demand and elastically scale to ensure efficient use of resources. More importantly, it inherits the high-availability characteristics of cloud servers, which can realize fault migration of any node and ensure rapid business recovery.

In addition, combined with the comprehensive performance monitoring tools of the Alibaba Cloud console, it is possible to quickly diagnose network abnormalities, or optimize the business communication model based on the analysis of performance monitoring to improve performance.

eRDMA is fully compatible with the RDMA ecology and Verbs interface. The right side of the figure above shows the current Verbs Opcode that eRDMA can support, covering the main operation codes of RDMA. But at the same time, it should be noted that eRDMA is based on the communication realized by the reliable connection service (RC), and supports both proxy and CM connection methods, and users can choose according to their own applications.
insert image description here

Alibaba Cloud launched the first GPU instance ebmgn7ex that supports eRDMA. It is an upgraded version of ebmgn7e. It is equipped with 8 GPUs of NVIDIA Ampere architecture, and the GPUs are interconnected through NVLink. The whole machine is embedded with two eRDMA network cards, which are respectively mounted on two CPU Sockets to realize a balanced data channel between nodes. Each 4 GPU cards share one eRDMA device and share the maximum 100G eRDMA network bandwidth.

Different from existing elastic computing instances, the eRDMA, EBS storage and VPC network used by ebmgn7ex are all shared bandwidth models. In the absence of storage and VPC network traffic, eRDMA can use a maximum bandwidth of 200G; when there is traffic in the storage or VPC network, the bandwidth will be allocated among different data traffic according to the weight. However, the Dragon CIPU can guarantee storage performance at the bottom layer, which can make the utilization of bandwidth more efficient and fully release the performance of the Dragon CIPU.
insert image description here

The bottom layer of the mainstream AI framework performs multi-machine or multi-card communication through a variety of different communication backends. Several large collective communication libraries such as OpenMPI, Gloo, and NCCL have their own implementations of various collective communication modes and algorithms.

NCCO NVIDIA has implemented a set of open source communication library for DPU, which can realize high-speed interconnection between machines or GPU cards through channels such as PCIe and NVLink. With the help of software and hardware collaboration, it has unique advantages in performance optimization and is also the recommended communication backend for mainstream AI frameworks. eRDMA provides good support for NCCO. It can use the open source NCCO communication library without modification, and can enable the underlying eRDMA devices. Compared with traditional TCP communication, it has lower latency to achieve higher throughput, but it can achieve performance comparable to RDMA communication links.

In addition, compared with RDMA, it is more suitable for large-scale deployment on the cloud, providing users with ultra-high cost-effective and inclusive large-bandwidth communication link solutions.
insert image description here

Compared with ebmgn7e, ebmgn7ex adopts the same CPU and GPU configuration, but the bottom layer adopts the fourth-generation Shenlong CIPU, which increases the network bandwidth of the whole machine from 65G to 200G, and the increase rate is close to 200%. At the same time, the communication delay is reduced by 80% compared with the VPC network through eRDMA technology.

We selected several typical multi-machine training application scenarios to conduct a comparative test between the two. The overall performance of ebmgn7ex is improved by about 30%, especially in scenarios that are highly dependent on delay and bandwidth, such as vgg16, voice conversion, and mae, etc., the performance is improved by more than 40%. At the same time, on the basis of close user costs, ebmgn7ex obviously shows a better cost performance. According to preliminary estimates, according to the user's business scenario, 20%-30% of cost optimization can be achieved.
insert image description here

Alibaba Cloud released the CIPU cloud processing unit in 2022. Today, on Alibaba Cloud, cloud services are far from providing a server to users after dividing resources through virtualization, but splitting CIPU, GPU storage networks, etc. into different resources through CIPU intelligent scheduling and management capabilities The pool can flexibly provide corresponding services according to the needs of users.

Since the storage network and computing resources are independent of each other, when using computing services, several types of resources can be dynamically adjusted according to user needs. For example, the number of CPU cores, the capacity of mounted ESSD data disks, and network bandwidth can all be changed according to the performance requirements of the business, which provides great flexibility compared to self-built IDCs. Based on CIPU, we do two things:

First, hardwareize the software. The original virtualization software, network transmission, trusted computing and other functions are realized through the hardware acceleration of the CIPU, and the cloud software functions are liberated from the CPU memory of the server, allowing the computing resource pool to provide services independently. The performance of functions such as virtualization and network transmission has also been improved through hardwareization of software.

Second, hardware software. Through software-defined hardware, the RDMA function originally provided by hardware is provided on the CPU.
insert image description here

Currently, there are three main solutions that support the RDMA function, namely IB, RoCE, and iWARP. CIPU provides the RDMA function based on the iWARP protocol through software-defined networking.

There are not many solutions using the iWARP protocol on the market. The reason is that the iWARP protocol is based on TCP transmission. Compared with RoCE and IB, it requires a large amount of memory access, which will affect the CPU performance of the server in a large network. In addition, due to the adoption of common TCP/IP network facilities, the monitoring of RDMA transmission cannot be realized, and there are also problems in controlling congestion.

However, Alibaba Cloud uses CIPU to realize congestion control and traffic management of network data through software subscription. At the same time, the CIPU offloads the original network and virtualization layer memory and computing work to the CIPU processor without affecting the performance of the computing nodes.

insert image description here

Through the above methods, CIPU circumvents the shortcomings of iWARP and uses TCP/IPoverlay to form a network. Originally, computing products provided by Alibaba Cloud using the RoCE protocol first required a set of TCP overlay for the VPC network, and another set of RoCE for RDMA data transmission. Compared with the original solution, eRDMA merges RDMA into the VPC network, and the whole topology becomes very simple and convenient.

Alibaba Cloud's network storage and computing resources are all connected to the VPC network. Through the CIPU, all resources in the AZ can be used for data transmission through RDMA. Traditional RoCE and IB networks need to build a dedicated computer room independently, and the network topology is difficult to change after the computer room is built. The eRDMA instance does not need to build a new computer room, so the cost of the computer room is greatly reduced compared with the RoCE network method, which is very cost-effective.

In addition, CIPU software-defined hardware can implement elastic networks, so the eRDMA computing cluster can link all resources in the AZ on the overlay class without additional network configuration. As long as the resources deployed in a single AZ can be pulled up and provided to the same computing resource cluster, the elasticity capability is greatly enhanced. At the same time, eRDMA is based on a general-purpose computing network, and has very good compatibility, and can be compatible with almost all APIs.
insert image description here

Because eRDMA can merge almost any scale of computing resources in an AZ into an independent computing cluster, it is very suitable for building an AI training cluster. Based on the eRDMA protocol, this instance uses an ICL processor, and each machine can be interconnected through the eRDMA 200G network, and all such instances applied for in the availability zone can realize RDMA transmission among each other.
insert image description here

In particular, resources are provided on the cloud through the virtualization layer, and the computing power will be lost to a certain extent in the virtualization layer. However, the bare metal instances provided by Alibaba Cloud using CIPU are different. The three letters EBM in front of the instance represent that the physical resources of the whole machine of this bare metal instance are directly transmitted to the user through the CIPU, without the loss of the virtualization layer in between. With the support of the cloud management of the CIPU, the user can obtain a Complete physical machine capabilities.

The price of this instance is lower than that of the previous bare metal instance ebmgn7e with 64G network bandwidth, which is the cost-effective bonus brought about by the compatibility of CIPU with the network. At the same time, not only training cluster services can choose this instance, but some general stand-alone training and inference services can also be selected. For example, ChatGPT is currently very popular. The ChatGPT model requires more than 500G of video memory during inference, so this instance is also the best choice for large model inference.

Currently, you can apply for eRDMA instances on the cloud through the invitation test link. The procedure options when applying for an instance are also different from before.

First, the option of automatically installing the eRDMA driver is added when selecting the operating system. It is recommended to check it. After checking, the eRDMA network driver can be automatically installed in the instance;

Second, in terms of network configuration, ebmgn7ex needs to select two fixed elastic network cards. The two elastic network cards cannot be released. Each network card can provide 100G network bandwidth. After checking the eRDMA on the right, it can support RDMA transmission. If there are abnormalities in some applications during use or testing, you can also cancel the check. At this time, the network bandwidth of the instance remains unchanged, but only VPC network connection instances are supported.
insert image description here

Alibaba Cloud will gradually launch more products to support eRDMA. The ebmgn7ex instance will also form a complete ecology on Alibaba Cloud. Through eRDMA-based interconnection, users can access ECS general-purpose computing instances in the cluster and CPFS distributed storage that supports eRDMA to obtain lower latency. And because bare metal and CPFS realize RDMA transmission, with the help of GPU Direct Storage function, more NC resources can be saved when quickly obtaining training data.

At the same time, the original security functions on the public cloud and various PaaS services provided by Alibaba Cloud in terms of artificial intelligence, such as ACK container service, serverless computing service, and all functions of the PAI platform, can also run on eRDMA-based instances.
insert image description here

Instance users who use Alibaba Cloud to support eRDMA can obtain extremely high elasticity of computing clusters, build RDMA training clusters on the cloud at an extremely fast speed of minutes, and elastically expand capacity according to needs, which will greatly save training business time .

eRDMA has a very high bandwidth, and all instances supporting eRDMA can perform RDMA low-latency communication with a bandwidth of hundreds of gigabytes without increasing costs and saving the cost of building a computer room. More importantly, the public cloud ecosystem can provide very high-quality computing services, such as business-based elastic scaling, oss, CPFS and other storage.

eRDMA also provides convenient management capabilities, as well as the stability guarantee for computing, storage, and network provided by the public cloud based on the Shennong architecture. The extremely high stability also saves operation and maintenance costs.

在阿里云,the more you buy,the more you save。

Click "Read the original text" at the end of the article to watch the full video.

Detailed Explanation by Senior Technical Experts: How Alibaba Cloud eRDMA-Based GPU Instances Can Significantly Improve Multi-machine Training Performance

Guess you like

Origin blog.csdn.net/bjchenxu/article/details/129925664