Lin Lixiang, Senior Technical Expert of Alibaba Cloud: Based on Alibaba Cloud's elastic GPU service, Shenlong AI acceleration engine can seamlessly improve AI training performance

At 14:00 on March 23, 2023, the viewing entrance of the Alibaba Cloud developer community at the NVIDIA GTC developer conference was officially opened. Lin Lixiang, a senior technical expert at Alibaba Cloud, presented a presentation titled "Shenlong AI Acceleration Engine Based on Alibaba Cloud's Elastic GPU Service, Seamlessly improve the performance of AI training", the following is the content of his speech.

Alibaba Cloud Elastic GPU Service is an IAAS instance including NVIDIA GPU provided by Alibaba Cloud for customers on the cloud. The Shenlong AI Acceleration Engine is a software tool built on the Alibaba Cloud GPU IAAS service, designed for users to use the Alibaba Cloud GPU IAAS service When performing artificial intelligence calculations, the efficiency of GPU instances can be effectively utilized.

The scenarios and distribution of artificial intelligence training for users on the cloud are of great guiding significance for us to analyze users' usage habits and pain points and provide targeted optimization solutions.
insert image description here

The ease of use of the Pytorch framework is deeply rooted in the academic world and gradually spread to the industrial world. At present, Pytorch has almost become the framework of choice for AI users on the cloud to train new AI models. In addition, Tensorflow still occupies an important place in scenarios such as recommendation systems, while MXNET and JAX are currently being used less and less frequently.

Secondly, in terms of machine scale for AI training tasks, Alibaba Cloud currently provides bare metal/virtual machine instances with 8 GPUs on a single machine. Users can use a single machine or connect to each other through a network card to form a multi-machine training cluster. Alibaba Cloud also provides single-card/two-card/four-card virtual machine instances for users to choose from.

At present, the training tasks of some users can be realized by a single machine, such as some fintune models and small and medium models with acceptable training time. A considerable number of users have multiple GPU instances to form a training cluster to perform data-parallel or model-parallel training tasks, such as the training tasks of pre-trained models with massive training data sets or the training tasks of ultra-large-scale models.

In the process of AI training, users often use the AI ​​framework itself and the acceleration software provided by NVIDIA built on cuda and under the AI ​​framework to achieve the purpose of training AI models. However, in the process of actual end-to-end training tasks, there are often three major bottlenecks, such as computing, network, and storage. Especially in the scenario of using multiple machines and multiple cards for distributed training, the problem of network efficiency between machines is a common and major problem. The performance bottleneck of AI training tasks.

After in-depth understanding of the scenarios and pain points of users using GPU instances on the cloud for training tasks, how to solve user pain points based on Alibaba Cloud's elastic GPU service, using a combination of software and hardware, and finally improve the performance of user AI training, especially distributed training Excellent performance is the goal of Shenlong AI Acceleration Engine.
insert image description here

Alibaba Cloud Heterogeneous Computing is built on the base of the Shenlong Computing Platform and provides a variety of heterogeneous instance products such as NVIDIA GPUs and other heterogeneous devices. In addition, while providing traditional instances, instances that decouple AI computing loads, such as EAIS, are also provided.

Shenlong AI acceleration engine AIACC is built on heterogeneous instances as shown in the figure. Under the platform layer, as a software tool form of IAAS+, it is used to accelerate AI computing load scenarios. In addition to AIACC, this layer also provides tools such as fastgpu to quickly and easily build GPU clusters.

The software provided by Shenlong AI acceleration engine AIACC includes two aspects of AI training and AI reasoning.

In the field of AI training, AIACC, the Shenlong AI acceleration engine, aims at the main pain points of users using cloud instances for AI training, and optimizes the combination of software and hardware on distributed training and training calculation graphs respectively, and forms corresponding software tools to accelerate users. AI training performance.
insert image description here

The picture above is the software architecture diagram of the Shenlong AI acceleration engine AIACC in terms of distributed training. The software code is called Acspeed (hereinafter, Acspeed will be collectively referred to as AIACC's software optimization tool for distributed training).

In the context of massive data, distributed training is a common demand for users to conduct AI training. Compared with the nvlink/nvswitch that can be provided inside a single machine, the inter-network card communication capability provided by the instance on the cloud has advantages of more than one order of magnitude in terms of communication delay and communication bandwidth.

The difference between the interconnection capabilities between machines and the interconnection capabilities within machines, in distributed training scenarios such as data parallelism and model parallelism, often makes the data exchange between machines a shortcoming of the entire training efficiency, which is what Acspeed needs to focus on. Performance issues.

From the user's usage habits, Pytorch is currently the main framework for new model training, and Pytorch ddp is Pytorch's native choice for distributed training.

Therefore, in addition to optimizing the performance of the underlying network, Acspeed also needs to ensure compatibility with native DDP usage habits to minimize user costs. Therefore, Acspeed adopts a layered decoupling form in the software architecture, which realizes seamless compatibility with the AI ​​framework on the top, optimizes the performance of the underlying infrastructure on the bottom, and realizes communication under specific AI scenarios across layers. Multidimensional tuning of efficiency.

As shown in the figure, Acspeed aims at the bottleneck of distributed training, and realizes the non-inductive performance optimization of the combination of software and hardware through the layered decoupling of the AI ​​framework layer, the communication algorithm layer and the network layer.

At the AI ​​framework layer, Acspeed uses Pytorch from the c10d-plugin interface and the corresponding wrapper to realize non-sense support for Pytorch ddp usage and bucket-level optimization; for users who use Tensorflow, Acspeed also provides horovo-api compatibility layer .

In the collective communication algorithm layer, Acspeed uses the collective communication compiler technology, based on nccl runtime, to perform adaptive topology perception and algorithm optimization for interconnection information such as nvlink/nvswitch/PCIe/NIC switches.

At the network layer, Acspeed has made in-depth optimization for the infrastructure of Alibaba Cloud VPC network and eRDMA network. Due to the seamless support for the use of Pytorch ddp, users can enjoy the optimization effect of distributed training performance brought by Acspeed without changing the business code.
insert image description here

AIACC also compiles and optimizes the Pytorch training calculation graph, and the software code is Agspeed (Agspeed will be used uniformly to represent AIACC's optimization work in Pytorch calculation training graph).

As we all know, Pytorch's model has swept academia and industry with its outstanding ease of use. Compared with the graph mode, although the Pytorch mode is straightforward, it lacks the mechanism of graph fusion, which may lead to low GPU computing efficiency in some scenarios to varying degrees. The Pytorch community has also discovered this problem, and has already started working on the graph mode to optimize the calculation graph under the premise of ensuring that the user experience is not affected. Excellent projects including torchscript, torchdynamo, and inductor have emerged, and continuous Integrate and iterate. After the era of Pytorch2.0, the function of computing graph compilation will become the default option of the Pytorch framework.

At present, the entire Pytorch training calculation graph component still needs to be improved in the end-to-end scenario, including frequent graph recompile, graph break and other issues, which limit the implementation and use in user scenarios.

On the basis of absorbing the excellent work of the community, Agspeed has done a lot of coverage and optimization in the end-to-end scenario for the shortcomings of the current Pytorch calculation graph.

As shown in the figure, the upper layer of Agspeed ensures that the user-side Pytorch eager API usage scenarios remain unchanged, focusing on optimizing the coverage of the front-end and back-end of the compiler. At the front end of the compiler, Agspeed introduces the graph catch mechanism of torch dynamo, which captures and optimizes the catch exception. The back-end optimizer with the best performance is automatically selected through the front-end autotuner. On the backend of the compiler, Agspeed introduces backends including torchscript, Functor aot, nvfusor, and inductor, and adds them to the optimization path through plugins to address the insufficient performance of end-to-end models in specific backends.

Compared with the eager mode and the current compilation method, Agspeed guarantees the user's end-to-end performance enhancement and provides SLA guarantee.
insert image description here

We divide the optimization stack of Acspeed software into AI framework layer, collective communication algorithm layer and network layer. A special case is the scenario of using Pytorch for AI training, which corresponds to the Pytorch layer, nccl layer and the TCP layer of the operating system.

In distributed training, different levels of Pytorch, nccl, and TCP have different communication data structures as carriers to undertake the task of distributed communication. For example, the corresponding Pytorch framework is the bucket of ddp, which is the communication transmission data structure at the framework layer; the communication carrier corresponding to nccl is the chunk on the nccl side, which is the data structure of nccl for communication transmission; the tcp layer corresponds to the tcp Buffer, which is the real transmission data for tcp communication.

The factors affecting the bandwidth utilization of different levels of communication carriers or data structures are not the same but interrelated.

The different parts are: for ddp bucket, the main factor affecting communication efficiency is the bucket granularity of calculation and communication overlap; the main factor affecting communication efficiency of nccl chunk is the influence of slice overlap and channel, etc.; TCP buffer affects communication efficiency The main factor is the product of delay bandwidth and so on.

The interrelated part is: the ddp bucket of pytorch will be disassembled into chunks of nccl for transmission, and the chunk of nccl will be disassembled into buffers of tcp for transmission.

The communication data division granularity and strategy of the upper layer will affect the communication efficiency and strategy selection of the lower layer.

Acspeed has opened up the framework layer, the collective communication algorithm layer to the network layer in terms of architecture, and at the same time performs cross-layer autotune on multi-layer influencing factors to improve the overall communication efficiency. At the same time, in order to solve the problem of affecting the communication efficiency of different layers, a further optimization of the combination of software and hardware has been made in engineering.

In the AI ​​framework layer, the communication granularity of the Pytorch reducer is dynamically adjusted to achieve the best fusion granularity optimization; in the nccl side project, the pipeline capability of the nccl primitive communication is improved through the automatic adjustment of the dynamic channel slice; in The network layer implements an adaptive delay-bandwidth product buffer and supports Alibaba Cloud's unique eRDMA function.

Unlike traditional TCP/IP, RDMA can offload the network protocol stack on the network card, thereby optimizing communication between networks. RDMA is also used differently compared to TCP/IP.
insert image description here

eRDMA is a solution for using RDMA in large-scale cloud networks innovatively proposed by Alibaba Cloud, and realizes the large-scale RDMA cluster networking capability in cloud-based networks. It can not only realize the Bypass os kurnel capability of the communication software stack to improve the utilization rate of communication bandwidth, but also, for GPU training scenarios, combined with GPU Direct RDMA technology, it can bypass host and realize efficient cross-machine transmission of GPU buffer.

As shown in the figure, the maximum efficiency of eRDMA meets the requirements of large-scale network bandwidth and delay for AI distributed training in large-scale cloud scenarios. Since eRDMA is quite different from traditional TCP/IP program logic, the nccl plugin provided by Acspeed at the communication layer can support nccl's enabling and optimization of RDMA capabilities without any need to change any existing usage methods. Under the premise, enjoy the effect of eRDMA in distributed training.
insert image description here

We selected a benchmark from a single machine to 8 machines on Alibaba Cloud for distributed training expansion, and compared the performance of Pytorch's native ddp with the superimposed AIACC optimization capability.

The abscissa in the above figure represents the data of different models trained using Pytorch ddp, Acspeed and Agspeed at different machine scales. The ordinate is the normalized performance data, the performance of a single machine is expressed as 1, and the performance of the ideal 8 machines is 8. It can be seen that from 8 cards in a single machine to 64 cards in 8 machines, AIACC has better scalability, and has a significant performance improvement compared to Pytorch's native ddp, and the improvement effect ranges from 30% to 150%.
insert image description here

AIACC achieves compatibility with customer codes, and customers can use AIAC for acceleration without any sense. Combined with Alibaba Cloud's IAAS resources, it realizes in-depth optimization combining software and hardware. All of the above capabilities are provided to customers in a convenient, open source, and compatible manner, which improves the business performance of customer AI training, accelerates the efficiency of customer algorithm development and model iteration speed, and also improves the efficiency of users using GPU resources. Reducing the cost of computing resources and ultimately enhancing the competitiveness of customers' products also reflects the value of AIACC's customers.
insert image description here

Finally, welcome to use AIACC to accelerate the training business on the cloud.

We provide a variety of methods such as one-click selection, software package installation, etc., to facilitate users to run AIACC directly on heterogeneous instances or containers. Taking Acspeed as an example, you only need to add a line of code import acspeed after import torch on the business code, and you can start your AIACC acceleration journey.

AI distributed training communication optimization library AIACC-ACSpeed ​​official documentation
https://help.aliyun.com/document_detail/462031.html

AI Training Calculation Optimization Compiler AIACC-AGSpeed ​​official documentation
https://help.aliyun.com/document_detail/467465.html

Click "Read the original text" at the end of the article to watch the full video.

Lin Lixiang, Senior Technical Expert of Alibaba Cloud: Based on Alibaba Cloud's elastic GPU service, Shenlong AI acceleration engine can seamlessly improve AI training performance

Guess you like

Origin blog.csdn.net/bjchenxu/article/details/129800825