AI base, the answer sheet of the era of large models

picture

1. The Birth of Wen Xin Yi Yan

"Wen Xin Yi Yan was trained on the largest high-performance GPU cluster in the AI ​​field in China."

As early as June 2021, in order to meet future large-scale model training tasks, Baidu Smart Cloud began to plan the construction of a new high-performance GPU cluster. Together with NVIDIA, it completed the IB network architecture design that can accommodate a scale of 10,000 cards or more. The nodes in the cluster Each GPU card in the room is connected through the IB network, and the cluster construction will be completed in April 2022, providing single-cluster EFLOPS-level computing power.

In March 2023, Wenxin Yiyan was born on this high-performance cluster, and iteratively developed new capabilities. At present, the size of this cluster is still expanding.

Dr. Lai Junjie, General Manager of Solutions and Engineering, NVIDIA China: GPU clusters interconnected by high-speed IB network are the key infrastructure in the era of large models. The largest high-performance GPU/IB cluster in the domestic cloud computing market jointly built by NVIDIA and Baidu Smart Cloud will accelerate Baidu's breakthrough in the field of large models.

2. High performance cluster design

A high-performance cluster is not a simple accumulation of computing power, but requires special design and optimization to fully utilize the overall computing power of the cluster.

In distributed training, GPUs communicate continuously between machines and within machines. While using high-performance networks such as IB and RoCE to provide high-throughput and low-latency services for inter-machine communication, it is also necessary to design the internal network connection of the server and the communication topology in the cluster network to meet the needs of large-scale model training for communication. requirements.

To achieve the ultimate design optimization, it is necessary to have a deep understanding of what each operation in the AI ​​task means to the infrastructure. Different parallel strategies in distributed training, that is, how to split models, data, and parameters, will generate different data communication requirements. For example, data parallelism and model parallelism will respectively introduce a large number of intra-machine and inter-machine Allreduce operations. Generate All2All operations between machines, and 4D hybrid parallelism will introduce communication operations generated by various parallel strategies.

To this end, Baidu Smart Cloud optimizes the design from two aspects of stand-alone server and cluster network to build a high-performance GPU cluster.

In terms of stand-alone servers, Baidu Smart Cloud's super AI computer X-MAN has evolved to the fourth generation. X-MAN 4.0 establishes high-performance card-to-card communication for GPU, providing 134 GB/s Allreduce bandwidth within a single machine. This is currently Baidu's server product with the highest degree of customization and the most dedicated materials. In the MLCommons 1.1 list, X-MAN 4.0 ranks TOP2 in terms of hardware performance of a single computer with the same configuration.

In terms of the cluster network, a three-layer Clos architecture optimized for large-scale model training is specially designed to ensure the performance and speedup of the cluster during large-scale training. Compared with the traditional method, the architecture is optimized with eight guide rails, so that the number of hops in the communication between any same-number card in different machines is as small as possible, and it provides the Allreduce operation for the same-number card that accounts for the largest proportion of network traffic in AI training. High-throughput and low-latency network services.

The network architecture can support a super-large-scale cluster with a maximum of 16,000 cards, which is the largest scale of all IB network boxes at this stage. The network performance of the cluster is stable and consistent at a level of 98%, which is close to the state of constant communication. As verified by the large model algorithm team, 100 billion model training jobs are submitted on this ultra-large-scale cluster, and the overall training efficiency is 3.87 times that of the previous generation cluster under the same machine scale.

However, building a large-scale, high-performance heterogeneous cluster is only the first step in the successful implementation of a large model. To ensure the smooth completion of AI large model training tasks, more systematic software and hardware optimizations are needed.

3. The challenge of large model training

In the past few years, the parameter size of large models will reach a rate of 10 times of annual growth. Around 2020, a large model will be a parameter with a scale of 100 billion. In 2022, a scale of 100 billion parameters will be required to be called a large model.

Before the large model, the training of an AI model can usually be satisfied with a single machine, a single card, or a single machine with multiple cards, and the training period ranges from hours to days. Now, in order to complete the training of large models with hundreds of billions of parameters, distributed training in large clusters with hundreds of servers and thousands of GPU/XPU cards has become a must, and the training cycle has also been extended to months.

In order to train GPT-3 with 175 billion parameters (300 billion token data), it takes 32 years to convert 1 block of A100 based on half-precision peak calculation performance, and 34 days to calculate 1024 blocks of A100 based on 45% resource utilization. Of course, even if the time issue is not considered, one A100 cannot train a model with a scale of 100 billion parameters, because the model parameters have exceeded the memory capacity of a single card.

Large-scale model training is performed in a distributed training environment, and the training cycle is shortened from decades to tens of days for a single card. It is necessary to break through various challenges such as computing walls, video memory walls, and communication walls, so that all resources in the cluster can be used. Make full use of it to speed up the training process and shorten the training cycle.

The computing wall refers to the huge difference between the computing power of a single card and the total computing power of the model. A100's single-card computing power is only 312 TFLOPS, while GPT-3 requires a total computing power of 314 ZFLOPs, which is a difference of 9 orders of magnitude.

The video memory wall refers to the fact that a single card cannot fully store the parameters of a large model. The 175 billion parameters of GPT-3 itself requires 700 GB of memory space (each parameter is calculated as 4 bytes), while the NVIDIA A100 GPU has only 80 GB of memory.

The essence of the computing wall and the video memory wall is the contradiction between the limited single card capability and the huge storage and computing requirements of the model. This can be solved by distributed training, but after distributed training, the problem of communication walls will be encountered.

The communication wall is mainly due to the frequent parameter synchronization of each computing unit in the cluster under distributed training, and the communication performance will affect the overall computing speed. If the communication wall is not handled well, it is likely to lead to a larger cluster size and lower training efficiency. The successful breakthrough of the communication wall is reflected in the strong expansion capability of the cluster, that is, the multi-card acceleration capability of the cluster matches the scale. The linear speedup ratio of multiple cards is an indicator for evaluating the acceleration capability of multiple cards in the cluster. The higher the value, the better.

These walls began to appear during multi-machine multi-card training. As the parameters of the large model become larger and larger, the corresponding cluster size becomes larger and larger, and the three walls become higher and higher. At the same time, during the long-term training process of large clusters, equipment failures may occur, which may affect or interrupt the training process.

4. The process of large-scale training

Generally speaking, looking at large model training from the perspective of infrastructure, the whole process can be roughly divided into the following two stages:

Phase 1: Parallel strategy and training optimization

After submitting the large model to be trained, the AI ​​framework will comprehensively consider the structure of the large model and the capabilities of the training cluster, formulate a parallel training strategy for this training task, and complete the AI ​​task placement. This process is to disassemble the model and place tasks, that is, how the large model should be disassembled, and how the disassembled parts are placed in each GPU/XPU of the cluster.

For the AI ​​tasks placed on the GPU/XPU, the AI ​​framework will jointly train the cluster to optimize the entire link at the single-card runtime and cluster communication level to accelerate the operating efficiency of each AI task during the large model training process, including data loading. Operator computing, communication strategy, etc. For example, replace ordinary operators running in AI tasks with optimized high-performance operators, and provide communication strategies that adapt to the current parallel strategy and training cluster network capabilities.

Phase 2: Resource Management and Task Scheduling

The large model training task starts running according to the parallel strategy formulated above, and the training cluster provides various high-performance resources for AI tasks. For example, what environment does the AI ​​task run in, how to provide resource docking for the AI ​​task, what storage method the AI ​​task uses to read and save data, and what type of network facility the GPU/XPU communicates with, etc.

At the same time, during the running process, the training cluster will cooperate with the AI ​​framework to provide a reliable environment for long-term training of large models through elastic fault tolerance and other methods. For example, how to observe and perceive the running status of various resources and AI tasks in the cluster, and how to schedule resources and AI tasks when the cluster changes.

From the dismantling of the above two stages, we can find that the entire large model training process relies on the close cooperation of the AI ​​framework and the training cluster to complete the breakthrough of the three walls and jointly ensure the efficiency and stability of large model training.

5. Full-stack integration, "AI base" accelerates large model training

Combining years of technical accumulation and engineering practice in the field of AI and large models, Baidu launched a full-stack self-developed AI infrastructure "AI Big Base" at the end of 2022, including a three-layer technology stack of "chip-framework-model". All levels have key self-developed technologies and leading products, corresponding to Kunlun Core, PaddlePaddle , and Wenxin Large Model respectively.

On the basis of these three-layer technology stacks, Baidu Smart Cloud has launched two major AI engineering platforms, "AI Middle Platform" and "Baidu Baige · AI Heterogeneous Computing Platform", which improve efficiency at the development and resource levels respectively, and complete The breakthrough of the three walls accelerates the training process.

Among them, "AI Middle Station" relies on the AI ​​framework to formulate parallel strategies and optimized environments for the large model training process, covering the entire life cycle of training. "Baidu Baige" realizes efficient chip enablement and provides various AI resource management and task scheduling capabilities.

picture

Baidu's "AI Big Base" has carried out full-stack integration and system optimization of the technology stacks of each layer, and completed the construction of cloud and smart technology integration, which can realize end-to-end optimization and acceleration of large model training.

Hou Zhenyu, Vice President of Baidu Group: Large model training is a systematic project, and the cluster size, training time, and cost have all improved a lot compared to the past. If it is not full-stack optimization, it is difficult to ensure the smooth completion of large model training. Baidu's technical investment and engineering practice in large models over the years has enabled us to establish a complete set of software stack capabilities to accelerate the training of large models.

Next, we will combine the two stages of the large model training process mentioned above to describe how the various layers of the technology stack of the "AI Big Base" are integrated and system optimized to achieve end-to-end optimization and acceleration of large model training.

5.1 Parallel strategy and training optimization

model split

Paddle can provide rich parallel strategies such as data parallelism, model parallelism, pipeline parallelism, parameter grouping and slicing, and expert parallelism for large model training. These parallel strategies can satisfy the training of large-scale models ranging from one billion to one hundred billion, or even trillions of parameters, and achieve breakthroughs in computing walls and video memory walls. In April 2021, Paddle was the first in the industry to propose a 4D hybrid parallel strategy, which can support the training of hundreds of billions of large models at the monthly level.

topology awareness

Baidu Baige has cluster topology awareness capabilities specially prepared for large model training scenarios, including intra-node architecture awareness, inter-node architecture awareness, etc., such as the internal computing power of each server, CPU and GPU/XPU, GPU/XPU and GPU /XPU link mode, and GPU/XPU and GPU/XPU network link mode between servers and other information.

automatic parallelization

Before the large model training task starts running, Paddle can form a unified distributed resource map for the cluster based on the topology perception capability of the Baidu Baige platform. At the same time, Paddle calculates the view according to the unified logic formed by the large model to be trained.

Combining these two graphs, Paddle automatically searches for the optimal model segmentation and hardware combination strategy for the model, assigns model parameters, gradients, and optimizer states to different GPUs/XPUs according to the optimal strategy, and completes AI tasks placement to improve training performance.

For example, model-parallel AI tasks are placed on different GPUs of the same server, and these GPUs are linked through the NVSwitch inside the server. Place data-parallel and pipeline-parallel AI tasks on the same number of GPUs in different servers, and these GPUs are linked through IB or RoCE. Through this method of placing AI tasks according to the type of AI tasks, cluster resources can be used efficiently and large model training can be accelerated.

End-to-end adaptive training

During the running of the training task, if the cluster changes, such as a resource failure, or the cluster scale changes, Baidu Baige will perform fault-tolerant replacement or elastic expansion and contraction. Since the locations of the nodes participating in the calculation have changed, the communication mode between them may no longer be optimal. Paddle can automatically adjust model segmentation and AI task placement strategies based on the latest cluster information. At the same time, Baidu Baige completes the corresponding tasks and resource scheduling.

Flying Paddle's unified resource and computing views and automatic parallel capabilities, combined with Baidu Baige's flexible scheduling capabilities, realize end-to-end adaptive distributed training of large models, which can cover the entire life cycle of cluster training.

This is an in-depth interaction between the AI ​​framework and the AI ​​heterogeneous computing power platform, which realizes the system optimization of the trinity of computing power, framework, and algorithm, supports automatic and flexible training of large models, and has a performance improvement of 2.1 times in end-to-end actual measurements, ensuring Efficiency for large-scale training.

training optimization

After splitting the model and placing the AI ​​task, in order to ensure that the operator can accelerate calculations on various mainstream AI frameworks such as Paddle and Pytorch and various computing cards during the training process , the Baidu Baige platform has built-in AI acceleration suite. The AI ​​acceleration kit includes data layer storage acceleration, training and inference acceleration library AIAK, which have optimized the entire link from data loading, model calculation, distributed communication and other dimensions.

Among them, the optimization of data loading and model calculation can effectively improve the operating efficiency of a single card; the optimization of distributed communication, combined with high-performance networks such as IB or RoCE of the cluster and specially optimized communication topology, and a reasonable AI task placement strategy, together Solve the communication wall problem.

Baidu Baige's multi-card acceleration ratio in a kilocalorie-scale cluster has reached 90%, so that the overall computing power of the cluster can be fully released.

In the test results of MLPerf Training v2.1 released in November 2022, the model training performance results submitted by Baidu using Flying Paddle plus Baidu Baige ranked first in the world under the same GPU configuration, and the end-to-end training time and training throughput were both average. Go beyond the NGC PyTorch framework.

5.2 Resource management and task scheduling

Baidu Baige carries the operation of all AI tasks through the container engine CCE, and provides various AI resource management, architecture awareness, elastic fault tolerance and other capabilities through relevant container plug-ins, and completes the computing wall, video memory wall, and communication wall at the resource efficiency level. breakthrough.

resource management

Baidu Baige can provide various computing, network, storage and other AI resources, including Baidu Taihang Elastic Bare Metal Server BBC, IB network, RoCE network, parallel file storage PFS, object storage BOS, data lake storage acceleration RapidFS and other suitable Cloud computing resources for large model training.

When the task is running, these high-performance resources can be reasonably combined to further improve the efficiency of AI operations, and realize the calculation acceleration of AI tasks throughout the process. Before the start of the AI ​​task, the training data in the object storage BOS can be preheated, and the data can be loaded into the data lake storage acceleration RapidFS through the elastic RDMA network. The elastic RDMA network can reduce the communication delay by 2 to 3 times compared with the traditional network, and accelerate the reading of AI task data on the basis of high-performance storage. Finally, the calculation of AI tasks is performed through the high-performance Baidu Taihang Elastic Bare Metal Server BBC or cloud server BCC.

elastic fault tolerance

When AI tasks are running, not only high-performance resources are required, but also the stability of the cluster needs to be ensured to minimize the occurrence of resource failures so as not to interrupt training. However, resource failure cannot be absolutely avoided. The AI ​​framework and training cluster need to jointly ensure that the training task can be restored from the latest state after being interrupted, so as to provide a reliable environment for long-term training of large models.

Baidu's self-developed heterogeneous collection communication library ECCL supports the communication between Kunlun chips and other heterogeneous chips, and supports the perception of slow nodes and faulty nodes. Through Baidu Baige's resource elasticity and fault-tolerant strategy, slow nodes and faulty nodes are eliminated, and the latest architecture topology is fed back to the flying paddles , and tasks are re-arranged, and corresponding training tasks are allocated to other XPU/GPUs to ensure smooth training run efficiently.

6. AI Pratt & Whitney in the era of large models

The large model is a milestone technology for artificial intelligence to move towards general intelligence, and mastering the large model is a must-answer question on the path to complete the intelligent upgrade. Ultra-large-scale computing power and full-stack integrated software optimization are the best answers to this must-answer question.

In order to help the society and industry to quickly train their own large-scale models and seize the opportunity of the times, Baidu Smart Cloud released the Yangquan Smart Computing Center at the end of 2022. Equipped with the full-stack capabilities of Baidu's "AI Big Base", it can provide 4 EFLOPS of heterogeneous computing force. This is currently the largest and most technologically advanced data center in Asia.

At present, Baidu Smart Cloud has opened all the capabilities of the "AI Big Base" to the outside world to realize AI inclusiveness in the era of large models. Delivery in various forms, so that society and industry can conveniently obtain smart services.

—— END ——

Recommended reading:

Augmented language models - the road to general intelligence?

Realization of full message based on public mailbox

Baidu APP iOS terminal package size 50M optimization practice (2) Image optimization

On the recompute mechanism in distributed training

Analyze how the Dolly Bear business practices stability construction based on the distributed architecture

Software Quality and Testing Essays by Baidu Engineers

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/8749678