The hottest topic in the current circle is the big model
And the most popular model in China is Wen Xin Yi Yan
Anyway, thousands of engineers from our company produced this large model for the liver
Already got dark circles
Super large models like "Wen Xin Yi Yan"
The training process is too cruel, everyone has to "vomit blood"
I'm afraid, the engineers have passed out several rounds of crying in the toilet
Today, we will talk about the technical
Train a large model of "Wen Xin Yi Yan"
How difficult is it? How cruel is it? How much blood is vomiting?
first
If you want to refine a large model → you must first build a large cluster
Large clusters refer to ultra-large-scale GPU computing power clusters
Only large clusters can hold large models
Usually, the scale of hundreds of billions of parameters is called a large model.
For example, GPT-3 has 175 billion parameters
And Wenxin large model (ERNIE 3.0 Titan)
The parameters are as high as 260 billion
↓
Because, only when the training parameters reach a certain magnitude
like reaching some mysterious tipping point
The big model will suddenly "open up"
For training on a scale of hundreds of billions or even trillions of parameters
If the computing power is configured according to the traditional old way
Build a few GPU servers and form a computing power pool
That's one thousand and one nights
(A thousand and one nights can't be finished)
↓
give an example
If you choose NVIDIA's main GPU A100 to fight
Training on 175 billion parameters of GPT-3
In theory, it takes 32 years for a single card
haven't had time to go out yet
Just failed miserably under the "Wall of Sighs"
Computing Wall丨Video Memory Wall丨Communication Wall
▌The calculation wall refers to the huge difference between the computing power of a single card and the total computing power of the model. A100's single-card computing power is only 312 TFLOPS, while GPT-3 requires a total computing power of 314 ZFLOPs, which is a difference of 9 orders of magnitude.
▌The memory wall refers to the fact that a single card cannot fully store the parameters of a large model. The 175 billion parameters of GPT-3 itself requires 700 GB of memory space (each parameter is calculated as 4 bytes), while the NVIDIA A100 GPU has only 80 GB of memory.
▌The communication wall is mainly due to the frequent parameter synchronization of each computing unit in the cluster under distributed training, and the communication performance will affect the overall computing speed. If the communication wall is not handled well, it is likely to lead to a larger cluster size and lower training efficiency.
Therefore, in order to train the "Wen Xin Yi Yan"
Baidu is also fighting
Built the largest cloud computing market in China
High performance GPU cluster
↓
The fighting power of this giant cluster is overwhelming
Break it apart for everyone to see
↓
The computing power nodes in the cluster are AI servers
These AI servers are named: X-MAN
It is a super AI computer customized by Baidu Smart Cloud
Has evolved to the 4th generation
We first rolled up the performance of a single node to the limit
In a small case, put 8 GPUs
Provide 134GB/s Allreduce bandwidth within a single machine
Therefore, each node is
A "small steel cannon" with full computing power
Well, after finishing the single point, we will form a team (cluster)
Want to stimulate the combat effectiveness of the entire group
It's not just a simple pile of equipment
The emphasis is on the exquisite architecture design, just like the formation of troops
↓
Baidu Smart Cloud on [Cluster Network Design]
Completely starting from the actual needs of large model training
(For example, during training, the Allreduce operation of the same card occupies the largest network traffic. How to achieve high throughput and low latency?)
Baidu adopts three-layer CLOS architecture, IB networking
Instantly maximize the performance of the entire cluster
so
Baidu rolled out the country's largest IB networking GPU cluster
Support 16,000 GPU cards, 2000+ AI servers
Can provide single-cluster EFLOPS-level computing power
Of course, this "Big Mac" cluster
It wasn't built overnight either.
In 2021, Baidu Smart Cloud will start building a new generation of high-performance GPU clusters
In 2022, the cluster will be completed, capable of accommodating more than 10,000 cards, and providing single-cluster EFLOPS-level computing power
In 2023, the cluster will work hard to carry out the rapid launch of Wenxin Yiyan
The cluster continues to expand ing...
At this point, the large cluster is considered Ready
But you thought that putting the large model on the cluster
Can you run happily?
Can engineers breathe a sigh of relief?
The process of training a large model
It is a process of full-stack collaboration, soft and hard
If any link is off the chain, you will not be able to train
Therefore, many people want to know
How did Wenxin Yiyan be trained?
↓
Training large models → relying on AI large base
Baidu "AI Big Base"
It is Baidu's full-stack self-developed AI infrastructure
From the perspective of the cloud-intelligent integrated architecture
Three layers from bottom to top
↓
Chip layer, frame layer, model layer
The ability to integrate these three layers of technology stack
Integrate into two major engineering platforms: Baidu Baige and AI Zhongtai
It forms the base of Baidu AI
How to use this base?
Let's take a closer look
The training process of the large model
↓
Step ❶, disassemble the large model and formulate a strategy
The training of large models must be distributed training
Break down a task into countless small tasks
put these small tasks
Put it on different GPUs or XPUs in the cluster for training
when assigning tasks
Need to develop a "parallel strategy", such as knife
In this link, Baidu Flying Paddle is the strategy maker
Baidu Fei Paddle is one of the three top AI frameworks in the industry
The formulated "4D Hybrid Parallel Strategy" is unrivaled in the world
It can support the training of hundreds of billions of large models at the monthly level
Well, now the task is cut into "shards"
Waiting to be put into the computing cluster for training
However, such a large cluster
Do you know the link relationship between devices?
Do you know which is working fine and which is failing?
Step ❷, perceive the cluster topology, and take stock of computing resources
At this time, Baidu Baige made its debut
Provide powerful AI computing power
And has super cluster topology awareness
It can sense the computing power of each server
Can perceive how many GPUs, CPUs, and XPUs there are, whether they are idle or busy
Can perceive the connection mode between each node
Server↔Server, GPU↔GPU
Then
Baidu Baige puts the "general ledger"
Delivered to Baidu Flying Paddle for processing
↓
Next, the flying paddle according to this picture
Another "Unified Logical Computing View"
OJBK, all preparations are Ready
With a picture in hand, don't panic
↓
Step ❸, the flying paddle starts to send jobs automatically
Divide the previously divided small tasks
Assign to different GPU/XPU for training
This step is the most time-consuming and expensive part.
Efficiency must be considered while cost must be considered
At this time, the flying paddle will follow the two pictures obtained earlier
Execute an "optimal delivery strategy"
↓
Taking into account both communication requirements and computing resources (bandwidth, links)
Consider both cost and efficiency, fast and save money
Step ❹, training and inference, speed up! accelerate! accelerate!
Wenxin said, the internal test for more than a month
Completed 4 major technical upgrades
People in the circle are stunned
Iterating so quickly, why?
On the one hand, the hardware cluster foundation is strong enough
Baidu kcal scale cluster
Multi-card linear speedup up to 90%
On the other hand, the software acceleration capability explodes
The two most time-consuming steps in training and inference
Baidu has developed its own secret weapon: Acceleration Toolkit
During the training process, Wen Xin Yi Yan adopted
Various optimized training acceleration technologies, including…
↓
This kind of "acceleration" is also NO.1 in the horizontal evaluation! In the test results of MLPerf Training v2.1 released in November 2022, the model training performance results submitted by Baidu using Flying Paddle plus Baidu Baige ranked first in the world under the same GPU configuration, and the end-to-end training time and training throughput were both average. Go beyond the NGC PyTorch framework.
In the reasoning process, Wen Xinyiyan adopted the
Various inference acceleration optimization methods
Models that can optimize the output of AI frameworks
Speed up inference and improve resource utilization
Step ❺, in the training infinite loop
Resource management and task scheduling, the two tasks interact continuously
Baidu Flying Paddle, Baidu Baige, left and right Dharma protectors
↓
Baidu Baige
Provide various high-performance "computing network storage" resources for AI tasks
Real-time perception of the resource demand status of AI tasks
For each AI task, schedule matching resources
Baidu Flying Paddle
According to the latest changes in the cluster informed by Baige
Automatically adjust model segmentation and AI task placement strategy
so far
It can ensure the efficiency of large-scale training
Greatly improve the performance of adaptive distributed training
Therefore, the large model looks difficult
But with the right tools, it doesn't seem that hard
(unfortunately most people don't )
The biggest tool behind Wenxin Yiyan is
[AI Big Base] of Baidu Smart Cloud
At present, all the capabilities of the AI base have been opened to the public
This base is extremely versatile
Easily handle all walks of life and various subdivision scenarios
↓
For the majority of users
Do things on a mature AI base
All the holes were filled in an instant.
↓
① Accelerate the AI research and development process, good tools and platforms are ready-made, full-stack support of software and hardware, less pitfalls, easy to use
② Advanced technology, standing on the shoulders of giants to see trends, engage in research and development, and avoid detours
③Flexibility of delivery: Delivery methods, central cloud, edge cloud BEC, local computing cluster LCC, private cloud ABC Stack, etc.
It is said that the AI singularity has arrived, under such a big change
You can have a taste of the most popular large model in China "Wen Xin Yi Yan"
You can also use the mysterious ability behind it
Baidu Smart Cloud "AI Base"
Refining your own industry "elixir"
Easter egg: When in doubt, ask "a word"
The secret of training large models has been picked up
However, when it comes to title this tweet
We have a problem, don't know which one to choose?
what to do? Let me ask you about "Wen Xin Yi Yan"
↓
In the end, Wen Xinyiyan helped us choose the title
Well, this logic of thinking is quite exciting, right?
* This article comes from the official account: extra large
Click "Read the original text" to cooperate and consult immediately