Behind the big model is Baidu's ten years of hard work

ea9b679adcb345dedfff971c74ef7efa.gif

The hottest topic in the current circle is the big model

And the most popular model in China is Wen Xin Yi Yan

Anyway, thousands of engineers from our company produced this large model for the liver

Already got dark circles3a76cf4b4146b0196c9f2f10da6456b3.png

d2b93376c276220517bd5578c78664e9.gif

Super large models like "Wen Xin Yi Yan"

The training process is too cruel, everyone has to "vomit blood"

I'm afraid, the engineers have passed out several rounds of crying in the toilet

da2f467a112b6b189222166b1b7253be.png

Today, we will talk about the technical

Train a large model of "Wen Xin Yi Yan"

How difficult is it? How cruel is it? How much blood is vomiting?

c888a5efc10384bb36f559d5cd383162.gif

first

If you want to refine a large model → you must first build a large cluster

Large clusters refer to ultra-large-scale GPU computing power clusters

Only large clusters can hold large models

Usually, the scale of hundreds of billions of parameters is called a large model.

For example, GPT-3 has 175 billion parameters

And Wenxin large model (ERNIE 3.0 Titan)

The parameters are as high as 260 billion


86001264bcaccf89f31da4eaa1d40fcf.gif

Because, only when the training parameters reach a certain magnitude

like reaching some mysterious tipping point

The big model will suddenly "open up"

7dc99101928cf6241466fad116dd27b1.gif

For training on a scale of hundreds of billions or even trillions of parameters

If the computing power is configured according to the traditional old way

Build a few GPU servers and form a computing power pool

That's one thousand and one nights

(A thousand and one nights can't be finished)


give an example

If you choose NVIDIA's main GPU A100 to fight

Training on 175 billion parameters of GPT-3

In theory, it takes 32 years for a single card

28e0c78ae7721816b3e03394e19009cc.gif

haven't had time to go out yet

Just failed miserably under the "Wall of Sighs"

Computing Wall丨Video Memory Wall丨Communication Wall

4a3fe37ee8c2aad15341866aa37708bc.png

▌The calculation wall refers to the huge difference between the computing power of a single card and the total computing power of the model. A100's single-card computing power is only 312 TFLOPS, while GPT-3 requires a total computing power of 314 ZFLOPs, which is a difference of 9 orders of magnitude.

▌The memory wall refers to the fact that a single card cannot fully store the parameters of a large model. The 175 billion parameters of GPT-3 itself requires 700 GB of memory space (each parameter is calculated as 4 bytes), while the NVIDIA A100 GPU has only 80 GB of memory.

▌The communication wall is mainly due to the frequent parameter synchronization of each computing unit in the cluster under distributed training, and the communication performance will affect the overall computing speed. If the communication wall is not handled well, it is likely to lead to a larger cluster size and lower training efficiency.


Therefore, in order to train the "Wen Xin Yi Yan"

Baidu is also fighting

Built the largest cloud computing market in China

High performance GPU cluster

407197ec3e807c54728d66858690cae7.gif

The fighting power of this giant cluster is overwhelming

Break it apart for everyone to see

ed72a78902a0e53e83fc72c05ab8264d.png4f2393865be1c21b6452aee4d0dd7cb2.png900fc91c7d7566f31a0c4d8fb5f162f4.png

The computing power nodes in the cluster are AI servers

These AI servers are named: X-MAN

It is a super AI computer customized by Baidu Smart Cloud

Has evolved to the 4th generation

2483d00e14137527831daa9bd0ab3834.png

We first rolled up the performance of a single node to the limit

In a small case, put 8 GPUs

Provide 134GB/s Allreduce bandwidth within a single machine

Therefore, each node is

A "small steel cannon" with full computing power

59d0791964b8cb96dfac278a82c5d2d6.png

Well, after finishing the single point, we will form a team (cluster)

Want to stimulate the combat effectiveness of the entire group

It's not just a simple pile of equipment

The emphasis is on the exquisite architecture design, just like the formation of troops

Baidu Smart Cloud on [Cluster Network Design]

Completely starting from the actual needs of large model training

(For example, during training, the Allreduce operation of the same card occupies the largest network traffic. How to achieve high throughput and low latency?)

Baidu adopts three-layer CLOS architecture, IB networking

Instantly maximize the performance of the entire cluster

c6eac63dd1cc337534c91e2e094db84c.png

so

Baidu rolled out the country's largest IB networking GPU cluster

Support 16,000 GPU cards, 2000+ AI servers

Can provide single-cluster EFLOPS-level computing power

c9af93e6de604431e0c218401eea6774.png

Of course, this "Big Mac" cluster

It wasn't built overnight either.

In 2021, Baidu Smart Cloud will start building a new generation of high-performance GPU clusters

In 2022, the cluster will be completed, capable of accommodating more than 10,000 cards, and providing single-cluster EFLOPS-level computing power

In 2023, the cluster will work hard to carry out the rapid launch of Wenxin Yiyan

The cluster continues to expand ing...

1e407fc150d825f6a1f91ec18ae393fc.png

At this point, the large cluster is considered Ready

But you thought that putting the large model on the cluster

Can you run happily?

Can engineers breathe a sigh of relief?

14f89387163fc75ab7440575c55bc16e.png

The process of training a large model

It is a process of full-stack collaboration, soft and hard

If any link is off the chain, you will not be able to train

f395f21b98e3125d986f919c666af561.png

Therefore, many people want to know

How did Wenxin Yiyan be trained?

Training large models → relying on AI large base

Baidu "AI Big Base"

It is Baidu's full-stack self-developed AI infrastructure

From the perspective of the cloud-intelligent integrated architecture

Three layers from bottom to top

Chip layer, frame layer, model layer

The ability to integrate these three layers of technology stack

Integrate into two major engineering platforms: Baidu Baige and AI Zhongtai

It forms the base of Baidu AI

8f409653f6fe65d8afec717de8421a6f.png

How to use this base?

Let's take a closer look

The training process of the large model

Step ❶, disassemble the large model and formulate a strategy

The training of large models must be distributed training

Break down a task into countless small tasks

put these small tasks

Put it on different GPUs or XPUs in the cluster for training

396c016fe34cd4fd4e022da844277f9b.png

when assigning tasks

Need to develop a "parallel strategy", such as knife

In this link, Baidu Flying Paddle is the strategy maker

Baidu Fei Paddle is one of the three top AI frameworks in the industry

The formulated "4D Hybrid Parallel Strategy" is unrivaled in the world

It can support the training of hundreds of billions of large models at the monthly level

7c424953392e000f90d91c61e891f6af.png

Well, now the task is cut into "shards"

Waiting to be put into the computing cluster for training

However, such a large cluster

Do you know the link relationship between devices?

Do you know which is working fine and which is failing?

Step ❷, perceive the cluster topology, and take stock of computing resources

At this time, Baidu Baige made its debut

Provide powerful AI computing power

And has super cluster topology awareness

b3e3db2ad3d701e532bcbae395f6db1c.png

It can sense the computing power of each server

Can perceive how many GPUs, CPUs, and XPUs there are, whether they are idle or busy

Can perceive the connection mode between each node

Server↔Server, GPU↔GPU

c78215c67272852f0658badcd08f387c.png

Then

Baidu Baige puts the "general ledger"

Delivered to Baidu Flying Paddle for processing

a2dc079298cd7bad05426f1044de6a58.png

Next, the flying paddle according to this picture

Another "Unified Logical Computing View"

OJBK, all preparations are Ready

With a picture in hand, don't panic

e3f91b778b09bdfb7037ff42a3cbd1b3.png

Step ❸, the flying paddle starts to send jobs automatically

Divide the previously divided small tasks

Assign to different GPU/XPU for training

This step is the most time-consuming and expensive part.

Efficiency must be considered while cost must be considered

At this time, the flying paddle will follow the two pictures obtained earlier

Execute an "optimal delivery strategy"


bf86295a1d321d5e2aef3931bff6ab1e.png

Taking into account both communication requirements and computing resources (bandwidth, links)

Consider both cost and efficiency, fast and save money

Step ❹, training and inference, speed up! accelerate! accelerate!

Wenxin said, the internal test for more than a month

Completed 4 major technical upgrades

People in the circle are stunned

cf53b6d58b3949bc1631d336ea60e742.png

Iterating so quickly, why?

On the one hand, the hardware cluster foundation is strong enough

Baidu kcal scale cluster

Multi-card linear speedup up to 90%

On the other hand, the software acceleration capability explodes

The two most time-consuming steps in training and inference

Baidu has developed its own secret weapon: Acceleration Toolkit

12fa571c2aa49024c1fd6b27570d9dfb.png

During the training process, Wen Xin Yi Yan adopted

Various optimized training acceleration technologies, including…

42898d0950d144c4dd6cebcf7655d338.png

This kind of "acceleration" is also NO.1 in the horizontal evaluation! In the test results of MLPerf Training v2.1 released in November 2022, the model training performance results submitted by Baidu using Flying Paddle plus Baidu Baige ranked first in the world under the same GPU configuration, and the end-to-end training time and training throughput were both average. Go beyond the NGC PyTorch framework.


In the reasoning process, Wen Xinyiyan adopted the

Various inference acceleration optimization methods

Models that can optimize the output of AI frameworks

Speed ​​up inference and improve resource utilization

039e3a50de9dce74afe3fd38ebf86d99.png


Step ❺, in the training infinite loop

Resource management and task scheduling, the two tasks interact continuously

Baidu Flying Paddle, Baidu Baige, left and right Dharma protectors

ce78bda5ad5213584780393d22313de1.png

Baidu Baige

Provide various high-performance "computing network storage" resources for AI tasks

Real-time perception of the resource demand status of AI tasks

For each AI task, schedule matching resources

Baidu Flying Paddle

According to the latest changes in the cluster informed by Baige

Automatically adjust model segmentation and AI task placement strategy

cea31f7dac28ef7d24ef71802b0ab5b3.png

so far

It can ensure the efficiency of large-scale training

Greatly improve the performance of adaptive distributed training

14e93b0b099f10e82f44197e648f2a16.png

Therefore, the large model looks difficult

But with the right tools, it doesn't seem that hard

(unfortunately most people don't bf53fe373a768a4a3fc40c20ff93f245.png)

The biggest tool behind Wenxin Yiyan is

[AI Big Base] of Baidu Smart Cloud

At present, all the capabilities of the AI ​​base have been opened to the public

This base is extremely versatile

Easily handle all walks of life and various subdivision scenarios

3b5c9d4058601e85fd7ddc78a8c5b618.jpeg

For the majority of users

Do things on a mature AI base

All the holes were filled in an instant.

① Accelerate the AI ​​research and development process, good tools and platforms are ready-made, full-stack support of software and hardware, less pitfalls, easy to use

② Advanced technology, standing on the shoulders of giants to see trends, engage in research and development, and avoid detours

98ab1bc47b09b49da4acee24f0f7ced7.png

③Flexibility of delivery: Delivery methods, central cloud, edge cloud BEC, local computing cluster LCC, private cloud ABC Stack, etc.

53e43edefea338a4fe89a9d79b697bba.png

02aaa7c751a298eff3a4c40bb55ecda4.png

It is said that the AI ​​​​singularity has arrived, under such a big change

You can have a taste of the most popular large model in China "Wen Xin Yi Yan"

You can also use the mysterious ability behind it

Baidu Smart Cloud "AI Base"

Refining your own industry "elixir"

2a682f8337e246d18a6f99ee4b0c93d9.gif

ce2703ca1fcef13d8e46680e1304c9f3.gif

Easter egg: When in doubt, ask "a word"

The secret of training large models has been picked up

However, when it comes to title this tweet

We have a problem, don't know which one to choose?

what to do? Let me ask you about "Wen Xin Yi Yan"

2245130f6cffa1cb5e52c6d1256f9ed2.png

41554e479a0ea9145144c2b9031e3d7e.png

In the end, Wen Xinyiyan helped us choose the title

Well, this logic of thinking is quite exciting, right?

format,pngformat,pngformat,png

* This article comes from the official account: extra large

0424245a0ca390a0156477b2b3b077bf.png

1473cb5246bdb3c7654500cb560cf705.png

d45eba8d75e49694d2ffcabe42611ad6.jpeg

2f89633cd33200696357cfac2ba19d19.jpeg

83bd565fa93e8c1fab8bcfe22a758cfb.jpeg

Click "Read the original text" to cooperate and consult immediately

Guess you like

Origin blog.csdn.net/weixin_48493350/article/details/131179787