In the era of large-scale "violent computing", how does Huawei Ascend break through the difficulties of computing power? | WAIC2023

Text|Yao Yue,
Editor|Wang Yisu

"In the past two years, the large-scale model has brought about a 750-fold increase in computing power demand, while the hardware computing power supply (increase) has only tripled." Zhang Dixuan, President of Huawei's Ascend Computing Business, unveiled the The truth about the huge computing power gap caused by the "violent calculation" of the large model.

And this computing power gap is still expanding. Zhang Dixuan predicts that by 2030, the computing power required by AI will increase by 500 times compared to 2020.

At the same time, due to well-known reasons, the localization of computing power is imminent.

As for how to make up for the shortcomings of computing power, Zhang Qingjie, KPMG China’s digital empowerment leader, believes that it needs to be solved in three ways, namely, computing power construction, infrastructure sharing and optimization, algorithm optimization, and data quality. Among them, computing power construction is placed in the first place.

Huawei has been quite active in computing power construction in recent years. According to the research report of CITIC Securities in July, among the existing urban smart computing centers in China, Huawei currently occupies about 79% of the overall smart computing center market share in terms of the number of construction.

In addition to winning by quantity, it is more important to improve the ability of the computing power cluster. Just at the 2023 World Artificial Intelligence Conference, Huawei announced that the Ascend AI cluster has been fully upgraded. The cluster size has been expanded from the initial 4,000-card cluster to 16,000 cards, and the computing power cluster has ushered in the "ten thousand cards" era.

Hu Houkun, Huawei's rotating chairman, said that the Ascend AI cluster is equivalent to designing the AI ​​computing power center as a supercomputer, which improves the performance efficiency of the Ascend AI cluster by more than 10%, and the system stability is greatly improved. more than double the improvement.

Zhang Dixuan also revealed to Light Cone Intelligence in the group interview that as early as 2018, Huawei judged that artificial intelligence would develop rapidly, and changed the development model of small models in the past, forming a model of large computing power combined with big data to generate large models, so Huawei At that time, the development of computing power cluster products began.

In the era of AI, it is no longer possible to increase computing power by stacking chips like in the era of stand-alone systems, but to systematically reshape the computing power infrastructure. While expanding the supply of huge computing power, it is necessary to solve the problems of computing power utilization rate and high usage threshold, and finally realize the ecologicalization of computing power.

The computing power cluster ushers in the "Wanka" era

After ChatGPT detonated the demand for computing power this year, the hardware side was the first to be popular with GPUs. The total market value of Nvidia has risen by 66% this year, and the latest is 1.05 trillion US dollars.

GPUs based on Nvidia A100 have become a must-have for large models, but stacking cards alone cannot cope with the outbreak of the "Hundred Models War". So, how to maximize the precious computing resources?

Since it is difficult for a single server to meet the computing needs, connecting multiple servers into a "supercomputer" is becoming the main direction of "computing power infrastructure". This "supercomputer" is a computing power cluster.

In 2019, Huawei released the Atlas 900 AI training cluster, which was composed of thousands of Huawei self-developed Ascend 910 (mainly used for training) AI chips. By June this year, it had supported 8,000 cards. At the just-concluded World Artificial Intelligence Conference, Huawei even announced that it plans to build a cluster of more than 16,000 cards by the end of this year or early next year.

 What is the concept of Wanka cluster?

Taking the GPT-3 model training with 175 billion parameters as an example, using 8 V100 graphics cards, the training time is expected to be 36 years, 512 V100 graphics cards, the training time is close to 7 months, and 1024 A100 The training time can be reduced to 1 months.

According to Huawei's evaluation, training the GPT-3 model with a data volume of 100B takes one day to complete the training under the Atlas 900 AI cluster with 8000 cards, and only half a day with the 16000 card cluster.

But regardless of the large computing power and high efficiency of "Wanka", it is not easy to really use it to train models.

Just as Gao Wen, an academician of the Chinese Academy of Engineering, said, "Some people say that only a few thousand people in the world can select and connect a model on 1,000 cards at the same time, and no more than 100 people can train on 4,000 cards. The number of people training the model is less.” Training and reasoning data on Kcal and Wanka is a very big challenge for software planning and resource scheduling.

First of all, Wanka-level training puts forward higher requirements for distributed parallel training. Distributed parallel training is an efficient machine learning method that divides a large-scale data set into multiple parts, and then trains the model in parallel on multiple computing nodes. This can greatly reduce training time and improve model accuracy and reliability.

The distributed parallel training of the Shengteng computing power cluster needs to rely on Huawei's self-developed MindSpore AI framework.

Shengsi MindSpore supports multiple model types, and has also developed a set of automatic hybrid parallel solutions to realize hybrid parallel training of data parallelism and model parallelism.

Under the same computing power and network, such a dual-parallel strategy can achieve a greater computing-to-communication ratio, and at the same time solve the practical difficulties of manual parallel architecture, improving the efficiency of large-scale model development and tuning.

In addition, due to distributed parallel training, every time a training result is obtained, all chips need to be synchronized once. In this process, there is a probability of error. This situation puts forward higher requirements for stability on the scale of Wanka.

"Shengteng's reliability and usability design can achieve 30 days of long-term stable training. Compared with the industry's most advanced level of about 3 days, it has improved the performance stability and usability advantages by nearly 10 times." Zhang Dixuan said.

How to improve the efficiency of the computing power cluster?

The computing power cluster is not only expanded in scale, but also needs to be greatly improved in efficiency, otherwise there will be a problem that the more cards there are, the lower the computing power utilization rate will be.

Taking the AI ​​cluster with a scale of thousands of cards deployed by Huawei in Ulanqab, Inner Mongolia as an example, under the same computing power, the computing efficiency can be increased by more than 10%.

According to Ascend's indicators, the computing power of 1,000 cards is about 300P, 1,000 cards can increase by about 30P, and 10,000 cards can increase by about 300P.

"300P computing power can process billions of images, tens of millions of people's DNA, and about 10 years of autonomous driving data in 24 hours." A person engaged in cloud computing business said to Guangcone Intelligent that improving computing power efficiency is also Reduced computational cost.

If it is said that from 300P of 1,000 cards to 3,000P of 10,000 cards has to rely on stacking cards to "work miracles", then this 10% efficiency improvement requires more complex systematic upgrades.

In addition to integrating Huawei's comprehensive advantages in cloud, computing, storage, network, and energy, the Ascend computing power cluster has also carried out architectural innovations.

A server is a node. Huawei has creatively launched a peer-to-peer architecture at the computing node level, which breaks through the performance bottleneck caused by traditional CPU-centric heterogeneous computing, thereby improving the bandwidth of the entire computing and reducing latency. Performance has been improved by 30%.

 In addition, computing power is a huge consumer of electricity, especially when hundreds of servers are combined, reducing energy consumption also needs to be realized simultaneously.

With the improvement of computing power, the energy consumption of servers is also getting higher and higher. Traditional air cooling can no longer support high heat dissipation. It is urgent to solve the problem of how to ensure the cooling capacity of servers under the condition of strict policy restrictions on PUE (power usage efficiency).

Among several heat dissipation routes, liquid cooling is considered to be one of the mainstream solutions.

The liquid cooling solution is inherently more power-efficient than the traditional air cooling solution. Shengteng adopts a precise supply method that directly injects cold water into each chip. The cost of maintenance is also reduced, and the risk of coolant leaking and polluting the environment is also reduced.

"Accurate supply depends on the sensors and electronically controlled valves on the chip board, coupled with central control, which can provide refined cold delivery for different chips under different loads." Huawei computing staff Xiang Guangcone Intelligence introduce.

In November 2021, the National Development and Reform Commission and other departments issued a document clearly stating that the PUE of newly built large and ultra-large data centers should be lower than 1.3, and the PUE of data centers in Inner Mongolia, Guizhou, Gansu, and Ningxia should be controlled below 1.2. Ascend's computing power cluster has achieved a PUE of less than 1.15.

Lowering the computing power threshold depends on ecology

"Electricity is plug-and-play, and there is basically no need to teach ordinary people how to use it. And computing power, even if you provide it to enterprises, many people will not use it." Wu Hequan, academician of the Chinese Academy of Engineering and director of the China Internet Society Advisory Committee, said, Now the computing power (usage) threshold is too high.

An industry insider also told Lightcone Intelligence: "It is difficult for small and medium-sized enterprises to obtain technical support for training servers, and coupled with the lack of domestic software ecology, it is also difficult for small and medium-sized enterprises to play on their own."

No matter how powerful the computing power cluster is, if the demand side cannot be opened, it will eventually restrict the development of the entire computing power. Whether AI computing power can meet the "low threshold" usage standard like electricity, ecology is particularly important.

This is also the reason why Nvidia suffered from Wall Street's "rolling eyes" and invested in the CUDA software system regardless of the cost. It is CUDA that enables an ordinary student to program a graphics card, and Nvidia uses software and hardware collaboration to create an ecosystem and maximize the supply of computing power.

In addition to Nvidia, Apple has earlier confirmed the importance of the ecosystem in terms of achieving a good user experience.

Currently, Huawei Ascend has built a set of self-innovated full-stack software and hardware systems, including Ascend AI cluster series hardware, heterogeneous computing architecture CANN, full-scenario AI framework MindSpore, Ascend application enabling MindX, and Yizhan development platform ModelArts, etc. CANN is the core software layer of CUDA + CuDNN that is benchmarked against Nvidia.

 Zhang Dixuan said, "Ascend AI supports the original innovation of nearly half of China's original large-scale models, and it is also the only technical route in China that has completed the development and commercialization of large-scale models with hundreds of billions of parameters. The measured training performance of open-source Transformer-type large-scale models can reach industry 1.2 times of that."

Behind these, Huawei has made the above-mentioned software open source and hardware open source.

First of all, in terms of basic software, Ascend has carried out a series of open source and support around the whole process of large model development, training, fine-tuning, and inference.

In addition to open-sourcing the AI ​​framework MindSpore, Shengteng also provides a large-scale model development kit that can support full-process script development with more than a dozen lines of code. In the words of Zhang Dixuan, it is "in order to make the development of large models out of the box".

Fine-tuning is a key link for a large model to have industry attributes, and it plays a decisive role in the application effect. In this regard, Huawei Ascend provides a low-parameter fine-tuning module, which integrates a variety of fine-tuning algorithms. According to Zhang Dixuan, only 5% of fine-tuning parameters, including LoRA and P-Tuning, can achieve the effect of full-parameter fine-tuning.

In addition, in response to a series of problems such as difficult deployment of large model reasoning and high cost, Huawei Ascend has integrated automatic pruning, distillation, and quantification tools on the development tool chain MindStudio. Model compression" Zhang Dixuan introduced that the reasoning stage supports online distributed reasoning, which can make the application go online quickly, and the reasoning delay is less than 50 milliseconds.

"Zhang Dixuan introduced that the inference stage supports online distributed inference, which can make the application go online quickly, and the inference delay is less than 50 milliseconds.

In terms of hardware, Huawei also provides motherboards, SSDs, network cards, RAID cards, Atlas modules and boards to support the development of AI hardware products for partners.

Based on the current shortage of computing power supply, Huawei Ascend also focused on "operators and models" and proposed migration and adaptation solutions.

The last mile of the integration of training and promotion in the industry

After the initial establishment of the computing power ecology, whether it can continue to operate in a healthy manner will eventually return to the issue of large-scale commercialization.

"Don't write poetry, just do things." The large-scale model Pangu 3.0 just released by Huawei, like other domestic large-scale models, focuses on the "industry". Moreover, the Pangea model has been used in more than 1,000 projects in many industries such as weather forecasting, drug research and development, and coal preparation.

However, for domestic large-scale models as a whole, there are still some problems in deeply meeting the needs of the industry.

"The needs of enterprises are very specific, such as 'identify valuable metals in this pile of garbage', which can be done by trained elementary school students, but for large models, this kind of needs of enterprises is too heavy, and it may The final effect is not very good.” A staff member of the enterprise service business friend said to Guangcone Intelligent that directly invoking general AI capabilities cannot meet the widely existing differentiated intelligence needs in the industry.

Huawei divides the large model into three levels, L0, L1, and L2. L0 is the basic general model. On the basis of the basic model L0, plus industry data, the large industry model obtained by mixed training is L1, and then L1 is deployed for specific downstream subdivision scenarios of thousands of industries to obtain subdivisions The task model L2 of the scene.

Now, whether it is for Huawei or other large-scale enterprises, how to quickly produce L2 models from the industry's large-scale L1 model, and how to deploy the L2 model to the device side, edge side, and cloud side has become the last mile problem to get through the industry application.

Aiming at this last mile, Ascend, in conjunction with iFLYTEK, Zhipu AI, Yunwalk and other upstream large-scale model partners, proposed an "integration of training and pushing" solution.

Simply understand, doing model training is equivalent to the university study stage, reasoning deployment (the trained model is run in a specific environment) is formal employment, and the integration of training and promotion is "practice while learning".

General-purpose large models are generally trained based on extensive public literature and network information. The information is mixed, and the accumulation of many professional knowledge and industry data is insufficient, which will lead to insufficient industry pertinence and accuracy of the model, and excessive data "noise". At the same time, due to the difficulty in obtaining industry data and the difficulty in combining technology with the industry, the implementation of large models in the industry has been slow.

The integration of training and push supports the central node to send the model to the edge node of the enterprise for reasoning, and the edge site then sends the data back to the center for algorithm update and incremental training to achieve autonomous evolution capabilities. In other words, "students take the initiative to study in a direction that is more suitable for employment."

In this way, it is ensured that the cyclical production process from training to reasoning is no longer separated. Moreover, the greater initiative of developing the industry's large-scale model is handed over to the industry and enterprises themselves, which will undoubtedly meet the industry's AI application and development scenarios to the greatest extent, and realize the deep integration of AI infrastructure and industry needs.

Compared with central training and edge reasoning, the integration of training and promotion will also lower the deployment cost for small and medium-sized enterprises, and will accelerate the "cultivation" of small and medium-sized enterprises to join the industry and scene large models.

For the entire computing power ecology, getting through the last mile as soon as possible means that it will be truly activated, and then there will be sustainable development.

Guess you like

Origin blog.csdn.net/GZZN2019/article/details/131657296