In the era of large models, what kind of AI computing power system do we need?

Currently, the "100-Mode War" has brought about an explosion in demand for computing power, and the AI ​​chip industry is also facing huge opportunities. "Innovative architecture + open source ecology" is stimulating the proliferation of diverse AI computing power products. Facing new industry opportunities, the AI ​​computing power industry chain urgently needs to jointly seize opportunities through upstream and downstream collaboration.

Recently, Stephen Zhang, senior product manager of Inspur Information's AI&HPC product line, shared his insights at the Open Computing China Summit on the trend of computing power demand in the AIGC era and the development of open accelerated computing. He pointed out that open accelerated computing ecological collaboration will effectively empower diverse people. The innovative development of AI computing power products provides useful solutions to the computing power challenges in the AIGC era.

The following are the key points of the speech:

  • Large models bring explosive demands for AI computing performance, interconnection bandwidth, and scalability;
  • Open accelerated computing technology is born for large-scale deep neural network training;
  • Application-oriented computing power infrastructure architecture design and collaborative design of computing power and algorithms can achieve more efficient large model training;
  • Open accelerated computing has accumulated fruitful results in terms of performance, scalability, energy saving, and ecological compatibility;

The following is the original text of the speech:

Computing power requirements and trends in the era of large models

Since the release of ChatGPT, we can clearly feel that the whole society has paid widespread attention to generative artificial intelligence technology. After ChatGPT came out of the circle, it has brought more participants, and the number of models and model parameters have continued to increase. According to incomplete statistics, the number of large models in our country has exceeded 110, which has brought about a sharp increase in the demand for AI computing power.

 

In response to the severe computing power challenges brought by the development of large models, we have conducted a large number of demand analysis and trend judgments. Judging from the changing trend of AI server computing power and power consumption over time, the most direct way to solve the computing power shortage problem of large models is to increase the computing power of a single machine. From 2016 to now, the computing power of a single AI server has increased nearly 100 times, and the power consumption has increased from 4 kilowatts to 12 kilowatts. The power consumption of the next generation of AI servers continues to increase to 18 kilowatts or even more than 20 kilowatts. The system architecture power supply and cooling methods of AI servers, as well as the data center infrastructure construction model, will be difficult to meet the deployment needs of high-power AI servers in the future.

Secondly, as the number of parameters in large models increases, the demand for the number of GPUs also increases, requiring greater memory capacity. In 2021, a large model with a scale of 100 billion will require 3,000 GB of video memory capacity, which translates into nearly 40 80G GPUs to accommodate this model, including weight parameters, gradient data, optimization value data, and activation value data. Today, the parameters of many large models have exceeded one trillion, and the video memory capacity will reach 30,000GB, requiring nearly 400 GPUs with 80G video memory to carry it. This means that a larger computing power platform is needed to carry out such large-scale operations . Model training.

A larger platform will bring another problem, that is, more communication between cards and between different nodes. The training of large models requires the integration of multiple parallel strategies, and the P2P interconnection bandwidth between cards and cross-node Internet bandwidth puts forward higher requirements.

Take the engineering practice of "Source 1.0" large model training with 245.7 billion parameters as an example. "Source 1.0" training has a total of 180 billion Tokens and a video memory capacity requirement of 7.4TB. The training process integrates tensor parallelism, pipeline parallelism, and data parallelism. kind of strategy. The tensor parallel communication frequency of a single node reaches 82.4 times per second, and the minimum communication bandwidth requirement within the node reaches 194GB/s. Pipeline parallelism will be carried out in the computing node, and the cross-node communication bandwidth reaches 26.8GB/s. At least 300Gbps communication bandwidth is required to meet the bandwidth requirements of pipeline parallel training. During the training of "Source 1.0", two 200Gbps network cards were actually used for cross-node communication. The frequency of parallel data communication is low but the amount of data is large. The bandwidth requirement must reach at least 8.8GB/s, which can be met by a single machine's 400Gbps bandwidth.

As the number of model parameters further increases and GPU computing power increases exponentially, higher interconnect bandwidth will be required in the future to meet the training needs of larger-scale models.

Open accelerated computing is born for ultra-large-scale deep neural networks

The computing system for AIGC large model training needs to have three main characteristics, one is large computing power, the other is high interconnection, and the third is strong expansion. It is difficult for traditional PCIe CEM accelerator cards to meet the three characteristics, so more and more More and more chip manufacturers have developed non-PCIe accelerator cards.

The open computing organization OCP released an accelerated computing system architecture specifically for large model training in 2019. The core is the UBB and OAM standards and is characterized by large computing power. The accelerator in the form of Mezz card has higher heat dissipation and interconnection capabilities, and can carry chips with higher computing power. At the same time, it has very strong cross-node scalability and can be easily expanded to kilo- and ten-thousand-card platforms to support the training of large models. This architecture is a computing architecture naturally suitable for very large-scale deep neural network training.

 

However, during the implementation of the OAM industry, accelerator cards developed by many manufacturers still have inconsistent hardware interfaces, inconsistent interconnect protocols, and incompatible software ecosystems, resulting in long adaptation cycles and customization of new AI accelerator card systems. The implementation problem of high investment costs has led to an increasing gap between computing power supply and computing power demand. The industry urgently needs a more open computing power platform and more diverse computing power to support the training of large models.

In this regard, Inspur Information has carried out a lot of work, including technical pre-research and contributions to the industrial ecology. Since 2019, Inspur Information has taken the lead in the formulation of OAM standards, released the first open acceleration substrate UBB, developed the world's first open acceleration reference system MX1, and collaborated with industry-leading chip manufacturers to complete the development of OAM accelerator cards. Adaptation proves the feasibility of this technical route. In order to promote the industrialization of systems that comply with OAM open acceleration specifications, Inspur Information developed the first "ALL IN ONE" OAM server product, which integrates the CPU and OAM accelerator card into a 19-inch chassis to achieve data center-level speed Deployed and implemented in many customers’ intelligent computing centers.

Since then, the computing power and power consumption of OAM chips have continued to increase, and data centers have increasingly higher requirements for green energy saving. In this regard, we developed the first liquid-cooled OAM server, which can realize liquid cooling of 8 OAM accelerators and two high-power CPUs. The entire liquid cooling coverage rate exceeds 90%. The liquid cooling system built based on this product Cold OAM intelligent computing center solution, the PUE value of the kilocalorie platform is less than 1.1 under stable operation. The new generation OAM server NF5698G7 just released by Inspur Information is based on a full PCIe Gen5 link and has a 4-fold increase in H2D interconnection capabilities, providing a more advanced deployment platform for the research and development of a new generation of OAM.

Solve energy consumption issues through platform architecture design and computing algorithm collaborative design

Merely providing a computing power platform is not enough. Currently, data centers are facing huge energy consumption challenges, especially for AI servers for large model training. The power consumption of a single machine can easily exceed 6-7 kilowatts.

 

A formula can quickly calculate the overall power consumption (E) required to train a large model: the numerator uses 6 times the number of model parameters and the number of Tokens used in the training process to represent the computing power equivalent required for large model training, and the denominator uses acceleration The number of cards and the computing performance of a single accelerator card represent the overall computing performance that the intelligent computing infrastructure can provide. The result of dividing the two represents the time required to train a large model, multiplied by the Ecluster indicator (large The overall power consumption can be obtained by the daily power consumption of the model training platform). Then, when a model is selected and the number and scale of cards are determined, the overall power consumption required for large model training can only be optimized by optimizing the computing power value of a single card or reducing the power consumption of a single platform.

We conducted further research on the optimization of these two parameters. Two tables present the comparison of platform power consumption and the corresponding overall power consumption of large model training under different network architecture designs of large model training platforms. Taking the two-machine network card (NIC) networking solution and the single-machine eight-network card (NIC) networking solution as examples, although the impact of different numbers of network cards on the power consumption of a single machine is not significant, at the level of the entire computing platform, network cards The increase in number leads to an increase in the number of switches, and the total power consumption will be significantly different. The total power consumption of the 8-network card solution can reach more than 2,000 kilowatts, while the 2-network card solution only consumes more than 1,600 kilowatts. The 2-network card solution can save 18% of power consumption.

Therefore, based on actual application requirements, by carefully calculating the network bandwidth required for large model training, the total power consumption can be significantly optimized without affecting performance. During the training process of the "Source" large model, only two 200G IB cards were used to complete the training of the 245.7 billion parameter model. This is the first technical path we discovered to optimize the total power consumption of the training platform.

Second, it is also a very important proposition to improve the computing power utilization of a single card to achieve efficiency improvement and energy saving. After our testing, we have adopted the method of co-design of algorithms and computing power architecture, based on the technical characteristics of the computing power infrastructure, and deeply optimized the parameter structure and training strategy of the model, so that the training of the same-scale model can be completed in a shorter time. Taking the training of the GPT-3 model as an example, the model training time can be optimized from 15 days to 12 days, and the total power consumption saving reaches 33%.

The above two points can illustrate that application-oriented architecture design and collaborative design of computing power and algorithms can achieve more efficient large model training and ultimately accelerate the realization of energy conservation and carbon reduction goals.

Green and open acceleration platform empowers large models to efficiently release computing power

Based on the above innovation and research in open computing and efficient computing technologies, products and methods, Inspur Information is actively building a green and open accelerated intelligent computing platform for generative AI.

The liquid-cooled open accelerated intelligent computing center solution released in collaboration with partners last year has, first of all, very high computing power performance; secondly, it can achieve large-scale expansion of thousands of cores and support model training of over 100 billion scale; at the same time, advanced liquid cooling technology This greatly optimizes the PUE of the entire platform.

At the same time, Inspur Information is also actively building a full-stack open accelerated intelligent computing capability. In addition to providing the underlying AI computing platform, there is an AI resource platform on the upper layer, which can realize unified management of more than 30 types of diverse computing power chips through a unified interface at the resource management layer. Scheduling and management. Further up is the AI ​​algorithm platform, which provides open source deep learning algorithm frameworks, large models, and open data sets. On top of this are computing power services, including computing power, model data, delivery, operation and maintenance and other service models. The top layer is the Yuannao ecosystem with more than 4,000 partners. Inspur Information and ecological partners jointly carry out the design of open accelerated computing solutions and successfully promote them to industrial implementation.

The AI ​​computing platform based on open acceleration specifications has currently adapted to more than 20 mainstream large models in the industry, including the very familiar GPT series, LLaMA, Chat GLM, and "Source". It also supports multi-type diffusion model adaptation.

"Helping hundreds of cores, wisdom of thousands of models" accelerates the implementation of diverse computing power

During the rapid development of AIGC technology and industry, although the industry has formulated relevant specifications for open accelerated computing, there are still some problems in the implementation of the industry. For example, open computing systems are highly customized, and the areas covered by the specifications are insufficient, including system adaptation, management and scheduling of multiple computing power chips, as well as the deployment of deep learning environments, etc.

Based on the OAM specification, the "Open Acceleration Specification AI Server Design Guide" was recently released. Based on the pain points of customers in the current AIGC industry background, it defines the principles of open acceleration server design, including application orientation, diversified openness, green efficiency, and coordinated design. At the same time, the server design method is deepened and refined, including multi-dimensional collaborative design solutions from the node layer to the platform layer. The plan fully considers the problems encountered during the adaptation and development process, and further refines the design parameters from node to platform. The ultimate goal is to improve the development, adaptation, and deployment efficiency of multi-computing chips.

Since servers for AIGC training have a lot of high-power chips and high interconnection bandwidth designs, stability problems are serious. More comprehensive testing is required to ensure system stability and reduce the occurrence of breakpoints and the impact on the efficiency of large model training. Therefore, the "Guide" provides comprehensive and systematic testing guidance from structure, heat dissipation, pressure, stability, software compatibility, etc.

Finally, to push diverse computing power into industrial applications, the most critical thing is performance, including chip performance, interconnection performance, model performance and virtualization performance. Based on the Benchmark tuning experience accumulated in the early stage, the "Guide" proposes performance evaluation and tuning standards and methods to help partners launch their latest chip products into applications faster and better, and improve the availability of computing power. The ultimate goal is to promote the innovation and development of the entire AI computing power industry, collaborate with upstream and downstream partners in the industry chain to promote the entire open acceleration ecosystem, and jointly respond to the computing power challenges in the AIGC era.

thank you all!

 
JetBrains releases Rust IDE: RustRover Java 21 / JDK 21 (LTS) GA With so many Java developers in China, an ecological-level application development framework .NET 8 should be born. The performance is greatly improved, and it is far ahead of .NET 7. PostgreSQL 16 is released by a former member of the Rust team I deeply regret and asked to cancel my name. I completed the removal of Nue JS on the front end yesterday. The author said that I will create a new Web ecosystem. NetEase Fuxi responded to the death of an employee who was "threatened by HR due to BUG". Ren Zhengfei: We are about to enter the fourth industrial revolution, Apple Is Huawei's teacher Vercel's new product "v0": Generate UI interface code based on text
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5547601/blog/10110166