The 2023 Smart Expo Product Gold Award was awarded to AIStation, and the efficiency of the large-model computing power platform attracted attention

On June 25 , 2023 , the 2023 Global Artificial Intelligence Product Application Expo opened in Suzhou . Inspur Information's intelligent business production innovation platform AIStation relies on its leading resource scheduling and platform management capabilities to effectively improve the efficiency of large-model computing power platforms and won the " Product Gold Award " , the core award of the Smart Expo . This award not only reflects AIStation 's leadership in large model computing power and business support, but also reflects the industry's high concern for the efficiency of large model computing power platforms.

Currently, generative AI technology represented by large models is accelerating its development and revolutionizing the intelligent transformation paths of various industries. Generative AI innovation requires distributed training of large AI models with hundreds of billions of parameters based on massive data sets on AI server clusters with hundreds or thousands of accelerator cards. How to maximize the performance of large model computing platforms, suppress performance losses, and efficiently complete the training and deployment of large AI models has become a new challenge in the AIGC era.

As an end-to-end platform designed to provide full-process support for artificial intelligence development and deployment, AIStation can help customers accelerate the development and deployment of large AI models with its powerful resource scheduling and management capabilities. By integrating computing resources, data resources, and deep learning software stacks Resources are managed in a unified manner, effectively improving the efficiency of large-model AI computing power clusters.

One-stop management, millisecond-level scheduling, cluster utilization reaches 70%

Large model training requires the construction of a systematic distributed training environment including computing, network, storage, framework, etc. Traditional decentralized management not only has high threshold and low efficiency, but also lacks a targeted and optimized overall scheduling system, resulting in large model computing The overall synergy of the platform is poor, and the training computing power efficiency is low.

In view of the large-scale and systematic characteristics of distributed training computing, AIStation realizes unified pool management of heterogeneous computing power clusters, and automatically configures the underlying computing, storage, and network environments of training through a self-developed distributed task adaptive system. And provides the function of customizing basic hyperparameters. Through a variety of efficient resource management and scheduling strategies, AIStation can achieve millisecond-level scheduling of Wanka clusters and increase the overall resource utilization to more than 70% .

At the same time, AIStation integrates mainstream large model training frameworks and relies on containerization technology to standardize and modularize the operating environment and framework adaptation process, support the construction of operating environments in seconds, and ensure the efficient operation of AI development and AI business .

Bottleneck optimization, robust fault tolerance, and accelerated large model training throughout the process

Aiming at bottlenecks such as computing network construction, data acceleration, and network communication optimization encountered during large-scale distributed training, AIStation improves computing resources through features such as image distribution acceleration, data caching acceleration, network topology scheduling, and resource dynamic elastic scaling. utilization while accelerating the entire training process. Among them, AIStation can improve the model training efficiency by 200%-300% through the data caching mechanism, and can automatically schedule training tasks according to the data caching status of the node, avoiding repeated downloading of training data, saving data loading time, and integrating with the self-developed scheduling system. After cooperation, the linear acceleration ratio of distributed training can be as high as 0.9 , effectively suppressing the performance loss of multi-node collaboration.

Robustness and stability are currently strong requirements for completing large model training efficiently. In this regard, AIStation can realize comprehensive detection and automatic processing of training anomalies and faults by providing integrated capabilities such as full life cycle management, fault tolerance, cluster monitoring and operation and maintenance, effectively shortening the breakpoint training time, reducing complexity, and continuously Stable training reduces the cost and cycle of large model training. 

Efficient calling to release the application value of large models

For application deployment after large model training is completed, AIStation fully integrates training and inference, accelerating the implementation of model applications. In response to the suddenness of calls in practical applications of large models, AIStation can promptly adjust resource allocation according to changes in inference service resource requirements, achieve second-level service expansion and contraction based on the number of real-time business requests, and can support millions of high-concurrency large models. In the AI ​​inference service scenario, the average service response delay is less than 1ms , and the response efficiency for sudden access peaks is increased by 50% .

At present, AIStation has been effectively verified in the training practice of the " source " large model with 245.7 billion parameters . The training computing power efficiency supporting the  " source " large model reaches 44.8% , which is higher than the 21.3% of GPT-3 . At the same time, a large commercial bank's parallel computing cluster based on AIStation won the 2022 IDC " Future Digital Infrastructure Leader " award for its leading large-scale distributed training support capabilities . In the future,  the AIStation platform will continue to provide efficient computing platform management capabilities for the development and deployment of large models in various industries, and accelerate the iterative innovation of AIGC technology.

Guess you like

Origin blog.csdn.net/annawanglhong/article/details/131456220