Building AIGC intelligent computing OS to help large models efficiently release computing power

Recently, the large model intelligent computing software stack OGAI conference opened in Beijing. As an important base for the large model intelligent computing software stack, the smart computing base for large model computing power services - Intelligent Computing OS was released at the conference. Intelligent computing OS is an intelligent computing power operation and management platform for large model computing power services. By pooling general and intelligent computing power, it meets the multi-tenant elastic AI computing power supply needs, continuously drives the transformation and upgrading of industrial AI, and serves hundreds of industries. Start a new journey of smart computing power.

OGAI , or Open GenAI Infra, is a full-stack, full-process intelligent computing software stack that provides AI computing system environment deployment, computing power scheduling and development management capabilities for large model businesses.

Based on the design concepts of full stack and full process, full release of computing power, and practical verification and refinement, OGAI is divided into 5 layers, L0-L4, aiming to create efficient productivity for large model development and application. Inspur Yunhai Intelligent Computing OS, as the base of the intelligent computing software stack, can meet the elastic AI computing power operation and management needs of multi-tenants based on bare metal. Among them, the efficient bare metal service supports the deployment of thousands of large-scale bare metal nodes in minutes and elastic expansion on demand, enabling one-click acquisition of heterogeneous computing chips, IB, RoCE high-speed network, high-performance storage and other environments, and realizing computing, Network and data isolation to ensure business security.


Intelligent computing OS is the basic base of the intelligent computing center

As a typical application in the era of smart computing, AIGC (artificial intelligence generated content) has attracted much attention since its emergence, and has greatly accelerated the transformation process of traditional data centers into "intelligent computing centers". The intelligent computing center not only needs to provide general computing power, but also has diverse heterogeneous computing power such as GPU, DPU, FPGA, etc., and can provide computing power issuance or sales services according to the different needs of users.

Intelligent computing OS focuses on the intelligent computing center scenario. Based on the integration of cloud, server, storage, network, AI and other infrastructure products, it integrates various computing resources to provide basic hardware facilities, cloud, data, intelligence, etc. for the intelligent computing center. The unified operation, operation and maintenance portal and intelligent management of the software platform help enterprises solve various problems in the smart computing era and meet the construction and operation and maintenance needs of smart computing centers for customers in the Internet, education, scientific research, finance and other industries.

Through the production, aggregation, scheduling and release of computing power, it helps enterprises to efficiently develop exclusive large models, forms an AI development model that is adapted to the enterprise, and facilitates the implementation of generative AI.

In large-scale model scenarios, the unified computing power platform of Intelligent Computing OS can easily and flexibly obtain GPU bare metal services. Thousands of large-scale bare metal servers can be deployed in minutes and can be elastically expanded on demand. Heterogeneous computing chips, IB, RoCE high-speed network, parallel storage and other computing environments can be obtained with one click. Computing, network, data and isolation ensure business security. It is as easy to use as a virtual machine and fully unleashes the potential of computing power.


Intelligent computing OS maximizes resource utilization

In user scenarios, independent construction of general computing power and AI computing power often occurs, which can lead to a series of problems:

  1. Information islands: Different types of resources are relatively independent, and information cannot be shared or exchanged;
  2. Resource exclusive: Users have exclusive access to equipment, computing power cannot be shared, and resource utilization is low;
  3. Waste of manpower: Different computing resources are operated and maintained independently, which brings huge complexity to management and operation and maintenance;

Intelligent computing OS supports the unified management of various heterogeneous computing powers such as general computing power and intelligent computing power, and adopts an elastic computing framework to achieve elastic resource scheduling and a multi-tenant system. Automatically allocate and schedule resources based on tenant usage, significantly reducing the waiting time for resource usage. Carry out logical system planning according to the actual situation of users, achieve resource isolation between different tenant systems, and ensure data security. Through automated operation and maintenance capabilities, the professional requirements for operation and maintenance personnel and the complexity of operation and maintenance are reduced, helping users focus on AI development and truly achieving cost reduction and efficiency improvement.


Intelligent computing OS maximizes model training efficiency

In view of the computing characteristics of AI large model training, Zhishu OS comprehensively optimizes cluster architecture, high-speed interconnection, computing power scheduling, etc. In terms of system architecture, it adopts an AI server with 8 accelerators integrated into a single node, and achieves ultra-high speed between accelerators within a node. High-speed P2P communication, establishing an extremely low-latency, ultra-high-bandwidth Infiniband communication network between nodes.

At the level of large model training technology, we successfully used the training optimization experience of the Chinese massive AI model "Source 1.0" to carry out targeted optimization of the distributed training strategy, and accurately adjusted it through the rational design of tensor parallelism, pipeline parallelism and data parallelism. The model structure and hyper-parameters of the training process finally achieved a training efficiency of 53.5% for a large AI model with a parameter scale of 100 billion, setting a new high in the industry for AI large model training efficiency.


Intelligent computing OS promotes computing power operation

In order to ensure the normal supply of model development resources, companies usually consider the peaks and troughs of computing power and purchase additional computing power equipment as redundancy, resulting in idle waste of computing power resources and cost expenditures.

Intelligent computing OS provides commercial computing power rental services, which can sell computing power services through the Internet, and cooperates with the built-in commercial billing system to accurately bill resource usage, providing on-demand billing, annual and monthly subscription and other diversification The billing package helps enterprises quickly build a mature computing power sales system and maximize the value output of idle computing power resources.

At present, Intelligent Computing OS has been widely used in many provincial and municipal intelligent computing centers. It has trained 2 large LLM models in the full stack and has rich experience in construction and optimization, providing efficient computing power support for thousands of industries.

Guess you like

Origin blog.csdn.net/annawanglhong/article/details/133380206