"Open Acceleration Specification AI Server Design Guide" released to cope with the challenge of explosive computing power of generative AI

On August 10, at the 2023 Open Computing Community China Summit (OCP China Day 2023), the "Open Acceleration Specification AI Server Design Guide" (hereinafter referred to as the "Guide") was released. The "Guide" targets generative AI application scenarios and further develops and improves the design theory and design methods of the open acceleration specification AI server. It will help community members efficiently develop AI accelerator cards that comply with the open acceleration specification and greatly shorten the adaptation time with the AI ​​server. Allocation cycle, provide users with AI computing power product solutions that best match application scenarios, and seize the huge opportunities in the computing power industry brought by the explosion of generative AI.

 Currently, generative AI technology is developing rapidly, leading a new wave of AI innovation. Large AI models are the key base for generative AI and have significant value potential for improving production efficiency and transforming and upgrading traditional industries. Efficient training of large models usually requires the support of an AI server cluster composed of high-computing power AI chips of more than 1,000 calories. As generative AI accelerates, the industry's demand for AI servers equipped with high computing power AI chips continues to rise. Against this background, hundreds of companies around the world have invested in the development of new AI acceleration chips, and the diversification trend of AI computing chips has become prominent. Due to the lack of unified industry standards, there are significant differences in AI acceleration chips from different manufacturers, resulting in different chips requiring customized system hardware platforms, resulting in higher development costs and longer development cycles.

OCP is the world's most extensive and influential open source organization in the field of basic hardware technology. In 2019, OCP established the OAI (Open Accelerator Infrastructure) group to define an AI accelerator card form that is more suitable for ultra-large-scale deep learning training to solve the problem of inconsistent forms and interfaces of multiple AI accelerator cards. At the end of 2019, OCP officially released the OAI-UBB (Universal Baseboard) 1.0 design specification, and subsequently launched an open acceleration hardware platform based on the OAI-UBB1.0 specification, which can support OAM products from different manufacturers without hardware modification. In recent years, system manufacturers represented by Inspur Information have developed a number of AI servers that comply with open acceleration specifications, realizing the industrialization practice of open accelerated AI servers.

Based on product development and engineering practice experience in the field of open accelerated computing, the "Guide" further develops and improves the design theory and design methods of the open accelerated specification AI server, and proposes four major design principles and full-stack design methods, including hardware design reference, Management interface specifications and performance testing standards are designed to help community members develop AI accelerator cards faster and better and adapt to open accelerated AI servers to meet the computing power challenges of generative AI.

The "Guide" points out that the design of open accelerated and standardized AI servers should follow four major design principles, namely application-oriented, diverse and open, green and efficient, and coordinated design. On this basis, design methods such as multi-dimensional collaborative design, comprehensive system testing, and performance evaluation and optimization should be adopted to improve adaptation deployment efficiency, system stability, and system availability.

Multi-dimensional collaborative design means that system manufacturers and chip manufacturers must conduct all-round, multi-dimensional collaboration in the early stages of planning to minimize customized development content. Large model computing systems are usually integrated highly integrated computing clusters, including computing, storage, network equipment, software, frameworks, model components, cabinets, refrigeration, power supply, liquid cooling infrastructure, etc. Only through multi-dimensional collaboration can we achieve globally optimal performance, energy efficiency or TCO indicators and improve system adaptation and cluster deployment efficiency. The "Guide" provides a full-stack reference design of software and hardware from nodes to clusters.

Comprehensive system testing refers to the fact that heterogeneous accelerated computing nodes usually have a high failure rate and require more comprehensive and rigorous testing to minimize the risk of failure during system production, deployment, and operation, improve system stability, and reduce breakpoints for training lasting impact. The "Guide" comprehensively sorts out the testing points in terms of structure, heat dissipation, pressure, stability, software compatibility, etc.

Performance evaluation and tuning refers to the need to carry out multi-level performance evaluation and in-depth tuning of software and hardware for large model accelerated computing systems. The "Guide" provides the key points and indicators for basic performance, interconnection performance, and model performance testing, and points out the key points for large model training and inference performance tuning to ensure that the open acceleration specification AI server can effectively complete the current mainstream large models. innovative application support.

Guess you like

Origin blog.csdn.net/annawanglhong/article/details/132302111