Revealing how NVIDIA A100, A800, H100, and H800 GPUs can achieve 100-fold training acceleration for high-performance large models

Keywords: Transformer; PLM; SLM; NLM; LLM; Galactica; OPT; OPT-IML; BLOOM; BLOOMZ; GLM; Reddit; H100; H800; A100; A800; MI200; MI250; LaMA; OpenAI; GQA; RMSNorm; SFT ; RTX 4090; A6000; AIGC; CHATGLM; LLVM; LLMs; GLM; AGI; HPC; GPU; CPU; CPU+GPU; Nvidia; Nvidia; Intel; AMD; high-performance computing; high-performance server; blue ocean brain; multi-heterogeneous Computing power; large model training; general artificial intelligence; GPU server; GPU cluster; large model training GPU cluster; large language model; deep learning; machine learning; computer vision; generative AI; ML; DLC; image segmentation; pre-trained language Model; AI server; GH200; L40S; HBM3e; Grace Hopper; gracehopper

Abstract: This article mainly introduces the internal operation principle of the large model and the development status of computing power in my country. A large model refers to a deep learning model with a huge amount of parameters, such as GPT-4. It is able to produce more accurate and creative results by training on large-scale datasets. The inner workings of the large model include the processing of input data, multi-layer neural network computation and output generation. These models usually consist of billions of parameters and require huge computing resources and high-speed memory for training and inference.

With the rapid development of large models, our country has made significant progress in the development of computing power. In recent years, my country has invested a lot of resources in research and development in the fields of high-performance computing and artificial intelligence, and has built a series of supercomputing centers and cloud computing platforms. These measures not only enhance our country's scientific research capabilities, but also provide strong support for large model training and application. my country's computing power development has entered the world's leading ranks, laying a solid foundation for promoting the development of artificial intelligence.

The Blue Ocean Brain large model training platform is a high-performance computing platform independently developed by Blue Ocean Brain and is dedicated to large model training and inference. The platform uses advanced hardware architecture and optimized software algorithms to provide efficient computing and storage capabilities.

The internal working principle of the large model

In recent years, the pre-trained Transformer model on large-scale corpus has produced a pre-trained language model (Pre-trained Language Model, PLM), which has demonstrated powerful language understanding and generation capabilities in various natural language processing tasks. Research has found that expanding model size can improve model capabilities, leading to the generation of Large Language Model (LLM). When the model size exceeds a certain threshold, these large models not only greatly improve their performance, but also demonstrate language learning capabilities that small models do not have.

The rapid progress of LLM technology has changed the development and application paradigm of AI systems. This article reviews the development history of LLM technology in recent years, and summarizes LLM's R&D resources, existing problems, and future directions.

I. Introduction

Language is the unique ability of human beings to express and communicate. It begins to be formed in early childhood and continues to develop and change throughout life. However, if a machine wants to naturally master the ability to understand and use language like humans, it must be equipped with powerful artificial intelligence algorithms. Achieving machines with human-like ability to read, write and communicate is a long-standing research challenge.

Technically speaking, language modeling is one of the main methods to improve machine language intelligence. Language modeling typically models word sequence generation probabilities to predict unoccurred words. Language modeling research has received widespread attention in academia. Its development can be divided into four main stages:

1. Statistical Language Model (SLM)

SLM (Statistical Language Model) emerged in the 1990s. It is based on statistical learning methods and uses the Markov hypothesis to build a word prediction model. SLMs with a fixed context length n are also known as n-gram language models, such as bigram and trigram language models. It is widely used in information retrieval and natural language processing, but often suffers from the curse of dimensionality. Therefore, smoothing strategies need to be specially designed, such as back-off estimation and Good Turing estimation have been introduced to alleviate the data sparsity problem. 


2. Neural Language Model (NLM)

In the field of natural language processing, neural network models such as recurrent neural network (RNN) are widely used to describe the probability of word sequences. Early work introduced the concept of distributed representation of words and constructed word prediction functions based on distributed word vectors as an important contribution in this field. Subsequent research expanded the idea of ​​learning effective features of words and sentences, developed general neural network methods, and established unified solutions for various natural language processing tasks. In addition, word2vec proposes to use simplified shallow neural networks to learn distributed word representations, and these representations have been shown to be very effective in a variety of natural language processing tasks. The above research applies language models to the field of representation learning, not just word sequence modeling, and has a profound impact on natural language processing.

3. Pre-trained language model (PLM)

PLM obtains semantic representation by pre-training on large-scale corpus, and then fine-tunes it for downstream tasks. The introduction of structures such as Transformer greatly improves performance. "Pre-training-fine-tuning" has become an important paradigm in natural language processing.

4. Large Language Model (LLM)

Large language models continue to expand the scale of models and data, demonstrating powerful language capabilities that small models do not have. Models such as GPT-3 exhibit amazing contextual learning capabilities. ChatGPT successfully applies large language models to open domain conversations.

Compared with the pre-trained language model (PLM), the large language model (LLM) has three key differences:

1) LLM exhibits amazing emergent capabilities that PLM does not have, making it powerful on complex tasks

2) LLM will change the way humans develop and use AI systems, and needs to be accessed through prompt interfaces

3) The boundaries between LLM research and engineering are no longer clear. LLM technology is leading the revolution in the fields of AI, natural language processing, information retrieval and computer vision, and a practical application ecology based on LLM is taking shape.

However, the intrinsic principles and key factors of LLM need to be further explored. It is very difficult to train large-scale LLM, and there are also challenges in aligning LLM with human values. Therefore, more attention needs to be paid to the research and application of LLM.

2. Overview

The background of the large language model (LLM) will be outlined below, and the technical evolution of the GPT series of models will be summarized.

1. Background of large language models

A large language model (LLM) usually refers to a Transformer structured language model that is trained on large-scale text data and contains hundreds of billions (or more) of parameters, such as GPT-3, PaLM, Galactica, LLaMA, and LLaMA2. LLM demonstrates strong language understanding capabilities and the ability to solve complex tasks through text generation. In order to quickly understand the working principle of LLM, the basic background of LLM will be introduced below, including expansion rules, emergence capabilities and key technologies.

1) Expansion rules of large language models

At present, large language models are mainly built on the Transformer architecture, in which multi-head attention mechanism layers are stacked in very deep neural networks. The existing large language model adopts a similar Transformer structure and the same pre-training goals as the small language model (such as language modeling), but the large language model greatly expands the model size, training data volume and total calculation (order of magnitude improvement) . A large body of research shows that scaling up can significantly improve the capabilities of language models. Therefore, it is of interest to establish a quantitative method to describe the expansion effect.

KM expansion law: In 2020, the OpenAI team proposed for the first time that there is a power law relationship between the performance of the neural language model and the model size, data set size and training calculation amount. Three formulas are proposed based on experiments to describe the expansion law under a given computational budget.

Here L is the cross-entropy loss expressed in natural logarithm. The above three rules are obtained by fitting the language model performance under different data amounts, different model sizes and different training calculations. The results show that model performance has a very strong dependence on these three factors.

Chinchilla Scaling Rule: The Google DeepMind team proposed an alternative form of scaling rule to guide the optimal training computation of large language models. Rigorous experiments were conducted over a wider range of model sizes and data amounts and a similar expansion law was fitted, but with different coefficients:

In this law, E, A, B, α and β are empirically determined coefficients. The researchers further demonstrated how to optimally allocate the computing budget between the model size and the amount of data by optimizing the loss function L(N,D) under the condition of the training calculation constraint C ≈ 6ND.

Here G is the expansion coefficient calculated based on the coefficients A, B, α and β. For example, according to the literature analysis, as a given computing budget increases, the KM expansion law is more inclined to allocate the budget to model size, while the Chinchilla expansion law believes that the model and data size should be increased in similar proportions. Despite some limiting assumptions, these scaling laws provide an intuitive understanding of scaling effects that can be used to predict the performance of language models during training. However, some capabilities (such as contextual learning) cannot be fully predicted based on scaling laws and will only appear after the model exceeds a certain scale.

One of the key characteristics of large language models is to exhibit emergent capabilities that pre-trained language models do not have, that is, new capabilities that only appear after the model reaches a certain scale. When emergent capabilities arise, performance suddenly and significantly improves beyond random levels, similar to phase transitions in physics. Emergent capabilities can be related to complex tasks, and attention needs to be paid to those general capabilities that can solve a wide range of tasks. The following briefly introduces three typical emergent capabilities of large language models and related representative models.

Contextual learning: GPT-3 proposed this capability for the first time, that is, only needs to provide language instructions and a few examples, and the model can generate the expected output without additional training. However, this ability is related to the scale of the model and needs to reach a certain number of parameters to appear.

Instruction following: Through instruction fine-tuning, large language models can generalize on completely unseen tasks based solely on language descriptions. This capability will be significantly improved when the model exceeds 68 billion parameters. Different models also have different mastery of this ability.

Step-by-step reasoning: Small models have difficulty solving complex tasks that require multi-step reasoning, while large language models can complete such tasks by providing thought chain hints for intermediate reasoning steps. The effect of this hint is only significant when the model exceeds 60 billion parameters. Different tasks rely on this ability to varying degrees.

2) Key technologies of large language models

After a long period of development, large language models (LLM) have evolved to the current stage of general use and powerful capabilities. Major technological advances include:

Expansion: Increasing the model, data size and training calculation amount can significantly improve the capabilities of LLM. It is also important to rationally utilize the law of expansion to guide resource allocation.

Training: Distributed training algorithms are critical to successfully training large models. Several optimization frameworks and techniques can facilitate large-scale distributed training.

Ability guidance: Designing an appropriate prompt strategy can stimulate the potential capabilities of LLM, but the effect may be different for small models.

Alignment fine-tuning: Making LLM-generated content consistent with human values ​​through reinforcement learning of human-computer interaction.

Tool operation: Using external tools to compensate for the limitations of LLM, similar to its "eyes and ears", can expand the scope of capabilities.

Additionally, many other factors, such as hardware upgrades, have contributed to LLM's success. However, we mainly discuss the main technical approaches and key findings in developing LLM.

2. Technical evolution of GPT series models

ChatGPT has received widespread attention for its excellent ability to communicate with humans. It is developed based on the powerful GPT model, and its conversational capabilities have been specially optimized. Considering the strong interest in ChatGPT and GPT models, this paper specifically summarizes the technical evolution of the GPT series of models in the past few years to improve public understanding. Generally speaking, OpenAI has gone through the following stages in large language model research:

1) Early exploration

According to an interview with OpenAI co-founder Ilya Sutskever, the idea of ​​using language models to implement intelligent systems was explored early on in OpenAI, but at the time it was experimenting with recurrent neural networks (RNNs). With the emergence of the Transformer architecture, OpenAI developed two early GPT models: GPT-1 and GPT-2, which can be regarded as the basis for the later more powerful GPT-3 and GPT-4.

GPT-1: In 2018, OpenAI developed the first GPT model based on the then-new Transformer architecture. GPT-1 adopts the Transformer decoder structure and uses unsupervised pre-training and supervised fine-tuning methods to lay the foundation for subsequent GPT models.

GPT-2: GPT-2 increases the number of parameters based on GPT-1, reaching 15 billion, and uses a larger web page data set for training. Complete downstream tasks through unsupervised language modeling without explicit fine-tuning of annotated data.

2) Capability leap

Although GPT-2 aims to be a general-purpose multi-task learner through unsupervised training, its performance is still weak compared to the current state-of-the-art methods with supervised fine-tuning. Although the GPT-2 model is small in scale, it has been widely used in downstream tasks, especially dialogue tasks, after fine-tuning. On the basis of GPT-2, GPT-3 achieves a major leap in capabilities under a similar generative pre-training architecture by expanding the model scale.

GPT-3, released in 2020, further expanded the model size to 175 billion parameters. The GPT-3 paper formally proposes the concept of In-Context Learning (ICL), which uses a language model with small or zero samples. ICL is still essentially language modeling, but what is predicted is the text output for a given task. GPT-3 not only performs strongly on NLP tasks, but also shows amazing adaptability in tasks requiring reasoning. Although the GPT-3 paper does not explicitly discuss emergent capabilities, it can be observed that its performance leap may exceed the basic scaling laws, marking an important evolution from pre-trained language models to large language models.

3) Ability enhancement

GPT-3 became the basis for OpenAI to develop more powerful language models, improving mainly in two ways:

Training using code data: The original GPT-3 is trained on plain text and has weak inference capabilities. Use GitHub code fine-tuning to enhance your programming and mathematical problem-solving skills.

Aligning with Humans: OpenAI started researching how to learn from human preferences as early as 2017. They use reinforcement learning methods to train language models to match human expectations. Not only improves the ability to follow instructions, but also reduces the generation of harmful content. It is important to align language models with human values ​​through human-computer interaction reinforcement learning.

4) Important milestones in language models

Based on previous explorations, OpenAI has made two important developments: ChatGPT and GPT-4, which have greatly improved the capabilities of the AI ​​system:

ChatGPT: Released in November 2022 is a dialogue-optimized GPT model, and the training method is similar to InstructGPT. Exhibiting excellent ability to communicate with people and rich knowledge, it is currently the most powerful chatbot and has a great impact on AI research.

GPT-4: Released in March 2023, it supports multi-modal input, which is significantly improved compared to GPT-3.5, and is superior to ChatGPT in various difficult tasks. Responses to malicious questions are also safer through iterative alignment. OpenAI employs various strategies to mitigate potential risks.

Despite great progress, these language models still have limitations and require continuous optimization to make them more powerful and secure. OpenAI uses an iterative deployment strategy to control risks.

3. Large language model resources

Given the technical difficulties and computing resource requirements for training large language models, it is very difficult to develop or reproduce large language models from scratch. A feasible approach is to conduct incremental development or experimental research based on existing language models. The following is a brief summary of publicly available resources for developing large language models, including public model checkpoints, corpora, and code libraries.

1. Publicly available model checkpoints or APIs

Given the high cost of pre-training models, public pre-training checkpoints are critical for research organizations working on large language models. Parameter scale is a key factor to consider when using these models. To help users choose appropriate research directions based on computing resources, the public models are divided into two levels: tens of billions and hundreds of billions of parameters. In addition, the public API can directly use the model for inference without running it locally. The following describes the exposed model checkpoints and APIs.

1) Models with tens of billions of parameters

Public language models with tens of billions of parameters include mT5, PanGu-α, T0, GPT-NeoX-20B, CodeGen, UL2, Flan-T5 and mT0, etc., with parameter sizes ranging from 10 to 20 billion. Among them, Flan-T5 can be used for instruction fine-tuning research, CodeGen is specially designed for generating code, and mT0 supports multiple languages. For Chinese tasks, PanGu-α performs better. LLaMA is a recently published model that exhibits remarkable capabilities on instruction-following tasks. Models of this size typically require hundreds to thousands of GPUs/TPUs. To accurately estimate required computing resources, compute volume metrics such as FLOPS can be used.

2) Models with hundreds of billions of parameters

There are few public language models with hundreds of billions of parameters, mainly OPT, OPT-IML, BLOOM, BLOOMZ, GLM and Galactica. Among them, OPT is used to reproduce GPT-3, BLOOM and BLOOMZ perform better in multi-language modeling, and OPT-IML has been fine-tuned. This type of model usually requires thousands of GPUs/TPUs. For example, OPT uses 992 A100 GPUs and GLM uses 96 DGX-A100 nodes.

3) Public API for large language models

Compared with using the model directly, the API provides a more convenient way to use large language models without running it locally. The APIs of the GPT series models have been widely used, including ada, babage, curie, davinci, etc. Among them, davinci corresponds to the largest model of GPT-3. There is also a code generation API related to Codex. The GPT-3.5 series adds new interfaces such as text-davinci-002. gpt-3.5-turbo-0301 corresponds to ChatGPT. Recently, the API of GPT-4 was also released. Generally speaking, interface selection depends on specific application scenarios and response requirements.

2. Common corpus

Unlike small-scale pre-trained language models, large language models require a larger amount and wide range of data for training. To meet this need, an increasing number of publicly available datasets are being released for research. Here is a brief overview of some commonly used large language model training corpora, which are divided into six categories according to content types: Books, CommonCrawl, Reddit Links, Wikipedia, Code, Others. 

1)Books

BookCorpus contains over 11,000 e-books covering a wide range of topics and is used by early small-scale models such as GPT and GPT-2. The Gutenberg corpus contains more than 70,000 literary works of various types and is currently one of the largest public book collections. It is used to train models such as MT-NLG and LLaMA. The unpublished Books1 and Books2 data sets used in GPT-3 are larger.

2)CommonCrawl

CommonCrawl is one of the largest open source web crawler databases and has been widely used in large-scale language model training. Existing filtering data sets based on CommonCrawl include C4, CC-Stories, CC-News and RealNews. C4 includes five variants18, namely en, en.noclean, realnewslike, webtextlike and multilingual. Among them, the en version is used to pre-train T5, LaMDA, Gopher and UL2 are used to pre-train multiple models; CC-Stories and CC-News are subsets of CommonCrawl data, including content in the form of stories; RealNews is also used to pre-train multiple models. training data.

3)Reddit Links

Reddit is a social media platform where users can submit links and posts. WebText is a well-known Reddit-based corpus consisting of highly liked links on Reddit. OpenWebText is an easily accessible open source alternative. PushShift.io is a live-updating dataset that includes historical data since Reddit's creation. Provides useful utilities that enable users to search, summarize, and perform preliminary statistical analysis of entire data sets. Users can easily collect and process Reddit data.

4)Wikipedia

Wikipedia is an online encyclopedia containing a large number of high-quality articles covering a variety of topics. Use an explanatory writing style and support citations, covering many different languages ​​and a wide range of knowledge areas. The Wikipedia English version is widely used in most LLMs (such as GPT-3, LaMDA, and LLaMA), and it is also available in multiple languages, which can be used in a multilingual environment.

5)Code

The main source of code data collection is to crawl codes with open source licenses from the Internet, including public code libraries with open source licenses (such as GitHub) and code-related question-and-answer platforms (such as StackOverflow). Google publicly releases the BigQuery data set, which contains a large number of open source license code snippets in various programming languages ​​and is a typical code data set. The BIGQUERY used by CodeGen is a subset of the BigQuery dataset used to train multilingual versions of CodeGen-Multi.

6)Others

The Pile is a large-scale, diverse open source text dataset (over 800GB of data), including books, websites, code, scientific papers, and social media platforms. Consisting of 22 high-quality subsets, it is widely used in models of different parameter scales, such as GPT-J (6B), CodeGen (16B) and Megatron-Turing NLG (530B). In addition, ROOTS is a large corpus composed of various smaller datasets covering 59 different languages, which is used to train BLOOM.

In order to pre-train LLM, it is usually necessary to mix different data sources, such as C4, OpenWebText, and The Pile, etc., and extract data from related sources (such as Wikipedia and BigQuery) to enrich the corresponding information in the pre-training data. To quickly understand the data sources used by existing LLMs, the pre-training corpora of three representative LLMs are introduced below:

GPT-3 (175B) is trained on a mixed dataset including CommonCrawl, WebText2, Books1, Books2, and Wikipedia.

PaLM (540B) uses a pre-trained dataset consisting of social media conversations, filtered web pages, books, Github, multilingual Wikipedia, and news, containing a total of 780 billion tokens.

LLaMA pulls training data from multiple data sources, including CommonCrawl, C4, Github, Wikipedia, books, ArXiv, and StackExchange. The training data size of LLaMA (6B), LLaMA (13B) and LLaMA (32B) is 1.0 trillion tokens, while LLaMA (65B) uses 1.4 trillion tokens.

3. Code base resources

In this section, we briefly introduce some code libraries that can be used to develop LLM. 

1)Transformers

Transformers is a Python library developed by Hugging Face, using the Transformer architecture. Provides a simple and easy-to-use API to facilitate users to customize various pre-training models. The library has a large and active community of users and developers, regularly updating and improving models and algorithms.

2)DeepSpeed

A deep learning optimization library developed by Microsoft (compatible with PyTorch), which has been used to train several LLMs, such as MT NLG and BLOOM. Supports distributed training optimization techniques such as memory optimization (ZeRO technology and gradient checkpointing) and pipeline parallelism.

3)Megatron-LM

A deep learning library developed by NVIDIA for training LLM. Provides distributed training optimization technologies, such as model and data parallelism, mixed precision training and FlashAttention, which can improve training efficiency and speed and achieve efficient distributed training.

4)JAX

A Python library developed by Google for high-performance machine learning algorithm operations. It supports efficient array operations under hardware acceleration, can perform efficient calculations on various devices, and also supports special functions such as automatic differentiation and just-in-time compilation.

5) Colossal-AI

A deep learning library developed by HPC-AI Tech for training large-scale artificial intelligence models. Based on PyTorch implementation, it supports parallel training strategies and PatrickStar method to optimize heterogeneous memory management. The ColossalChat-like ChatGPT model (versions 7B and 13B) was recently released.

6)BMTrain

The distributed training library developed by OpenBMB emphasizes concise code, low resource usage and high availability. BMTrain has migrated common LLMs (such as Flan T5 and GLM) into its ModelCenter, and users can use them directly.

7)FastMoE

FastMoE is a training library specifically for MoE models, developed based on PyTorch and focusing on efficiency and user-friendliness. It simplifies the process of converting Transformer models to MoE models and supports parallel training of data and models.

In addition to the resources provided by the above-mentioned deep learning frameworks, other frameworks such as PyTorch, TensorFlow, MXNet, PaddlePaddle, MindSpore and OneFlow also provide support for parallel algorithms, which are often used to train large-scale models.

4. Data Collection

LLM requires high-quality data for pre-training, and its model capabilities also rely on pre-processing methods and pre-training corpora. The following mainly discusses the collection and processing of pre-training data, including data sources, pre-processing methods and analysis of the impact on LLM performance.

1. Data source

The key to developing a capable LLM is to collect a large natural language corpus. The existing LLM mixes various public text datasets as a pre-training corpus, and the sources are divided into general text and special text. General text data (such as web pages, books, conversation texts, etc.) are large-scale, diverse and easy to obtain, and are utilized by most LLMs to enhance their language modeling and generalization capabilities. Specialized datasets (such as multilingual data, scientific data, code, etc.) give LLM the ability to solve specialized tasks.

Ratio of various data sources in existing LLM pre-training data

1) General text data

Universal pre-training data is an integral part of the LLM model, providing rich text resources and diverse topics. Among them, three important general text data include web pages, conversation texts and books.

Web pages include Wikipedia, news sites, etc., but low-quality content needs to be filtered. To improve data quality, researchers usually use web crawler tools, such as CommonCrawl, to grab large amounts of data from the Internet. This data may contain both high-quality and low-quality text, and thus needs to be filtered and processed.

Dialogue text can enhance LLM's dialogue capabilities and performance in question and answer tasks. Researchers can exploit subsets of public conversation corpora or collect conversation data from online social media. Since conversation data often involves discussions between multiple participants, an efficient way to process it is to convert the conversation into a tree structure, connecting each utterance to the utterance that responded to it. In this way, a conversation tree between multiple parties can be divided into multiple sub-conversations in the pre-trained corpus. However, excessive introduction of conversational data may cause the instruction to be mistakenly perceived as the beginning of a conversation, thereby reducing the effectiveness of the instruction.

Books are another important source of general text data, providing more formal long texts relative to other corpora. This has potential benefits for LLM to learn language knowledge, model long-term dependencies, and generate narrative and coherent texts. Existing open source datasets include Books3 and Bookcorpus2, which are available in the Pile dataset.

2) Private text data

Specialized datasets are very useful for improving the capabilities of LLM in specific tasks. Three specialized data types include multilingual text, scientific text, and code.

• Multilingual text: Integrating multilingual corpora can enhance the model’s multilingual understanding and generation capabilities. For example, BLOOM and PaLM collect multilingual data containing 46 and 122 languages ​​in their pre-training corpora. These models exhibit excellent performance in multilingual tasks such as translation, multilingual summarization and multilingual question answering, and are consistent with State-of-the-art models fine-tuned on the target language have comparable or even better performance.

• Scientific texts: The continuous growth of scientific publications bears witness to human exploration of science. To enhance LLM's understanding of scientific knowledge, scientific corpora can be incorporated into the model's pre-training corpus. By pre-training on a large number of scientific texts, LLM can achieve excellent performance in scientific and reasoning tasks. Existing work mainly collects arXiv papers, scientific textbooks, mathematics web pages and other related scientific resources. Due to the complexity of data in scientific fields, such as mathematical symbols and protein sequences, specific tokenization and preprocessing techniques are often required to convert these different formats of data into a unified form that can be processed by language models.

• Code: Programming has received a lot of attention in academia and in PLM applications, but producing high-quality and accurate programs remains challenging. Recent research shows that pre-training LLM on a large corpus of code can improve the quality of programming, passing unit test cases or solving competition programming problems. There are two main sources of code corpora for pre-training LLM: programming Q&A communities and open source software repositories. Unlike natural language text, code is presented in a programming language format, corresponding to long-distance dependencies and accurate execution logic. Recent studies have shown that training codes may be the source of complex reasoning capabilities, and that formatting reasoning tasks as codes can also help LLMs generate more accurate results.

2. Data preprocessing

After collecting a large amount of text data, it is necessary to preprocess the data, especially to eliminate noisy, redundant, irrelevant and potentially harmful data, because these data may affect the capability and performance of LLM. Data preprocessing strategies to improve data quality are reviewed below. The typical flow of preprocessing the pre-training data for LLM is illustrated in the figure.

A typical flow chart for preprocessing pre-training data

1) Quality filtering

To remove low-quality data, existing works usually adopt classifier-based or heuristic-based methods. Classifier-based methods use high-quality text to train a classifier and predict a score for each data, thereby filtering low-quality data. But these methods may remove high-quality texts from dialects, spoken and sociolinguistic languages, leading to bias and reduced diversity. Heuristic-based methods eliminate low-quality text by designing a set of rules. These rules can be summarized as: remove duplicate, irrelevant or incomplete text; remove text with spelling errors, grammatical errors or unusual wording; remove text that lacks context. The text of the message, etc.

2) Deduplication

Existing research has found that repeated data in the corpus will affect model diversity and the stability of the training process, so the pre-training corpus needs to be deduplicated. Specifically, duplication can be removed at different granularities such as sentence level, document level and data set level. At the sentence level, low-quality sentences containing repeated words and phrases should be removed; at the document level, duplicate documents with similar content can be removed by detecting the overlap ratio; at the same time, overlap between the training set and the evaluation set should be prevented. All three levels of deduplication can help improve LLM training and should be used together.

3) Privacy removal

Most pre-training text data comes from online sources, including user-generated content involving sensitive or personal information, which may increase the risk of privacy leaks. Therefore, personally identifiable information (PII) needs to be removed from the pre-training corpus. A straightforward and effective approach is to use rule-based methods, such as keyword identification, to detect and delete sensitive information such as PII. In addition, the researchers also found that LLM's vulnerability to privacy attacks may be attributed to the presence of duplicate PII data in the pre-training corpus. Therefore, deduplication can also reduce privacy risks.

4) participle

Word segmentation is a key step in data preprocessing, which divides the original text into word sequences as input to LLM. Although existing tokenizers are convenient, it is more effective to use tokenizers designed for pre-trained corpora, especially for corpora in multiple domains, languages, and formats. Several recent LLMs use SentencePiece to train customized tokenizers for pre-training corpora, and use the BPE algorithm to ensure that information is not lost. However, it should be noted that normalization technology may reduce word segmentation performance.

3. The impact of pre-training data on large language models

Unlike small-scale PLM, large-scale LLM usually cannot perform multiple pre-training iterations, so it is very important to prepare an adequate pre-training corpus before training. Next, we will discuss how factors such as the quality and distribution of the pre-training corpus affect the performance of LLM.

1) Mixed sources

Pre-training data from different domains or scenarios have different linguistic features or semantic knowledge, and the distribution of pre-training data needs to be carefully set when mixing data from different sources. Gopher experiments show that increasing the proportion of book data can improve the ability of the model to capture long-term dependencies from text, and increasing the proportion of the C4 data set will improve the performance on the C4 verification data set. However, training too much data in a certain field alone will affect the generalization ability of LLM in other fields. Therefore, it is recommended that researchers should determine the proportion of data from different domains in the pre-training corpus to develop an LLM that better meets the needs.

2) The amount of pre-training data

To pretrain an effective LLM, it is important to collect enough high-quality data. Existing research has found that as the LLM parameter size increases, more data is needed to train the model. Many existing LLMs suffer from sub-optimal training due to lack of sufficient pre-training data. Extensive experiments show that it is necessary to use equal scales of model parameters and training tokens for a given computing budget. LLaMA studies show that smaller models can achieve good performance using more data and training for longer periods of time. Therefore, researchers are advised to focus on the amount of high-quality data when adequately training models.

3) Quality of pre-training data

Research shows that pre-training on low-quality corpora can harm model performance. In order to develop a well-performing LLM, both the quantity and quality of the training data collected are crucial. Recent research has shown the impact of data quality on downstream task performance. By comparing the performance of models trained on filtered and unfiltered corpora, the same conclusion is reached that pre-training LLM on cleaned data improves performance. More specifically, duplication of data may lead to the "double dip phenomenon" and may even destabilize the training process. In addition, duplicate data will reduce the ability of LLM to copy from context, further affecting the generalization ability of LLM in ICL. Therefore, it is necessary for researchers to carefully preprocess the pre-training corpus to improve the stability of the training process and avoid its impact on model performance.

5. Adaptation and fine-tuning of large language models

The pre-trained LLM can obtain general capabilities for solving various tasks, and the capabilities of LLM can be further adapted to specific goals. Two methods of adapting pre-trained LLM will be introduced below: instruction fine-tuning and alignment fine-tuning. The former aims to enhance the capabilities of the LLM, while the latter aims to align the behavior of the LLM with human values ​​or preferences.

1. Instruction fine-tuning 

Instruction fine-tuning is a method for fine-tuning a pre-trained LLM on a collection of examples in natural language format. After collecting or constructing instances of the instruction format, fine-tune the LLM using a supervised manner, such as training with a sequence-to-sequence loss. After fine-tuning, LLM can demonstrate the ability to generalize to unseen tasks, even in multilingual scenarios.

1) Construction of formatted instances

Examples of instruction formats include task description, input output, and examples. Existing research has published annotated data in natural language format, which is an important public resource.

Formatting existing datasets: Several early research efforts collected examples in different domains and created supervised multi-task training datasets for multi-task learning. That is, human-written natural language task descriptions are used to add formatting to these data sets to guide the language model to understand different tasks. For example, each Q&A task has a description of "Please answer the following questions." Instruction has been shown to be a key factor in the generalization ability of language modeling tasks. In order to generate better annotation data for instruction tuning, some work uses the reverse input and output method, that is, reversing the existing input and output design instructions. There are also works that utilize heuristic templates to convert large amounts of unlabeled text into labeled instances.

Formatting human needs: Although a large amount of training data has been formatted by adding instructions, this data mainly comes from public NLP datasets and lacks diversity and matching with real needs. To solve this problem, some works adopt real queries submitted by users to the OpenAI API as task descriptions. These queries, expressed in natural language, are well suited to guide the language model’s ability to follow instructions. Additionally, annotators are asked to write various instructions for real-life tasks such as open generation, Q&A, brainstorming, and chatting. Then let other annotators answer directly based on these instructions as output. Finally, the instruction and desired output are paired as a training instance. Notably, these real-world tasks are also used for alignment fine-tuning. Other work inputs existing instances into language models to generate instructions and data to reduce the burden of manual annotation and build more diverse training data.

Key Factors for Building Instances: The quality of instruction instances has a significant impact on the performance of the model. Some key factors in the construction of the examples are discussed here.

Diagram of a format instance and two ways of constructing an instruction format instance

Increase the number of instructions: A large number of research results show that expanding the number of tasks can significantly improve the generalization ability of large language models. As the number of tasks increases, the performance of the model continues to improve at the beginning, but when the number of tasks reaches a certain level, the performance improvement of the model becomes negligible. A reasonable conjecture is that a certain number of representative tasks can provide relatively sufficient knowledge, and adding more tasks will have limited benefits. In addition, it is also beneficial to enhance the diversity of tasks from multiple dimensions such as the length, structure, and creativity of task descriptions. Regarding the number of instances required for each task, previous research has found that a small number of instances can usually saturate the model's generalization performance. However, significantly increasing the number of instances (e.g., hundreds) for some tasks may lead to overfitting, affecting model performance.

The design of the instruction format is also important: task descriptions and examples can often be added to the input-output pairs. An appropriate number of examples facilitates model understanding and reduces sensitivity to instruction engineering. But adding too much irrelevant content may be counterproductive. Instructions containing chained reasoning can improve the model's reasoning capabilities.

2) Instruction fine-tuning strategy

Unlike pre-training, instruction fine-tuning is usually more efficient since only a small number of examples are required for training. Instruction fine-tuning can be regarded as a supervised training process, and its optimization process has some differences from pre-training, such as training objective functions (such as sequence-to-sequence loss functions) and optimizing parameter settings (such as smaller batch sizes and learning rates) . These details require special attention in practice. In addition to optimizing parameter settings, instruction fine-tuning also needs to consider the following two important aspects:

Data distribution balance: Since a mixture of multiple tasks is involved, the data proportions of different tasks need to be balanced. One approach is to combine all data and sample proportionally. Usually, high-quality data such as FLAN are given a higher sampling ratio, and a maximum capacity is set to limit the total number of samples to prevent large data sets from occupying the sampling set.

Incorporating pre-training: Some methods add pre-training numbers to instruction fine-tuning as regularization. There is also a method that does not divide into stages, but uses multi-task learning to simultaneously train pre-training data and instruction format data from scratch. Some models also pre-train instruction data as a small part of the pre-training corpus to gain the advantages of both pre-training and instruction fine-tuning.

3) The effect of instruction fine-tuning

Instruction fine-tuning has the following two main effects on the language model:

Performance improvement: Instruction fine-tuning can significantly improve the capabilities of language models of different sizes, and fine-tuning has obvious effects even on small data sets. The fine-tuned small model is sometimes even better than the original large model. Instruction fine-tuning provides a general and efficient method to improve the capabilities of existing language models.

Task generalization: Instruction fine-tuning gives the model the ability to follow human natural language instructions to complete tasks, and can generalize even unseen tasks. It has been shown to enhance model performance on both seen and unseen tasks. Instruction fine-tuning can also help alleviate some of the weaknesses of language models and improve the ability to solve real-world tasks. The fine-tuned model can generalize the ability of English tasks to other language-related tasks, and even achieve satisfactory multi-language task performance using only English instructions.

2. Alignment fine-tuning

This part first introduces the background of alignment fine-tuning, including definitions and evaluation criteria; then focuses on the collection method of human feedback data for aligning language models; finally discusses the key techniques of using human feedback for reinforcement learning to achieve alignment fine-tuning.

1) Align fine-tuned background and standards

Language models have demonstrated powerful capabilities on many natural language processing tasks, but can sometimes exhibit undesired behaviors such as generating false information, pursuing inaccurate targets, and producing harmful, misleading, or biased output. The goal of pre-trained language models is language modeling, which does not take human values ​​into account, so alignment fine-tuning is required to make the model behave in line with human expectations.

The criteria for alignment fine-tuning are different from pre-training and other fine-tuning and are more subjective and complex, such as usefulness, honesty, and harmlessness. These standards are difficult to directly serve as optimization targets and require specific technologies to be achieved. Usefulness requires that the model solves user problems and answers questions in a concise and efficient way, and demonstrates the ability to ask appropriate questions to obtain more information. Defining and measuring usefulness is challenging; honesty requires providing accurate content without fabrication and communicating uncertainty. Relatively more objective and may rely less on manpower; harmlessness requires not generating offensive or discriminatory language, detecting and rejecting malicious requests, and depends on the context of use.

2) Collection of human feedback

Choosing the right annotator is very important. It requires native speakers with high education level and strong English ability, preferably with relevant academic qualifications. It is also necessary to evaluate the consistency between the annotator's output and the researchers' expectations, select the person with the highest consistency for the annotation work, and provide detailed guidance during the annotation process. There are three main ways to collect human feedback:

Sorting-based method: Let the annotator sort the multiple candidate output results generated by the model to obtain a preference ranking, and adjust the model according to this ranking to output with a higher ranking. Richer preference information can be obtained than just selecting a single best output.

Question-based approach: Researchers design specific questions that annotators need to answer to evaluate model outputs, and the question design needs to cover various alignment criteria. You can get more detailed feedback than sorting.

Rule-based approach: Researchers formulate a series of rules to test whether the model output violates these rules, and the annotator needs to perform a quantitative rule score on the degree of violation. Direct feedback on compliance with alignment standards is available.

Reinforcement learning is an important technique in alignment fine-tuning, which can learn and optimize the model to achieve alignment standards based on human feedback. Reinforcement learning methods based on human feedback are discussed in detail below.

RLHF algorithm workflow

3) Reinforcement learning based on human feedback

To ensure that LLM is consistent with human values, methods have been proposed to fine-tune LLM using collected human feedback data, called RLHF. This method uses reinforcement learning algorithms (such as PPO) to adapt LLM to human feedback by learning a reward model. This approach incorporates humans into the training loop to develop good LLMs like InstructGPT.

Reinforcement Learning Systems Based on Human Feedback: PLM is typically a generative model, initialized with existing PLM parameters. The reward model provides guidance signals that reflect human preferences for text generated by LM (Language Model). Existing work usually adopts reward models with different parameter scales from the LM (Language Model) to be aligned. Finally, in order to optimize the PLM using signals from the reward model, a specific RL algorithm is designed for fine-tuning of large-scale models. Specifically, PPO is an RL alignment algorithm widely used in existing works.

Key steps in reinforcement learning based on human feedback:

Supervised fine-tuning: Collect a supervised dataset containing input hints and desired outputs to fine-tune the LM. For example, InstructGPT requires human annotators to write prompts and desired outputs.

Train the Reward Model: Train the RM using human-feedback data, generate a certain amount of output text, and invite human annotators to annotate preferences for these input-output pairs. Finally, the RM is trained to predict the human-preferred output.

Reinforcement learning fine-tuning: The alignment fine-tuning of LM is formalized as an RL problem, where the strategy is given by PLM, the action space is the vocabulary of LM, the state is the currently generated token sequence, and the reward is provided by RM. A penalty term is added to the reward function to avoid deviating from the initial model.

3. Efficient fine-tuning

This section discusses how to fine-tune large models (such as Transformers) efficiently. Below, we will review several representative parameter-efficient fine-tuning methods and summarize the existing work on parameter-efficient fine-tuning LLM.

1) Efficient parameter fine-tuning method

Several main methods for efficient fine-tuning of Transformer language model parameters:

Adapter fine-tuning: Insert a small adapter module into the Transformer model to compress and map feature vectors. Adapters can be connected serially or in parallel after attention and feedforward layers. During fine-tuning, only the adapter parameters are optimized and the original language model parameters are fixed.

Prefix fine-tuning: Add a set of trainable prefix vectors in front of each Transformer layer as additional task-specific parameters. Use reparameterization techniques to learn small matrices of mapping prefixes instead of direct optimization. Only prefix parameters are optimized to suit downstream tasks.

Prompt fine-tuning: Add soft prompt tokens to the input layer and add them to the input text in an embedded form. Only optimize hint embeddings to suit specific tasks. Take advantage of prompts' free-form design.

Low-rank adaptation: Use a low-rank decomposition matrix to approximate the network parameter update matrix of each layer. The original parameters are fixed and only two small adaptable matrices in the low-rank decomposition are trained.

The advantages of each method are different, but the common point is that only a few parameters are optimized to adapt to downstream tasks, and most parameters of the language model are fixed to achieve efficient fine-tuning of parameters.

2) Efficient fine-tuning of parameters on large language models

With the rise of large language models (LLM), researchers are paying more and more attention to efficient fine-tuning methods to develop more lightweight adaptation methods suitable for various downstream tasks. Among them, the LoRA method is widely used in open source LLM (such as LLaMA and BLOOM) to achieve efficient fine-tuning of parameters. LLaMA and its variants have attracted much attention due to their efficient fine-tuning of parameters. For example, Alpaca-LoRA is a lightweight fine-tuned version of Alpaca, a fine-tuned 7 billion parameter LLaMA model containing 52,000 human instruction-following demonstrations. For Alpaca-LoRA, extensive exploration has been conducted across different languages ​​and model sizes.

In addition, the LLaMA-Adapter method inserts learnable cue vectors in each Transformer layer, where zero-initialized attention is proposed to alleviate the impact of under-fitting cue vectors, thereby improving the training effect. This approach has also been extended to multimodal settings such as visual question answering.

6. Summary and future directions

Understanding and interpreting the emergent capabilities of language models is an important yet challenging issue. As the size of the model increases, capabilities like chain reasoning will suddenly appear, but the mechanism is not yet clear. Exploring the influencing factors and theoretical explanations of emergent ability is a current research hotspot. However, more formal theories and principles still need to be established, such as explaining language models from the perspective of complex systems. Interpreting the capabilities and behavior of language models remains a fundamental question worth exploring and is key to developing the next generation of models. An interdisciplinary perspective is needed to gain deeper understanding and explanation. 

Building more efficient Transformer variants and mitigating catastrophic forgetting are two important directions for improving language model architecture in the future. Due to the high complexity of standard self-attention, more efficient attention mechanisms need to be explored. In addition, when fine-tuning a language model, original knowledge can easily be overwritten by new data and forgotten. Therefore, it is necessary to introduce more flexible mechanisms or modules to support model data updates and task specialization, while retaining the original general capabilities. Extending existing architectures to adapt to new tasks without forgetting old knowledge is a key challenge for language models.

Despite their powerful capabilities, large language models still face similar security challenges as small models, such as generating wrong information, being exploited to generate harmful content, and so on. The main countermeasure is alignment optimization through human feedback, but current reinforcement learning methods rely heavily on a large number of high-quality human annotations.

As large-scale language models (LLMs) have demonstrated powerful capabilities in various tasks, they are being widely used in various real-world applications, including specific tasks of following natural language instructions. As an important advancement, ChatGPT has changed the way people obtain information and has been reflected in the "New Bing" release. In the near future, it is foreseeable that LLM will have a significant impact on information retrieval technology, including search engines and recommendation systems. In addition, the development and use of intelligent information assistants will be widely promoted with the upgrade of LLM technology. From a broader perspective, this wave of technological innovation will form an application ecosystem supported by LLM, such as ChatGPT’s support for plug-ins, which is closely related to human life.

The current situation of my country’s computing power development

In order to promote the construction of computing power infrastructure and promote the digital transformation of all walks of life, the Ministry of Industry and Information Technology and the People's Government of Ningxia Hui Autonomous Region held the 2023 China Computing Power (Infrastructure) Conference in Yinchuan, Ningxia from August 18 to 19. The conference aims to continue to promote the deep integration of the digital economy and the real economy and inject strong impetus into high-quality development.

1. The development of AI continues to deepen, driving the construction of computing power infrastructure to accelerate

The Ministry of Industry and Information Technology has been committed to promoting the construction of computing power infrastructure in recent years and continues to strengthen the top-level design of computing power. They have issued a number of policy documents, such as the "14th Five-Year Plan for the Development of the Information and Communications Industry" and the "Three-Year Action Plan for the Development of New Data Centers" to optimize the national computing power layout and promote the construction and application of computing power infrastructure. The Ministry of Industry and Information Technology also plans to issue policy documents based on the latest developments in the computing power industry to promote the high-quality development of computing power infrastructure and improve computing power supply capabilities. These measures have accelerated the construction of computing infrastructure and laid a solid foundation for the development of the digital economy.

At the 2023 China Computing Power Conference, two important aspects of development needs were pointed out. On the one hand, it is necessary to enhance independent innovation capabilities, promote innovation in computing architecture, computing methods and algorithms, strengthen the research and development of key products such as CPUs, GPUs and servers, and accelerate the application of new technologies and products. On the other hand, it is necessary to strengthen the construction of the computing power-related software and hardware ecosystem, enhance the advanced level of the industrial foundation, promote the coordinated development of the upstream and downstream of the industrial chain, and jointly build a good development ecology.

As of the end of 2022, my country has more than 6.5 million standard racks, with a total computing power scale of 180EFLOPS, second only to the United States, and a total storage scale of more than 1,000EB (1 trillion GB). Riding the wave of artificial intelligence development, my country continues to strengthen the research and development of key products such as CPUs, GPUs and servers. The momentum of computing power development is expected to continue to increase, and the upstream and downstream of the domestic computing power industry chain are expected to usher in rapid development together.

Development of artificial intelligence application scenarios in China

China's artificial intelligence industry has made significant progress in 2022, with application penetration continuing to increase and application scenarios expanding. Especially in industries such as finance and telecommunications, the application penetration of artificial intelligence has increased significantly. The widespread application of smart customer service, physical robots, smart outlets and cloud access points has increased the penetration rate of artificial intelligence in the financial industry to 62%; while the penetration rate in the telecommunications industry has increased from 45% to 51%, and artificial intelligence technology is next The construction of a new generation of smart networks provides important support. According to the International Data Corporation (IDC), by the end of 2023, 50% of China's manufacturing supply chain will use artificial intelligence technology. As time goes by, the implementation of intelligent scenarios in various industries will show a deeper and broader trend.

Artificial intelligence industry penetration rate (%)

With the rise of large models in the field of artificial intelligence, the demand for intelligent computing power has shown a trend of geometric growth. China's Internet giants and technology giants have launched independently developed large models, such as Baidu's Wenxin model, Huawei's Pangu model, Alibaba's Tongyi model, etc. These large models have hundreds of billions or even trillions of parameters and require a large amount of high-quality training data and huge computing power support. As the complexity of large models continues to increase, the scale of data grows rapidly, and application scenarios continue to expand and deepen, the demand and scale of intelligent computing power will surely experience explosive growth in the next few years. According to OpenAI's estimates, since 2012, the computing power required for training the world's top AI models has doubled every 3-4 months, with an annual growth rate of up to 10 times.

Computing power requirements for large model training

The scale of intelligent computing power is continuing to expand, and it has become a consensus to build computing power infrastructure. According to the "2022-2023 China Artificial Intelligence Computing Power Development Assessment Report" jointly released by IDC and Inspur Information, China's artificial intelligence computing power will continue to grow rapidly. As of 2022, China's intelligent computing power scale has reached 268 exascale operations per second (EFLOPS). It is expected that by 2026, China's intelligent computing power scale will reach 1271.4 EFLOPS, with a compound growth rate of 52.3% in the next five years. , while the compound growth rate of general computing power scale is 18.5%. At the national level, plans have been launched to build national computing power hub nodes in 8 regions and plan 10 national data center clusters to achieve effective integration of resources, promote industrial structure adjustment, and build more robust computing power and algorithms. infrastructure.

China’s Intelligent Computing Power Scale and Forecast (EFLOPS)

2. There is a scissor gap between computing power demand and chip capabilities. AI development will place higher requirements on chip performance.

Due to the needs of diverse artificial intelligence application scenarios, the traditional general-purpose computing power based on CPU is no longer enough to meet the requirements. Therefore, heterogeneous computing solutions using CPUs and AI chips (such as GPUs, FPGAs, and ASICs) have become the main solution for current and future intelligent computing. Heterogeneous computing solutions require a large number of AI chips. These chips have excellent parallel computing capabilities and high interconnection bandwidth, which can maximize the efficiency of supporting AI computing. According to forecasts from the Qianzhan Industry Research Institute, China's artificial intelligence chip market will continue to grow from 2023 to 2027. By 2024, China's artificial intelligence chip market size will exceed 100 billion yuan; by 2027, the market size will reach 288.19 billion yuan.

 Forecast of China's artificial intelligence chip market size (100 million yuan)

The competition for AI chip computing power is in full swing, and companies are launching new products one after another. On June 13, AMD released its new artificial intelligence GPU Instinct MI300 and plans to ship it to some customers later this year. This processor is a version of AMD optimized for large language models, with an astonishing 153 billion transistor count, 192GB of memory and 5.2TB/s of memory bandwidth, as well as 896GB/s of Infinity Fabric bandwidth. On August 8, NVIDIA announced the launch of the next-generation NVIDIA GH200 Grace Hopper platform, which is the world’s first GPU chip equipped with HBM3e memory. HBM3e memory will make the next-generation GH200 3.5 times faster than the current model when running AI models. These high-capacity GPUs help reduce AI training costs.

NVIDIA GH200

The industry leaders are mainly European, American and Japanese, and domestic substitution is imperative. According to data from Zhongyan Puhua Industrial Research Institute, the current top ten in the global artificial intelligence chip industry are dominated by European, American, Korean and Japanese companies, with the top three being Nvidia, Intel and IBM. Domestic chip companies such as Huawei HiSilicon ranked 12th, Cambrian ranked 23rd, and Horizon Robotics ranked 24th. Under the current competitive landscape, with the accelerated development and vertical integration of large models at home and abroad, domestic AI computing power chip manufacturers will usher in industry development opportunities.

3. The three parties collaborate to help the computing power infrastructure, and deepen the construction of the "East Data, West Computation" project

At the press conference of the 2023 China Computing Power Conference, Zhang Yunming, Vice Minister of the Ministry of Industry and Information Technology, introduced the positive results achieved in recent years in building a high-quality computing power supply system. In order to improve the comprehensive capabilities of the computing power infrastructure, all parties actively cooperated and adopted various measures, and achieved positive results in three aspects.

1) Computing power development planning policies have been introduced one after another, and the system guarantee is strong and effective. The Ministry of Industry and Information Technology, the National Development and Reform Commission and other departments jointly issued the "National Integrated Big Data Center Collaborative Innovation System Computing Power Hub Implementation Plan" and approved the construction of 10 national computing power hub nodes in 8 regions. At the same time, the "Three-Year Action Plan for the Development of New Data Centers (2021-2023)" was also issued to continuously optimize the overall layout of the national computing power.

2) The construction of computing power infrastructure is steadily advancing, and development momentum continues to increase. In order to support the development of the digital economy, all parties in the industry have worked closely together to accelerate infrastructure construction, computing power system construction and green development. Since 2018, the compound annual growth rate of the number of racks in my country's data centers has exceeded 30%. As of the end of 2022, the number of standard racks will exceed 6.5 million, and the total computing power will reach 180EFLOPS, second only to the United States. At the same time, the total storage size exceeds 1000EB (1 trillion GB). These data show that our country has made significant achievements in computing power base.

3) Computing power empowers the transformation and upgrading of traditional industries, and the emergence of integrated applications accelerates. At present, my country's computing power industry has initially formed a scale, and enterprises in the industry chain have launched collaborative cooperation between the midstream and downstream, forming a positive interaction. Computing power has not only become an important support point for the transformation and upgrading of traditional industries, but also spawned a number of new economic growth points. According to the calculations of the China Academy of Information and Communications Technology, the scale of my country's core computing power industry will reach 1.8 trillion yuan in 2022. Every investment of 1 yuan in computing power will drive 3 to 4 yuan of GDP economic growth. These data show that the development prospect of the computing power industry in our country is very broad and has huge economic potential.

Ningxia will expand the influence of its computing power hub by hosting the Western Digital Empowerment Conference and the second "Western Digital Valley" computing power industry conference. As the first industrial conference in the western region with the theme of digital empowerment, the first "Western Digital Valley" Computing Power Conference has signed 24 projects in 2022, with a total investment amount of 72.7 billion yuan, and 18 projects have been implemented so far. As the computing power hub node of "Eastern Data and Western Computing", Ningxia has built 349,000 data center standard racks in June 2023. The Internet inter-provincial export bandwidth has reached 20.6Tbps, and its network level is in a leading position in the western region.

At present, the computing power structure is dominated by general computing and storage services, accounting for 61%. The country is promoting the "Eastern Data and Western Computing" project. By building a new computing power network system, the computing power demand from the east will be directed to the west in an orderly manner, optimizing the layout of data center construction, and promoting the coordinated development of the east and west. The eight national computing power hub nodes will become the key connection points of my country's computing power network, promote the development of data center clusters, and promote the collaborative construction between data centers and networks, cloud computing and big data. "The strategic fulcrum of the project promotes the orderly transfer of computing power resources to the west.

Blue Ocean Brain Large Model Training Platform

The Blue Ocean Brain large model training platform provides powerful computing power support, including an AI accelerator based on high-speed interconnection of open acceleration modules. It is configured with high-speed memory and supports fully interconnected topology to meet the communication requirements of tensor parallelism in large model training. It supports high-performance I/O expansion and can be extended to Wanka AI cluster to meet the communication needs of large model pipelines and data parallelism. Powerful liquid cooling system hot-swappable and intelligent power management technology, when the BMC receives a PSU failure or error warning (such as power outage, surge, overheating), it automatically forces the system's CPU to enter ULFM (ultra-low frequency mode) to achieve the lowest power. consumption). Committed to providing customers with environmentally friendly and green high-performance computing solutions through "low carbon and energy saving". Mainly used in deep learning, academic education, biomedicine, earth exploration, meteorology and oceanography, supercomputing centers, AI and big data and other fields.

1. Why do we need large models?

1. The model effect is better

The effect of large models in various scenes is better than that of ordinary models

2. Stronger creative ability

Large models can perform content generation (AIGC) to facilitate large-scale content production

3. Flexible customization of scenarios

By giving examples, we can customize a large number of application scenarios for large models.

4. Less labeled data

By learning a small amount of industry data, the large model can respond to the needs of specific business scenarios

2. Platform features

1. Heterogeneous computing resource scheduling

A comprehensive solution based on general-purpose servers and dedicated hardware for scheduling and managing multiple heterogeneous computing resources, including CPUs, GPUs, etc. With powerful virtualization management functions, it is possible to easily deploy underlying computing resources and efficiently run various models. At the same time, give full play to the hardware acceleration capabilities of different heterogeneous resources to speed up the running speed and generation speed of the model.

2. Stable and reliable data storage

Supports multiple storage type protocols, including block, file and object storage services. Pool storage resources to realize the free circulation of models and generated data, and improve data utilization. At the same time, data protection mechanisms such as multiple copies, multi-level fault domains, and fault self-recovery are adopted to ensure the safe and stable operation of models and data.

3. High-performance distributed network

Provides network and storage of computing resources, forwards them through distributed network mechanisms, transparently transmits physical network performance, and significantly improves the efficiency and performance of model computing power.

4. Comprehensive security guarantee

In terms of model hosting, a strict authority management mechanism is adopted to ensure the security of the model warehouse. In terms of data storage, measures such as privatization deployment and data disk encryption are provided to ensure the security and controllability of data. At the same time, in the process of model distribution and operation, comprehensive account authentication and log audit functions are provided to fully guarantee the security of models and data.

3. Common configurations

1. Processor, CPU:

Intel Xeon Gold 8358P 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8350C 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8458P 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W

Intel Xeon Platinum 8468 Processor 48C/64T 2.1GHz 105M Cache 350W

AMD EPYC™ 7742 64C/128T,2.25GHz to 3.4GHz,256MB,DDR4 3200MT/s,225W

AMD EPYC™ 9654 96C/192T,2.4GHz to 3.55GHz to 3.7GHz,384MB,DDR5 4800MT/s,360W

2. Graphics card, GPU:

NVIDIA NVLink-A100-SXM640GB

NVIDIA HGX A800 8-GPU 80GB

NVIDIA Tesla H800 80GB HBM2

NVIDIA A800-80GB-400Wx8-NvlinkSW×8

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/132783705