Breaking the Boundary: High Performance Computing Leads LLM to the Innovation Era of General Artificial Intelligence (AGI)

AGI | AIGC | Large Model Training | GH200

LLM | LLMs | Large Language Model | MI300

The success of ChatGPT drives the development of the entire AIGC industry, especially in the fields of LLM (Large Language Model, Large Language Model), NLP, high-performance computing and deep learning. The development of LLM will provide a strong driving force for the growth of the global and Chinese AI chip and AI server markets. It is estimated that LLM will bring about US$89.12 billion and US$33.82 billion in market space for global and Chinese AI servers.

Foreign manufacturers have leading advantages in the field of LLM, but my country's LLM products are also developing rapidly. Since 2023, many manufacturers have launched self-developed general-purpose LLMs. The application of domestic LLMs in various industries and ecological construction have also made positive progress. Although my country's LLM still has a certain gap compared with GPT-4, it can be expected to reach or approach the level of ChatGPT in the short term.

It is worth noting that AMD launched the MI300 series accelerator cards last week, aiming to compete with Nvidia. The MI300 series is AMD's latest series of APU accelerator cards for AI and high-performance computing, including MI300A and MI300X. Among them, MI300A integrates CPU and GPU, while MI300X is an accelerator specially launched for generative AI, which is benchmarked against Nvidia H100. In terms of performance parameters, the MI300 series products are comparable to or even surpassed by Nvidia's high-end accelerator cards, but in general, it is still difficult to shake Nvidia's absolute leading position in this field in the short term.

 

Looking forward to the second half of the year, my country's large-scale model products have initially achieved commercial capabilities. The announcement of favorable general artificial intelligence development policies in Beijing, Shanghai and Shenzhen demonstrates my country's emphasis on and support for the development of AIGC, and at the same time will bring demonstration effects for other cities to issue similar policies. Under the resonance of policy and technology, my country's AIGC industry has broad prospects for future development.

 

Today, the gap between the domestic LLM model-related technology and the most advanced technology is further widening. In the one to two years after the appearance of Bert, the domestic technology in this field has been catching up quickly, and at the same time, some good improvement models have been proposed. The watershed when the gap widens should be after the release of GPT 3.0, that is, around the middle of 2020. At that time, only a few people realized that GPT 3.0 was not only a specific technology, but also reflected a development concept of where LLM should go.

Large language model (LLM) is a low-cost, high-efficiency technique that has attracted widespread attention in the fields of natural language processing (NLP) and artificial intelligence (AI). Among them, ChatGPT, as a representative of LLM, has brought about a paradigm shift in the field of NLP and AI? If yes, what impact will it have? LLM has accumulated a wealth of knowledge by learning from massive amounts of data. So, how does LLM access this knowledge? As the scale of LLM gradually increases, what impact will it have on research and application? In addition, In Context Learning is a mysterious technology that is closely related to Instruct. Does the LLM have the ability to reason? How is the chain of thought (CoT) realized? Next, the above-mentioned aspects will be described in detail.

 

Background, competencies of LLMs

1. Background of LLM

LLM (Large Language Model, Large Language Model) refers to a language model trained using a large amount of text data, containing hundreds of billions or more parameters. Adopt Transformer architecture and language modeling pre-training target, but LLM model size, pre-training data and total computation are larger compared to small models. This allows them to better understand natural language and generate high-quality text. The capacity improvements of LLMs can be partially described by scaling laws, but some capabilities are only observed above a certain level of model size.

2. Emergent ability of LLM

The emergent ability of LLM refers to the ability that does not exist in small models but appears in large models, and is one of the most striking features that distinguish LLM from previous PLMs. When the scale reaches a certain level, the performance of LLM is significantly higher than that of random state, and this new mode is closely related to the phenomenon of phase transition in physics. The emergent ability of LLM can be related to some complex tasks, and people are more concerned about its general ability.

Three representative emergent abilities of LLM include context learning, instruction following, and step-by-step reasoning. Among them, the context learning ability enables the language model to generate the expected output of the test instance by completing the word sequence of the input text; the instruction following ability enables the LLM to perform new tasks by understanding the task instructions without using explicit samples, thereby improving Generalization ability; the step-by-step reasoning ability enables LLM to solve complex tasks to arrive at the final answer by utilizing the prompt mechanism involving intermediate reasoning steps.

NLP Research Paradigm Shift

Modeling from Shallow Semantics to Deep Semantics

In the past 10 years, the field of NLP has probably undergone two important research paradigm shifts.

1. From deep learning to two-stage pre-training model

The introduction of deep learning in the NLP field roughly began in 2013 until the emergence of GPT 3.0 (around May 2020). Before the emergence of Bert and GPT models, the popular technologies in the NLP field were mainly deep learning models, mainly relying on improved LSTM and CNN models as feature extractors, and Sequence to Sequence+Attention as the overall technical framework. However, although these increase the model depth, they are still not successful enough in solving specific tasks. This is mainly due to the limited amount of task training data and the insufficient expressive ability of the LSTM/CNN feature extractor.

It was not until the emergence of the two pre-trained models, Bert and GPT, that it represented a technological leap in the field of NLP and brought about a paradigm shift in research in the entire field. The impact of this paradigm shift is mainly reflected in two aspects: one is the decline or even gradual extinction of some NLP research subfields; in technical mode.

1. The decline and even gradual extinction of some NLP research subfields

NLP is a general term for a macro research field, which has a variety of specific subfields and subdirections. If you analyze it carefully, from the perspective of the nature of the tasks, these tasks can be divided into two categories: intermediate tasks and final tasks.

1) Intermediate tasks

Typical intermediate tasks mainly include Chinese word segmentation, part-of-speech tagging, NER, syntactic analysis, anaphora resolution, semantic parser, etc. These tasks generally do not solve the actual needs of the application, and most of them are used as intermediate stages for those tasks that solve actual needs Or an auxiliary stage exists. For example, there is almost no demand for a syntax parser to show users the syntax analysis tree of this sentence. Users do not need to see the results of these intermediate NLP processing, but only care about whether a specific task is done well.

2) Final mission

This type of task (text classification, text similarity calculation, machine translation, text summarization, etc.) is characterized by the fact that each sub-field solves a certain actual need, and the task results can basically be presented to the user directly. For example, if the user does exist and give you a sentence in English, Tell him what Chinese is.

It stands to reason that "intermediate tasks" should not appear, and the reason why they exist is that the development level of NLP technology is not high enough. In the early stage of technological development, due to the relatively backward technology at that time, it was difficult to complete the difficult final task in one step. Such as machine translation, it is very difficult to do a good job in machine translation in early technology, so researchers divide and conquer the problem, decomposing it into various intermediate stages such as word, part-of-speech tagging, and syntactic analysis. Put your pieces together to complete the ultimate mission.

Since the emergence of Bert/GPT, there is no need to do intermediate tasks. Because through the pre-training of a large amount of data, Bert/GPT has absorbed these intermediate tasks as linguistic features into the parameters of Transformer. At this time, it is possible to directly solve those final tasks end-to-end without having to specialize in this intermediate process. modeling.

2. Unification of technical routes in different research directions

In addition to "intermediate tasks", NLP tasks can be divided into two types: natural language understanding and natural language generation. Natural language understanding tasks include classification tasks such as text classification, sentence relationship judgment, and emotional tendency judgment. The model judges which category it belongs to according to the input text. Natural language generation tasks include chatbots, machine translation, text summarization, question answering systems and other generation tasks. The model generates corresponding output text based on the input text.

Since the emergence of the Bert/GPT model, there has been a trend of technical unification in the NLP field. The feature extractor has gradually been unified from LSTM/CNN to Transformer. Most tasks adopt pre-training + fine-tuning or Zero/Few Shot Prompt mode. The natural language understanding task adopts the two-way language model pre-training + Fine-tuning mode represented by Bert, and the natural language generation task adopts the autoregressive language model + Zero/Few Shot Prompt mode represented by GPT 2.0. The development ideas and future development directions behind these two models are different, but many people underestimate the potential of the GPT model. The autoregressive language model of GPT mode can generate high-quality text, can be applied to multiple natural language generation tasks and has good migration ability. In contrast, the Bert mode performs poorly in generating tasks and the Fine-tuning method requires a large amount of labeled data, which is not easy to adapt to new tasks.

2. From pre-training model to general artificial intelligence (AGI, Artificial General Intelligence)

The time covered by this paradigm shift is roughly after the emergence of GPT3.0, starting around June 2020 and continuing until now. The key node of ChatGPT transformation, but before the emergence of InstructGPT, LLM was in the transition period of paradigm transformation.

1. The "autoregressive language model + prompting" mode represented by GPT 3.0 occupies a dominant position

In the early stage of the development of the pre-training model, the technical framework converged to two different technical paradigms: Bert mode and GPT mode, and people generally are more optimistic about the Bert mode. A considerable number of subsequent technical improvements follow the path of Bert. Walk. However, with the continuous development of technology, it is found that the largest LLM models are almost all "autoregressive language models + Prompting" models similar to GPT 3.0 (such as GPT-3, PaLM, GLaM, Gopher, Chinchilla, MT- NLG, LaMDA, etc.). Why is this so? There must be an inevitability behind it, mainly due to two reasons.

 

1) Google's T5 model, which unifies the external manifestations of natural language understanding and natural language generation tasks in form

As shown in the figure above, the text classification problem in the T5 model and the regression or classification problem for judging sentence similarity are typical natural language understanding problems. In the T5 model, the input and output forms of these natural language understanding problems are consistent with the generation problem, and the classification problem can be converted into a string for the corresponding category generated by the LLM model, so as to realize the complete unity of the expression form of the understanding and generation tasks. This shows that natural language generation tasks are expressively compatible with natural language understanding tasks, while the reverse is difficult to do. The advantage of this is that the same LLM generative model can solve almost all NLP problems. In contrast, if the Bert mode is still adopted, the LLM model cannot handle the generation task well.

2) If you want to do a good job with zero shot prompting or few shot prompting, you must adopt GPT mode

Studies have shown that the Bert mode is better than the GPT mode when solving downstream tasks in a fine-tuning manner. However, if the downstream task is solved in zero shot/few shot prompting mode, the effect of GPT mode is better than that of Bert mode. This shows that the generation model is easier to complete the task of zero shot/few shot prompting, and the Bert mode has disadvantages when doing tasks in this way.

So here comes the question: why pursue the zero shot/few shot prompting method to do tasks? To explain this problem clearly, we first need to figure out another problem: what kind of LLM model is the most ideal?

 

For the LLM model, first of all, it should have a strong self-learning ability. If all the different types of data available in the world, such as text and pictures, are input into the model, it should be able to automatically learn all the knowledge points contained in it. The learning process does not require human intervention, and it can flexibly apply the learned knowledge to solve practical problems. . Due to the huge amount of data, to absorb all knowledge, the model must have a large number of parameters to store knowledge, so this model must be a giant model.

Second, the LLM model should be able to solve problems in any subfield of NLP, not limited to a certain limited field, and should even be able to respond to problems in other fields outside of NLP. In addition, when using the LLM model to solve a problem in a specific domain, the expression that humans are accustomed to should be used, that is, the LLM should understand human commands. This reflects adapting LLM to humans, rather than adapting people to the LLM model. A classic example of human adaptation to LLM is that people go to great lengths to try out different prompts in an attempt to find good prompts that better address the problem at hand.

Why pursue zero shot/few shot prompting to solve tasks? There are mainly two reasons.

1) The scale of the ideal LLM model must be very large, and only a very small number of institutions have the ability to make this model or change the model parameters. There are tens of thousands of task demanders, most of which are small and medium-sized organizations or even individuals, even if the model is open source, they cannot deploy the model, let alone use the Fine-tuning mode to modify the model parameters. Therefore, we should pursue a method that allows the task demander to complete the task without modifying the model parameters, that is, the prompt mode should be used to complete the task instead of the Fine-tuning mode. The model maker uses LLM as a public service and operates in the mode of LLM as Service.

As a service supporter, considering the diversity of user needs, the LLM model maker should pursue to enable LLM to complete as many types of tasks as possible. This is an incidental effect, and it is also a realistic factor why super-large models will definitely pursue AGI. .

2) Whether it is zero shot prompting, few shot prompting, or even chain of thought (CoT, Chain of Thought) prompting that promotes LLM's reasoning ability, they are all one of the existing technologies. Specifically, the original intention of zero shot prompting is to let LLM do things directly in the way of task expression commonly used by humans, but it is found that LLM cannot be well understood and the effect is not good. After continuing research, people found that for a certain task, if you give LLM a few examples and use these examples to represent the task description, the effect will be better than zero shot prompting, so they all began to study better few shot prompting technology.

It can be understood that it is hoped that LLM can use the command methods commonly used by humans to perform a certain task, but the current technology cannot do it, so the next best thing is to use these alternative technologies to express human task needs. If you understand the above logic, it is easy to draw the following conclusion: few shot prompting (also known as In Context Learning) is just a transitional technology. If a task can be described more naturally and LLM can understand it, then these transitional technologies will definitely be abandoned without hesitation. The reason is obvious that using these methods to describe task requirements does not conform to human usage habits.

2. A new interactive interface that adapts LLM to humans

ChatGPT is a capable and empathetic technical approach that comes closest to the ideal LLM model. The powerful ability of ChatGPT is mainly due to the GPT 3.5 model, rather than manual labeling data. Although artificially labeled data has been added, the magnitude of these data is only tens of thousands, which has little effect on enhancing the basic capabilities of GPT 3.5.

The biggest contribution of ChatGPT is that it basically realizes the interface layer of the ideal LLM, allowing LLM to adapt to people's customary command expression, instead of allowing people to adapt to LLM. This increases the ease of use and user experience of LLM, and is a human-computer interface technology that is more in line with human expression habits and interacts with LLM. The technical contribution of ChatGPT will surely inspire the follow-up LLM model and continue to do further work on the easy-to-use human-machine interface.

3. Many NLP subfields no longer have independent research value

The paradigm shift will change the pattern of the NLP field, and many independent research fields will be included in the LLM technology system and gradually disappear. Although many "intermediate tasks" no longer need to exist independently, most of the "final tasks" will still exist as independent fields, but under the "pre-training + fine-tuning" framework, new improvement schemes have been proposed one after another.

Studies have shown that as the size of the LLM model increases, the performance of many NLP tasks will be greatly improved. Therefore, the so-called "unique" problems in many fields are only external appearances due to lack of domain knowledge. As long as more domain data is provided to LLM to let it learn more knowledge, these problems can be solved well. The future technology development trend should be to pursue larger and larger LLM models, and to cover more and more fields by increasing the diversity of pre-training data. The focus of research will be on how to construct an ideal LLM model, rather than solving specific problems in a certain field. Therefore, more and more subfields of NLP will be included in the LLM technology system and gradually disappear.

To judge whether a specific field needs to stop independent research immediately, the following two methods can be adopted: one is to judge whether the research effect of LLM exceeds human performance. For those research fields where the effect of LLM has exceeded human performance, independent research is no longer necessary. The second is to compare the task effects of the two modes. If the effect of the few-shot prompting or instruction-based method reaches or exceeds the effect of fine-tuning with larger field-specific data, it means that this field does not need to continue to exist independently. sex.

If this conjecture is true, it will mean that many researchers in the NLP field are faced with the choice of where to go. Should they continue to work on unique problems in the field? Or give up this approach and build a better LLM instead?

4. More research fields other than NLP will be included in the LLM technology system

The ideal LLM should be a general artificial intelligence model and should not be limited to a certain subject area. The emergence of ChatGPT proves the feasibility of this pursuit of AGI, and now is the time to put aside the thinking constraints of "domain discipline". In addition to showing smooth dialogue forms in various NLP tasks, ChatGPT also has powerful coding capabilities.

 

LLM techniques are expanding outward, and one of the natural directions is image processing and multimodal tasks. There are already some works trying to integrate multimodality into LLM to realize the function of general human-machine interface, such as DeepMind's Flamingo and Microsoft's "Language Models are General-Purpose Interfaces".

The effect of applying the pre-training model in the image field to downstream tasks is far less significant than that of the pre-training model in the NLP field. This may be because the image pre-processing model still needs to be explored in depth to release the potential of image data. Therefore, the integration of the field of image processing into LLM may be slower than imagined. Of course, if the pre-trained models in the image field are passed, they are likely to be integrated into a large LLM to directly complete the terminal task, similar to the situation in the NLP field.

In addition to images and multimodality, other domains will gradually be incorporated into LLM, which are high-value research topics. My personal thinking on paradigm shift shows that the main technical progress of LLM technology can be divided into two categories: one is about how LLM models absorb knowledge from data, and it also includes the impact of model scale growth on the ability of LLM to absorb knowledge. The second category is the human-machine interface about how people use LLM's intrinsic ability to solve tasks, including In Context Learning and Instruct. Chain of Thought (CoT) prompting, an LLM reasoning technique, also belongs to In Context Learning in essence.

Guide massive knowledge from endless data

The current research results show that Transformer is powerful enough as a feature extractor and does not need special improvement. What did the pre-training process let Transformer learn? How is knowledge stored? How to correct wrong knowledge? These questions are the focus of current research. This section describes the research progress in this area.

1. What knowledge has LLM learned

LLM acquires a large amount of knowledge by learning massive free texts, which can be roughly divided into two categories: language knowledge and world knowledge. Linguistic knowledge includes morphology, part of speech, syntax, and semantics, which are helpful for humans or machines to understand natural language. Research shows that LLM can learn linguistic knowledge of various hierarchical types, and these knowledge are stored in the low and middle layers of Transformer. World knowledge includes real events (factual knowledge) and common sense knowledge (Common Sense Knowledge).

研究表明,LLM可以从训练数据中吸收大量世界知识,并且这些知识主要分布在Transformer的中层和高层,随着模型层数的增加,能够学习到的知识数量逐渐以指数级增加。对于Bert类型的语言模型来说,只用1000万到1亿单词的语料,就能学好句法语义等语言学知识,但是要学习事实类知识,则需要更多的训练数据。随着增加训练数据量,预训练模型在各种下游任务中效果越好,这说明了从增量的训练数据中学到的更主要是世界知识。

二、LLM如何存取知识

LLM是一种基于Transformer结构的语言模型,可以从大量的自由文本中学习到丰富的语言类和世界知识。但对于具体的某条知识,LLM是如何存储和提取的呢?从Transformer的结构来看,模型参数由两部分构成:多头注意力(MHA)部分占了大约参数总体的三分之一,三分之二的参数集中在FFN结构中。

FFN的第一层是一个MLP宽隐层,也就是Key层;第二层是一个MLP窄隐层,也就是Value层。FFN的输入层实际上是某个单词对应的MHA的输出结果Embedding,也就是通过Self Attention,将整个句子有关的输入上下文集成到一起的Embedding,代表了整个输入句子的整体信息。

Key层的每个神经元节点,记载了一对<Key,Value>信息。比如对于FFN第一个隐层的第i个节点ki,也许就是它记载了<北京,is-capital-of,中国>这条知识。ki节点对应的Key向量,其实指的是节点ki和输入层每个节点的权重向量;而对应的Value向量,指的是节点ki和FFN第二层的Value层每个节点形成连接的权重向量。

The Key vector of each neuron is used to identify a certain language or knowledge pattern in the input, which is a pattern detector. If the input contains a certain pattern it wants to detect, then the input vector and the key weight of the ki node are calculated for the vector inner product, and Relu is added to form a large value response of ki, which means that ki has detected this pattern, so this The response value is propagated to the second layer of FFN through the Value weight vector of the ki node. This is equivalent to weighting the value of the Value vector with the response value, and then passing and reflecting it to the output of each node in the second layer Value layer.

In this way, the forward propagation calculation process of FFN looks like a certain knowledge mode is detected through Key, and then the corresponding Value is taken out, and the Value is reflected on the second layer output of FFN. Of course, each node in the second layer of FFN will collect all the node information of the Key layer of FFN, so it is a mixed response, and the mixed response of all nodes in the Value layer can be interpreted as the probability distribution information representing the output word. Although the idea of ​​treating FFN as a Key-Value memory may not be the final correct answer, it is estimated that the distance from the final correct answer is not too far.

3. How to modify the knowledge stored in LLM

When using LLM for natural language processing, you may encounter some outdated or wrong knowledge. To address this issue, three different approaches can be used to revise the knowledge stored in the LLM.

1. Correct knowledge from the source of training data

By tracking the source of training data corresponding to a piece of knowledge, locate which data leads LLM to learn knowledge. Then delete the data source and re-train the entire LLM model to achieve the purpose of deleting the relevant knowledge in the LLM. But this method is not applicable in the small-to-many common knowledge correction scenario.

2. Correct knowledge through fine-tuning

Construct training data according to the new knowledge to be corrected, and let the LLM model do fine-tuning on the training data to guide LLM to remember new knowledge and forget old knowledge. However, there will be a phenomenon of forgetting the knowledge that should be forgotten, and forgetting the knowledge that should not be forgotten, which leads to the decline in the effect of some downstream tasks after doing so. In addition, the cost is quite high.

3. Directly modify the model parameters of LLM to correct knowledge

By locating the specific location where knowledge is stored, the corresponding model parameters in FFN are forcibly adjusted to replace old knowledge with new knowledge. However, this approach needs to address two key issues. First of all, we need to know how to locate the specific storage location of a piece of knowledge in the LLM parameter space. Secondly, we need to know how to modify the model parameters to realize the modification of old knowledge to new knowledge.

Understanding the process of revising LLM knowledge is helpful in gaining a deeper understanding of the inner workings of LLM. Although the three methods have their own advantages and disadvantages, all of them can help correct outdated or wrong knowledge in LLM and improve the performance of LLM in natural language processing tasks.

What happens when the LLM gets bigger and bigger

In recent years, the scale of LLM models has continued to grow. Currently, the most effective LLM models have parameter scales that exceed 100 billion (100B) parameter scales. For example, OpenAI's GPT-3 has a scale of 175B, Google's LaMDA has a scale of 137B, PaLM has a scale of 540B, and DeepMind's Gogher has a scale of 280B. There are also Chinese giant models in China, such as Tsinghua & Zhipu GLM with a scale of 130B, Huawei’s “Pangu” with a scale of 200B, Baidu’s “Wenxin” with a scale of 260B, and Inspur’s “Yuan 1.0” with a scale of 245B.

So the question is, what happens as the size of the LLM model continues to grow? The application of the pre-training model is often in two stages: the pre-training stage and the specific scene application stage. In the pre-training stage, the optimization goal of the LLM model is cross entropy. For an autoregressive language model like GPT, it is to see whether the LLM correctly predicts the next word. In the scene application stage, it generally depends on the evaluation indicators of the specific scene. Generally, the better the indicators of the LLM model in the pre-training stage, the stronger the ability to solve downstream tasks. However, that's not quite the case.

Existing studies have shown that the optimization indicators in the pre-training stage do show a positive correlation with downstream tasks, but not completely. In other words, it is not enough to judge whether an LLM model is good enough just by looking at the indicators in the pre-training stage. Therefore, it is necessary to fully evaluate and test the model in both the pre-training phase and the application phase.

In the pre-training phase, the research of OpenAI and DeepMind shows that increasing the amount of training data and model parameters at the same time is the optimal choice, and only increasing one of them alone is not good enough. DeepMind believes that the amount of training data and model parameters are equally important, so they should be increased in the same proportion. For example, if the total computing power budget for training LLM is increased by 10 times, then the amount of model parameters and the amount of training data should be increased by 3.3 times, so that the model works best. For the Chinchilla model, it chooses to increase the training data by 4 times, but reduces the model parameters to a quarter of Gopher, which is about 70B. As a result of this, Chinchilla outperforms the larger Gopher on pre-training metrics and many downstream task metrics. This shows that you can choose to enlarge the training data and reduce the LLM model parameters in the same proportion, so as to achieve the purpose of greatly reducing the model size without reducing the model effect.

From the perspective of the effect of LLM on solving downstream specific tasks, as the size of the model increases, different types of tasks have different performances. For example, for simple tasks, such as the perplexity of language models, as the size of the model increases, the effect of the model will continue to improve. In OpenAI's research, when the amount of training data increases from 12B to 800B, the perplexity of the GPT-3 model decreases from 3.15 to 1.28.

For tasks of moderate difficulty, such as question answering, text classification, etc., as the model size increases, the model effect will first increase and then stabilize. In OpenAI's research, when the amount of training data increases from 12B to 800B, the performance of the GPT-3 model on tasks such as LAMBADA and SuperGLUE has improved, but the improvement rate has gradually decreased. For complex tasks, such as machine translation, semantic understanding, etc., as the size of the model increases, the model effect will first increase and then appear saturated or slightly decreased. In Google's research, when the number of model parameters increased from 1558M to 137B, the BLEU score increased from 36.8 to 37.5, but as the model size further increased, the BLEU score decreased slightly. Therefore, when choosing the size of the LLM model, it is necessary to comprehensively consider various factors according to the difficulty and requirements of the specific task in order to obtain the best model performance.

 

The first type of task demonstrates the scaling law of the LLM model, that is, as the size of the model increases, the performance of the task becomes better and better. Such tasks are usually knowledge-intensive tasks, and the more knowledge the LLM model contains, the better the task performance. Studies have shown that the higher the learning efficiency of the larger LLM model, the same amount of training data, the larger model can learn more knowledge points. Most of the traditional natural language understanding tasks belong to this type. In the past two years, the effect of these tasks has been greatly improved, probably because of the increase in the scale of the LLM model.

The second type of task shows that LLM has a certain "emergent ability". When the model size reaches a certain threshold, the effect of the LLM model on this type of task will suddenly increase in performance. This "emergence capability" is the key to the growth of the LLM model scale. As the model scale becomes larger, the LLM model will gradually unlock new capabilities. This phenomenon is amazing, because even if the LLM model can't solve some tasks well now, if you continue to push the model, maybe one day its ability will be suddenly unlocked. These tasks are generally composed of multiple steps, and multiple intermediate steps need to be solved first, and logical reasoning ability plays an important role in the final solution of such tasks. Thinking chain prompting is a typical technology to enhance the reasoning ability of LLM, which can greatly improve the effect of such tasks. As for why LLM has this "emergent ability" phenomenon, further research is still needed.

 

There are also some task effect curves showing U-shaped characteristics, that is, as the model scale increases, the task effect gradually becomes worse, but when the model scale further increases, the effect begins to improve, showing a U-shaped growth trend. Implicitly within these tasks are two different types of subtasks, one is the real task and the other is the "interference task". When the model size is small, it is impossible to identify any subtask, so the performance of the model is similar to that of randomly selecting answers.

When the model grows to a medium scale, it mainly performs interference tasks, so it has a negative impact on the real task effect, which is reflected in the decline of the real task effect. When the model size is further increased, the LLM can ignore the distracting task and perform the real task, and the effect starts to grow. If Chain of Thinking (CoT) Prompting is adopted, the performance of some tasks will be converted to follow the Scaling law, that is, the larger the model size, the better the effect, while other tasks will be converted to a U-shaped growth curve. This shows that this type of task should belong to the reasoning type task, and the task performance will change qualitatively after adding CoT.

从In Context Learning到Instruct理解

一般常提到的人和LLM的接口技术包括:Instruct和In Context Learning。Instruct是ChatGPT的接口方式,人以自然语言给出任务的描述,例如“把这个句子从中文翻译成英文”。而In Context Learning和few shot prompting意思类似,给LLM几个示例作为范本,然后让LLM解决新问题。

虽然这些技术都是描述任务的方式,但其实思路是不同的。Instruct是一种抽象的描述方式,而In Context Learning是一种例子示范的说明法。尽管叫法有些混乱,但这两种技术是最为常见的人和LLM的接口技术。下面将重点介绍Instruct和In Context Learning,而不再提及zero shot和few shot。

一、神秘的In Context Learning

In Context Learning是一项非常神奇的技术。它之所以神奇,是因为只需要提供LLM几个样本示例<x1,y1>,<x2,y2>....<xn,yn>,然后给一个新的输入xn+1,LLM就能成功预测对应的输出yn+1。这听起来有些类似于Fine-tuning,但实际上更为复杂。

Fine-tuning和In Context Learning看似都提供了一些示例给LLM,但两者之间存在着质的不同。Fine-tuning使用这些示例作为训练数据,通过反向传播来修正LLM的模型参数,从而实现了LLM从示例中学习的过程。而In Context Learning只是简单地展示示例,然后要求LLM去预测新的示例,没有使用反向传播来修正模型参数,这意味着它貌似没有经历学习的过程。但是,In Context Learning却能够仅凭一眼就预测出新的示例。

At present, there are some studies that have put forward different views on this issue, but there are conflicting conclusions among them. The truth of this matter is still an unsolved mystery. Some studies believe that In Context Learning does not learn the mapping function from examples, but achieves predictions through the distribution of inputs and outputs. While other studies argue that LLM still learns the mapping function from examples, but this process is implicit.

2. The magical Instruct understanding

Instruct is a task representation for human understanding. Based on this premise, current Instruct research can be divided into two categories: one is Instruct that is biased towards academic research, and the other is Instruct that focuses on the description of real human needs.

First, let’s look at Instruct, which is biased toward academic research. The core research topic in this field is the generalization ability of LLM model to understand Instruct in multi-task scenarios. This method uses multiple NLP tasks, each of which has one or more Prompt templates as Instruct, and uses training data to fine-tune the LLM model so that it can learn multiple tasks simultaneously.

After training the model, give the LLM model a new task Instruct, and then let LLM solve the zero shot task, so as to judge whether the LLM model has the generalization ability to Instruct. Current research shows that factors such as increasing the number of multi-tasks, increasing the size of the LLM model, providing CoT prompting, and increasing the diversity of tasks can effectively increase the LLM model's ability to understand Instruct.

The second is Instruct oriented to the real needs of human beings. This type of research is represented by InstructGPT and ChatGPT. This method is also based on multi-tasking, but the biggest difference from the method that is biased toward academic research is that it is oriented to real needs. It uses task description prompts sampled from real requests submitted by a large number of users for LLM multi-task training, instead of fixing the scope of research tasks and then letting researchers write task description prompts.

The advantage of this method is that it can cover more diverse task types, which is more in line with the real needs of users; at the same time, the prompt description of the task comes from the request submitted by the user, reflecting the real expression of the user when expressing the task demand. Therefore, the LLM model trained by this method can better meet user needs. The InstructGPT paper also compares the method to the academic research-biased FLAN method. The results show that the effect of the FLAN method is far from that of InstructGPT. This is because the FLAN method involves relatively few task domains, while the task types used by InstructGPT are more diverse and more in line with the real needs of users. Therefore, collecting real needs from user data is very important to improve the effect of LLM model.

3. The connection between In Context Learning and Instruct

In Context Learning can be regarded as expressing task commands through some specific examples, while Instruct is an abstract task description that is more in line with human habits. This raises a natural question: is there a connection between these two approaches? For example, can you provide some specific examples to let LLM find out the corresponding Instruct command described in natural language to complete a certain task?

Some research work is currently exploring the connection between representational task examples and natural language commands, and this direction has high research value. On this question, the answer is yes: LLM can indeed achieve this task. A recent study used GPT-3 and InstructGPT as the basic model, let LLM generate natural language commands to describe a task through some specific examples, and then use this description to test the task effect. The blessing of this technology has greatly improved the Instruct effect generated by LLM, and even surpassed human performance in some tasks. This points to a mysterious intrinsic connection between figurative task examples and natural language commands, but we are currently unable to determine the exact nature of this connection.

How to enhance the reasoning ability of LLM

At present, many studies have shown that LLM has a strong memory ability, but it is usually not considered that a person is smart just because he/she has a strong memory ability, because reasoning ability is often an important criterion for judging whether a person is smart or not. Therefore, strong reasoning skills are also essential for LLM. In the past year, the reasoning ability of LLM has become one of the most important and hot research areas. Current research shows that when the size of the model is large enough, LLM itself has reasoning ability, and has achieved a good ability on simple reasoning problems, but more in-depth research is still needed on complex reasoning problems.

The research on LLM's reasoning ability can be classified into two categories: the method based on Prompt and the method of introducing program code. The Prompt-based method stimulates the reasoning ability of LLM itself through appropriate prompts or prompt samples. Google has done a lot of fruitful work in this direction. The method of introducing program code involves pre-training code and text together in the pre-training process, so as to further enhance the reasoning ability of LLM, which is an idea practiced by OpenAI. The general direction of these two methods is very different: the former directly enhances the reasoning ability of LLM by providing diverse training data, while the latter is a technical method to enable LLM to better demonstrate its reasoning ability in the process of solving problems. While the two approaches are complementary, the root cause is more important in the long run.

In summary, it can be roughly divided into three technical routes.

 

1. Add auxiliary reasoning prompt directly to the question

In various domains, prompt-based methods have been proven to be an effective way to enhance the reasoning ability of LLM. This method is very simple, just add an auxiliary reasoning prompt to the question. Among them, Zero-shot CoT is a widely used method, which stimulates the reasoning ability of LLM itself by adding the prompt "Let's think step by step" to the question asked.

Specifically, it is divided into two stages. In the first stage, prompts are added to the question, and LLM will output the specific reasoning process; in the second stage, the specific reasoning process output by LLM is spliced, and Prompt is added. At this time, LLM will give answer. This simple operation can greatly increase the effect of LLM in various reasoning tasks. At present, there is no conclusion about why LLM has reasoning ability, but it may be because there are a lot of data beginning with "Let's think step by step" in the pre-training data, and LLM remembers these patterns during pre-training.

Therefore, when we enter this prompt, LLM will imitate these examples for step-by-step reasoning and give an answer. Of course, the effect of Zero-shot CoT is worse than that of standard CoT, because the accuracy is not estimated to be too high based on LLM recall examples. But whether it is Zero-shot CoT or standard CoT, it shows a truth, that is, LLM itself has reasoning ability, but we have no way to stimulate its ability.

2. Example-based chain of thought (few-shot CoT, Chain of Thought) Prompting

At present, the Prompt-based method is the main direction of LLM reasoning research, and a lot of work is carried out based on this idea. In this direction, several representative works have achieved remarkable results, and these works can basically represent the development direction of CoT technology.

 

The main idea of ​​CoT is very simple and clear. In order for the LLM model to learn to reason, it is necessary to give some manually written reasoning examples. The examples detail the specific reasoning steps before getting the final answer, and these manually written detailed reasoning processes are the chain of thought. Prompting. The purpose of CoT is to let the LLM model understand that in the reasoning process, the steps should not be taken too far, and it is necessary to turn big problems into small problems, step by step, and accumulate small victories into big victories. The earliest article to clearly propose the concept of CoT is "Chain of thought prompting elicits reasoning in large language models", which was published in January 2022. Although the method of CoT is very simple, the reasoning ability of the LLM model has been greatly improved after the application of CoT, and the accuracy rate of the GSM8K mathematical reasoning test set has increased to about 60.1%. It is worth mentioning that this idea of ​​giving detailed reasoning steps and intermediate processes is not the first proposed by CoT. Earlier "scratchpad" technology used a similar idea.

 

Not long after CoT was proposed, in March 2022, an improved technology called "Self-Consistency" came out quickly, increasing the accuracy of the GSM8K test set to 74.4%. The idea of ​​this improved technology is also very simple and clear. First, use CoT to give several examples of the reasoning process, and then ask LLM to reason about the given problem, but unlike CoT, "Self-Consistency" requires LLM Output multiple different reasoning processes and answers, and vote for the best answer. This way of thinking teaches LLM to learn the truth that there are many correct solutions to a math problem, and each different derivation process points to the final answer. Simple methods often contain profound philosophical implications. Later, on the basis of "Self-Consistency", the work "On the Advance of Making Language Models Better Reasoners" further integrated "expanding from one Prompt question to multiple Prompt questions, checking the correctness of the intermediate steps of reasoning, and multi- The three points of improvement, the weighted vote of the answer of each output, increased the accuracy of the GSM8K test set to about 83%.

 

3. Divide and conquer algorithm

The core idea is to decompose a complex reasoning problem into several easy-to-solve sub-problems, solve these sub-problems, and then deduce the answers to complex problems from the answers to sub-problems. This kind of thinking may be the authentic way to reveal the essence of the problem and finally solve the complex reasoning problem of LLM. Taking the "Least-to-most prompting" technique as an example, it is divided into two stages. In the first stage, we get the final question to be asked from the original question, and then construct a prompt template, fill in the content of "If I want to solve the Final Q problem, then I need to solve it first", let the LLM model answer, and get the pre-prompt Sub-question Sub Q. In the second stage, let the LLM answer the sub-question Sub Q first, and get the corresponding answer, and then stitch together the original question, the sub-question Sub Q and the corresponding answer, and then ask the LLM the final question Final Q, at this time the LLM will give the final answer. This idea embodies the idea of ​​dismantling sub-problems and gradually finding the final answer from the answers to sub-problems, similar to the idea of ​​divide and conquer algorithm.

Code pre-training enhances LLM reasoning ability

The above mentioned three mainstream methods of using Prompt to stimulate the reasoning ability of the LLM model. At the same time, an interesting and puzzling phenomenon was observed: in addition to the text, the pre-training of the model with the program code and the text can significantly improve the LLM model. reasoning ability.

In the paper "On the Advance of Making Language Models Better Reasoners", an interesting phenomenon is shown through experimental data: participating in model pre-training with program code and text can significantly improve the reasoning ability of the LLM model. The experimental results show that just switching from the plain text pre-training model to the text and Code mixed pre-training model can improve the model reasoning ability by 20 to 50 percentage points on almost all test data sets.

In addition, the study also found that the plain text pre-training model of GPT 3 actually has a considerable degree of reasoning ability, but it needs to be stimulated by appropriate methods. Adding instruct fine-tuning will damage the reasoning ability of the LLM model, but it will improve the natural language understanding ability to a certain extent. As for why the pre-training model can obtain additional reasoning capabilities from the pre-training of the code, the exact reason has not yet been obtained, but it may be because the code training is essentially a multi-modal alignment of two types of data <text, Code>, where Data containing a considerable proportion of mathematical or logical reasoning is helpful for solving downstream mathematical reasoning problems. These conclusions inspire further thinking and exploration in the future.

Thoughts on Reasoning Ability of LLM

In the past year, the technology to stimulate the reasoning ability of LLM has made rapid progress, but the overall feeling is that there is still a long way to go before touching the real essence of the problem, and more in-depth thinking and exploration are needed. For complex reasoning questions, it is disassembled into several simple sub-questions, because sub-questions have a higher probability of being answered correctly for LLM. Inspired by the "Least-to-most prompting" technology, LLM reasoning may essentially be a graph reasoning problem that continuously interacts with LLM, or a program flow chart execution problem that continuously interacts with LLM.

Suppose we can decompose complex problems into a graph structure composed of sub-problems or sub-steps, where nodes represent sub-problems or sub-steps, and edges represent dependencies between sub-problems. According to the dependencies, we can guide the LLM to answer the subquestions that must be answered step by step until the final answer is derived. There may be cyclic structures in the graph, that is, some sub-steps need to be executed repeatedly. If we can get the above sub-problem dismantling diagram, then we can effectively guide LLM to reason.

Suppose we were able to decompose a complex problem into sub-problems or sub-steps and generate a program flow-chart-like structure with looping structures and conditional branches. We can interact with LLM when executing each sub-step, get the answer of the sub-step, and continue to execute according to the process until the final answer is output. This multimodal pre-training model can enhance the ability of the LLM model to construct an implicit flow chart from the text and execute it according to the flow chart, thereby enhancing its reasoning ability.

However, how to obtain graph structure or flowchart structure based on text description is still a difficult point. One possible idea is to implicitly learn the internal implicit structure through enhanced text and higher-quality code pre-training. The current CoT technology tries to deduce the graph structure or program flowchart based on the last graph node, but the current method limits its backward deduction depth and can only deduce a simple graph structure, which is the reason for its limited ability. .

LLM research trends and key directions worthy of research

Here are some important LLM research areas or research directions worthy of in-depth exploration.

1. Exploring the scale ceiling of the LLM model

Although the scale of the LLM model seems to have no technical content, it is actually very important. Since the advent of Bert, to the impressive key technological breakthroughs of GPT 3 and ChatGPT, the core contributions have all come from the growth of the LLM model scale, rather than a specific technology. This shows that for knowledge-intensive tasks, as the size of the model increases, the effect of various tasks will become better and better. For many difficult tasks of reasoning types, after adding CoT Prompting, the effect also shows a tendency to follow the Scaling law. Therefore, a natural question is: For these tasks, to what extent can the scale effect of LLM solve these tasks?

Considering the magical "emergence ability" of LLM, if we continue to increase the model size, what unexpected new capabilities will be unlocked? This is also an interesting question. Therefore, it is necessary to continuously increase the size of the model to see where the ceiling of the model size is for solving various tasks. Of course, this kind of thing can only be said, for 99.99% of practitioners, there is no opportunity and ability to do this.

To do this, there are extremely high requirements for the financial resources and willingness to invest, engineering capabilities, and technical enthusiasm of research institutions, all of which are indispensable. According to a rough estimate, there are no more than 5 institutions in foreign countries and no more than 3 institutions in China that can do this. This is because making a super-large-scale LLM model requires very high engineering implementation capabilities of the technical team, and requires very strong hardware and software support. So this is technical work.

Nevertheless, the research significance of continuing to increase the size of the LLM model is still very important. In addition to exploring the extent to which the scale effect of LLM affects the effects of various tasks, it is also possible to explore what new capabilities will be unlocked when the scale of the LLM model increases. The answers to these questions will help to better understand the nature and behavior of LLM models, and provide important references for future research and applications. Therefore, it is very valuable for capable research institutions to continue to expand the research on the scale of LLM models.

2. Enhance the complex reasoning ability of LLM

As previously described on the reasoning ability of LLM, although LLM has made great progress in reasoning ability in the past year, there are still some limitations. For example, many studies have shown that LLM still cannot solve complex reasoning problems well, especially when long strings or numbers are involved, the reasoning ability of LLM will drop significantly. Therefore, strengthening the complex reasoning ability of LLM should be one of the focuses of future research.

In the previous article, we mentioned a method to directly enhance the reasoning ability of LLM, which is to add the code to the pre-training. Although this method has been summarized by some practices, it is necessary to deeply explore the underlying principles and introduce more types of new data to enhance the reasoning ability of LLM. This may be a more essential direction to improve the reasoning ability of LLM, not just limited to the addition of code.

3. LLM incorporates more research fields besides NLP

The current ChatGPT is a model that performs well on natural language processing (NLP) and programming tasks. As one of the cutting-edge research leading to artificial general intelligence (AGI), the combination of image, video, audio and other multimedia data with language models, and further application of AI to other fields such as scientific research and robot control, is to achieve a wider range of An important path for application and differentiated development. Although this research direction is still in its infancy, it has extremely high research value.

4. An easier-to-use interactive interface between people and LLM

As discussed earlier, the main technical contribution of ChatGPT lies in its excellent performance in specific domains, such as NLP and programming tasks. However, we also realize that current technology is still imperfect, and there are many cases where commands and instructions LLM cannot understand. Therefore, a very promising and new technical direction is to find better ways to enable LLM to understand the way humans use their own customary command expressions. Exploration in this direction will create new opportunities for us and provide more potential solutions for improving the state of the art of LLM.

5. Build a difficult comprehensive task evaluation data set

An excellent evaluation data set is the basis for advancing technology. As LLM models scale up and task performance rapidly improves, many classic test sets quickly become too easy to effectively assess the flaws and blind spots of current techniques. Therefore, constructing a test dataset with high difficulty is crucial to promote the progress of LLM technology. At present, some new test sets have appeared in the industry, such as BIGBench and OPT-IML. These test sets have a certain degree of difficulty, synthesize the requirements of multiple task types, and better reflect the challenges of current LLM techniques.

Inspired by ChatGPT, in addition to the difficulty and diversity of the test set, factors reflecting real user needs should also be considered. In other words, these tasks should be proposed by real users, and only the LLM model constructed in this way can truly solve the actual needs of users. In addition, LLM will rapidly expand its capabilities beyond NLP, so it is necessary to consider in advance how to incorporate more evaluation data from other fields into it. This will help to further improve the wide adaptability of LLM models.

6. High-quality data engineering

Data is the core of the pre-training model, and the pre-training process is the process of acquiring knowledge from the data. Therefore, more attention needs to be paid to mining, collecting and cleaning high-quality data. Data quality and quantity are two key aspects. According to the experimental comparison of T5, it can be concluded that between quality and quantity, quality should be given priority. Therefore, the correct approach is to increase the data size while ensuring data quality. In terms of data quality, multiple criteria such as the information content and diversity of the data need to be considered. For example, Wikipedia is high-quality data with a very high information content. Increasing the diversity of data types is crucial to motivating various new capabilities of LLM. For example, adding data from a question-answering website is directly helpful for improving the question-answering ability of LLM. Diverse data endow LLM with better ability to solve various types of tasks, so diversity is the most critical criterion in data quality.

Regarding the amount of data, in principle, what can be included in the pre-training model is the data publicly released on the Internet. However, there is also a certain limit to the amount of data. A study estimated the scalability of data volumes and concluded that high-quality NLP data will be exhausted by around 2026, low-quality NLP data will be exhausted between 2030 and 2050, and low Quality image data will be exhausted between 2030 and 2060. This shows that either new types of data sources need to be developed, or the efficiency of data utilization by LLM models must be improved. Otherwise, current approaches to data-driven model optimization will stop making progress or yield diminishing returns. Therefore, it is necessary to seek new solutions to deal with the limit problem of data.

7. The sparseness of the super large LLM model Transformer

There are some largest models in LLM, such as GPT 3, PaLM, GLaM, etc., which adopt a sparse structure. The main advantage of using a sparse model is that training and inference time can be greatly reduced. Compared with the dense model, the training speed of the sparse model can be increased by 4 times to 7 times under the same computing power budget. This is because although the sparse model has a huge amount of parameters, for each training instance, the sparse model only uses a small part of the parameters to participate in training and inference through the routing mechanism, so the speed is faster.

Future ultra-large-scale LLM models are likely to tend to sparse models for two main reasons. First of all, studies have shown that the standard dense model itself exhibits sparse activation during training and inference, that is, only some parameters are activated, and most parameters do not participate in training and inference. Based on this, migrating to a sparse model is a reasonable choice. Second, the size of the LLM model will continue to increase, and the high training cost is the main obstacle for it to expand the model size. Using sparse models can significantly reduce the training cost of very large models, so as the model size increases, the benefits of sparse models will be more obvious. For these reasons, future larger-scale LLM models are likely to adopt sparse model schemes.

However, the reason why other large-scale models have not yet adopted sparse models is that sparse models have problems such as unstable training and easy overfitting, making it difficult to train well. Therefore, solving the problems faced by sparse models and designing sparse models that are easier to train is an important direction for future research.

What should be paid attention to when replicating ChatGPT?

To reproduce an amazing LLM model like ChatGPT, we need to weigh the following issues when selecting technology.

1. Regarding the pre-training mode, you can choose an autoregressive language model such as GPT, a bidirectional language model such as Bert, or a mixed mode such as T5. Based on the analysis in this paper, it may be a better choice to choose the GPT autoregressive language model. However, it seems that many domestic LLM projects have chosen the Bert bidirectional language model or the T5 mixed language model, which may lead to a shift in direction.

2. Strong reasoning ability is an important basis for users to recognize LLM. In order to achieve this goal, according to current experience, it is best to introduce a large amount of code and text in the pre-training stage and conduct LLM training at the same time. There is also a corresponding analysis in the previous article to explain this.

3. If you want the model parameters to be not too large but still have good results, there are two options. One is to strengthen the high-level feature extraction and representation capabilities, which can be achieved through deeper network structures or more complex feature extraction methods. The second is to use the combination of text retrieval model and LLM to provide preliminary screening and matching through the text retrieval model, and then LLM for further generation and reasoning, which can greatly reduce the parameter scale of the LLM model.

Fourth, due to the high cost of super-large model training, few institutions have the ability to implement it. Therefore, it is very important to reduce the training cost of LLM. Among them, an effective technical choice is to sparse the feature extractor of LLM, which can effectively reduce the training and inference costs of the model. Therefore, as the size of the model increases, the sparseization of the LLM model is an option that should be considered.

5. At present, the technical solution closest to the ideal LLM is ChatGPT. The ideal LLM should be an almost omnipotent general-purpose large model that can support various task types. To achieve this goal, more task types can be supported by increasing the diversity of pre-training data for LLM. The better the diversity of data, the richer the types of tasks that LLM can support. Therefore, attention should be paid to enhancing the capabilities of LLMs by increasing data diversity.

6. An easy-to-use man-machine interface is also very important. LLM needs to be able to understand the true meaning of tasks described by humans in the way they are used to. At the same time, it is also necessary to collect task expressions according to the needs of end users, rather than relying on the imagination or guesswork of developers. ChatGPT inspired me a lot in this regard, so it doesn't really matter whether reinforcement learning is used or not, other alternative techniques can achieve similar results.

To replicate an amazing LLM model like ChatGPT, it is necessary to weigh factors such as pre-training mode, reasoning ability, model size, training cost, data diversity, and human-machine interface in terms of technology selection, and choose the most suitable one. method to achieve the goal.

Factors required for LLM training

When training a large language model, there are many challenges, which can be summarized into the following six aspects: hardware requirements, health checks, orchestration technology, data processing, model scale expansion, and cost management. Each aspect has a significant impact on the effectiveness and efficiency of model training.

8f11f8f0132c2a1cef540030c960c7cd.png

When training large language models, we face several challenges. The first is the hardware aspect. Using the latest hardware can provide better performance, while not taking full advantage of the latest hardware can lead to longer training times and less than optimal results.

Blue Ocean Brain's high-performance LLM large-scale model training platform uses the working fluid as the intermediate heat transfer medium to transfer the heat from the hot zone to the distant place for cooling. It supports a variety of hardware accelerators, including CPU, GPU, FPGA and AI, etc., which can meet the needs of large-scale data processing and complex computing tasks. Adopt distributed computing architecture to efficiently process large-scale data and complex computing tasks, and provide powerful computing support for deep learning, high-performance computing, large-scale model training, and research and development of large-scale language model (LLM) algorithms. It is highly flexible and scalable, and can be customized according to different application scenarios and requirements. Various computing tasks can be quickly deployed and managed, improving the utilization and efficiency of computing resources.

Another challenge is health checks to ensure hardware is functioning properly and to minimize disruption. Orchestration also needs to be considered to ensure that workloads in a team don't interfere with each other, while keeping networking and security well configured. Handling large-scale datasets is also a challenge, requiring efficient storage, processing, and loading methods. Scaling infrastructure and designing algorithms to overcome limiting problems is also an important task. These models are usually not suitable for a single GPU, so you need to consider how to split the model on multiple GPUs.

Finally, cost management is a factor that cannot be ignored. Training large models can be expensive, and the time of the machine learning team should be well utilized, allowing them to focus on creating new models rather than spending too much time on infrastructure.

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/131378192