[Natural Language Processing] ChatGPT related core algorithms

ChatGPT related core algorithm

The outstanding performance of ChatGPT benefits from the support and cooperation of multiple core algorithms behind it. This article will introduce the Transformer model as the basis for its implementation , the Prompt/Instruction Tuning algorithm that stimulates the knowledge contained in it , its emerging thinking chain ability , and the reinforcement learning algorithm based on .

1. Transformer-based pre-trained language model

The powerful basic model of ChatGPT adopts Transformer architecture, Transformer [ 26 ] ^{[26]}[ 26 ] is a deep neural network model based on self-attention mechanism, which can efficiently process sequence data in parallel. The original Transformer model consists of two key components:an encoder and a decoder. The encoder is used to map the input sequence to a set of intermediate representations, and the decoder converts the intermediate representation to the target sequence. Both the encoder and decoder consist of multi-layerand feed-. Among them, the self-attention module can learn the dependencies between different positions in the sequence, that is, when processing the information of each position, the model will consider the information of all other positions in the sequence. This mechanism enables the Transformer model to effectively process long-term distance dependence. On the basis of the original Transformer model, three types of pre-trained language models have been successively derived:encoding pre-trained language model,decoding pre-trained language modeland encoding and.

1.1 Encoder-only Pre-trained Language Models (Encoder-only Pre-trained Models)

Such models only utilize the encoder in the original Transformer model during pre-training. The corresponding pre-training task usually chooses the masked language modeling task ( Masked Language Modeling), that is, after masking ( [MASK]replacing ) a certain proportion of words in the input sentence, the model is required to predict the masked words according to the context information. Among them, representative works include , , and so on.BERT [ 2 ] ^{[2]} [2]ALBERT [ 27 ] ^{[27]} [27]RoBERTa [ 28 ] ^{[28]} [28]

  • BERTThe model is the most classic encoding pre-training language model, which pre-trains the parameters of the Transformer model through masked language modeling and next sentence prediction tasks.
  • ALBERTIt is a lightweight BERTmodel . The author reduces the number of model parameters by decomposing the word vector matrix and sharing Transformer layer parameters.
  • RoBERTaCompared with BERTthe model RoBERTa, in the pre-training stage, more corpus and dynamic masking mechanisms are used (different words are masked in the same sample in different rounds), the next sentence prediction task is removed, and a larger batch size is used.

1.2 Decoder-only Pre-trained Language Models (Decoder-only Pre-trained Models)

GPT ( Generative Pre-trained Transformer) O pen AI OpenAIThe pre - training model of only the decoder proposed by Open AI . Compared with the previous model, it is no longer necessary to adopt a different model architecture for each task, but to use a model with excellent generalization ability to fine-tune downstream tasks in a targeted manner. In this chapter, we will introduce the GPT series models, includingGPT-1, ,GPT-2andGPT-3.

1.2.1 GPT-1

GPT-1 在文章《Improving Language Understanding by Generative Pre-Training》 [ 1 ] ^{[1]} was proposed in [ 1 ] . Before GPT was proposed, most deep learning methods required a large amount of high-quality data labeled manually, but the cost of labeling data was huge, which greatly limited the upper limit of the model's performance in various tasks. How to use the large-scale unlabeled data that is easy to obtain to provide guidance for the training of the model has become GPT-1thefirst problem. In addition, many tasks in the field of natural language processing rely on the representation of natural language in the hidden space, and the representations corresponding to different tasks are likely to be different, which makes it difficult to generalize the model learned from one task data to other tasks. superior. Therefore, how to apply the representation learned from large-scale unlabeled data to different downstream tasks becomes GPT-1thesecond problem.

GPT-1The structure is very simple, by 12 1212 layersTransformer Block(self-attention module and feed-forward neural network module) are superimposed. For the first problem,GPT-1a left-to-right generative objective function is used to pre-train the model. This objective function can be simply understood as given the previousi − 1 i − 1i1 tokentokent o k e n , vs.iii tokentokento k e n make predictions. Based on such an objective function,GPT-1unlabeled natural language data can be used for training to learn deeper grammatical and semantic information.

For the second question, after completing unsupervised pre-training, GPT-1supervised fine-tuning using labeled data is then used to make the model better adapt to downstream tasks. Given input token tokent o k e n sequencex 1 , x 2 , . . . , xm x_1, x_2, ..., x_mx1,x2,...,xmwith label yyThe data set of y , the parameters of the model are retrained and adjusted, and the optimization model used is that the predicted label is closest to the real value when the input sequence is given.

Specifically, GPT-1after pre-training on a large-scale unlabeled corpus, the labeled data is used to fine-tune the model parameters on specific target tasks, realizing the transfer of knowledge obtained in pre-training to downstream tasks. GPT-1Before the proposal of , the commonly used pre-training method in the field of natural language processing was ; after that, the proposed two-step training method became the training paradigm for many large language models. From this point of view, it is similar to the role played in specific downstream tasks. The implicit representation of natural language is obtained through an unsupervised method, and then transferred to other target tasks. But from a higher level, it is different from the previous word vector representation method. The increase in the amount of data and data scale enables the model to learn natural language representation in different scenarios. The figure below is an overview of the original text, the left side is the architecture and the objective function during training; the right side is the change of model input and output when fine-tuning different tasks. In general, the goal is to learn a general-purpose natural language representation that can then be adapted to a wide range of tasks with simple adjustments. From the current point of view, there are two reasons behind the success: the first is 2017 2017Word2vec [ 29 ] ^{[29]} [29]GPT-1GPT-1Word2vecGPT-1GPT-1GPT-1
insert image description here
GPT-1GPT-1The introduction of Transformer in 2017 made it possible to capture long-distance dependencies in natural language; the second is that the GPT model uses a larger amount of data and more model parameters in the pre-training process, enabling the model to learn from large-scale corpus Learn knowledge that previous models cannot learn. The task fine-tuning builds a knowledge bridge between general pre-training and downstream tasks, making it a feasible way to solve multiple problems with one model.

1.2.2 GPT-2

Different from solving multiple downstream tasks through the pre-training-fine-tuningGPT-1 paradigm in , it focuses more on setting the ability of the language model. It means that the model does not perform any training or fine-tuning in downstream tasks, that is, the model no longer optimizes parameters based on the data of downstream tasks, but understands and completes tasks by itself according to given instructions.GPT-2 [ 3 ] ^{[3]} [3]Zero-shotZero-shot

To put it simply, GPT-2there is no innovation in the model architecture GPT-1of , but GPT-1the introduction of task-related information as the condition for output prediction on the basis of , GPT-1and the conditional probability p ( output ∣ input ) p(output|input) inp(outputinput) 变为 p ( o u t p u t ∣ i n p u t ; t a s k ) p(output|input;task) p(outputinput;ta s k ) ; and continue to increase the size of training data and the amount of parameters of the model itself, Zero-shotfinally showing great potential for multiple tasks under the setting of .

Although GPT-2there is no change in the model architecture, Zero-shotthe idea of ​​introducing tasks into the model as a condition for output prediction to achieve multiple tasks under the setting has continued to this day. This idea is actually conveying that as long as the model is large enough and the knowledge learned is enough, any supervised task can be completed in an unsupervised manner, that is, any task can be regarded as a generation task.

1.2.3 GPT-3

GPT-3 [ 4 ] ^{[4]} [4]Used the GPT-2same model and schema as . In order to explore the impact of model size on performance, a total of 8 88 models of different sizes, and the largest hasa 1750 by 1750The model with 175 billion parameters is calledGPT-3. Table 2.1 comprehensively countsGPT-1,andGPT-2GPT-3

GPT-3The most notable feature is its size. It is mainly reflected in two aspects. On the one hand, the model itself has a large scale and a large number of parameters, with 96 9696 layersTransformer Decoder Layer, each layer has96 9696 128128128- dimensional attention head, the dimension of word embedding has also reached12, 288 12,28812,288 ; on the other hand, the data set used in the training process is large, reaching45 TB 45TB45TB . _ In the case of such a model size and data volume,GPT-3it has shown excellent performance on multiple tasks. GPT-2ContinuingGPT-3underFew-shot,One-shotand , Zero-shotetc.Significantly improved.

Although GPT-3it has achieved surprising results, there are still many limitations. For example, the natural left-to-right generative learning makes its comprehension ability need to be improved; social and ethical issues, etc. At the same time, since the GPT series model does not change the model structure, but continuously increases the amount of training data and model parameters to enhance the model effect, the training cost is huge, which makes it impossible for ordinary institutions and individuals to undertake large-scale language model training and even reasoning. The cost greatly increases the threshold of model promotion.

1.3 Encoder-decoder Pre-trained Models Based on Encoder-decoder Pre-trained Models

Encoder-based architectures benefit from the global visibility of bi-directional encodings and perform well on tasks related to language understanding, but cannot be applied to generative tasks due to the inability to perform variable-length generation. The decoder-based architecture adopts the one-way autoregressive mode, which can complete the generation task, but the information can only flow in one direction from left to right. The model only knows the "above" but not the "below", and lacks two-way interaction. In response to the above problems, some models use a sequence-to-sequence architecture to fuse the two structures, and use the encoder to extract useful representations from the input to assist and constrain the generation of the decoder. Table 2.1 lists several classic models under this framework.

  • BART: BARTThe specific structure is a two-way encoder splicing a one-way autoregressive decoder. The pre-training method used is to input text containing various noises, and then the model performs denoising and reconstruction. In the decoder part, BARTeach layer performs a cross-attention mechanism on the hidden representation of the last layer of the encoder to aggregate key information. BARTOn Wikipedia and BookCorpus BookCorpusTraining on Book C or p u s data set, the data volume reaches160 GB [ 30 ] 160GB^{[30 ] }160GB[30]
  • T5: BARTIn order to take into account different tasks, complex pre-training tasks are designed. For the problem of how to achieve excellent transfer performance in multiple tasks, Google researchers proposed a new paradigm: unifying all natural language processing tasks into text-to -text A generation task for text . T5By adding prompt words before input, a single model can be used to solve multiple tasks such as machine translation, text summarization, question answering and classification. For the huge, high-quality and diverse pre-training data required for transfer learning, T5Google's specially constructed C 4 C4Training on the C 4 dataset [31] ^{[31]}[31]
  • Switch Transformers: With the in-depth study of the language model, the increase in the number of parameters can significantly improve the performance of the model, but it is followed by an increasing amount of calculation in the application. Swicth-TransformerIntroduce the conditional operation idea of ​​the mixed expert network ( Mixture-of-Experts, MoE) into the fully connected layer of Transformer, so as to increase the size of the model without increasing the amount of calculation during reasoning [ 32 ] ^{[32]}[32]

insert image description here

2. Prompt learning and instruction fine-tuning

2.1 Overview of Prompt Learning

Hint learning ( Prompt Learning) is simply to edit the input of downstream tasks through some methods, so that it formally simulates the data and tasks used in the model pre-training process. For example, when doing emotion classification tasks, the method of supervised learning is to input I the test today , and the model outputs the classification score or distribution, while the method of prompt learning is to splice the natural language description after I failed the test today. I feel very_ ___ , let the model generate the following content, and then match the generated content to a certain classification label according to a certain mapping function.

It can be seen that the method of hint learning shortens the distance between the test distribution and the pre-training distribution, and then can use the powerful language modeling ability acquired by the large-scale pre-training language model in the pre-training process, so that it can be used without fine-tuning. Can achieve good results on a variety of downstream tasks. More work in the follow-up proposed the methods of automatic prompt search and continuous prompt , so that the prompt itself can also be fine-tuned, making it more flexible.

There are also various interesting usages of hint learning, such as context learning ( ) in small sample scenarios In-context Learning, that is, several complete examples are added to the hint, such as the capital of the United States is Washington, the capital of France is Paris, and the capital of the United Kingdom is ____ , and thinking chains ( , ) on reasoning tasks (we will introduce them in detail in the next section) and so on.Chain-Of-ThoughtCOT

Compared with hint learning, instruction fine-tuning ( Instruction Tuning) can be said to be an enhanced version of hint learning. The essential goal of the two learning methods is to dig out the potential knowledge contained in the model itself by editing the input, so as to better complete the downstream tasks . Unlike hint learning, instruction learning is no longer satisfied with imitating the distribution of pre-training data, but hopes to learn the distribution of human interaction patterns by constructing instructions and Instructionfine-tuning, so that the model can better understand human intentions. Aligned with human behavior; in instruction learning, what the model needs to face is no longer a simple completion task, but the "instructions" of various tasks , that is, task requirements. The model needs to make matching correct responses according to different task requirements. Examples of "instructions" are as follows:

  • Please translate the following sentence into English "What core technologies does ChatGPT use?"
  • Please help me to segment the following sentence in Chinese "I like ChatGPT so much!"
  • Please help me write a poem describing spring, with birds, flowers and grass in the poem.

It can be seen from the examples that the original classic tasks in natural language processing have become "instructions" that are more in line with human habits after being packaged with task requirements. Studies have shown that when the types of "instruction" tasks reach a certain level, the large model can even Zero-shothave better processing capabilities on zero-sample ( ) tasks that have not been seen before. Therefore, instruction learning can help language models train deeper language understanding capabilities, as well as zero-shot learning capabilities for various tasks. O pen AI OpenAIInstructGPTThe model proposed by OpenAI uses the idea of ​​instruction learning , and ChatGPTInstructGPT follows the method of .

2.2 Instruction Learning in ChatGPT

Negi Open AI OpenAIOpenAI 's blog ( ), the construction method and training methodhttps://openai.com/blog/chatgpt/ of the instruction learning dataset used by ChatGPT is roughly the same as that of InstructGPT, so we introduce the details of InstructGPT's construction of the "instruction" dataset .

InstructGPT's "instruction" data set consists of two parts, one of which is collected from users around the world using O pen AI OpenAIOp e n A IAPI APIThe real human-computer interaction data after the API , these data have been deduplicated and filtered for sensitive information before use; the other part of the data comes from manual annotation. In order to enable annotators to annotate high-quality datasets,O pen AI OpenAIOp e n A I passed the preliminary review and interview, hired a40 40An annotation team of 40 people. Among these manually labeled data, they are divided into three categories. One is to increase the diversity of tasks in the data set, and the "instructions" for any task are written by the labeler; the other is small sample ( ) data, which is written by theFew-shotlabeler Write out "instructions" and some corresponding question and answer pairs, which are used to train the small sample learning (Few-shot learning) ability of the model; the third is inthe O pen AI API OpenAI\ APIThere are existing use cases in Open AI API , and annotators imitate these use cases to write  similar "instruction" data. These data include common task types in language models (generation, question and answer, chat, rewriting, summarization, classification, etc.), of which45.6 % 45.6\%45.6% of the "instructions" are generation tasks, accounting for the largest proportion among all types.

InstructGPT aligns the model with human needs by performing supervised fine-tuning ( , ) and human feedback- based reinforcement learning ( , ) on the constructed “instruction” dataset .Super-vised fine-tuningSFTReinforcement Learning from Human FeedbackRLHF

In the experimental results, the instruction will be used after learning and contains 175 B 175BInstructGPT model of 175 B parameters, in the classic data set FLAN FLANof instruction learningFLAN T 0 T0 After fine-tuning on T 0 , it was found that the InstructGPT model compared with FLAN FLANFLAN T 0 T0 T 0 Both models have a certain degree of improvement in effect. The reason can be attributed to two points:

  • First, the existing public NLP datasets tend to focus on NLP tasks that are easy to evaluate (such as classification tasks, question answering tasks, translation or summarization tasks, etc.). But in fact, statistics show that in O pen AI API OpenAI\ API Among the users who have used Open AI API, those who use models to solve classification or question answering tasks account for only a small part of various tasks, while open generation tasks account for the largest proportion of tasks . This makes the previous models trained with public NLP datasets lack effective training on open tasks. InstructGPT allows annotators to annotate a large number of open "instructions" related to generation and brainstorming, and let the model train, so that the model can greatly improve the effect of these aspects.
  • Second, the existing public NLP datasets are often only processed for one or several language tasks. This ignores the reality that human users will ask language models for various tasks. Therefore, a model that can comprehensively handle various tasks can achieve better results in practice. The instruction learning technology used by InstructGPT can just make up for the shortcomings of the traditional model. By marking a large number of "instruction" data with task diversity, it helps the model obtain processing capabilities on various tasks.

3 Chain of Thought (COT)

In the process of solving complex reasoning tasks such as mathematical problems, human beings usually decompose the problem into multiple intermediate steps, and solve them step by step, and then give the final answer, such as solving the problem Xiaohua day24 page book,12 12After reading the book "Red Rock" in 12 days, Xiao Ming reads 36 3636- page book, how many days can I finish reading "Red Rock"? , people will decompose the problem into(1) Hongyan24 ∗ 12 = 288 24*12=2882412=288 (page),(2) Xiao Ming can use288 ÷ 36 = 8 288 ÷ 36 = 8288÷36=8 (days). Inspired by this, Google researcherJason W ei Jason\ WeiJ a so n We i  (Currently Open AI OpenAIOp e n A I staff), etc. put forward the thinking[33] ^{[33]}[ 33 ] , by inserting a series of intermediate reasoning steps in the examples of small-sample prompt learning, the reasoning ability of large-scale language models is effectively improved. Figure 2.2 shows that the model can correctly solve mathematical problems by generating thinking chains.

Compared with general small-sample hint learning, thinking chain hint learning has several attractive properties:

  • With the blessing of the thinking chain, the model can decompose the problem that needs multi-step reasoning into a series of intermediate steps, which can allocate additional computing resources to the problem that needs reasoning.
  • The chain of thought provides an interpretable window for the reasoning behavior of the model, making it possible to probe the black-box language model by debugging the reasoning path.
  • Thinking chain reasoning has a wide range of applications, not only for tasks such as solving mathematical word problems, common sense reasoning, and symbol manipulation, but may also be applicable to any problem that needs to be solved through language.
  • The thinking chain is very simple to use and can be easily integrated into contextual learning ( In-context Learning), thereby inducing large language models to show reasoning ability.

insert image description here
at [33]^{[33]}Based on [ 33 ] , [34] ^{[34]}[ 34 ] For the zero-sample scenario, the recommended keywordLet's think step by step (let's think step by step) isto generate the content of the intermediate step, thus avoiding[33] ^{[33]}The process of manually composing intermediate steps in [ 33 ] .

4. Reinforcement Learning with Human Feedback (RLHF) based on human feedback

R L H F RLHF R L H F is an important technology for ChatGPT / InstrcutGPT to achieve alignment with human intentions, that is, to generate results without negative effects as much as possible according to human instructions[16] ^{[16]}[ 16 ] . The algorithm is implemented under the framework of reinforcement learning, which can be roughly divided into the following two stages:

  • Reward model training : This phase aims to obtain a reward model that fits human preferences. The reward model takes a prompt and a reply as input and computes a scalar reward value as output. The training process of the reward model is achieved by fitting human tendencies to different responses. Specifically, we first sample multiple different responses to the same prompt based on a model fine-tuned on human-written data. Then, the replies are combined in pairs to form a reward model training sample, and human beings give a tendency label. Finally, the reward model calculates the propensity probability to fit the human label through the difference between the reward values ​​of the two replies in each sample, and then completes the training of the reward model.
  • Generative Policy Optimization : Given the learned reward model, the parameters of ChatGPT/InstructGPT will be treated as a policy for training under the framework of reinforcement learning. First, the current strategy samples replies based on the incoming query. The reward model then computes a reward for the quality of the reply, which is fed back to the current policy for updating. It is worth noting that in order to prevent over-optimization of the above process, the loss function also introduces word-level KL KLK L penalty term. Furthermore, to avoid performance degradation on public NLP datasets, the policy update process takes into account the pre-training loss.

5. References

[1] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[J]., 2018

[2] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proc. of NAACL. 2019: 4171-4186

[3] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1: 9

[4] BROWN T B, MANN B, RYDER N, et al. Language Models are Few-Shot Learners[C]//Proc. of NeurIPS. 2020

[16] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[J]. ArXiv preprint, 2022, abs/2203.02155

[26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[C]//Proc. of NeurIPS. 2017: 5998-6008

[27] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[C]//Proc. of ICLR. 2020

[28] LIU Y, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach[J]. ArXiv preprint arXiv:1907.11692, 2019

[29] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[C]//Proc. of ICLR. 2013

[30] LEWIS M, LIU Y, GOYAL N, et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[C]//Proc. of ACL. 2020: 7871-7880

[31] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the Lim-its of Transfer Learning with a Unified Text-to-Text Transformer[J].Journal of Machine Learning Research, 2020, 21(140): 1-67

[32] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. J.Mach. Learn. Res, 2021, 23: 1-40

[33] WEI J, WANG X, SCHUURMANS D, et al. Chain of thought prompting elicits reasoning in large language models[J]. ArXiv preprint,2022, abs/2201.11903

[34] KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[J]. ArXiv preprint, 2022, abs/2205.11916

This article is reproduced from: Harbin Institute of Technology, Institute of Natural Language Processing (HIT-NLP), "ChatGPT Research Report"

Guess you like

Origin blog.csdn.net/be_racle/article/details/129500087