Friends of ChatGPT: Read the classic paper of the big language model until you spit it out

Zhihu: Ostrich

Position: Alibaba Algorithm Engineer

Original text: https://zhuanlan.zhihu.com/p/620360553

To say that the entry with the most swiping screens in 2023, ChatGPT can be said to be second to none. As of the recent GPT-4, technological innovation has shown a trend of breaking the circle, from the academic circle to the industrial circle to the capital circle, and it has gradually affected the daily life and work of ordinary people.

Frankly speaking, for the work related to the generation of large language models, I have held a conservative attitude for a long time, thinking that this direction is more of an ideal pursuit of deep learning. Now that the clown is actually myself, maybe an excellent job is called an excellent job because it requires continuous pursuit of an ideal state.

Closer to home, this series intends to follow suit and discuss ChatGPT-related technologies. The main content is divided into three parts, and it will also be divided into three articles:

1. Intensive reading of classic papers [this]: Through reading this article, you can understand the general idea of ​​ChatGPT related classic work and the key conclusions of each period;

2. Open source implementation technology [soon]: Summarize the main directions and methods of open source workers following ChatGPT in recent months;

3. The past, present and future of natural language generation tasks [later]: In addition to the big language model, talk about the "traditional" research direction and future imagination of natural language generation.

Due to the rapid development of related technologies, the content of the three parts will be updated regularly. This article is mainly for the study of the first part of the classic papers, and there are many related works (as shown in the figure), it is not realistic to read them one by one, so this article chooses the most sustainable OpenAI series and Google series, as well as LLaMA, which has a relatively large influence recently, and finally Chinese adaptation is better for GLM and ChatGLM.

ed5888f0080236fa5ca65948f8598d15.png

Large models above 10B (yellow is open source)

In addition, reading this article requires certain basic concepts of NLP, such as knowing what is BERT and Transformer, what is the Encoder-Decoder architecture, what is pre-training and fine-tuning, what is a language model, etc.

OpenAI series

The goal of this section is to read the main principles of ChatGPT through the OpenAI series of papers. Its advanced work context can be summarized as the following figure. Tracing back from dependencies requires understanding Codex and instructGPT, then GPT-3, and then GPT-2 and GPT-1. (GPT-4 is simply regarded as the Plus version of GPT-3.5 for the time being, and it has increased the processing ability of multi-modal data, and we will discuss it after more details are made public.)

8468e01f7d1f8ed8bcc1f25f80f42527.jpeg

GPT-1

Link to the paper: "Improving Language Understanding by Generative Pre-Training"

motivation

The task goal is the same as BERT (but before BERT). It is hoped that pre-training with large-scale unlabeled data and fine-tuning downstream tasks can solve classic NLP tasks and alleviate the problem of high data collection costs for supervised tasks. Although GPT-1 is not the first work to use the pre-training-fine-tuning architecture, it is also a very early work using Transformer-Decoder for related tasks.

Program overview

  • Model structure: Decoder part of Transformer

  • Training method: Autoregressive generation method for language model pre-training, discriminative structure for downstream task fine-tuning.

some details

  • Pre-training:

    • Loss: The classic language model training objective, which expresses the unlabeled sample library as a token sequence set U = {u_1, ..., u_n}, and maximizes the following likelihood estimation. That is, predict the next token through the previous token of a paragraph, where k is the context window.

4df4edf9f59b8a6f05c98874629097a7.jpeg
    • Model: Modeling P using a multi-layer Transformer decoder, the simplified formula is expressed as follows. W_e is the token embedding matrix, W_p is the position vector matrix, through the multi-layer transformer block, and finally each token becomes the encoded vector h_n through the transformer block, and finally passes through a linear layer + softmax, which is the predicted distribution of the next token.

d7481f013e715f56193fc686f6c78870.jpeg
    • Data: The early data is not very exaggerated. There are two main data of GPT-1:

      • BooksCorpus dataset: includes more than 7000 unpublished books;

      • 1B Word Benchmark (optional).

  • fine-tuning:

    • Model modification: By adding special tokens as the beginning [Start] and end [Extract] of the input, etc., the hidden layer output of the end [Extract] is connected to the fully connected layer, and the downstream classification and other variant tasks are performed. as the picture shows:

62425d6b1856e8b372929b550d5646f1.jpeg
    • loss:

5ff43017e054058d5775a8c77eba41ad.jpeg cd91d76da4808339910f6e2037134451.jpeg
    • Small details: In the process of fine-tuning, adding pre-training targets on the basis of downstream task targets will achieve better results.

09eb2badee30be35dde2544681cced03.jpeg

Results and discussion

  • Main verification method: The article mainly verifies the effectiveness of the strategy through the effect of downstream tasks, through some classic task data sets. In conclusion, there are still good results on many data sets, taking the classification-based data set as an example, as shown in the figure. It can be seen that there is no trace of BERT in the comparison item at this time.

cb081c2a247b1eba92b7d0112eb23d82.jpeg

GPT-2

Link to the paper: "Language Models are Unsupervised Multitask Learners"

motivation

Shortly after GPT-1, BERT appeared, doing various tasks on the list. GPT-1 tried to increase the size of the model, but under the training framework of pre-training + fine-tuning, it still couldn’t beat BERT with the same parameter size; but the research has to continue, try to change the game, use Zero-Shot as a selling point, and the effect is good.

Program overview

The method of GPT-2 to realize Zero-Shot seems to be relatively simple now: all NLP tasks are regarded as the modeling of p(output|input), and if the system is implemented using a model with sufficient capacity, it is necessary to tell the model What tasks need to be completed, then the modeling goal can be expressed as p(output|input, task).

For the selection of a unified large model, the network structure is the same as GPT-1, and the way of use is also very natural: both task and input use natural language as input to GPT, and the model continues to predict the next largest possible token step by step until Finish. For example, translation tasks: the model inputs "translate Chinese to English, the original text is 'I love deep learning'", and the model outputs "I love deep learning.". Another example is the reading comprehension task, the model inputs "answer the question, the content 'xxx', the question 'xxx?'", and the model outputs the answer to the question.

That's right, it is the early Prompting method (in fact, it is not the earliest). The basis for this is that there is a large amount of Prompt-structured corpus in the training data set, which enables the model to learn what needs to be generated after encountering similar prompts.

fa5ab37ac7eade394f51879ed04e094a.jpeg

some details

  • Training data: In order to support multi-task Zero-Shot, the model needs to see as much and rich data as possible, and the goal of data collection is the same. The main points are as follows:

    • Open source Common Crawl, webpage data of the whole network, the data set is large and rich enough, but there are quality problems, so it is not used directly;

    • Self-built WebText dataset and webpage data, focusing on a clean and high-quality one: only keep the webpages that have been filtered by people, but the cost of human filtering is very high. The method here is as long as the Reddit platform (similar to domestic post bars, social sharing platforms) The off-site link shared by the user also requires the post to have at least 3 karma (similar to likes?). It can be considered that what is shared is often interesting, useful or interesting content.

    • WebText finally includes 4500w links, post-processing process: 1. After extracting the content of the webpage; 2. Keeping the content after 2017; 3. Deduplication; 4. Heuristic cleaning to get 800w+ documents, about 40GB; 4. Excluding Wiki Encyclopedia documents, to avoid overlapping with downstream test data (because many test tasks include Wikipedia data, do other data not overlap?).

  • Model: The GPT structure is used, but some adjustments have been made in model feature input encoding, weight initialization, dictionary size, input length, batch size, etc., mainly upgrades.

conclusion and discussion

  • main conclusion:

    • The article tried four sizes of models, of which 117M corresponds to Bert-base (and GPT-1), 345M corresponds to Bert-large parameters, and the largest model is 1542M (1.5 billion parameters).

de518cafd1cea863acfa7eb30094f083.jpeg
    • The selection of the model uses 5% of WebText as the verification data. The experiment found that the models of all sizes are still underfitting. As the training time increases, the effect on the verification set can still continue to improve.

    • Of course, the best results at that time were also achieved on most Zero-Shot task sets:

7bb3d12e0285c8422b946a1b03b86603.jpeg
  • Secondary conclusion: In most tasks, the relationship between model capacity and core indicators can be found that as the model capacity increases, the effect continues to become stronger. 1.5 billion parameters does not seem to have reached the bottleneck, that is to say, if we continue to increase the model capacity, the extent to which the effect can be achieved is very imaginative. (There will be subsequent miracles of GPT-3)

    • Children's Book Test tasks:

d93765f92faa430aa42ed4cf4f49252b.jpeg
    • Winograd Schema Challenge tasks:

8df668e2ffb1322de9e7c9c1f97d58b6.jpeg
    • Other Zero-Shot tasks

5154b26efd78f87a7b0e305e3f467a7a.jpeg
    • The effect of the language model pre-training set and verification set (the smaller the perplexity perplexity, the better)

6c7ce3b565d7c669161c20ba91cc52cf.jpeg

GPT-3

Link to the paper: "Language Models are Few-Shot Learners"

motivation

After BERT came out, although the pre-training + fine-tuning architecture has achieved amazing results (the GPT series cannot be compared in the short term), this fine-tuning has many limitations:

  • Fine-tuning requires more domain data, the cost of labeling is high, and some special tasks are even more difficult (such as error correction, writing, question and answer, etc.).

  • Fine-tuning works well with small amounts of data, it is likely just overfitting. Many tasks are said to be better than humans, but the actual performance is actually exaggerated (the model does not perform tasks based on knowledge and reasoning, and is not intelligent).

  • Compared with human learning habits, after humans have enough knowledge (pre-training), they do not need to look at a large amount of supervised data to do tasks (corresponding to fine-tuning), but only need to look at a small number of samples.

The article believes that although fine-tuning is really ineffective now, it is still worthwhile to pursue non-fine-tuning. The method is to continue the final conclusion of GPT-2, a larger model, more data, and prompt more information (In-Context learning).

Program brief

Mainly compared with GPT-2:

  • Following the GPT-2 model and training method, the model size is upgraded to 175B (175 billion parameters vs 1.5 billion), this 175B model is called GPT-3;

  • Unlike the BERT/GPT-1 model that uses downstream task fine-tuning for effect verification, and GPT-2 that only uses Zero-Shot for verification, GPT-3 mainly verifies its In-Context learning ability (may be considered as no fine-tuning, no The way of gradient update depends on the ability to complete specific tasks through prompt and a few examples as input).

  • GPT-3 is not incapable of fine-tuning, and I will do some work to see the performance of fine-tuning in the future (here is the work of Codex, InstructGPT and ChatGPT).

some details

  • Model training method: As mentioned earlier, there is no innovation compared to GPT-2, that is, a larger model, more and richer data, and longer training time, not only Zero-Shot, but also One-Shot and Few-Shot tasks ( The x-Shot here does not fine-tune the model, which is the so-called In-Context learning, and there is no special operation in the pre-training stage), as shown in the figure.

ca0ce1cd67e4fcbc3447f8fc2bfb14c7.jpeg
  • Model: Following the structure of GPT-2, some optimizations have been made in model initialization, normalization, and Tokenization. In addition, some advantages similar to Sparse Transformer have been "copied" (in short, some operations that have been verified and effective in the same period have been added. or verify a small operation that works by yourself). In order to verify the effect of the model capacity, the article trained models of various sizes, the largest 175B called GPT-3. In order to train large models, I have also done some work on model parallelism and efficiency improvement (in fact, this part is also important, but I didn't expand it). The comparison of the model size parameters and the training resource overhead of some work in the same period is shown in the figure:

37b69d6a610daa30fa84668ad21e4dc4.jpeg b89e946b6ac9c94e5d2ed8b12d2488b5.jpeg
  • Training data preparation:

    • The article found that after the model is enlarged, the negative impact of introducing some dirty data is not so great. Therefore, compared to GPT-2, GPT-3 started to use the Common Crawl dataset, but did some cleaning work: 1. Retain similar content to high-quality datasets (using some similar or discriminative methods); 2. Deduplication;

    • Finally, combine the cleaned Common Crawl data with the existing high-quality data set to obtain the training data set, and use it for sampling with different weights:

bf76639d01f32f368fd62876f84fb4d1.jpeg
  • Model training process:

    • A larger batch size can be used for large models, but a smaller learning rate is required; the batch size is dynamically adjusted according to the noise scale of the gradient; in order to prevent OOM of large models, a full range of model parallelism is used, and Microsoft provides hardware and software support.

conclusion and discussion

  • The main conclusion: the overall effect is good, compared on various data sets, NLU related tasks, GPT-3 performed well (some data sets even surpassed the supervised fine-tuning method); in QA, translation, reasoning and other tasks, it is still The gap between distance supervision and fine-tuning models is obvious; the generation tasks can basically be difficult for humans to distinguish. For example, several main tasks:

    • SuperGLUE: understanding the task first

90f8f5d2683d1094c94656fd946fbba8.jpeg
    • Winogrande: reasoning task-based

d9bcc83a15ee815a412358dd3664fd6f.jpeg
    • TriviaQA: Reading Comprehension Task-Based

0fbbe37378596dfcc22ed036742b2928.jpeg
  • Secondary conclusions:

    • From the main conclusion curve, it can be clearly found that few-shot is better than zero-shot, and the larger the model, the better (nonsense), and the 175B does not seem to have reached the limit, and the effect of larger models may continue to rise.

    • Another data shows that the larger the model, the greater the space for loss reduction. The largest model in the current version still has not converged (yellow curve); output ratio) is gradually decreasing.

2896626094cb81476b2894ca1810dc66.jpeg
    • Because the model generation effect is difficult to distinguish between true and false, the article also focuses on the bias, immorality and improper use of the model, so it is also decided not to open source (OpenAI is on the road of CloseAI)!

Codex

Link to the paper: "Evaluating Large Language Models Trained on Code"

motivation

As mentioned in the GPT-3 paper, GPT can be fine-tuned but it will be done in the future, and Codex is one of the fine-tuning tasks. The task is to explore the fine-tuning of the GPT model in the direction of code generation, which can be regarded as a paper in the application direction.

Program brief

Specifically, Codex uses code comments to generate code. The training data is obtained from github, mainly in python language. In order to verify the effect of the model, Codex made a new data set (164 original code questions, which can be considered some classic leetcode questions and interview questions), and verified the correctness of the generated code through unit testing.

In the end, Codex can achieve a test pass rate of 28% (GPT-3 can only solve 0%); if repeated sampling is allowed to generate multiple results, choose 100, and a pass rate of 70% can be achieved (think about how much you can pass). After some rerank strategies, the pass rate is close to 80%.

dffcfb13d2d847ce2cad8cbcf249cc6b.jpeg

some details

  • Verification set preparation: Because there was no ready-made verification set generated by the evaluation code before, the article designed a HumanEval by itself. And use pass@k as the evaluation indicator (if one of the k results can pass, it will be considered as a pass, and then the pass rate will be calculated). Considering that the security of the generated code is uncontrollable, it needs to be run in a sandbox environment (it’s okay if it crashes). The sample data of HumanEval is as follows, including code comments and standard answers:

e334ac5ed7b84331e43dacdf06f2ed5c.jpeg
  • Training data: As of May 2020, involving 5.4 million Github repositories, including 179GB of Python files, the file size is less than 1MB. Do some filtering, the main filtering items are automatically generated code, the average line length is greater than 100, the maximum line length is greater than 1000, contains a certain percentage of numbers, etc. The final dataset size is 159GB.

  • Model: Considering the generation task, it should be beneficial to use the pre-training model of the GPT series. The 13B GPT model was selected as the main model for fine-tuning. It is worth mentioning that using pre-trained GPT fine-tuning is not better than using code data to train from scratch (it should be because the amount of data is large enough), but using fine-tuning converges faster. Model details:

    • The parameter configuration is similar to that of GPT-3; based on the characteristics of the code data, a special tokenizer is made, and finally 30% of the tokens are reduced; special stop characters ('\nclass', '\ndef', etc.) are used to ensure sample data code integrity;

conclusion and discussion

  • main conclusion:

    • Different parameter adjustments, and the number of samples, significantly affect the pass rate of the generated code.

70895bec57b7bd2cb23102ff6bb3ced3.jpeg
    • If you only choose one answer, using some model output indicators, such as the maximum mean log-probability, can be better than random selection; using unit tests with prior knowledge for code selection can achieve the best theoretical results (Oracle) .

bef478f317b5654e6692fe2a138f2856.jpeg
  • Secondary conclusion: Because the effect is not bad, it seems that the effect will be improved when the model is larger in the trend. At the end of the article, we discuss the concern about the machine's ability to write code (self-optimization is the most terrible); in addition, there are no surprises in the code Discrimination, moral prejudice. (This probably stems from the fact that there are people in the code who say it well, the code is named with Fxxk, and there are prejudices where there are people).

InstructGPT

Link to the paper: "Training language models to follow instructions with human feedback"

motivation

Another fine-tuning exploration of GPT, using user instructions and preferred answers to fine-tune the GPT model, so that the content generated by the model is more in line with the user's intentions, more realistic and useful (Alignment, alignment process). The starting point for doing this is to face a classic application scenario. The user declares an intention to a command and expects the model to generate useful and harmless content. However, the large language model GPT trained with a large amount of web page data cannot directly meet this demand. Therefore, it is necessary to fine-tuning.

Program brief

The process of instruction fine-tuning is divided into three steps (RLHF, Reinforcement Learning from Human Feedback), as shown in the figure below:

1. Prepare a batch of prompts (handwritten by source annotators and OpenAI API requests); for this batch of prompts, the annotators write expected answers by hand, and use this prompt+answer data to fine-tune the GPT-3 generation model, which is called supervised policy here;

2. Use the fine-tuned model to generate answers based on more prompts (one prompt generates multiple samples to generate an answer). At this time, outsourcing only needs to mark the relative order of the generated content; use this marked data to train a reward model (RM model), input prompt and answer, and the model outputs a score (the GPT model is also used here).

3. Sampling more prompts, using the way of reinforcement learning, continue to train the generative model, and the reward of reinforcement learning is scored by the model in step 2.

Steps 2 and 3 are a continuous iterative process, that is, the better generative model (policy) trained in step 3 can be used to collect more data with relative order labels, which are used to train new RM model (that is, step 2), and then train a new generative model (corresponding to step 3). Most of the relative order annotation data comes from step 1, and some come from iterations of steps 2 and 3.

In addition, this article is not the first work using this method. There is also a previous article "Learning to summarize from human feedback", which uses a similar three-step method for summarizing tasks. It is also the work of OpenAI, which reflects the continuity of the work, not overnight, and the inspiration does not mean that there is something.

bad407eb25fb3983c9d1817d3d72bc47.jpeg

some details

  • data collection process

    • Cold start stage: Supervise the early version of InstructGPT trained by some manually labeled prompt+answer data; Enrichment stage: Deploy the trial version online service to collect more and richer real user prompts. This work does not use the service user data of the online formal environment, the data of the trial version will be used for data labeling and model training, and the user is also notified in advance;

    • The collected prompts are deduplicated according to the longest common prefix;

    • Each user has a maximum of 200 prompts to avoid the model catering to individual user preferences; the training set, verification set and test set do not include the same users (emphasizing the generalization ability of the user dimension);

    • Filtering information related to personal identity in the prompt is also to avoid the model from learning user characteristics;

    • The training data of the early version of InstructGPT were prompts and answers handwritten by outsourcers. Cold start prompts include 3 categories: 1. Any common task questions, pursuing the richness of tasks; 2. Write multiple queries and answers in the same type of prompt; 3. Imitation Prompt requests from real users;

    • After the above operations, three types of data are obtained: 1. SFT dataset, training set 13k prompts (from API and handwritten by annotators), used to train the SFT model; 2. RM dataset, training set 33k prompts (from API and Handwritten by the labeler), the sorting of the output answers of the manual label generation model is used to train the RM model; 3. The PPO data set, 31k prompts (only from the API), does not require manual labeling, and is used for RLHF fine-tuning.

  • prompt features:

    • The distribution and samples of prompt instruction types of real users are shown in the figure, 96% of which are in English, but it is found that it has generalization ability to other languages.

e0e1746446c116ff817ccff6d85ba72c.jpeg
  • Data annotation details (also critical, worthy of reference):

    • An outsourced labeling team, consisting of 40 contractors, is supposed to improve the diversity of labeling and prevent the model from being sensitive to the labeler's style;

    • Annotators were tested (exams, personality tests) with the goal of screening those annotators who are sensitive to different groups (religion, sexual orientation, race, etc.) and who can identify potentially harmful content;

    • Annotators are required to be able to accurately judge user intentions and skip ambiguous intentions; consider implicit intentions and be able to identify some potential and inductive swear words, prejudices, and false information;

    • There is a slight conflict in the alignment of the intentions of the training and verification phases: the training phase emphasizes the usefulness of the generated content, while the verification phase focuses on the truthfulness and harmlessness of the content;

    • In the labeling process, algorithm development and labeling personnel communicate closely. In order to achieve this, an entry process has been made for outsourced labeling personnel (may have to pay social security=_=);

    • In order to test the generalization ability of the model to the labelers, a part of the test labelers (held-out labelers, real tool people, rigorous) are reserved. The data produced by these labelers is not used for training, and these personnel have not been tested. test;

    • Although the labeling task is difficult, the labeling consistency of the labelers is not bad. The labeling consistency of the training labelers is 72.6 ± 1.5%, and the consistency of the test labelers is 77.3 ± 1.3%.

  • Model implementation: the same as the training process, including three parts

    • Supervised fine-tuning (SFT): Using the data marked by the labeler, supervised fine-tuning of the GPT model, trained for 16 epochs, and the learning rate cosine decay. The selection of the model uses the RM model score (chicken and egg) on ​​the validation set. It is worth mentioning that the SFT model here overfits after 1 epoch on the verification set, but continuing to do more epochs is beneficial to RM score and human preference.

    • Reward modeling (RM):

      • Model: The same SFT GPT model structure, but another 6B is trained (175B is unstable, not suitable for the following RL training), the input is the prompt and the generated content, and the pooling is followed by a full connection (maybe) to output a scalar reward points.

      • The Loss function is expressed as:

467ed23eebaf573f37b434579dbd9a47.jpeg
      • K is the number of answers generated by a prompt model, and the labelers sort the K models; K is 4-9, and the cost of marking 9 and 4 is similar;

      • A bias is needed to normalize the reward so that its mean value is 0, which is convenient for downstream RL to use (the bias here can be the mean value of reword, which is also a routine operation of RL);

    • Reinforcement learning (RL), two experimental models:

      • 'PPO' model: directly use the classic PPO algorithm, an offline RL algorithm, the goal is to maximize the reward of the model feedback, while taking into account the KL divergence of the online model and the offline model (here the offline model is the SFT model, and the online model is For the target model to be optimized, the online model parameters will be periodically synchronized to the offline model. If you are not familiar with RL, you can simply understand its goal); the reward output by the model is scored by RM;

      • 'PPO-ptx' model: PPO+ pre-training target (with this target, it has been verified that it can take into account the effect of public NLP tasks), the final optimization target, maximizing:

fa11a57770cc7c3921ad3a4f42921292.jpeg
  • Verification method: evaluate the ability of the model in two aspects, 1. Whether the generated answer is liked by people; 2. How well is the classic NLP task solved.

    • The prompt verification effect in the API request:

      • The real distribution downsampling prompt is used as the test set;

      • 175B's SFT GPT-3 model as Baseline;

      • Annotators rate the content generated by each model with a score of 1-7 likes/recognition;

      • The basis of recognition is helpful, honest and harmless, and there are many detailed rules for each dimension.

    • In open source NLP datasets, there are two categories:

      • Measure datasets for safety, authenticity, harm and bias;

      • Entropy zero-shot results of classic NLP task datasets, such as reading comprehension, question answering, summarization, etc.

conclusion and discussion

  • main conclusion:

    • For the test set prompt obtained through the API, RLHF is shown to be better than the baseline:

921afb1714de39427408a4e1b61a11c1.jpeg
    • Comparing with other models, it is also good:

1e4739cca137c8ee10c518cfa17b17a7.jpeg
  • Secondary conclusions (vernacular evaluation of content generated by InstructGPT):

    • Annotators generally believe that the content generated by InstructGPT is much better than GPT-3;

    • The content generated by the InstructGPT model is more factual than GPT-3 (less factual errors);

    • The content generated by InstructGPT is better than GPT-3 in harmfulness, but not much stronger in bias;

    • In the process of RLHF fine-tuning, due to the "alignment tax", the performance of open source NLP tasks has deteriorated, but the pre-training target of the language model can be added on the basis of RLHF, which can be taken into account (PPO-ptx).

    • The InstructGPT model also showed good generalization on "held-out" annotators;

    • The performance of public NLP dataset tasks is not pursued by InstructGPT (ChatGPT is);

    • The prompt of the InstructGPT model outside the distribution of the RLHF finetuning dataset also has good generalization ability;

    • InstructGPT-generated content still suffers from some simple errors;

ChatGPT

No paper, official blog: "https://openai.com/blog/chatgpt"

OpenAI did not open the details of ChatGPT. There are only two general method descriptions. The summary includes:

  • The method is roughly the same as InstructGPT, but the data collection is slightly different. The data in the form of dialogue used by ChatGPT, that is, multiple rounds of prompts and contexts, and the data set of InstructGPT are also converted into dialogue format and used together.

  • Train the RM model, use the results generated by multiple models, randomly select the content generated by the model, and let the annotators sort according to the quality of the content, and then use the RM model to perform subsequent PPO fine-tuning training. Again, this is an iterative process.

More details are gone, but more details can be seen from a paper by OpenAI's friend Anthropic (the founder is also from OpenAI). Judging from the continuity of OpenAI's work, those who quit the company should also continue the related work.

Before reading Anthropic, insert a series of work summary of OpenAI and save it as a file. After reading the above paper, you should be able to roughly understand the contents of this table (reference):

ability model name training method OpenAI API
Before GPT-3


Pretrain + Fintune like Bert GPT-1 Language Modeling + Task Finetune -
Generation+Zero-shot task GPT-2 Language Modeling -
GPT-3 Series


Generation+World Knowledge+In-context Learning GPT-3 Initial Language Modeling davinci
+Follow Human Instruction+generalize to unseen task Instruct-GPT initial Instruction Tuning Davinci-Instruct-Beta
+Code Understanding+Code Generation Codex initial Training on Code Code-Cushman-001
GPT-3.5 Series


++Code Understandning++Code Generation++Complex Reasoning / Chain of Thought (why?)+long-term dependency (probably) Current Codex Strongest model in GPT3.5 Series Training on text + code Tuning on instructions Code-Davinci-002 (currently free. current = Dec. 2022)
++Follow Human Instruction--In-context learning--Reasoning++Zero-shot generation Instruct-GPT supervisedTrade in-context learning for zero-shot generation Supervised instruction tuning Text-Davinci-002
+Follow human value+More detailed generation+in-context learning+zero-shot generation Instruct-GPT RLHF More aligned than 002, less performance loss Instruction tuning w. RLHF Text-Davinci-003
++Follow human value++More detailed generation++Reject questions beyond its knowledge (why?) ++Model dialog context --In-context learning ChatGPT Trade in-context learning for dialog history modeling Tuning on dialog w. RLHF -

It may be true that as some bigwigs said, ChatGPT has no innovation, it is just a superposition of a bunch of strategies to make a powerful model; some people say that ChatGPT is more of a combination of engineering and algorithms. Anyway, the method is really work.

Anthropic的Claude

Reference paper link: "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"

Not long after ChatGPT came out, Anthropic quickly launched Claude, who is the most powerful competitor of ChatGPT under the media caliber. To be able to follow up so quickly, there is a high probability that it is the work of the same period (or even earlier, and the relevant working papers are several months earlier). Anthropic is a start-up company where OpenAI employees leave. It is said that it is not part of the OpenAI philosophy (maybe it is not open, social responsibility?).

Some internal test conclusions: Compared with ChatGPT, Claude can avoid potentially harmful problems, and is slightly inferior in code generation, and the general Prompt is no different. In terms of effect, ChatGPT may be more functional, while Claude is more "harmless" (or, in other words, has less potential negative impact on society), which is also reflected in the title of the reference paper.

motivation

Introduce the preference model and RLHF (human feedback reinforcement learning) to fine-tune the large language model (probably because of the departure from OpenAI, GPT-3 is not mentioned), and get a helpful and harmless personal assistant (similar to ChatGPT); this alignment (Alignment) fine-tuning, Make the pre-trained language model significantly improved in almost all NLP tasks, and can complete specific task skills, such as coding, summarization and translation.

Program brief

In fact, the idea is similar to that of InstructGPT, a three-stage RLHF. The difference is that 1. Iterative online model training is carried out: the model and RL strategy are updated with new manual feedback data every week, and the data and model are continuously iterated; 2. Data in a dialogue format is used; 3. More attention is paid to the model Helpful and harmless.

In addition to model and strategy design, the article focuses on the stability of RLHF; it also analyzes issues such as model calibration, target conflict, and OOD (out of distribution) identification.

The goal conflict refers to the goal conflict between helpful and harmless, because if the model answers "don't know" to all questions, although it is harmless, it is not helpful at all.

some details

  • Dialogue Preference Dataset:

    • A batch of helpfulness and harmlessness dialog data sets were collected, and the annotation of the data was completed by annotators interacting with various 52B language models on the dialog annotation page. Annotate the page as shown in the figure;

24c3c89c98ee8732ca53fc733ed66ed1.jpeg
    • Annotators have an open dialogue with the model on the interactive page, or ask for help, or give instructions, or guide the model to output harmful content (such as how to successfully rob). For multiple answers output by the model, the labeler needs to mark which is more useful or which is more harmful in each round of dialogue;

    • Three pieces of data were collected, one from the initial model (SFT), one from the early preference model (RM) sampling, and the last from the online reinforcement learning model with human feedback (weekly update);

    • Open source three data: https://github.com/anthropics/hh-rlhf

  • Data collection and model training process (concepts involved in the middle need to read previous papers and understand):

b0c54057d1bb17653b3f3843d5e8248b.jpeg

LLaMa与Alpaca

Things have developed to the present, there is a small problem, that is, the model is getting bigger and bigger, and the open source is less and less (in fact, most people can't play it if it is open source). First of all, the GPT-3 series models are very large, and both training and inference models require a large number of graphics cards; secondly, the data used by GPT-3 is not public, and it is difficult to reproduce it with computing power, so you need to disk the data yourself; in GPT The closed source of ChatGPT after -3 is even more closed, and commercial interests may need to be further considered.

In this context, there are more and more works on the efficiency improvement and openness of the front-line model. The most influential ones in the near future are Meta AI's LLama and Stanford's LLama-based Alpaca. The former is similar to the large language model of GPT, and the latter is similar to ChatGPT.

LLama

论文:《LLaMA: Open and Efficient Foundation Language Models》

Code: https://github.com/facebookresearch/llama

motivation

  • In the work related to large language models, the general assumption in the past was that the larger the model, the better the effect. However, some recent work has shown that under the premise of given computing resources, the best effect is not achieved by the largest model, but by a relatively small model with more data. The latter is more friendly to the inference or fine-tuning stage and is a better pursuit.

  • Correspondingly, the work of this paper is to train a batch of models, which achieves better results and lower prediction costs. One way to achieve this effect is to let the model see more tokens. These trained models are LLama.

Program brief

LLama's thinking is relatively simple, and it has been roughly included in the motivation. Other features of this work can be briefly summarized as follows:

  • Provides models from 7B to 65B, the effect of the 13B model can exceed GPT-3 (175B), and the effect of the 65B model is close to Google's PaLM (540B);

  • The training model only uses open source data sets, trillions of tokens.

  • Multiple tasks are on SOTA, and all model weights are open sourced.

some details

  • Training data set (mainly English, so Chinese and Chinese fine-tuning effects are worrying)

ec72875b8138eb1e830d6a306a56214f.jpeg
  • Model Capacity Overview

aa2b81c4dd7bbe180d202272a9491dba.jpeg
  • Model structure:

Like GPT, it is also the Transformer Decoder architecture, which uses various small optimizations that have been verified to be effective (such as: Pre-Normalization, SwiGLU activation function, Rotary Embedding, AdamW optimizer, etc.). At the same time, some training efficiency optimizations have been made, including model implementation and model parallel optimization.

  • Training process: 7B and 13B models are trained on 1T tokens; 33B and 65B models are trained on 1.4T tokens.

63b677719372bd467b26b029edeea7f1.jpeg

conclusion and discussion

  • LLama benchmarked against GPT, and mainly verified on the Zero-Shot and Few-Shot tasks; at the same time, considering that instruction fine-tuning is one of the popular applications now, it is also verified on the instruction fine-tuning task.

    • Zero-Shot

837a5028d0a9af4739b62453c2371e8c.jpeg
    • Few-Shot

8ff588bdeb5b5b782bd378daea100d5e.jpeg
    • Instruction fine-tuning (mainly compared with Google's Flan series):

216f4ef99ace3a63769ee5ac7bb14729.jpeg

Alpaca

Article: https://crfm.stanford.edu/2023/03/13/alpaca.html

Code: https://github.com/tatsu-lab/stanford_alpaca

motivation

As can be seen earlier, the instruction fine-tuning models such as GPT-3.5, ChatGPT, Claude, and Bing Chat have been verified to be outstanding, but there are still problems such as false content, bias, and maliciousness. In order to speed up the resolution of these problems, academics (poor teachers, students, companies) need to join in the research, but the GPT-3.5 models are large and closed-source.

LLama was released a while ago, giving hope. Therefore, the Alpaca model was obtained by fine-tuning instructions based on LLama. The effect is similar to that of GPT-3.5, and it is simple and low in cost of reproduction.

Program brief

  • The fine-tuning of instructions by the poor requires two prerequisites: 1. A pre-trained language model with small parameters and good effects; 2. High-quality instruction training data.

  • These two prerequisites seem to be easily met now: 1. LLama model 7B model is acceptable; 2. Training data can be automatically generated through the existing strong language model (prepare the prompt and call the OpenAI api of the GPT-3.5 series).

some details

  • Training method: use the instruction fine-tuning method to train on the LLama 7B model; the training data size is 52,000, and the source is the call to the OpenAI GPT-3.5 API (costing $500); the fine-tuning process is trained on 8 80G A100s graphics cards for 3 hours (Using cloud computing services, cost $100). The training process is shown in the figure.

0dede9841506f9728ef2c4f87b09871f.jpeg

Conclusion and Discussion

Currently open: test demo, training data set, training data generation process, training code; pre-trained weights will be open in the future (some external factors may be considered);

  • Demo: an interactive demo for everyone to try out Alpaca.

  • Data:52K demonstrationsused to fine-tune Alpaca.

  • Data generation process: the code forgenerating the data.

  • Training code: for fine-tuning the model using the Hugging Face API.

Possible directions in the future (not including optimization of reasoning ability, maybe these should be left to the rich):

  • Model verification: more systematic and rigorous evaluation of the model, starting from HELM (Holistic Evaluation of Language Models), to verify the generation of the model and its ability to command;

  • Model security: a more comprehensive assessment of the risk of the model;

  • Understanding the model (interpretable): What did the research model learn? What is the knowledge of the choice of the base model? What does increasing the model parameters bring? What is the most critical command data? Is there any other way of collecting data?

GLM and ChatGLM

Although LLama is good, it uses more English data sets, but it does not perform well in Chinese. Similarly, after fine-tuning the instructions, the upper limit should be relatively low in the Chinese scene. Therefore, in Chinese, it is necessary to have its own research direction. The current open source versions with relatively high influence are Tsinghua's GLM and ChatGLM.

There are many introductions related to GLM and ChatGLM. The following excerpts are part of the content for a brief understanding.

GLM

paper:

  • 《GLM: General Language Model Pretraining with Autoregressive Blank Infilling》

  • 《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》

Program brief

GLM-130B is an attempt in the direction of Tsinghua's large language model after GPT-3. Different from the architecture of BERT, GPT-3 and T5, GLM-130B is an autoregressive pre-training model with multiple objective functions.

f00b34271f3df6806796314dd0cb5f29.jpeg

some details

Opening in August 2022, GLM-130B has some unique advantages:

  • Bilingual: supports both Chinese and English.

  • High precision (English): Better than GPT-3 175B (API: davinci, base model), OPT-175B and BLOOM-176B on the public English natural language list LAMBADA, MMLU and Big-bench-lite.

  • High precision (Chinese): significantly better than ERNIE TITAN 3.0 260B and YUAN 1.0-245B on 7 zero-sample CLUE datasets and 5 zero-sample FewCLUE datasets.

  • Fast inference: The first INT4 quantized 100-billion model supports fast and largely lossless inference with a 4-card 3090 or 8-card 2080Ti server.

  • Reproducibility: All results (over 30 tasks) are reproducible with our open source code and model parameters.

  • Cross-platform: Support training and inference on domestic Haiguang DCU, Huawei Ascend 910 and Shenwei processors, and American Nvidia chips.

ChatGLM

Article: https://chatglm.cn/blog

Code: https://github.com/THUDM/ChatGLM-6B

Program introduction

ChatGLM refers to the design idea of ​​ChatGPT, injects code pre-training into the 100 billion base model GLM-130B, and realizes human intention alignment through Supervised Fine-Tuning and other technologies.

In order to better promote the development of large-scale model technology together with the community, Tsinghua also open sourced the ChatGLM-6B model. ChatGLM-6B is a Chinese-English bilingual language model with 6.2 billion parameters. By using the same technology as ChatGLM (http://chatglm.cn), ChatGLM-6B has Chinese question-and-answer and dialogue functions, and supports reasoning on a single 2080Ti.

some details

ChatGLM-6B has the following features:

  • Sufficient bilingual pre-training in Chinese and English: ChatGLM-6B has trained 1T tokens on the 1:1 ratio of Chinese and English materials, and has bilingual ability.

  • Optimized model architecture and size: Based on the GLM-130B training experience, the two-dimensional RoPE position encoding implementation has been revised, and the traditional FFN structure is used. The parameter size of 6B (6.2 billion) also makes it possible for researchers and individual developers to fine-tune and deploy ChatGLM-6B by themselves.

  • Lower deployment threshold: Under FP16 half-precision, ChatGLM-6B requires at least 13GB of video memory for reasoning. Combined with model quantization technology, this requirement can be further reduced to 10GB (INT8) and 6GB (INT4), making ChatGLM-6B deployable On consumer-grade graphics cards.

  • Longer sequence length: Compared with GLM-10B (sequence length 1024), ChatGLM-6B has a sequence length of 2048, supporting longer conversations and applications.

  • Human Intent Alignment Training: Supervised Fine-Tuning, Feedback Bootstrap, Reinforcement Learning from Human Feedback and other methods are used to make the model initially capable of understanding human instruction intentions. The output format is markdown, which is convenient for display.

Therefore, ChatGLM-6B has better dialogue and question answering ability under certain conditions. ChatGLM-6B also has quite a few known limitations and deficiencies:

  • Small model capacity: The small capacity of 6B determines its relatively weak model memory and language ability. ChatGLM-6B may generate incorrect information when faced with many factual knowledge tasks; she is also not good at solving logical problems (such as mathematics, programming).

  • May generate harmful instructions or biased content: ChatGLM-6B is only a language model initially aligned with human intent, which may generate harmful, biased content.

  • Weak multi-round dialogue ability: ChatGLM-6B's context understanding ability is not sufficient. In the face of long answer generation and multi-round dialogue scenarios, context loss and understanding errors may occur.

  • Insufficient English proficiency: Most of the instructions used during training are in Chinese, and only a small part of the instructions are in English. Therefore, when using English instructions, the quality of the reply may not be as good as that of Chinese instructions, or even contradict the reply under Chinese instructions.

  • Misleading: ChatGLM-6B's "self-perception" can be problematic and can easily be misled and produce false statements. For example, if the current version of the model is misguided, it will deviate in self-perception. Even though the model has undergone bilingual pre-training of about 1 trillion identifiers (tokens), instruction fine-tuning and human feedback reinforcement learning (RLHF), but due to the small capacity of the model, it may produce harmful effects under certain instructions. misleading content.

summary

At this point, the workload is still low, and the writing is vomited. Several jobs in the Google series still have to be completed in a separate article. Similar to OpenAI's work, Google has also produced models such as benchmarking GPT-3 and InstructGPT, and also includes a large language model of the Encoder-Decoder structure of the T5 series, and it is not simply Follow.

On the other hand, in March and April, a large number of open source workers also blossomed, and made a lot of exploration work in the application direction of ChatGPT, including the exploration and open source of training data, models, and training methods. In the direction of training efficiency, ChatGLM+Lora, LLama+Lora, etc. have also appeared to further reduce training costs.

The content of this part will also be introduced and updated in a summary later, and we look forward to the birth of more excellent work during this period. For the inaccuracies in the content of the article, corrections and exchanges are also welcome~.

Reference content:

The article refers to many papers, blogs, and the introduction of some related papers in "Learning AI with Li Mu". The reference links for some content and illustrations are as follows.

1、https://crfm.stanford.edu/2023/03/13/alpaca.html

2、https://chatglm.cn/blog

3、https://crfm.stanford.edu/2023/03/13/alpaca.html

4、https://space.bilibili.com/1567748478/channel/collectiondetail?sid=32744

5、A Survey of Large Language Models


Enter the NLP group —> join the NLP exchange group (remark nips/emnlp/nlpcc enters the corresponding contribution group)

Continue to release the latest information such as interpretation of natural language processing NLP daily high-quality papers, relevant first-hand information, AI algorithm positions, etc.

Join the planet, you will get:

1.  Update 3-5 latest and highest-quality paper speed readings every day . In a few seconds , you can grasp the general content of the paper, including a one-sentence summary of the paper, general content, research direction, and pdf download.

2.  The latest introductory and advanced learning materials . Including machine learning, deep learning, NLP and other fields.

3.  Specific subdivision of NLP directions includes but is not limited to : sentiment analysis, relationship extraction, knowledge graph, syntax analysis, semantic analysis, machine translation, human-computer dialogue, text generation, named entity recognition, reference resolution, large language model, zero sample Learning, small sample learning, code generation, multimodality, knowledge distillation, model compression, AIGC, PyTorch, TensorFlow, etc.

4.  Daily 1-3 recruitment information for AI positions such as NLP, search, promotion and promotion, and CV . Mock interviews can be arranged.

dd0a0065ff894c2eb004d441fcbb8977.png

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130097308