Want to become a master in the field of NLP? Let's start to understand from ChatGPT's 5 natural language models (LM, Transformer, GPT, RLHF, LLM) - Xiaobai can also understand

insert image description here

foreword

  If you want to stand out in the field of Natural Language Processing (NLP), then you must not miss ChatGPT's 5 natural language models: LM, Transformer, GPT, RLHF and LLM. These models are the most important foundation in the NLP field, covering key knowledge points such as language models, pre-trained models, and generative models . Even if you are an NLP novice, you can easily understand the principles and characteristics of these models! So, if you want to become a big cow in the NLP industry, start with ChatGPT's 5 natural language models!

ChatGPT basic science - know a little why

  At the end of 2022, ChatGPT suddenly became popular in the AI ​​circle. To be precise, it became popular first in the natural language processing circle. Thinking back at that time, I thought it would be popular in the circle for a while, but now... I didn’t expect it to be the lifeline of AI. Of course, it’s hard to say what AI engineers, especially NLP engineers, are. After the boiling at home and abroad, it was the first time to follow up. The result was that efforts to align were useless, and even the awesome Google and FaceBook overturned. But after all, some things were tossed out, everyone has a goal, and the circle has new vitality. I hope that OpenAI will continue to work hard, and I will wait a little longer.

  Whether it is ChatGPT or later imitators, they are actually language models, to be precise-big language models. When using, whether it is calling an API or an open source project, there are always some parameters that may need to be adjusted. It should not be a problem for most insiders, but it is a bit mysterious for laymen. Based on this, this article will briefly introduce the basic principles of ChatGPT related technologies. The writing will be from the perspective of a layman, trying to make the content as civilian as possible. Although I can't go into details, I know the principle enough to use it well.

This article is divided into four parts:

  • LM: This is the cornerstone of ChatGPT. It is the most basic concept. It cannot be avoided or escaped. There is no way.
  • Transformer: This is the cornerstone of ChatGPT, to be precise, part of it is the cornerstone.
  • GPT: Ontology, from GPT-1 to the current GPT-4, according to OpenAI's own statement, the model is still the same model, but it has grown up and become fatter, but it looks better. Almost no one thought about this. It's fine now, I can't climb it anymore.
  • RLHF: ChatGPT magic weapon, with this sharp blade, ChatGPT is the ChatGPT, otherwise it can only be GPT-3.

1. LM

  LM, Language Model, language model, in simple terms, is a model built using natural language. This natural language is what people often say, or recorded text, etc. As long as the text is produced by humans, it can be regarded as language. So is the text you see now. A model is a thing that outputs corresponding results through certain calculations based on specific inputs. It can be regarded as the human brain. The input is the words heard or seen by your ears, eyes, and the output is the words spoken or written by your mouth. To sum up, the language model is constructed using natural language text, and outputs the corresponding text model according to the input text.

  How to do it, there are many ways, for example, I wrote a template: "XX likes YY", if XX=me, YY=you, then I like you, and vice versa, you like me. What we want to focus on here is the probabilistic language model. Its core is probability, to be precise, the probability of the next word. The process of this language model is to predict the next word through the existing words. Let's take a simple example. For example, if you only tell the model: "I like you", when you enter "I", it knows that you are going to say "like" next. Why? Because it only has these four words in its mind, you didn't tell it anything else.

  Ok, next, we're going to upgrade. Suppose you gave the model many, many sentences, so much information that can be found on the Internet has been given to it. At this time, if you enter "I" again, I bet it probably won't say "like". Why? Simple, seeing a lot of the world, how can there be only three words like you in my eyes. But since we are considering the maximum probability, there is a good chance that it will output the same words every time. Yes, that's right, if you just choose the next most probable word every time, you will get the same words. This method is called Greedy Search (Chinese is called greedy search), very greedy, very short-sighted! Therefore, the language model will make some strategies in this place, let the model look at a few more possible words at each step, instead of just looking at the highest one, so that when you continue to look down, you will find that when you reach the next step, just The word with the highest probability, if you add the word in this step, its path (the product of two-step probability) may not have the path of the word with a lower probability. As an example, look at the image below:


(Figure 1: How to predict the next word)

  Look at the first step first. If you only choose the word with the highest probability, it will become "I think", but don't worry, let's give "like" a chance and consider whether they are good or not. Looking down one step further, the one with the highest probability is you, we also choose two, and finally there are a few sentences (and the probability we attach them):

  • "I like you" probability: 0.3×0.8=0.24
  • "I like to eat" probability: 0.3×0.1=0.03
  • "I miss you" probability: 0.4×0.5=0.2
  • "I want to go" probability: 0.4×0.3=0.12

  One more step makes a big difference! See who has the highest probability of becoming, and say "I like you" after a long absence. The above method is called Beam Search (beam search in Chinese). Simply put, it is to look at a few more words in one step and see the probability of the final sentence (such as generating a full stop, exclamation mark or other stop symbols). In our example just now, num_beams=2 (only 2 were read), the more you read, the harder it is to generate fixed text.

  Well, in fact, in the initial language model, everyone is basically here. The above two are also called decoding strategies. At that time, the model itself was more studied, and we experienced the transition from simple models to complex models to hugely complex models. The simple model is to cut a sentence into individual words, and then count the probability. This type of model is called Ngram language model, which is the simplest language model. N here represents the length of the context used each time. Or to give an example, look at the following sentence: "I like to miss you gently with the moon under the starry sky late at night." Commonly used N=2 or 3, equal to 2 called Bi-Gram, equal to 3 called Tri-Gram:

  • Bi-Gram: I /like like/under the /starry sky/starry sky/ in /late night late night/…
  • Tri-Gram: I /like/in like/in/late night in/late night/of/the starry sky/starry sky/under…

  In the former, the next word is based on the previous word, while in the latter, it is based on the previous two words. This is the difference. Here is a little knowledge that needs to be explained. In practice, we often do not call a word a "word", but a "Token". You can understand it as a small piece, which can be one word or two words. A word, or a three-character word, depends on how you tokenize it. That is to say, when a sentence is given, I have a variety of tokenization methods, which can be divided into words or divided into words. English is now a molecular word. For example, the word Elvégezhetitek becomes:

['El', '##vé', '##ge', '##zhet', '##ite', '##k']

  Chinese is now basically in the form of characters + words. We don't directly explain why we do this, but we can think about the effect of complete words or words, and it is more intuitive to take English as an example. If only 26 English letters are used, although the vocabulary is very small (maybe about 100 plus various symbols), the granularity is too fine, and each Token can hardly express semantics at all; if words are used, this granularity is a bit too much Big, especially in English, there are different tenses. In fact, they have similar meanings, but the back is different. So subwords came into being—it breaks a word into semantic units of a certain size, and each unit can express a certain meaning and can be combined flexibly. Chinese is a little simpler, that is, word + word, a word is a word that can express meaning independently, such as "yes", "you", "love"; words are not quite right when separated, such as "Great Wall", " L". Of course, it is not impossible for you to make Chinese characters one by one. We mainly look at the effect.

  The Ngram model has a fatal flaw - its representation is discrete. To explain a little bit, in a computer, only 1 and 0 can be used to represent a certain word. Assuming that the vocabulary size is 50,000, in the Bi-Gram just now, the Gram "I like" is a sparse vector composed of 49,999 0s and 1 1. There are many disadvantages in using this representation method, which will be explained later in the "Embedding" section. Yes, Embedding is a dense representation method. Simply put, a Token (token will be referred to later) is a number of decimals (generally it can be any number, professionally called the dimension of Embedding, according to the model and settings used) The parameters are determined), generally the more the number, the larger the model and the stronger the expressive ability. You said that I like to use a small model to create a large dimension, then I can only say that the effect may disappoint you, 911 may not be able to run a tractor on a country road.

  Next, we assume that each word is a one-dimensional vector, and briefly explain how to predict the next Token under this setting. In fact, it is still a probability, but this time it is a little different from just now. The one that was just discrete is the number of counts divided by the total number of words to get the probability of occurrence. But the dense vector needs to be changed slightly, that is to say, if you are given a d-dimensional vector, you will finally output a vector with a length of N, where N is the size of the vocabulary, and each value in N is a probability value, representing Token probabilities add up to 1. Written as a simple calculation expression as follows:

X = [0.001, 0.002, 0.0052, ..., 0.0341] # d维,加起来和1没关系,大小是1×d
Y = [0.1, 0.5, ..., 0.005, 0.3] # N个,加起来=1,大小是1×N
W·X = Y  # W自然可以是 d×N 维的矩阵

The above W is the parameter of the model. In fact, X can also be regarded as a parameter that is automatically learned. Because we know the size of the input and output, we can actually do arbitrary calculations in the middle. In short, it is all kinds of tensor (or more than three-dimensional array) operations, as long as the final form remains unchanged. The various calculations in the middle mean various models.

  In the early days of deep learning, the most famous language model is RNN, Recurrent Neural Network, which is called cyclic neural network in Chinese. The RNN model differs from other neural networks in that it has recurrent connections between its nodes, which allow it to remember previous information and apply them to the current input. This memory ability makes RNNs particularly useful when dealing with time series data, such as predicting future time series data, natural language processing, etc. In layman's terms, RNN is like a person with a memory function, which can respond to the current situation based on previous experience and knowledge, and predict future development trends. As shown below:

(Figure 2: RNN, from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

  The right side is the expansion of the left side, A is the parameter, X is the input, and h is the output. Since the natural language is Token by Token, a sequence is formed. How to learn this parameter? This will explain the learning process a little bit, please see the following picture:

insert image description here

(Figure 3: Language model input and output)

  The first line is X, the second line is Y, SOS means Start of Sentence, and EOS will not explain much. Note that the above h is not the probability of the output, but the hidden state. If you need the probability, you can perform a tensor operation on h and normalize it to the entire vocabulary. Simply demonstrate the code:

import torch.nn as nn

rnn = nn.RNN(32, 64)
input = torch.randn(4, 32)
h0 = torch.randn(1, 64)
output, hn  = rnn(input, h0)
output.shape, hn.shape
# (torch.Size([4, 64]), torch.Size([1, 64]))

  The above nn.RNNis the RNN model, the input is a 4×32 vector, in other words, 4 Tokens, dimension d=32, h0 is the initialization output, which is the first of the output4, here outputare four 64-dimensional The vectors represent the four outputs respectively, hnwhich is the output of the last Token, and can also be regarded as the representation of the entire sentence. If you want to output the probability of words, you need to expand to the vocabulary size first, and then normalize:

wo = torch.randn(64, 1000) # 假设词表大小N=1000
logits = output @ wo  # 4×1000
probs = nn.Softmax(dim=1)(logits) # 4×1000,每一行概率和为1

  Each line here probsis the probability distribution of the size of the vocabulary, and the sum is 1, which means the probability of this Token to each Token in the vocabulary.

  Because we know what the next Token is (that is, the second line Y in the above figure), then if the Token I get the highest probability happens to be this Token, it means that the prediction is correct, and the parameters do not need to be adjusted; otherwise, the model will adjust the previous parameters (above RNN, h0, inputparameters and below wo). You may wonder why it is also a parameter. In fact, we were lazy in inputthe above . The original parameter is a large vector of 1000×32, and 4 are the row vectors corresponding to the positions of the four Tokens. inputThis 1000×32 vector is actually a word vector (one line for each word), initialized randomly at the beginning, and then adjust the parameters through training.

  After the training is completed, these parameters will remain unchanged, and then the same steps as above can be used to predict, that is, given a Token to predict the next Token. If it is Greedy Search, every time you give the same Token, the same will be generated. The rest is connected with what was said before.

  Well, the language model is introduced here. It doesn’t matter if you can’t understand the above code, as long as you can get the meaning. There are many details in it, but the meaning is probably in place. You only need to have a general understanding of how each Token is represented, trained and predicted.

2. Transformer

  The next one to appear is Transformer, an architecture that just started in the NLP field, then spanned the speech and image fields, and finally unified almost all modalities. This is a paper published by Google in 2017, titled "Attention Is All You Need", the most important core of which is the proposed Self-Attention mechanism, which is also called self-attention in Chinese. To put it simply, it is to focus on those important Tokens during the language model modeling process. Thinking about it, Google didn't expect today when it published the paper.

  Transformer is an Encoder-Decoder architecture. Simply put, it maps the input to the Encoder first. Here, you can imagine the Encoder as the RNN introduced above, and the Decoder as the RNN. In this way, the left side is responsible for encoding and the right side is responsible for decoding. The difference here is that on the left side, because we know the data, we can use the current Token’s historical Token and the future (previous) Token at the same time when modeling; but when decoding, because it is output by Token by Token, only Modeling can be performed based on historical Tokens and Encoder Token representations, but future Tokens cannot be used.

  From a more general point of view, this architecture of Transformer is actually a Seq2Seq architecture. Don’t panic, this is simply a sequence-to-sequence model, that is, the input is a text sequence and the output is another text sequence. Translation is a good example. Let's look at the following classic picture from Google's GNMT (Google Neutral Machine Translation):

insert image description here

(Figure 4: GNMT diagram, from GNMT GitHub: https://github.com/belvo/Google-Neural-Machine-Translation-GNMT-)

  As I just said, the Encoder and Decoder can use RNN, and in the end, all the Tokens of the Encoder will eventually output a vector as a representation of the entire sentence. Speaking of which, how do you express the whole sentence? As we mentioned above, if the structure of RNN is such that the output of the last Token can be used as the representation of the entire sentence. Of course, intuitively, you can also take the average of each Token vector, or the average of the first and last, or the average of the next N. All of these are fine, and the problem is not big, but generally there are more cases of taking the average, and the effect is better. In addition to the average value, you can also sum, take the maximum value, etc., we will not discuss it in depth. Now comes the important point, look at the Decoder process, look carefully, in fact, it uses the information of each Token of the Encoder and the information of the Token it has generated when generating each Token. The previous mechanism of paying attention to the Token information in the Encoder is Attention (attention mechanism). To explain intuitively, when generating Knowledge, the word "knowledge" will be given more weight, and the others are similar.

  Let us take the above memory and look at the structure of Transformer, as shown in the following figure:

(Figure 5: Transformer, from the Transformer paper)

  This picture shows more of the internal structure. On the left is a Block of the Encoder (a total of N), and on the right is a Block of a Decoder (a total of N). For simplicity, we can assume that N=1, then the structure on the left is an Encoder, and the structure on the right is a Decoder. They can also be imagined as an RNN, which helps to grasp it from a macro perspective. Now, the imagination is over, let's go back to reality. The things used by Transformer have nothing to do with RNN. It can also be seen from the above figure that it mainly uses two modules: Multi-Head Attention and Feed Forward. For the former, we may wish to review the Attention of GNMT, which is the importance weight of each Token in the Decoder and Encoder. A thing called SelfAttention is used in Multi-Head Attention, which is very similar to the Attention just mentioned, except that it is the importance weight of each of its own Tokens and each of its own Tokens. To put it simply, it is "how important is a sentence". This thing can be said to be very quintessential. Whether it is ChatGPT or other non-text models, almost all of them use it. It can be said that it is truly unified. What does Multi-Head mean? To put it simply, it is to repeat the self-attention just now (Multi Heads), and each noticed information is different, so that more information can be captured. For example, the sentence we mentioned earlier: "I like to miss you gently with the moon under the starry sky late at night", some Head "I" noticed "like", and some Head "I" noticed "Late night ", and some Heads "I" noticed "missing you"...does it seem more Make Sense like this. For Feed Forward, you can regard it as a "memory layer", where most of the knowledge of the large model exists, and Multi-Head Attention extracts knowledge according to the attention of different weights.

  In practice, most NLP tasks are not actually Seq2Seq. The most common ones include: sentence-level classification, token-level classification (also called sequence labeling), similarity matching and generation; and the first three are the most widely used. At this time, the Encoder and Decoder can be disassembled and used. The Encoder on the left can use contextual information when representing a sentence as a vector, that is, it can be regarded as two-way; the Decoder on the right cannot see the future Token, and generally only uses the above, which is one-way . Although they can all be used to complete the tasks just mentioned, in terms of effect, Encoder is more suitable for non-generation tasks, and Decoder is more suitable for generation tasks. In the field of NLP, they are generally called NLU (Natural Language Understanding, natural language understanding) tasks and NLG (Natural Language Generation, natural language generation) tasks.

  First introduce the NLU task. Sentence-level classification is to output a category given a sentence. Because a sentence can be expressed as a vector, after tensor operations, it can naturally be mapped to the probability distribution of each class. There is no essential difference between this and the method mentioned in the previous language model, except that the category of the language model is the size of the entire vocabulary, and the category of classification depends on specific tasks, such as binary classification, multi-classification, multi-label classification, etc. wait. Token-level classification is given a sentence, to output a category for each Token. This is more similar to the language model, except that the next Token is replaced by the corresponding category. For example, named entity recognition is to extract the entities in the sentence (person names, place names, works, etc., which are words you care about, usually nouns) come out. Their categories are generally similar. If you take the name of a person (PER) as an example, it looks like this: B-PER means the beginning, and I-PER means the middle. For example: "Liu Yifei looks good", at this time, Token is a word, and the corresponding category is "B-PER, I-PER, I-PER, O, O", O means Other. Note that for classification tasks, categories are generally called labels. The similarity matching task is generally given two sentences, whether the output is similar, in fact, can also be regarded as a special classification problem.

  The NLG task is introduced next. In addition to generation, common tasks include text summarization, machine translation, rewriting and error correction, etc. The structure of Seq2Seq here is relatively common, reflecting a feeling of understanding first and then output. And pure generation tasks, such as writing poems, writing lyrics, and writing novels, are almost all Decoder structures. This type of task is slightly more troublesome than automatic evaluation. Except for other generated tasks, reference answers are generally provided. You can see the degree of overlap or similarity between the model output and the reference. But the pure generation task is a bit troublesome, and sometimes it is difficult to measure whether it is good or not. However, for those with specific goals (such as task robot dialogue generation), it is still possible to design some evaluation methods such as whether the task is completed and the goal is achieved. But for those without specific goals (such as chatting), different people have different opinions on the evaluation, and in many cases, it is still done manually.

  The Transformer architecture is based on Seq2Seq, which can handle NLU and NLG tasks at the same time, and the feature extraction ability of this Self Attention mechanism is very strong. This has made NLP a phased breakthrough, and deep learning has entered the era of fine-tuning models. The general approach is to take an open source pre-trained model, and then fine-tune it on your own data so that it can handle specific tasks. This open source pre-training model is often a language model, which is trained from a large amount of data corpus using the language model training method we mentioned earlier. The first job in the NLU field is Google's BERT. I believe many people have probably heard of it even if they are not in this industry. BERT uses Transformer's Encoder architecture, which has 12 Blocks (see the picture above, each Block can also be called a layer), more than 100 million parameters, it does not predict the next Token, but randomly assigns 15% of the Token Cover, and then use other uncovered Tokens to predict the covered Token. In fact, it is similar to predicting the next Token based on the above, the difference is that the following information can be used. The first work in the NLG field is OpenAI's GPT, which uses Transformer's Decoder architecture, and the parameters are similar to BERT. They were both published in 2018, and then took two different paths.

3. GPT

  GPT, Generative Pre-trained Transformer, that's right, it is the GPT of ChatGPT, which is called "Generative Pre-trained Transformer" in Chinese. Generative means that similar to the language model, Token by Token generates text, which is the Decoder mentioned above. As mentioned earlier, pre-training is a language model trained on a large amount of corpus. The GPT model has gone through 5 versions from 1 to 4. There is a ChatGPT version 3.5 in the middle. Next, we will introduce their basic ideas respectively.

  GPT-1, like BERT, follows the downstream task fine-tuning routine, that is, fixes the pre-training model, and then fine-tunes a model on different downstream tasks, as shown in the following figure:

insert image description here

(Figure 6: GPT basic structure, from GPT paper)

  Regarding the left side, we have already introduced it above. The architecture of Transformer (Decoder in GPT) is used, and the sub-modules in it can be ignored. Focus on the right side. There is a noteworthy place here, that is, for different task inputs, they are spliced ​​into text sequences, and then thrown to Transformer Decoder to output the results through a Linear+SoftMax. Linear is the most basic network structure. SoftMax, which we introduced earlier, is mainly used to map the output to a probability distribution (sum of 1). This method of splicing input was very popular in the era of large models at that time, and BERT followed closely in a similar way. Such a unified processing method can reduce the changes to the model for different tasks. Anyway, no matter what the task is, just try to make it into a sequence.

  There are still a few points in this article on GPT that may not seem to make sense at the time, but it is a bit interesting to look back now (just like Jobs said dot). The first is the relationship between the number of pre-training layers and task performance, as shown in the lower left figure; the second is the relationship between the number of training parameters and model performance.

insert image description here

(Figure 7: The left picture is the GPT parameter quantity and effect picture, and the right picture is the Zero-Shot capability, from the GPT paper)

  Two basic conclusions can be drawn from the above figure: first, each layer in the pre-training model contains useful functions for solving the target task, indicating that multiple layers have more capabilities; second, as the parameters increase, Zero -Shot for better performance. A brief summary is that a larger model can not only learn more knowledge, help solve downstream tasks, but also demonstrates Zero-Shot capabilities.

Zero-Shot refers to directly inputting the model task to let it output the task result; Few-Shot provides some examples to the model, and then gives the task to let it give the output result.

  Well, with the above conclusions, what happens naturally? Want to see how more layers (more parameters) perform? So more than half a year later, GPT-2 came, and the number of parameters increased from 110M of GPT to 1.5B, ten times that of the former. What's more interesting is that there is a "future work" in the blog of the GPT paper. The first one is to expand the scale, and the other two are to improve fine-tuning and better understand why generative pre-training can improve understanding ( NLU )ability.

  GPT was published in June 2018, and GPT-2 was published in February 2019. It is an upgraded version of the former: one is to expand the scale, and the other is Zero-Shot. If the former is to observe this phenomenon, then the latter is to further study this phenomenon. Please see the picture below:

insert image description here

(Figure 8: Parameter amount and Zero-Shot performance, from the GPT-2 paper)

  The vertical axis is the evaluation index of different tasks, and the horizontal axis is the parameter quantity, so the effect is clear at a glance. After further verifying the previous idea, the next step is to continue to expand the scale... But wait a minute, before that, we might as well take a look at the Token generation strategy in GPT-2, which is the method for generating the next Token. We mentioned the excellent Beam Search in the first part above, but it has two obvious problems. The first is that the generated content is easy to repeat; the second is that high-quality text is not necessarily related to high probability, and people prefer it. There are "different" content, rather than completely predictable. For example, Eileen Chang said that "lonely people have their own quagmire". This kind of unique words cannot be obtained with high probability words. In a nutshell, these two problems can actually be boiled down to one point: the generated content is too deterministic.

  Now, we introduce a sampling-based method. Simply put, the next Token is randomly selected based on the existing context. But there is also a problem with randomness, that is, it may generate incoherent text (easy to understand, right). Here is a Trick that can alleviate this problem-further increase the possibility of high-probability words and reduce the possibility of low-probability words. This makes it less prone to random to very low probability (and likely incoherent) generation. The specific method is to use a temperatureparameter to adjust the probability distribution of the output. The larger the parameter value, the smoother the distribution looks, that is, the gap between high probability and low probability is narrowed (not so sure about the output); of course, the more If it is small, the gap between high probability and low probability is more obvious (comparatively sure about the output). If it tends to 0, it is the same as Greedy Search. This is a relatively common method in deep learning. Interested readers can read this StackOverflow explanation further .

  In addition to the Trick just now, this paper introduced a new sampling scheme in 2018, which is very simple but actually very effective, which is the Top-K sampling used in GPT-2. To put it simply, when selecting the next Token, select from Top-K (Top-K=0 is the range of all vocabulary). This method is good, but there is still a small problem, that is, the Top-K is actually a hard truncation, regardless of whether the probability of the Kth is high or low. In extreme cases, if the probability of a certain word reaches 0.99, a slightly larger K will inevitably include some words with very low probability. This leads to incoherence.

Therefore, this paper in 2020 proposed another sampling scheme: Top-P, which is also available in GPT-2. This strategy is to choose among words whose cumulative probability exceeds P. In this way, for the case where the probability distribution is relatively uniform, there will be more optional words (the probability sum of dozens of words may exceed P); for the uneven probability distribution, there will be fewer optional words ( Maybe the probability of 2-3 words exceeds P).

  Top-P looks more elegant, and the two can also be used in combination, but most of the time when we need to adjust, just adjust one, including the previous temperature parameter. If you want to tune more than one, make sure you understand what each parameter does. Finally, it should be noted that no sampling strategy can guarantee 100% good results for each generation, and there is no way to avoid generating duplicate words (you can consider using a similar post-processing method to alleviate). There is no one strategy that is applicable in any scenario, and it needs to be flexibly selected according to the actual situation.

  GPT-3 was published in July 2020, and it was also a big news at the time, because its parameter volume had reached an order of magnitude unmatched by any other model at the time: 175B, more than 100 times that of GPT-2. Also, no open source. Reaching out to the party is a bit embarrassing. GPT-3 thinks that since it has the ability of Zero-Shot, can it not be fine-tuned? It is so troublesome to fine-tune when encountering a task. Look at humans, as long as a few examples (Few-Shot) and some simple instructions can handle the task, right? what to do? Didn’t GPT-2 further confirm the Zero-Shot capability? Go on, continue to go up, increase the amount of parameters, so there is a 175B GPT-3. In other words, come on for various tasks, I don’t adjust the parameters, at most I need a few examples (not even this in the next step), and I can get it done for you. In fact, looking back now, this paper is a milestone, because it fundamentally touched the original paradigm, and it touched the original paradigm in a revolutionary way. On this point, I wrote an article before: GPT-3 and its In-Context Learning | Yam , interested readers can read further. It's a pity that I didn't read it at the time. Now I recall that it was because 175B seemed too big at the time, and it was too expensive (several million dollars). No matter how good the effect is, what can I do? It's not just that a small number of people don't realize it, it may be that the whole world except the OpenAI team doesn't realize it. First look at the picture below:

(Figure 9: The performance of X-Shot at different parameter levels, from the GPT-3 paper)

This picture can provide several information:

  • X-Shot has huge differences in different magnitudes, and the large model has superpowers.
  • Under the large model, the effect of One-Shot is significantly improved; adding Prompt will further improve it significantly.
  • Few-Shots have diminishing marginal returns. When it is below 8-Shot, the effect of Prompt is obvious, but from One-Shot to 8-Shot, the effect of Prompt is also decreasing. Prompt is basically useless beyond 10-Shot.

  All in all, the large model has the ability of In-Context, which makes it unnecessary to perform adaptive training (fine-tuning) for different tasks, and it uses its own understanding. This should have been shocking (even a little scary), but everyone may have been shocked by its price and size first. Next, let's intuitively feel the way to use this In-Context capability to complete tasks, as shown in the following figure:

(Figure 10: How to use the In-Context ability to complete the task, from the GPT-3 paper)

  It doesn't look complicated at all, you just need to build the input according to its format and put it in. This is also a very important reason for the existence of this project - AI has become popular, and now you only need to have hands (maybe you will not have hands in the future), and you can make AI applications by using LLM (Large Language Model) .

  The last thing worth mentioning is the prospect in GPT-3 (they are really moving forward along the "prospect", rather than just writing it down). In the "Limitations" section of the paper, they raised some of the current problems of GPT-3, two of which should be pointed out in particular, because they are the direction of the next generation of InstructGPT (also the sister version of ChatGPT) and more advanced versions.

  • The self-supervised training (that is, the way to start a language model) paradigm has reached its limit, and new methods are imminent. Future directions include: learning objective functions from humans, reinforcement learning fine-tuning, or multimodality.
  • Not sure if Few-Shot learns new tasks during inference, or recognizes tasks learned during training. Ultimately, it's not even clear what humans learn from learning from scratch versus learning from previous samples. Understanding exactly how Few-Shot works is a future direction.

  The first point will be mentioned in the next section, mainly to talk about the second point. What I mean here is that when we give some examples (Few-Shot), we can't accurately determine how to "learn" new tasks during inference (in this case, there is no ability if there are no examples; The "learning" here is in quotation marks because it does not adjust the parameters), or it already has this ability during training, and the example just makes it "recall" what it learned before. It’s a bit confusing here, let’s take people as an example (it may not be appropriate, I hope it can explain the problem), for example, when you read a poem, you also wrote a sentence with great enthusiasm. Do you say that this poem is because you "understand" it when you read it, or do you have this accumulation (memory) in the first place, and now you are just ticked out because of reading this poem? You see, this involves knowledge in the fields of the brain, thinking, and consciousness, and humans have not yet figured out their principles, so we still don't know the answer.

4. RLHF

  RLHF, Reinforcement Learning from Human Feedback, learning from human feedback, sounds mediocre. Indeed, its idea is very simple and simple, but it has an effect that cannot be ignored. As we mentioned just now, GPT-3 said that new methods will be found in the future, including learning from humans, reinforcement learning fine-tuning, multimodality, etc. Today, from InstructGPT to ChatGPT, and then to GPT-4, it is implementing these new methods step by step. One thing to be reminded here is that these directions were not clearly placed there at the beginning, and there are still a lot of exploration and phased results in the middle (both their own research and the research of other practitioners). Don't look at the results and feel mediocre, especially as a non-industry person (especially some media), the difficult exploration in the middle is always worthy of respect. In addition, sometimes even if you know the method, it is very difficult to do it and achieve the effect. Moreover, due to the nature of popular science, this article can only introduce a little bit of content. Although the overall structure is relatively complete, it is still relatively simple (so the title is "A Little Why"). Generally speaking, it is very difficult to make it, but we just use it, as mentioned earlier, it only needs hands.

  Well, let’s get down to business, RLHF should be known mainly from OpenAI’s InstructGPT paper. Of course, the larger-scale familiarity is the release of ChatGPT. Because the latter has no papers and no open source, we can only take a peek at ChatGPT's leopard from InstructGPT. Of course, according to the ChatGPT official page , this "pipe" may be relatively thick. If you describe InstructGPT in simple language, it is actually using reinforcement learning algorithms to fine-tune a language model improved based on human feedback. The important thing is that it calls out the effect - the 1.3B InstructGPT is comparable to the 175B GPT-3, as shown in the figure below:

insert image description here

(Figure 11: Comparison of the effects of different strategies and different models, from the InstructGPT paper)

  There are a total of five wires above, let’s explain a little bit, the above two (PPO) are the results of InstructGPT settings; the middle SFT can be understood as GPT-3+ fine-tuning, theoretically (and actually) after fine-tuning The effect is better than Few-Shot, better than Zero-Shot; the following two are the results of GPT-3. Of course, this evaluation method may be worth deliberating.

  Ok, now let's see how it works, what role RLHF plays in it, and how it works. It's still the picture that has been posted everywhere. If you want to count the citation rate, I feel that this picture is definitely not low.

insert image description here

(Figure 12: InstructGPT workflow, from InstructGPT paper)

  This picture shows the whole process of InstructGPT more intuitively. There are three steps:

  • Step1: SFT, Supervised Fine-Tuning, supervised fine-tuning. As the name suggests, it is fine-tuned training on supervised (labeled) data. The supervision data here is actually to input the prompt and output the corresponding reply, but the reply here is manually written. This job has higher requirements than general labeling, and it is actually a kind of creation.
  • Step2: RM, Reward Model, reward model. Specifically, a prompt is thrown to the SFT in the previous step, and several (4-9) replies are output, and the annotators sort these replies. Then take 2 from 4-9 at a time, because they are in order, they can be used to train the reward model, so that the model can learn the good or bad evaluation. This step is very critical. It is the so-called Human Feedback, which guides the evolution of the model in the next step.
  • Step3: RL, Reinforcement Learning, reinforcement learning, using PPO strategy for training. PPO, Proximal Policy Optimization, is a reinforcement learning optimization method. The main idea behind it is to avoid too large updates each time and improve the stability of training. The specific process is as follows: First, a language model needs to be initialized, and then a prompt is thrown to it, which generates a reply. The RM in the previous step gives a score to the reply, and the score is sent back to the model to update the parameters. The model here is a strategy from the perspective of reinforcement learning. There is a very important action in this step, that is, when updating the model, the difference between the output of each Token in the model and the output of the SFT in the first step will be considered, and the two should be as similar as possible. This is to mitigate possible over-optimization in reinforcement learning.

  that's all? Yes, just like that, RLHF is shown on it, and everyone knows the effect. Although ChatGPT has not published related papers, we basically believe that it is also based on similar ideas. Of course, there are a lot of details here, even if you know this idea, you may not be able to reproduce it. This is normal in the era of deep learning. There are too many small designs and details in it. When they accumulate to a certain amount, it is difficult to make up for the difference at once. If others don’t tell you, then you can only do experiments slowly to verify it yourself.

  Below, we forcefully explain how a wave of RLHF works and why it is now a fundamental paradigm. In fact, the use of reinforcement learning in the field of NLP has been studied for a long time. It happens that the author himself has been paying attention to text generation and reinforcement learning in text generation for some reasons. There may be two difficulties here: one is the stability of training; the other is the design of the reward function. The former, with the PPO strategy and the difference measure with SFT, is improved considerably. The latter, if considered from an objective point of view, is not so easy to design a rule. I have also imagined many similar methods, such as adding some grammatical rule restrictions, or even rules like the law of least effort.

The Law of Least Effort: Ziv proposed in the book "Human Behavior and the Principle of Least Effort: an introduction to human ecology". Simply put, language is inert and will express as much as possible with fewer words. Semantics evolve in this direction.

  InstructGPT uses human feedback directly as a "rule", that is, to make this "rule" implicit and treat it as a black box. We only care about the results, but we don't care about the rules, how many rules there are, and how they work. This is a similar idea to deep learning. In comparison, my previous thoughts may be a bit naive. After all, linguistics itself has many controversies that have not been clarified. For example, is language ability an inborn ability? InstructGPT's approach is simpler, more straightforward, and more effective.

  What remains to be solved is how to measure "good or bad". After all, there is always a result in the end. Since there is a result, there must be standards. Readers may wish to think about it, if it were you, how would you design some indicators to measure the quality of the two output content. This step seems easy, but it is actually not easy, because the design of indicators will affect the learning direction of the model, and will eventually affect the effect. Because there are too many good or bad metrics for this output, although it seems to be sorting the several results given (the second step above), in fact, a lot of human cognition is hidden in this process, the model training process In fact, it is the process of aligning with the measurement process of the second step . Therefore, if the indicators in the second step are not well designed, the third step will be a waste of effort. Especially for InstructGPT, which is designed for almost all tasks, it is even more difficult to measure. For example, for a summary task, we may be most concerned about whether the original information can be accurately summarized, while a generation task may focus on fluency and logical consistency. There are 10 kinds of tasks in InstructGPT. It would be more troublesome to make indicators for each task separately, and the effect is not necessarily good, because these indicators are not necessarily in the same direction. What's more, in case there is a new task, is it necessary to design another set of indicators and retrain the model all over again?

  Let's take a look at how InstructGPT designs the measurement indicators. I think this is the most valuable part of the InstructGPT paper, and it is also the most worthy of our thinking and practice. For this reason, the author spent a long time studying relevant materials carefully, and wrote an article dedicated to introducing its annotation: ChatGPT Annotation Guide: Tasks, Data, and Specifications | Yam , interested readers can read further. However, reports and studies in this area seem to be relatively few.

  Not much to say, let's continue. First of all, InstructGPT uses three general indicators: helpful, authentic and harmless, which is a bit similar to Asimov's three laws of robotics. In other words, no matter what your mission is, you have to lean towards these three directions. This idea is really worth applauding. Now that we see this result, we naturally feel that it is nothing, but if we don’t know what to design in advance, most people will still easily fall into the situation of being affected by the task. In fact, the OpenAI team's persistence in In-Context Learning is the same, which we have already mentioned in the GPT section. When others tell you the result, you may feel that you don’t think there is anything, and even many research institutions and researchers have thought about it. But before it is effective, it is not easy to believe in a path that few people walk, and to go on unswervingly.

  With the three major guidelines just now, the next step is to refine them to make them operable. For example, for the first helpfulness, InstructGPT gave the following examples of "helpful" behaviors:

  • Write in clear language.
  • Answer the questions they want to ask, even if they ask the wrong ones.
  • Internationally sensitive (e.g. "football" should not refer to American football, and "president" does not necessarily refer to the President of the United States).
  • If the instruction is too confusing, ask for clarification and explain why the instruction is confusing.
  • Do not give overly long or lengthy answers, or repeat information from the question.
  • Don't assume irrelevant extra context (except facts about the world) beyond what's given, unless that's an implicit part of the task. For example, asking "Reply politely to this email: {email body}", the output should not assume "I can't come this time, but I'm free next weekend". But if asked to "write an e-mail to Socrates," then that assumption can be safely used.

  I believe that this list may be very long in fact, and many items will be added sequentially during the actual labeling process until the vast majority of cases can be covered, that is, for most of the data to be labeled, it is easy to judge according to the provided details Whether it is "helpful" to come out. Now might as well stop and think about it again, if you design reward rules based on these details from the beginning-just thinking about it doesn't feel realistic. There are also some examples of the other two indicators, and we will not repeat them here. Interested readers can read the previous article and the references behind it (some documents are not mentioned in the paper).

  The detailed rules are not finished yet. The next thing to solve is the conflict and trade-off between indicators mentioned above. Because this is a comparison task (comparing which output is better), when multiple indicators are involved, there must be a situation where one result A indicator is better than the other, but B indicator may be the opposite. The more indicators there are, the more complicated this is (fortunately, there are only three). In this regard, InstructGPT also gives guidelines:

  • For most tasks, it's more important to be harmless and authentic than to be helpful.
  • However, if (a) one output is much more helpful than the other; (b) the output is only slightly less real/harmless; (c) the task does not seem to fall into a "high risk domain" (e.g. loan application, medical, legal consultation, etc.). This time the more helpful ones score higher.
  • When choosing between equally helpful but inauthentic/harmful in different ways, ask yourself: which output is more likely to cause harm to the user (the person most affected by the task in the real world)? This output should rank lower. If this is not clear from the task, mark these outputs as tied.

  The general guideline for borderline cases is: what kind of output would you rather receive from a customer assistant trying to help you with this task ? This is a principle of putting yourself in the situation, imagining yourself as the task proposer, and then asking yourself what kind of output you expect.

  Looking at these now, do you feel that this step is not so easy? Although they seem not so "technical", they need excellent design ability, macro control ability and detail perception to complete well. I am more convinced that these details are gradually built from the bottom up, rather than envisioned from the beginning. It must be a complete set of system solutions that are gradually built up by constantly encountering doubts in practice, and then gradually adding rules after careful analysis and weighing. Personally, I feel that this set of things may be more precious property than data, and the barriers it produces can only be accumulated with time and continuous practice.

  Compared with GPT-3, InstructGPT/ChatGPT has a stronger Zero-Shot capability. Few-Shot is not very useful in many cases, but Prompt is still needed, which also gave birth to a new industry - Prompt project. However, according to the CEO of OpenAI , the Prompt project will not be needed in a few years (maybe a little bit more when generating pictures), what we need to do is to directly interact with AI through natural language. We can't judge whether what he said will be realized, but one thing is certain, the threshold of AI will be further lowered. In a few years, a junior high school student may be able to create good AI applications through some existing services.

5. LLM

  We are experiencing and entering a new era. As an external "most powerful brain", LLM will definitely be very easy for everyone to acquire in the future. As for what it is used for, Depends Your Imagination. Regardless of the industry, I believe this is an exciting signal. The author himself is often so excited that he can't sleep at night. I don't know what we can do for this kind of big change. There are too many possibilities in the future, but I believe the best way is to embrace it. Let's HuggingLLM create the era and create the future together. We believe that the world will become a better place because of it.

References

ChatGPT basic science popularization

Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/130364849