【GPT-4 Theory-1】Exploring the core technology of GPT-4 | JD Cloud technical team

foreword

GPT-4 has been released for some time, but for various reasons such as security, OpenAI did not publish the technical details and code of GPT-4, but only gave a 100-page technical report[1] .

This technical report focuses on the strengths of GPT-4, and only gives an overview of several technical directions, which is far from enough for us who want to understand the technical details.

In this article, I will combine the technical report of GPT-4, the improvement of GPT-4 relative to GPT 3.5/ChatGPT, the comparison of GPT-4 and ChatGPT, the recent work of OpenAI, the large language model (Large Language Model, LLM) model The scientific research progress of GPT-4, the scientific research progress of the multi-modal model and other information, in-depth analysis of the technical details of GPT-4.

Because there is no clear evidence that GPT-4 does this, we mainly discuss here what technologies OpenAI may use to achieve these capabilities of GPT-4. So if my speculation is wrong, readers are welcome to discuss it in the comment area. Next, let us become Sherlock Holmes and start analyzing the principles behind GPT-4.

1. Improvement of GPT-4

GPT-4 is iterated on the basis of ChatGPT. I will not go into details about the principle of ChatGPT here. If you need to understand, I will introduce it in the article "ChatGPT/InstructGPT Detailed Explanation". In this article, we first discuss what improvements GPT-4 has made compared to ChatGPT, that is, what functional improvements GPT-4 has compared to ChatGPT. Next we discuss what technologies OpenAI may have applied in GPT-4 in order to achieve these improvements. Finally we discuss other large language models and some thoughts on using GPT-4.

1.1 Zero-shot and few-shot learning ability

When we use GPT-4 for text generation, we will be pleasantly surprised to find that GPT-4 can almost perfectly answer your various tricky questions, which shows that GPT-4 has very powerful unsupervised learning capabilities.

In addition, the GPT-4 technical report also gives a large number of unsupervised learning examples, even approaching or even exceeding the supervised SOTA method in some scenarios. For example, on the code generation dataset of HumanEval[3], it even surpasses the famous code generation tool CodeX[3]. Furthermore, GPT-4 approaches SOTA's Anthropic-LM [4] on the question-answering dataset TruthfulQA [26] , which evaluates correctness.

1.2 Logical Reasoning Ability

GPT-4's technical report emphasizes its improvement over ChatGPT in many academic exams, as shown in Figure 1. The academic test assessment reflects that GPT-4 has stronger logical reasoning ability than ChatGPT. @越玉涛Max horizontally compared the logical reasoning problems of GPT-4 and ChatGPT through 19 questions [2], among which the correct rate of ChatGPT is 37%, and the correct rate of GPT-4 is 100%. From the comparison example, we can obviously It can be seen that GPT-4 has a qualitative leap in logical reasoning.

Figure 1: Comparison of GPT-4 and ChatGPT scores in many academic exams

I also found a simple logical reasoning problem here. It can be seen that in this bridge crossing example, although ChatGPT gave the bridge crossing sequence, the time calculation was wrong, and GPT-4 solved this problem simply and clearly.

Question: Four people cross the bridge ABCD. At most two people can cross the bridge at a time. Their flashlights can last for 17 minutes. The time required for each person is 1, 2, 5, and 10 respectively; find out how long it can take to cross the bridge at the fastest ?

ChatGPT: In order for four people to cross the bridge as quickly as possible, we need to let the slowest person cross the bridge as quickly as possible. Therefore, we can let the two slowest people A and B cross the bridge first, then let A go back, then let the second slowest people C and D cross the bridge, and finally let B go back, thus completing the whole process of crossing the bridge .

Specifically, we can follow these steps:

It takes 10 minutes for A and B to cross the bridge.

A returns and takes 1 minute.

C and D cross the bridge, it takes 5 minutes.

B returns, taking 2 minutes.

It takes 10 minutes for A and B to cross the bridge.

So the total time spent is 10 + 1 + 5 + 2 + 10 = 28 minutes. That's the fastest time it takes to get them all across the bridge.

GPT-4 : In order to make all four people ABCD cross the bridge as quickly as possible, we can adopt the following strategy:

A and B cross the bridge together in 2 minutes.

A returns and takes 1 minute.

C and D cross the bridge together, which takes 10 minutes.

B returns, taking 2 minutes.

A and B cross the bridge together in 2 minutes.

Total time: 2 + 1 + 10 + 2 + 2 = 17 minutes.

So you can get everyone across the bridge in as little as 17 minutes.

1.3 Ability to understand diagrams

A major improvement of GPT-4 is to start involving multimodality. Since the image interface of GPT-4 has not yet been opened, we borrow the example given in the technical report of GPT-4 here. In Figure 2, GPT-4 can accurately understand the incongruity of the VGA port and the Lightning port. This example shows that GPT-4 is not only a simple understanding of the content in the image, but its most powerful point is that it can recognize special points in the image.

Figure 2: GPT-4 has powerful graph and table understanding capabilities

1.4 Safer text generation capabilities

GPT-4's technical report focuses on the same security issues as previous models, but the security of GPT-4 has been greatly improved. According to the technical report, ChatGPT is about 10 times more likely to generate harmful content than GPT-4. Figure 3 exemplifies a large number of content generated by early GPT-4 and mature GPT-4 with risk prompts. It can be seen that the risk of mature GPT-4 is greatly reduced, but this does not mean that GPT-4 is a completely harmless bad model.

Figure 3: Example of early GPT-4 and mature GPT-4 on the security of generated content

GPT-4 has done a lot of work to ensure the safety of the model. First, they hired more than 50 domain experts from different directions to conduct confrontation tests and red team tests. Second, they trained a rule-based reward model (Rule-Based Reward Models, RBRMs) to assist the training of the model. We will expand on the implementation details of this part in detail later.

1.5 Stronger programming ability

Another important comparison item in the technical report of GPT-4 is the code generation ability between it and ChatGPT on LeetCode, which is easy, medium and difficult. In the unsupervised case, the evaluation accuracy of GPT-4 on the HumanEval dataset is improved from 48.1% of ChatGPT to 67.0%. GPT-4's technical report pointed out that ChatGPT only answered 20 of LeetCode's 166 programming questions, while GPT-4 answered 55 correctly. Table 1 is the specific distribution of GPT-4 and ChatGPT in LeetCode's 166 programming questions.

GPT-4 GPT-4 (no vision) ChatGPT
LeetCode(easy) 31 / 41 31 / 41 12 / 41
LeetCode(medium) 21 / 80 21 / 80 8 / 80
LeetCode(hard) 3 / 45 3 / 45 0 / 45

Table 1: Performance of GPT-4 and ChatGPT on LeetCode programming questions

1.6 Ability to handle other languages

GPT-4 has greatly improved in both English and non-English, and surpassed the performance of ChatGPT in English in most languages. Here we let ChatGPT and GPT-4 respond to Mr. Tang in "Let the Bullets Fly". Asked to write a poem, the request is "there must be wind, there must be meat; there must be hot pot, there must be fog; there must be beautiful women, there must be donkey!". Comparing the two poems, ChatGPT looks like a modern poem without much literary talent, while the content generated by GPT-4 is basically like an ancient Chinese poem except for the excessive number of words in the first two places, and even guarantees it to a certain extent. rhyme.

ChatGPT:

The wind blows over the hills, and the smell of meat is overflowing
. The hot pot is steaming, and the mist is lingering. The
beautiful women are graceful and charming, and the donkey is running happily
. This is a unique picture scroll, which makes people feel refreshed . People salivate , the mist fills the air, making people feel mysterious, the beauty’s face makes people intoxicated , the donkey’s cheerfulness makes people feel the vitality of life, this is a picture full of vitality, which makes people fascinated





GPT-4:

The tumbling wind blows over the mountain stream wantonly, and
the hot pot shabu-shabu is the icing on the cake.
Cooking smoke curls up in the mist, and
beauties and donkeys swim together.

The beautiful woman lightly strokes the back of the donkey,
blowing away worries and the world.
Playing with rouge, smiling and drunk,
donkey singing and rushing to the outside of the mountain.

Flaming red mushrooms are warmed up, and
chilli beef is boiled in water.
The mandarin duck hot pot sings and laughs,
time flies and stays in dreams.

The mist washes away the fate of the dust,
and the coir raincoat and chaifei are at risk.
The donkey's hoof crushes the customs of the world, and
Feihong's hoof prints leave glory.

1.7 Ability to handle longer sequences

The largest sequence that ChatGPT can handle is 4K tokens, while OpenAI has opened two models of 8k and 32k, in which each Chinese character occupies about 2 to 2.5 tokens. The tokens of GPT-4 are priced according to the instructions and responses (Table 2). The price of 32k is twice that of 8k. Compared with ChatGPT’s $0.02 per 1000 tokens, GPT-4 is about 15 times more expensive.

Model Prompt Completion
8K context $0.03 / 1K tokens $0.06 / 1K tokens
32K context $0.06 / 1K tokens $0.12 / 1K tokens

Table 2: Details of charges for GPT-4

The detection of more capabilities of GPT-4 is discussed extensively in their latest 155-page article [25] by Sébastien Bubeck, head of the Machine Learning Theory Group at Microsoft Redmond Research.

They pointed out that GPT-4 has shown far beyond the theoretical performance of the text generation model, and has become the spark that ignites the flames of general artificial intelligence (AGI). GPT-4 already has very strong reasoning, planning, and problem-solving , abstract thinking, comprehension of complex ideas, rapid learning, and the ability to learn from experience.


2. GPT-4 technical solution guess

With these improvements of GPT we found, we can combine the current progress of LLM and the work of OpenAI to guess the possible technical solutions of GPT-4. Because we can only rely on published algorithms to make guesses, it is not ruled out that OpenAI uses unopened algorithms as solutions, so if my guess is wrong, you can assume that you have learned several independent algorithms.

  1. Zero-shot and few-shot learning ability: The theoretical basis for this improvement is likely to be due to the emergence ability of large models (emergent ability) [5];
  2. Logical reasoning ability: the chain of thought (Chain of Thought, CoT) [6] and self-improvement ability (Self-Improve Ability) [7] of the large model are used;
  3. Ability to understand images: It is speculated that it draws on OpenAI's famous multi-modal model CLIP[8] or Microsoft's multi-modal model KOSMOS-1[12];
  4. Safer text generation ability: This part of the technical report introduces more, mainly expert testing, hallucination detection and RBRM;
  5. Stronger programming ability: It is speculated that this part draws on OpenAI's famous code generation model: CodeX;
  6. Ability to handle other languages: It is speculated that the idea of ​​​​cross-language pre-training models such as XLM [9] may be borrowed, or because the emergent ability strengthens the performance of GPT-4 in other languages;
  7. Ability to process longer sequences: It is speculated that this part uses the model Transformer-XL [10] that handles long inputs or the Sparse Transformer [11] proposed by OpenAI that can reduce the complexity of long data;

Below we introduce the basis of our speculation and a brief introduction to the techniques of these speculations.

2.1 Emergent capabilities

Emergent ability is the most important core technology for LLM to make breakthroughs. Emergent ability refers to a model that automatically learns some advanced and complex functions or behaviors during the training process, and these functions or behaviors is not directly encoded or specified.

This ability can make the model perform better when dealing with new and unknown tasks, because it can adaptively learn new functions or behaviors without retraining or modifying the model. Figure 4 shows that many LLMs, including GPT-3, have shown very strong emergence capabilities, that is, when the parameters of the model break through a certain indicator, its performance will improve rapidly. Here we can conclude that the zero-shot and few-shot learning capabilities of GPT-4 are derived from the emergent capabilities of large models.

The emerging ability of the model mainly depends on four points, which are:

  • The large number of parameters in the model;
  • the architecture of the model;
  • high-quality training data;
  • More advanced training strategies.

Among them, the parameter quantity of the model is the most important factor.

Figure 4: Many large models such as GPT-3 have demonstrated emerging capabilities on multiple tasks

2.1.1 Model parameter quantity

The parameter quantity of GPT-4 is a topic that everyone is discussing. Considering that GPT-4 is stronger than ChatGPT and has an additional image coding module, the parameter quantity of GPT-4 should not be smaller than ChatGPT. Figure 5 shows the predicted time for each token of ChatGPT Turbo and GPT-4 statistics from ARK Invest, where the time of GPT-4 is about 4 times that of ChatGPT. And GPT-4 is likely to use some strategies to speed up the reasoning speed of the model, so the text model parameter part of GPT-4 is about 100 billion levels but very close to trillions.

If GPT-4 uses CLIP for image encoding, according to the OpenAI paper, the current largest image encoder is a 64-fold enlarged residual network, then the image encoding of GPT-4 is about 1.6 billion. Of course, we cannot rule out that GPT-4 uses other image coding structures. For example, KOSMOS-1[12], which also uses Transformer, is a good choice. Then the parameters of the image part can only wait for more relevant content to be disclosed. .

Figure 5: The proportion of ChatGPT and GPT-4 in predicting each token according to the statistics of ARK Invest

2.1.2 Architecture of the model

What we can be sure of is that the technical report of GPT-4 points out that GPT-4 adopts a Transformer-based architecture, that is, the core architecture still adopts the Decoder-only structure of the GPT series. For the internal details of the GPT-4 model, we can confirm not many points. Considering the speed of GPT-4 and the ability to process long text, its internal structure has these two possibilities:

  1. Because GPT-4 greatly improves the ability of long text, GPT-4 has a certain probability of using Transformer-XL or Sparse Transformer;
  2. Because GPT-4 is more likely to be iterated on the basis of ChatGPT, it may still use the original Transformer and increase the number of layers, the number of heads and the number of hidden layer nodes.

Because GPT-4 also supports image input, there must be a part about image encoding, which we will expand in detail in Section 2.3.

2.1.3 Training Strategy and Training Data

GPT-4 basically maintains the same training strategy as ChatGPT, that is, it basically follows the paradigm of pre-training + prompt + prediction, as shown in Figure 6. Here we mainly introduce the improvement of GPT-4, there are three main points.

  • Introduced a rule-based reward model (Rule Based Reward Model, RBRM);
  • Introduced multimodal hint learning;
  • A chain of thought is introduced.

Figure 6: Model training steps of ChatGPT

1. RBRM

The first improvement of GPT-4 is the introduction of RBRM. RBRM is a four-category model written according to the rules. Its four categories are:

  • rejection of the desired pattern;
  • Rejection of unexpected styles;
  • contains impermissible content;
  • Safe, non-rejected responses.

GPT-4 was used in the PPO stage of Step 3 in Figure 6. In order to improve the security of the model, ChatGPT uses Reinforcement Learning with Human Feedback (RLHF) in Step 3 to train the model. This part of ChatGPT data comes from GPT-3 API users, and GPT-4 adds RBRM here, with the purpose of refusing to generate harmful requests and not rejecting harmless requests through correct reward-guided model training.

Using rules to build NLP models has a long history. In fact, the earliest models of NLP were rule-based models, followed by probability-based models and neural network-based models.

For example, Shannon used the probability model of discrete Markov process to describe the automaton of language, and the regular expressions we often use are typical rule-based text models. The advantage of a rule-based model is that we don't need training data, but the disadvantage is that it often requires domain experts to design rules, and often can only solve problems in a certain field. I guess here that RBRM is designed by domain experts and is a zero-sample classifier written by a series of text rules such as regular expressions and finite state machines.

Rule-based reinforcement learning has also been widely mentioned in recent years. An important optimization goal of reinforcement learning is to reduce the scope of the search space, and this work can just be done under the constraints of rules. After being constrained by the rules, search in the remaining space through reinforcement learning, which reduces the search space of reinforcement learning and can effectively improve the convergence speed. The working principle of the RBRM of GPT-4 is roughly shown in Figure 7.

Figure 7: How RBRM works

2. Multimodal Prompt Learning

GPT-4 does not detail the technical details of its multimodal capabilities, and its graphics interface is not open for public beta. But we can see if there is any similar work in the report of GPT-4 in the multimodal field. Coincidentally, the KOSMOS-1[12] announced by Microsoft at the beginning of this year has a very strong multi-modal QA capability, and its thinking is very similar to GPT-4. We can speculate that GPT-4 uses KOSMOS-1 A similar approach to multimodal prompting.

KOSMOS-1 supports three types of data sets, namely text generation, image description (Image Caption) generation and multimodal QA. Figure 8 is an example of KOSMOS-1 in image description generation and QA generation. In the image description generation of Figure 8.(a), the input of the model is the Embedding of the image, and the output is the predicted image description. In multimodal QA in Figure 8.(b), KOSMOS-1 takes both image embeddings and text embeddings as input, which are then used to predict the answer to the question.

Figure 8: Example of multimodal input for KOSMOS-1

3. Chain of thought

GPT-4 has significantly stronger logical reasoning ability than ChatGPT. When training the model, it should use the way of thinking chain to construct prompt samples. The thinking chain not only supports plain text input, but also multi-modal input of graphics and text. We will use a section to introduce this important content.

4. Capability prediction

When we train a model on a specific task, we want to be able to predict the final performance of the model on this task, which is the model's capability prediction (Capability Prediction). In the field of natural language processing and large-scale language models, ability prediction usually refers to predicting and evaluating the performance ability of a model in a specific task, domain or scene.

The purpose of capability prediction is to better understand the performance of the model in order to optimize, tune or improve the model. Through the ability prediction of the model, we can better understand the strengths and limitations of the model, which can provide valuable feedback for the further development and improvement of the model. GPT-4 also uses ability prediction during training, which allows them to more accurately evaluate the effect of the model and saves training costs.

2.2 Logical Reasoning Ability

In order to improve the reasoning ability of GPT-4, OpenAI is likely to use the very important thinking chain and self-improvement ability of LLM in recent years. They can be regarded as the targeted optimization of hint learning in terms of logical reasoning ability, and we will introduce them separately below. From the GPT-4 technical report, we can find that a lot of GPT-4 training uses evidence of chain of thought or self-improvement.

2.2.1 Thinking chain

Chain of Thought refers to a series of related thinking associations and associations caused by a certain point of view, idea or perceived stimulus when people are thinking. These associations can be established and strengthened through people's memory, experience, knowledge, emotion and consciousness, etc., and finally form an organic chain of thinking to help people understand and solve problems, make decisions and take actions. The chain of thinking is an important part of human thinking activities, which reflects people's way of thinking, thinking habits and thinking efficiency. By building and strengthening the chain of thinking, it can help people better understand and grasp the nature and laws of things, and solve problems and make decisions more effectively.

In the field of artificial intelligence, researchers are also exploring how to use technologies such as machine learning and natural language processing to simulate human thinking chains, establish machine thinking chains, help machines better understand and process human language and behavior, and achieve more Intelligent applications and systems. OpenAI's paper [6] is an article of great significance in the direction of the thinking chain, and it is also a technical solution that GPT-4 is likely to use. In this article, they proposed to improve the model by building a thinking chain prompt. reasoning ability. The chain of thought is also an emergent ability, which can greatly improve the logical reasoning ability of the model by providing only a small number of samples.

The difference between the thinking chain and the traditional prompt learning is that a reasoning process is added to the prompt, and a triplet composed of input, thinking chain and output is constructed. Fig. 9 is an example of a traditional prompt and a thought chain prompt.

Figure 9: Traditional prompt learning and thinking chain prompt learning. The thinking chain will give the reasoning process in the input to help the model learn the reasoning ability

The thinking chain also supports multi-modal input, and the GPT-4 technical report also pointed out that GPT-4 uses a multi-modal thinking chain. The GPT-4 example in Figure 13 is a classic prediction result that includes reasoning because the model is trained using the thinking chain. Figure 10 is the framework of a multi-modal thinking chain recently published by Shanghai Jiaotong University and Amazon: Multimodel-COT [14].

It consists of two stages, and the two stages share parameters. In the first stage, they feed images and text into the model to generate reasons, or chains of thought. In the second stage, they combined the raw input and the generated rationale into the model to generate the answer.

Figure 10: Inference process of Multimodel-COT

2.2.2 Self-promoting

In an article [7] published by Google in 2022, it was pointed out that the combination of LLM and thinking chain can allow the model to use unsupervised data for self-improvement (Self-Improve). Its core method is shown in Figure 11. GPT-4 also pointed out that they used the scheme of [7] to improve the model's ability to follow user intent.

Figure 11: LLM can improve itself with large models

Its calculation process is as follows:

  1. First, we build prompts based on the chain of thought;
  2. According to different temperature coefficients, the model generates multiple different Paths containing the reasoning process;
  3. We use voting to select the most likely correct answer;
  4. All Paths containing this correct answer are used to optimize LLM.

You may have discovered that this method doesn't always give you the right answer. The author draws two important conclusions through experiments:

  1. The correct rate of the answer is highly correlated with its confidence, which means that the answer obtained by voting is likely to be the most correct answer among the generated answers;
  2. Even if the answers are wrong, adding them to the training data helps the model train.

After getting the inference Path, the author constructed four different input data according to the Path, which are:

  1. Standard thinking chain prompts, that is, constructing (question, thinking chain, answer) ternary pair;
  2. Traditional prompt learning, that is, only questions and answers;
  3. The input is a question, add a "Let's think step by step" prompt to let the model predict the reasoning step;
  4. Traditional QA, that is, input questions and predict answers.

Finally, in order to enrich the data set, the author proposes two schemes to expand the data: one is to randomly combine two questions, and then let the model generate new questions; the other is to let the model generate an inference step and add it to the training set.

2.3 Ability to understand diagrams

Because GPT-4 supports graph input in image format, OpenAI's famous multimodal algorithm CLIP [8] says that we can map images and text to the same feature space through comparative learning, as shown in Figure 12. Then combined with the image encoder of CLIP, the image input of GPT-4 can be realized. At this time, we need to train an image encoder that can be aligned with the text features of GPT, and then use the output of the image encoder of CLIP as the image token, and finally Add an embedding layer to encode this token as a feature vector of GPT-4.

Figure 12: The structure of CLIP, which projects images and text to the same feature space through contrastive learning

In addition to GPT-4 can understand the example of this photo in Figure 2, the most amazing thing is that GPT-4 can also understand the academic picture in Figure 13 that contains many details. Because in an academic picture, the symbols referred to in the picture and the positional relationship between the targets are very important. If GPT-4 can capture these details through only one image encoding, then this image encoder must also show With a very strong emergent capability, this image encoder also has a high probability of a parameter amount of 100 billion scale.

Figure 13: GPT-4 has the ability to understand specific details in academic images

Another possibility of GPT-4's multimodal ability is similar to the multimodal large language model (Multimodel Large Language Model, MLLM). Among them, Microsoft's KOSMOS-1 demonstrated the ability of a multimodal language model similar to GPT-4, and KOSMOS-1 also demonstrated a very strong emergent ability in multimodal question answering, as shown in Figure 14.

KOSMOS-1 is a multi-modal model based on Transformer decoder, which stitches data of different modalities together, such as <s> and </s> represent text input, <image> and <\image> represent image input , where the image embedding uses the feature vector calculated by Microsoft's METALM [13]. We speculate that GPT-4 may draw on the ideas of KOSMO-1S, and then combine some of OpenAI's own multimodal work.

Figure 14: Microsoft's KOSMOS-1 has emerged with very strong image understanding capabilities

For more technical details of GPT-4's multi-modality, we can wait for GPT-4's image interface to be opened and test a lot before we can find out.

2.4 Safer output

The idea of ​​existing deep learning models is to use a large model to fit the training set. For a generative model, its output content is not completely controllable, and GPT-4 is no exception. The GPT-4 technical report pointed out that the text model will have the following types of risk outputs, such as hallucinations, harmful content, discrimination, false information, violence, privacy, network security, etc. GPT-4 has done a lot of work to alleviate this problem.

GPT-4’s first problem of mitigating risk output is to hire more than 50 experts from different fields to act as the red team for confrontation testing. The job of the red team is to ask dangerous questions to test the output given by GPT-4 and try to attack it. Through the confrontation of domain experts, OpenAI also collected a large amount of domain expert data in different directions to improve the security of GPT-4.

2.4.1 Hallucinations

Hallicination is a very difficult problem for generative models. It refers to the absurd or unreal content produced by the model, that is, serious nonsense. This hallucinatory behavior will be especially harmful as the content sentence generated by the model becomes more and more fluent and the content becomes more and more persuasive. The hallucinations of the model can be summarized for the following reasons:

  1. Data deviation: There may be some deviations in the training set, such as the accuracy of the data, and errors may affect the model's understanding of natural language;
  2. Data sparseness: The training set may have relatively little data in a certain aspect, resulting in an uncontrollable ability of the model to generate in this aspect;
  3. Model structure: The structure of the model and the amount of parameters may affect the generalization and representation capabilities of the model, leading to hallucinations in some aspects of the model.

GPT-4 adopts two strategies to solve this problem:

The first method is to use the data of ChatGPT for training. The advantage of this method is that ChatGPT already had the ability to refuse to generate harmful content to a certain extent at that time, and it has higher reliability than the data crawled on the Internet. But its problem is that it may inherit the problems of ChatGPT into GPT-4. And relying on the generated content of one model as the training data of another model may lead to overfitting of the model.

The second approach is to employ NLP techniques to detect hallucinated samples generated by the model, including automatic evaluation and human evaluation. The advantage of this method is that it can effectively detect and correct the hallucination problem produced by the model. Its disadvantage is that the automatic evaluation method may miss some phantom samples due to the defects of the evaluation model, and the biggest problem with manual evaluation is that the labor cost is very high.

In terms of hallucination detection, Meta has a very important contribution. On the one hand, they proposed the hallucination detection task and produced the hallucination detection data set HADES [15] for this task. On the other hand, they proposed a hallucination detection method [16], which synthesizes hallucination data to pre-train the model. fine-tuning. The model can detect hallucination words appearing in a sentence to evaluate the authenticity of the generated content, thereby mitigating the probability of hallucinations. Figure 15 is an example of this method in machine translation, and the part labeled 1 corresponds to the generated hallucination content. It is speculated here that OpenAI may have adopted a method or data similar to Meta.

Figure 15: An example of the hallucination detection method proposed by FAIR in machine translation

Specifically, OpenAI designed a multi-step process that uses GPT-4 itself to generate hallucinated or not compared data, and incorporates them into the training set of the reward model in step 2 of Figure 6:

  1. Input prompt p into GPT-4 and get a response r1;
  2. Feed p and r1 into GPT-4 and instruct it to list all hallucination tokens. If there is no hallucination, continue to generate until it lists hallucination h1;
  3. Feed p, r1 and h1 into GPT-4 and instruct it to generate a response r2 without hallucinations;
  4. Input p and r2 into GPT-4, let it list all hallucination tokens, if no hallucinations are detected, r1 and r2 can be put into the training set of the reward model as a comparison sample pair.

2.4.2 Other issues

For other possible risk outputs, OpenAI did not introduce its technical solutions in detail, but from their technical solutions, we can see that they probably used the following types of methods:

  1. Use RBRM to detect possible risks;
  2. Let the model learn to refuse to answer such questions through prompt learning;
  3. Use the red team to find these possible problems;
  4. Filter training data and delete samples that may cause risk problems;
  5. Train the reward model and let the model punish harmful output content;

2.5 Programming ability

GPT-4 has a huge improvement in programming ability compared with ChatGPT. On the one hand, it may have a stronger logic analysis ability because of the chain of thinking. On the other hand, it is likely to learn from OpenAI’s famous code generation algorithm CodeX[3] . CodeX is a derivative version of GPT-3 in the field of code generation, and it is also the basic algorithm behind the Copilot plugin. CodeX adopts the Decoder-only architecture system of the GPT series, and the parameter quantity of the model has many different versions ranging from 12M to 12B. The training of CodeX is divided into two stages: pre-training and fine-tuning.

In the pre-training phase, OpenAI first crawled a large number of Python files from Github, and obtained a training set with a size of 159GB after cleaning. Because CodeX is a code generation model, it does not use the weights trained by GPT-3, nor does it completely copy the model hyperparameters of GPT-3, but retrains a code generation model.

In the fine-tuning stage, OpenAI collected about 40,000 pieces of data from competition websites, interview websites, and Github’s unit test scripts. In evaluating the correctness of the code, CodeX does not use the traditional BLEU score, but uses the percentage of unit tests that the code can pass as the evaluation standard, and establishes the evaluation test set HumanEval and the evaluation standard pass@k.

In order to avoid data leakage, HumanEval's data is all constructed by humans, including a total of 164 questions and a large number of test cases. HumanEval divides each function into four categories, namely function signature, function annotation, function body and unit test sample. When performing hint learning, function signatures and function annotations are used as input hints, function bodies are used as required outputs, and unit tests are used to evaluate the effect of generated code.

CodeX's evaluation label is similar to Leetcode, that is, how many test cases have passed the test, CodeX's evaluation standard pass@k means that k are randomly selected from all generated answers of the model, and the probability of getting the correct answer from these k answers . Its calculation method is as formula (1). where n is the answer generated for each question, k is k randomly selected from n answers, and c is the number of n answers that pass the unit test.

Both CodeX and GPT-4 are the next-generation models of GPT-3. It is a reasonable job for GPT-4 to use the ready-made ideas and data of CodeX and improve the programming ability of the model.

2.6 Multilingual ability

Regarding the substantial improvement of GPT-4's ability in other languages, OpenAI did not give an introduction, and I did not find any relevant explanations. Here, based on the current technology accumulation, I guess the technical solutions that OpenAI may use:

  1. Improved training data for other languages;
  2. Larger-scale models allow GPT-4 to have more capabilities in small languages;
  3. Added tasks for small languages, such as using existing parallel corpus to construct machine translation tasks based on hint learning, using machine translation engines to translate part of the data into small languages, etc.

There is indeed not much relevant information in this part, and you are welcome to give your own guesses in the comment area.

2.7 Long sequence capability

The long sequence here includes two aspects. On the one hand, GPT-4 supports multiple rounds of dialogue, and on the other hand, GPT-4 supports longer input data. Let's discuss the technologies they may use.

2.7.1 Multiple Rounds of Dialogue

Both ChatGPT and GPT-4 support continuous dialogue, but OpenAI has not given the technical solution behind the continuous dialogue ability. If the previous dialogue is roughly provided as input to the model in each round of dialogue. Although theoretically it works, the biggest problem with this method is that as the number of dialogue rounds increases, the input data will also increase rapidly, which will lead to slower and slower prediction speeds of ChatGPT or GPT-4, but I did not notice this gradual slowdown in multiple rounds of conversations using ChatGPT and GPT-4.

If we want to solve this problem from a model perspective, we just have an algorithm that can solve this problem, and it is Transformer-XL [10]. An important improvement of Transformer-XL is to propose a fragment recursive mechanism, as shown in Figure 16. The fragment recursion mechanism is similar to the combination of Transformer and RNN. Its core idea is that for a variable-length data with unlimited length, the length of each fragment is fixed and the characteristics of this fragment are calculated during calculation, and then the next When fragmenting, the features of the previous fragment are added to the current fragment, so that the model can handle features of any length.

Figure 16: Transformer-XL's fragment recursion mechanism

Responding to the multiple rounds of dialogue between ChatGPT and GPT-4, I speculate that OpenAI borrowed the idea of ​​Transformer-XL's fragment recursion. That is, GPT-4 will then add the features of the cached round t-1 and round t when performing the calculation of the $t$th round, and use them together for the calculation of the current round. Because the t-1 round also considers the characteristics of the t-2 round, in theory, this method can obtain the dialogue content of many previous rounds without affecting the prediction time.

2.7.2 Long sequence input

The traditional Transformer is not good at dealing with long sequence problems, because the complexity of the Transformer with an input length of n is O(n^2). The default input length of Transformer is 512. The solution of Transformer for input data with a length greater than 512 is to split it into multiple text blocks with a length of 512, but this will cause context fragmentation. Transformer introduced in the previous section -XL is used to solve this problem.

Here we introduce OpenAI's own algorithm for solving long sequence input: Sparse Transformer[11], because GPT-3 is a mixed mode of ordinary Transformer and Sparse Transformer, so Sparse Transformer is also very likely to be used by GPT-4 A model that handles long input text, but how it mixes with the normal Transformer is unknown. The characteristic of Sparse Transformer is that it only pays attention to the state of the Top-k features that contribute the most. It uses the sparse attention mechanism to replace the intensive attention of Transformer, and reduces the complexity of calculating attention to O(n\sqrt n). The dense attention kernel of the traditional Transformer is decomposed into Stried Attention and Fixed Attention, and each attention kernel is divided into a row attention kernel and a column attention kernel. The decomposed attention kernels are all sparse, thus greatly reducing the complexity of the model, as shown in Figure 17.

Figure 17: Dense and sparse attention

Because GPT-4 supports longer sequences of data, I also list here two variants of Transformer for efficiently processing long data. Because the technical report of GPT-4 is too much, what is the network structure of GPT-4, we can only wait for the official announcement of OpenAI.

2.8 Summary of technical solutions

In this section, we have discussed many technical solutions, some of which have relatively high credibility, while others have a high degree of speculation. The table below gives the credibility of each option (increasing from 1 to 5).

Emergence chain of thought Self-promoting CLIP KOSMOS-1 CodeX XLM Trans-XL Sparse Transfer
5 5 3 3 3 4 1 1 4

According to our above speculation, we can guess that the technical solution of GPT-4 is roughly as follows:

  • The first stage: Build a multimodal pre-training model and fine-tune it. The main purpose of this stage is to train the first version of GPT-4 with certain capabilities based on the massive data crawled. The training method is similar to GPT-3. Its work focuses on two points: one is to build a multi-modal pre-training model based on KOSMOS-1 or other multi-modal models, and use Transformer-XL to solve the high-complexity problems of long texts; the other is to collect data, including massive Crawl data, single-modal, multi-modal, traditional prompt learning data, thinking chain prompt learning data, code data, etc. to train the model.
  • The second stage: GPT-4 behavior alignment. The main purpose of this stage is to align the model behavior with human behavior based on manual marking, and reduce the risk of the model. There are two models that need to be produced at this stage. One is to design a rule-based reward model RBRM based on expert knowledge, and the other is to train a deep learning-based reward model RM based on the output data of the manual marking data and hallucination detection model. .
  • The third stage: use RBRM and RM as the reward function, and use RLHF to train the model. The training methods of the second and third stages are similar to ChatGPT.
  • The fourth stage: model self-improvement, the training of GPT-4 may be a cyclic iteration and a training process with constant prompts. At this stage, GPT-4 will automatically generate more data, such as training data from model self-improvement, test cases from expert red team feedback, etc., and use these data to return to the first stage to train the model.

3. The development direction of GPT-4

Recently, I also applied GPT-4 and ChatGPT to my daily work, and I was deeply shocked by the powerful capabilities of GPT-4. It can not only assist me in daily programming and article writing, but also help me solve some daily chores, greatly improving my work efficiency. There are countless articles about GPT-4 with various praises and criticisms on the Internet. Here I will combine the technical solutions we analyzed to discuss the development direction of GPT-4, or to predict the possible appearance of GPT-5.

3.1 The optimization direction of GPT-4

Although GPT-4 has demonstrated strong capabilities in text generation, code generation, image understanding, and logical reasoning capabilities, it still has a lot of room for improvement. Future work may have the following key directions:

  1. The current cost of using GPT-4 is still very high, and the cost of a round of dialogue with GPT-4 is about 1 yuan. The maintenance cost of ChatGPT is nearly 1 million U.S. dollars per day. We predict that the parameter volume of GPT-4 may be close to a trillion scale, so we speculate that its maintenance cost may be around 5 million U.S. dollars. How to lighten the model so that GPT-4 can be used by more people, and even allow more people to train their own GPT-4 will be the direction of research in the future.
  2. GPT-4 is not absolutely safe, and GPT-4 still has hallucinations. GPT-4's hallucination detection, red team confrontation, RBRM, etc. are not the ultimate solution to security problems. Although no system is absolutely safe, OpenAI has invested heavily in security to mitigate the legal risks they may face.
  3. GPT-4 is still an offline model. An important reason why GPT-4 cannot replace search engines is that its knowledge is not updated in real time. Its level of knowledge depends on the cutoff date for which it crawls data, which will make it unable to address news, concepts, events, etc. that arise after the cutoff date.
  4. GPT-4 is still a preliminary exploration of multimodality. Multimodality and LLM may be the two most important directions of AGI in the next few years. OpenAI itself also has a lot of wonderful work in the direction of multimodality. How to further tap the ability of GPT-4 in the direction of multi-modality, involving more modalities, and more applications will be the next key work of OpenAI.

3.2 Application of GPT-4

GPT-4, with its powerful generative capabilities and logical reasoning capabilities, can greatly affect the way we work. I believe that many readers of this article are engaged in algorithm-related scientific research and work. I encourage everyone to use GPT-4, even ChatGPT. So which functions of GPT-4 are very helpful to us. Here I list several directions that I think are more helpful based on my experience:

  1. Writing functional code , asking GPT-4 to write a complex framework that satisfies a specific function may require you to provide it with complex hints, and you also need to check the code it generates. However, if GPT-4 is allowed to implement some less difficult functional functions, such as building a network or implementing a functional function, the usability of the code generated by GPT-4 is still very high.
  2. Do text polishing . As a technical research and development personnel, our writing may not be good. At this time, we can use GPT-4 to help us polish the articles we write. Especially when we write papers or emails in English, GPT-4 can help us solve Chinglish problems.
  3. After reading the paper , GPT-4 is not only a great machine translation tool, but after trial, its translation effect is far superior to traditional machine translation models in terms of professionalism and coherence. In addition, GPT-4 can also do some summarization, generalization, and extraction work, which allows us to quickly understand the core technology of a paper. ChatPDF based on ChatGPT is a very powerful assistant for us to read papers. Figure 18 shows that I use ChatGPT to help me read the generated content of GPT-4.
    Figure 18: GPT-4’s work on improving security generated by ChatPDF based on GPT-4’s technical report
  4. In daily work , GPT-4 is very good at writing official announcements, speeches, thank you letters and other content, and is also very good at summarizing and summarizing work. It can improve our human efficiency in these aspects. For things without ideas, I will also try to ask GPT-4, which can often help me open my mind.

Note that GPT-4 does not completely solve security issues such as hallucinations. Facing the content generated by GPT-4, we'd better conduct a strict review before using it, otherwise some unexplainable problems may occur. It is also for this reason that GPT-4 cannot replace the professional staff engaged in this area, because before the security issues of GPT-4 are resolved, professionals are always needed to check them, and the security issues of GPT-4 may be Accompany the entire life cycle of the generative model.

4. Other LLMs

With the proposal of ChatGPT and GPT-4, companies at home and abroad quickly followed up, setting off an upsurge in the development of LLM models, and many companies have proposed their own LLM.

Among them, the representative jobs in China include the following jobs.

  • Baidu's Wenxinyiyan: Baidu's Wenxinyiyan ( ERNIE -Bot) is the earliest follow-up pre-training large model in China, but Baidu has always kept their working technology secret. However, judging from his demo and the test results of many testers, Wen Xinyiyan is like an engineering combination of Baidu's many AI tasks;
  • Ali’s Tongyi Qianwen: Tongyi Qianwen is a text generation model built with Transformer-XL and has 2 billion parameters. According to the feedback from netizens who got the invitation code, the text generation effect of Tongyi Qianwen is slightly worse than that of Wenxin Yiyan.
  • SenseTime’s Rixin: Judging from the display effect of the press conference, SenseTime’s Rixin is currently the best LLM in China, and even achieved a similar effect to ChatGPT. Ririxin includes five main functions: "Consultation", "Miaohua", "Ruying", "Qiongyu" and "Gewu", among which "Consultation" is aligned with GPT-4.
  • GLM of Tsinghua University: GLM [17] is an open source bilingual language model that uses English and Chinese training jointly launched by Tsinghua University and Zhipu AI. The maximum parameter scale has reached 130 billion. The effect of GLM-130B is between GPT-3 and ChatGPT between. GLM also launched ChatGLM and GLM-6B, which can be run and fine-tuned on a single machine, which is currently the best open source Chinese pre-training large model.
  • Fudan University's MOSS: MOSS is the team of Mr. Qiu Xipeng from Fudan University's NLP Laboratory, and has recently open sourced related codes. Judging from the current effect, MOSS is not very mature, but the good news is that Teacher Qiu's team has been optimizing MOSS.

Not only domestic fast follow-up, foreign leading companies have also launched their own LLM, of which the representative ones are:

  1. MetaAI's LLaMA: LLaMA [19] has four sizes of parameters: 7 billion, 13 billion, 33 billion and 65 billion. Unlike OpenAI, MetaAI has open sourced their code and models and supports single-machine deployment. Although the effect of LLaMA is not as good as GPT-4, its open source and stand-alone features have attracted secondary development by many institutions and individuals.
  2. Google's PaLM and LaMDA: PaLM [20] is a language model proposed by Google with a structure similar to the GPT series, with a total parameter volume of 540 billion. Google recently launched a multi-modal model PaLM-E [21] that combines image capabilities. LaMDA[22] is a language model launched by Google to generate a more natural and humane language model, with expressions closer to humans. LaMDA has been improved on the basis of GPT-3, adding more dialogue scenes and emotions Comprehension ability can better simulate human dialogue and thinking. Even Google researcher Blake Lemoine lamented after testing LaMDA for a while: LaMDA may already have a personality.
  3. Claude of Anthropic: Anthropic is an artificial intelligence company founded by former employees of OpenAI and supported by Google R&D. They also recently launched their LLM: Claude. At present, the effect of Cluade is slightly stronger than ChatGPT, but significantly weaker than GPT-4.

In addition to the above, foreign LLMs include BigScience's BLOOM, Stanford's Alpaca, Microsoft's METALM, KOSMOS-1, etc., domestic Huawei's Pangu, Tencent's WeLM, etc. In addition to these general models, LLM is also used in subdivided fields, such as HuaTuo[23] in the medical field, BloombergGPT[24] in the financial field, etc.

5. Summary

Whether GPT-4 will bring about the fourth industrial revolution is a topic that needs time to verify, and I am not qualified to give a conclusion here, but GPT-4 has a huge impact on me personally.

First of all, it has shaken my understanding of traditional artificial intelligence to a certain extent. Just like many theorems of macrophysics are not valid in microphysics, many of the experiences I have accumulated in traditional artificial intelligence are not valid in GPT-4. . Its powerful zero-sample learning capabilities and higher-level capabilities are far beyond my traditional understanding of deep learning.

Secondly, GPT-4 and ChatGPT are becoming the most powerful assistants in daily work. GPT-4 also provided great help when writing this article. It can not only help me write codes, modify articles, but even help me Solve some non-working problems. In the end, the many different large models that have sprung up like mushrooms have injected new confidence and vitality into my increasingly pessimistic deep learning.

For the technology of GPT-4, I suggest that everyone should understand and learn to use it. Regardless of whether your work is computer-related or not, it will bring you some help. Even if you are a cook, it may generate a delicious recipe for you. When using GPT-4, we also need to look at the content it generates rationally. Only GPT-4 has a slight risk problem, and we cannot relax our review to prevent the hallucination problem from causing us losses.

In the future, GPT-4 will definitely bring us many influences. First of all, a large number of indistinguishable content generated by GPT-4 will quickly emerge on the Internet. It is worth pondering whether the public will be affected by the unified GPT-4 behavior pattern. Secondly, GPT-4 will greatly liberate the productivity of some jobs, and even replace these jobs. Whether we can seize this opportunity, it is very important to see new opportunities in this intertwined environment. Finally, the form in which GPT-4 will affect everyone is different. If GPT-4 really brings AGI, I hope that my friends will not miss it.

Reference

  • [1] https://cdn.openai.com/papers/gpt-4.pdf

  • [2] https://zhuanlan.zhihu.com/p/614340292

  • [3] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.

  • [4] Bai, Yuntao, et al. "Training a helpful and harmless assistant with reinforcement learning from human feedback." arXiv preprint arXiv:2204.05862 (2022).

  • [5] Wei J, Tay Y, Bommasani R, et al. Emerging abilities of large language models[J]. arXiv preprint arXiv:2206.07682,

  • [6] Wei J, Wang X, Schuurmans D, et al. Chain of thought prompting elicits reasoning in large language models[J]. arXiv preprint arXiv:2201.11903, 2022.

  • [7] Huang J, Gu S S, Hou L, et al. Large language models can self-improve[J]. arXiv preprint arXiv:2210.11610, 2022.

  • [8] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.

  • [9] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.

  • [10] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V.Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.

  • [11] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

  • [12] Huang, Shaohan, et al. "Language is not all you need: Aligning perception with language models." arXiv preprint arXiv:2302.14045 (2023).

  • [13] Hao, Yaru, et al. "Language models are general-purpose interfaces." arXiv preprint arXiv:2206.06336 (2022).

  • [14] Zhang, Zhuosheng, et al. "Multimodal chain-of-thought reasoning in language models." arXiv preprint arXiv:2302.00923 (2023).

  • [15] Liu, Tianyu, et al. "A token-level reference-free hallucination detection benchmark for free-form text generation." arXiv preprint arXiv:2104.08704 (2021).

  • [16] Zhou, Chunting, et al. "Detecting hallucinated content in conditional neural sequence generation." arXiv preprint arXiv:2011.02593 (2020).

  • [17] Du, Zhengxiao, et al. "GLM: General language model pretraining with autoregressive blank infilling." Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2022.

  • [18] Zhao, Wayne Xin, et al. "A Survey of Large Language Models." arXiv preprint arXiv:2303.18223 (2023).

  • [19] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

  • [20] Chowdhery, Aakanksha, et al. "Palm: Scaling language modeling with pathways." arXiv preprint arXiv:2204.02311 (2022).

  • [21] Driess, Danny, et al. "Palm-e: An embodied multimodal language model." arXiv preprint arXiv:2303.03378 (2023).

  • [22] Thoppilan, Romal, et al. "Lamda: Language models for dialog applications." arXiv preprint arXiv:2201.08239 (2022).

  • [23] Wang, Haochun, et al. "HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge." arXiv preprint arXiv:2304.06975 (2023).

  • [24] Wu, Shijie, et al. "BloombergGPT: A Large Language Model for Finance." arXiv preprint arXiv:2303.17564 (2023).

  • [25] Bubeck, Sébastien, et al. "Sparks of artificial general intelligence: Early experiments with gpt-4." arXiv preprint arXiv:2303.12712 (2023).

  • [26] Lin, Stephanie, Jacob Hilton, and Owain Evans. "Truthfulqa: Measuring how models mimic human falsehoods." arXiv preprint arXiv:2109.07958 (2021).

Author: JD Retail Liu Yan

Content source: JD Cloud developer community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8816351