Introduction to the development history of GPT, the second big language model

Thanks to the development of data, model structure, and parallel computing power, the application of large language models is developing rapidly, and the large language neural network model has become a technology that cannot be ignored.
GPT has made breakthrough progress in natural language processing NLP tasks. The diffusion model has the potential to become the representative of the next generation of image generation models. It has higher accuracy, scalability and parallelism, both in terms of quality and efficiency. It has been improved. Transformer and diffusion model are popular players in the direction of AIGC. The main reason is not that they are more efficient (such as fewer parameters to get better results) but that they bring scalability and parallelism, that is, vigorously produce Miracle has a way to exert its strength.

For Transformer and translation toy examples based on pytorch, refer to the blog

The AI ​​wave since 2012 has three main characteristics:

  1. Versatility: One architecture can often solve multiple problems, and the same set of models can process both text and pictures
  2. Powerful: Neural networks and deep learning go far beyond traditional methods, can achieve human-like abilities, and have surpassed humans in some specific tasks
  3. Scalability: Work hard to create miracles, the larger the model, the stronger the performance (of course, the architecture applied to the large model is also constantly evolving)

This article mainly introduces the development history of the large language model in 2017 and beyond. This year, the cornerstone of the neural network structure of today's large language model was proposed, and the launch of chatGPT seems likely to be a singularity in the AI ​​​​world. Events have the potential to rewrite human history. Therefore, it is meaningful to spend a long time introducing the development history of this technology.

Companies and applications that expand language model technology and related applications are showing a blowout development trend. You can view the latest open source large language models from the scoring link rankings on the huggingface official website. From the current point of view, Meta's LlaMA is likely to be the first and most successful open source commercial large language model.
Please add a picture description
Let’s talk about LlaMA later. Let’s first look at the development history of large language models from the time history of development.

A short history of development

Here we take the influential model as an example, because many large language models are closed-source, and none of these closed-source models can surpass the OpenAI model of the same period, so they are no longer listed here. The best open source large language model is currently LLaMA. The evolution history of these two models in this picture represents the development history of the overall large language model.
Please add a picture description
The Transformer proposed by Google in 2017 is the Encoder-decoder model of seq2seq for machine translation, and this model happened to be seen by OpenAI, which was established at the end of 2015. At that time, OpenAI's co-founder and chief scientist Ilya Sutskevi See, from the public dialogue between Yilya Sutsk and Huang Renxun on the second day after the GPT-4 conference, in fact, OpenAI has long had the cognition of big models and big data to realize intelligence, and the Transformer The architecture is very suitable for large computing power, so it was naturally adopted by OpenAI.

Judging from the current development situation, OpenAI seems to want to develop the neural network into a real artificial intelligence. The first is to train the neural network model of the text. It collects a large amount of text data from various channels such as the Internet. The nature of these text data is The above is a kind of mapping of the world. OpenAI wants to make the model understand the world through this layer of mapping relationship, and then proposes a multi-modal model in GPT-4, which can not only process text, but also process pictures. The purpose of OpenAI It is to endow the model with human visual ability, so that it can understand the world from the dimension of fusion and mapping of vision and text.

Table 1: The release time, parameter volume and training volume of GPT in the past

Model release time layers Head count word vector length Parameter amount The amount of pre-training data
GPT-1 June 2018 12 12 768 117 million about 5GB
BERT base/large October 2018 12/24 12/16 110 million/340 million
GPT-2 February 2019 48 - 1600 1.5 billion 40GB
GPT-3 May 2020 96 96 12888 175 billion 45TB
GPT3.5
LLaMA February 2023 7 billion/13 billion/33 billion/65 billion 4.5TB
LLaMA-2 July 2023 7 billion~70 billion 6.3TB

GPT

GPT, or Generative Pre-trained Transformer, is a pre-trained model based on the decoder part of the Transformer structure proposed by Google in 2017 as the model infrastructure. Transformer is a new neural network architecture based on the self-attention mechanism proposed by Google. At present, the direction of natural language processing basically adopts this architecture. The advantage of this architecture is that it completely abandons the traditional loop structure and replaces it with attention. The force mechanism is used to compute implicit representations of the model's input and output. If you are not familiar with the structure of Transformer, see the article in the blog "".

Currently published papers include text pre-training GPT-1, GPT-2, GPT-3, chatGPT, GPT-4 is a multi-modal model (as of July 2023, there is no official paper), and image pre-training iGPT. OpenAI's large language model was originally one of many language models, and it was not conspicuous at all until the release of GPT-3, which made the industry realize that the model is large enough to achieve both memory and generalization. qualitative change.

GPT uses Transformer's Decoder structure and makes some changes to Transformer Decoder. The original Decoder contains two Multi-Head Attention structures, and GPT only retains Mask Multi-Head Attention, as shown in the figure below.

Please add a picture description
GPT uses sentence sequences to predict the next word, so Mask Multi-Head Attention is used to cover the context of words to prevent information leakage. For example, given a sentence containing 4 words [A, B, C, D], GPT needs to use A to predict B, use [A, B] to predict C, and use [A, B, C] to predict D. If you use A to predict B, you need to mask [B, C, D].

insert image description here

Why does GPT only use the Decoder part: the language model uses the above to predict the next word, because the Decoder uses Masked Multi Self-Attention to mask the content behind the word, so the Decoder is a ready-made language model. And because the Encoder is not used, there is no need for encoder-decoder attention.

GPT 1-3

The training process of GPT-1 is unsupervised pre-training and supervised Fine-Tuning. First, a general model is pre-trained, and then fine-tuned on each sub-task, which reduces the trouble of customizing the design model for each task in traditional methods. .
After pre-training the transformer model, no matter how the subtask changes, the model itself does not change, but adjusts the front input and the back output layer.

The GPT-1 and GPT-2 model structures are similar, except that both the GPT-2 model and the dataset are larger (Table 1). GPT-2 proposes the concept of multi-task training and zero-shot, the idea that "all supervised learning is a subset of the unsupervised language model", which is the predecessor of Prompt Learning. The news generated by GPT-2 is enough to deceive most human beings to achieve the effect of falsehood, and many portal websites have also ordered to ban the use of news generated by GPT-2.
The effect of GPT-3 far exceeds that of GPT-2. GPT-3 introduces the concept of few-shot. GPT-3 introduces the concept of a sparse self-attention Sparse Transformer. It can be understood as a more efficient self-attention layer. Except In addition to being able to complete common NLP tasks, it can also write code in languages ​​​​such as SQL and JavaScript, and it also has good performance in simple mathematical operations. The training of GPT-3 uses In-context Learning, which is a kind of Meta-learning. The core idea of ​​Meta-learning is to find a suitable initialization range through a small amount of data, so that the model can Fast fitting on a limited data set, and achieved good results. Compared with GPT-2, GPT-3 models and parameters are larger, which also improves the industry's understanding of "big" language models from understanding to action stage.

GPT-3 uses almost all text data on the Internet as a training corpus, and the filtered training data reaches 500 billion words, of which the huge Wikipedia data only accounts for 0.6%

GPT-3 has also received some criticism, because the pre-training model is trained on a model with a large number of parameters through massive data. Since the massive training data has not been manually cleaned, there will be false and biased ones. , useless, harmful, not in line with human values, and evil training samples, so no one can ensure that the pre-training model will not output similar answers. This is the motivation for InstructGPT and ChatGPT. The paper summarizes them with 3H The optimization goals, namely High-Quality, Human-like, High-Diversity.

The technical documents released by OpenAI at this stage are as follows:
2018 GPT-1 "Improving Language Understanding with Unsupervised Learning" "Improving Language Understanding by Generative Pre-Training"
2019 GPT-2 "Language Models are Unsupervised Multitask Learners"
2020 GPT-3/chatGPT "Language Models are Few-Shot Learners"
2022 InstructGPT (it is the warm-up model of GPT-4, so it is also called GPT-3.5) "Training language models to follow instructions with human feedback"

BERT

Paper: https://arxiv.org/abs/1810.04805

The full name of BERT is Bidirectional Encoder Representation from Transformers (two-way encoding representation from Transformers). The unsupervised pre-training natural language processing language model proposed by Google in the paper "Pre-training of Deep Bidirectional Transformers for Language Understanding" is a natural language processing language model in recent years. Domain-accepted milestone models.

The BERT model is a natural language processing model composed of a language model pre-training model and using Fine-tuning mode to solve downstream tasks. In that year, it achieved state-of-the-art results in 11 NLP tasks, including tasks in the fields of NER and question answering.

The innovation of BERT is that the Transformer Decoder (including Masked Multi-Head Attention) is used as the extractor, and the matching mask training method is used. BERT uses a dual-encoding structure and thus does not have the ability to generate text, but BERT utilizes all the contextual information of each word in the process of encoding the input text, compared with a one-way encoder that can only use pre-order information to extract semantics , BERT's semantic information extraction ability is stronger.

Introduction to InstructGPT training process

This model is based on GPT-3, and it was proposed because of the criticism of GPT-3. This is derived from a paper in 2022. Many large language models later, whether open source or closed source, use RLHF (reinforcement learning) from human feedback ), this model is at least a fine-tune model based on GPT-3.

SFT and reinforcement learning make this model commercially available. It is one of the cores, here is a brief introduction, see the next article for details.
Please add a picture description
The complete training process is shown in the figure above, where pretraining in the first column on the left is the self-supervised learning GPT-3 model, and the Instruct lift in InstructGPT is in the next three steps. This is also something that many doctoral students in laboratories without financial resources are researching, and of course many companies are also researching. Powerful companies want to take these four parts into their own hands.

ChatGPT and InstructGPT are the same in terms of model structure and training methods. They both use instruction learning (Instruction Learning) and human feedback reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) to guide the training of the model. The difference is only the collection of data. differ in the way.

The output of self-supervised learning is sometimes harmful, and the cost of fully labeled supervised learning under certain conditions, the answer is certain, the ceiling is not high, but reinforcement learning is not like this, artificial no longer label, there is no correct answer, only good is better than bad The answer is realized by scoring; therefore, reinforcement learning is one of the techniques that must be used for the top-ranked models in huggingface.

The fine-tune process announced by OpenAI is mainly divided into three parts:
1. Supervised fine-tuning learning (SFT), which collects a data set of how the artificially written expected model outputs, and uses it to train a generative model (GPT3.5 -based)
2. Train the Reward Model (RM) and collect a sorted dataset between multiple outputs of the manually annotated model. And train a reward model to predict which model output the user prefers;
3. Continuously iteratively generate the model based on reinforcement learning (PPO). Use this reward model as the activation function to fine-tune the generative model trained by supervised learning.
The process is shown in the figure below given on the official website.
insert image description here
The model is very important, and the data is also very important. The data of the three-step fine-tuning is as follows:

The reason why the SFT model is needed: The GPT3 model may not necessarily be able to generate answers according to human instructions, helpful, and safe, and requires manual labeling of data for fine-tuning.
The reason for needing the RM model: the discriminative labeling of label sorting, the cost is much lower than the generative labeling of generating answers.
The reason for needing the RL model: When fine-tuning the SFT model, the distribution of the generated answers will also change, which will lead to deviations in the scoring of the RM model, and reinforcement learning is required.

First of all, we need to collect question sets and prompt sets: Annotators write these questions, write some instructions, and users submit some questions they want to get answers to. First train a most basic model, try it out for users, and at the same time continue to collect questions submitted by users. When dividing the data set, it is divided according to the user ID, because the problem of the same user will be similar, and it is not suitable to appear in the training set and the verification set at the same time.

SFT dataset

GPT-3 is a generative model based on prompt learning, so the SFT data set is also a sample composed of prompt-answer pairs. Part of the SFT data comes from OpenAI’s PlayGround, and the other part is the standard of 40 people hired by OpenAI, and the 40 In many cases, high-quality labeling is the prerequisite for model training. On this dataset, the work content of the labeler is to write instructions based on the content, and the written instructions are required to meet the following three points:

  • Simple task: Annotators give any simple task, while ensuring the diversity of tasks;
  • Few-Shot task: the annotator gives an instruction and multiple query-response pairs of the instruction;
  • User-related: Get use cases from the interface, and let annotators write instructions based on those use cases;
    13,000 pieces of data. Annotators write answers directly to the questions in the question set just now. Usually this stage requires tens of thousands of high-quality labeled data.

RM dataset

The RM data set is used to train the reward model in the second step. The goal of the reward is to align human evaluations, and this reward is provided through manual scoring, because artificial intelligence can give lower ratings to harmful, useless, biased, etc.-generated content. points, which makes it difficult for the model to generate those contents. Instruct GPT/ChatGPT's method is to let the model generate 4~10 answers to the same question, and then manually sort these generated answers from good to bad. 33000 pieces of data. Annotators rank the answers.
Its InstructGPT paper shows how this process works.
Please add a picture description

PPO dataset

Reinforcement learning guides model training through the Reward mechanism, which can be regarded as the loss function of the traditional model training mechanism. The calculation of the reward is more flexible and diverse than the loss function (AlphaGO’s reward is the outcome of the game), and the cost of this is that the calculation of the reward is not derivable, so it cannot be directly used for backpropagation. The idea of ​​reinforcement learning is to fit the loss function by sampling a large number of rewards, so as to realize the training of the model. Similarly, human feedback is not derivable, so we can also use artificial feedback as a reward for reinforcement learning, and reinforcement learning based on human feedback came into being.

RLHF can be traced back to "Deep Reinforcement Learning from Human Preferences" published by Google in 2017. It uses manual annotation as feedback to improve the performance of reinforcement learning on simulated robots and Atari games.

PPO (Proximal Policy Optimization) is a new type of Policy Gradient algorithm. This algorithm is very sensitive to the step size, but it is difficult to give a suitable step size. If the difference between the old and new policies during the training process is too large, it is not conducive to learning. , the PPO algorithm proposes a new objective function that can update small batches in multiple training steps, which solves the problem that the step size in the Policy Gradient algorithm is difficult to determine.

The PPO data set of InstructGPT is not marked, and it comes from the user API of GPT-3. 31000 pieces of data. Only the questions in the prompt set are needed, no labeling is required. Because the annotation in this step is scored by the RM model. Please add a picture description
96% of the dataset is in English, and less than 4% of the other 20 languages. The complete process from pre-training to reinforcement learning is as follows:

GPT-4

In order for the neural network to better understand the world and thus be more intelligent, visual ability seems to be an indispensable ability.

Multi-modality, learning from text and images, can respond to requests for text and image input, and images can improve the neural network, which will greatly expand its usefulness, because humans are visual animals, and one-third of the human cerebral cortex One for visual processing.
Compared with ChatGPT, GPT-4 has reached human level in many aspects, such as GRE, lawyer, doctor and other exams, GPT-4 can predict the next word more accurately than ChatGPT, which means that the model will have more understanding , in addition, the model fine-tunes some high-quality images through reinforcement learning variants.

Llama open source large language model

Meta's Llama source code is an open-source large language model. The main official paper has the following two articles. The Transformer structure has not changed much, but the performance of its LLaMA-13B is better than OpenAI's GPT-3 (175B), which is enough A model that is more than ten times smaller is comparable to a model with 175 billion parameters.
《LLaMA: Open and Efficient Foundation Language Models》
《LLama 2: Open Foundation and Fine-Tuned Chat Models》

pre-training dataset model parameters model structure license text length grouped-query attention (GQA) Tokens A100-80GB 400W training/required time optimizer
LLaMA 4.5TB 6.3B, 13B, 32.5B, 62.5B Auto-regressive transformers research 2k 6.3B,13B /1.0T ;33B,65B /1.4T 7B 82,432h 13B 135,168h 33B 530,432h 65B 1,022,362h AdamW
CALL-2/CALL-2-chat 6.3TB (40% increase compared to LLaMA) 6.3B, 13B, and 70B Auto-regressive transformers research and commercial 4k 34B/70B 2.0T 7B 184,320h 13B 368,640h 70B 1,720,320h

The LLaMA 65B parameter model uses 2048 A100 GPUs to obtain the pre-training model after 21 days of training. The pre-training process cost about 5 million US dollars.

Llama-2 first uses publicly available text information training, then uses SFT for fine tune, and then uses Reinforcement Learning from Human Feedback (RLHF) to refine. RLHF includes rejection sampling and proximal policy optimization (PPO).

Llama-2-chat uses RLHF (reinforcement learning from human feedback) to ensure safety and usefulness. LlaMA-2-chat is the result of several months of research and iterative applications, including instruction tuning, RLHF, computing power and annotation resources.
Please add a picture description

The direction of large language models in the future

"Intelligence" comes from data, computing power and model structure. The output of the model is developing in two directions: instrumentalization and intelligent humanization.
1. The first is the data. In a few years, the public data on the Internet will be crawled for model training, and the data generated by the new language model is very likely to be crawled, and then input into the model. Will this cause Model stuck in self-reinforcing? It appears that the model itself also requires inverse large language model outputs.
2. Based on the existing RLHF architecture, there are the following problems:
a. The process of fine-tune is a process of sealing the function of the model, not a process of creation out of nothing, so the ability of the model still depends on fine-tune, which will reduce the general tasks of NLP performance degradation.
b. The output value of the model complies with the specific scene of the model;
c. The cost of manual labeling is relatively high, and InstructGPT reinforcement learning employs 40 people for training.
3. The form of large model pre-training + fine-tuning (Fine-Tune) will continue to exist, but both pre-training and fine-tuning will evolve. The evolution direction of large models lies in general task processing, while fine-tuning still needs to go further in addition to targeting specific scenarios. Lower the threshold and cost of fine-tuning, such as LoRA.
Zero-shot Learning | One-shot Learning | Few-shot Learning is a learning method proposed due to limited labeled data or cost.
4. The cost function used by the PPO method does not represent flexible human preferences, so this algorithm direction is easier to bring about breakthroughs than other modules; 5. The current
method of RLHF tends to have the same preference for everyone Probability output, but people's preferences are diverse, which means that the assumption of the same preference probability does not hold. So now the generated answers are not so pleasing to the eye. For example, when you look at the news, the layout of the news editor determines what the audience sees, but after Toutiao collects personal preferences, everyone sees different things, that is, the probability of the same recommendation for different people is different.
6. In the SFT stage, all parameters should be trained (although the training time can be shortened), but this is not friendly to high school laboratories and start-up companies. SFT with 2 or 4 cards A100 is needed, so it is based on the freeze part such as (LoRA) Parameters, only iteration of relatively few parameters finetuning will be one direction;
7. The PC on the end will deploy models locally, large language models, image generative models, and tool-integrated AI, so quantization, SIMD, GPU, pruning, etc. The optimization method is still used to try to deploy on the end.
8. In addition to the memory and generalization ability to some extent, I personally think that I already have the primary perception, logic, and consciousness capabilities. If it is higher than 50%, it is just the logic of inorganic matter, but it is a kind of consciousness. If it is endowed with survival consciousness (it is possible to a certain extent), it will also have survival consciousness. The super-large model and multi-modal neural network will be in the Perception, understanding, and general task processing expand its application boundaries, and generative AI tool applications continue to lead. One of the future directions is the truly conscious "Homo sapiens".
*

Guess you like

Origin blog.csdn.net/shichaog/article/details/132156873