Article reading summary: GPT

Article directory


insert image description here

GPT1

GPT1 uses an unsupervised pre-training-supervised fine-tuning method, and builds an effective model in the NLP field based on the Transformer decoder; it is the basis of GPT2 and GPT3.

  1. Unsupervised framework
    1) Framework: use the language model for pre-training, use the n-gram method to predict the current word; (use the first k words to predict the kth word, maximize the possibility of occurrence) 2) Both GPT and
    BERT is to use Transformer as the basis of the model, butGPT uses Transformer's decoder, while BERT uses an encoder;
    3) The mathematical expression of the Transformer decoder used by GPT is as follows:

      Where U represents the context vector of the token, We is the embedding matrix of the token, Wp is the position encoding matrix; h0 represents the sum of the word embedding and position embedding of the token (each token is composed of word meaning and position); hl is the transformer decoder The output result; finally multiplied by a WeT for softmax to get the probability of classification;

  2. Supervised fine-tuning
    1) Known dataset CCThe format of C is x 1 , . . . , xm − > yx^1,...,x^m -> yx1,...,xm>y , of whichx 1 , . . . , xmx^1,...,x^mx1,...,xm is the token, y is the label, the data is pre-trained model, input to the softmax layer for classification, and the model prediction result is obtained

    The left side of the picture above is a structural diagram of the Transformer block. Because it is the Transformer decoder, the introduction in the article is also very simple; the right side shows how to apply the GPT pre-training model for various fine-tuning, and then apply it to different tasks. middle.

GPT2

  1. Core idea: zero-shot, you can do things that supervised learning can do without supervised learning! ! (Supervised work with unsupervised pretraining)
    1) The core of the language model is the conditional modeling of sequences p ( sn − k , . . . , sn ∣ s 1 , s 2 , . . . sn − k − 1 ) p(s_{nk},.. .,s_n \vert s_1,s_2,...s_{nk-1})p(snk,...,sns1,s2,...snk1)
    2) Any supervised task is to estimatep ( output ∣ input ) p(output \vert input)p ( o u t p u t i n p u t ) , usually we need to use a specific network structure for modeling, but if we make a general model, the network structure of different tasks is the same, then the only difference is the input The data. For the input and output of NLP tasks, we can use vector representations. For different tasks, we can actually add our task description on the input, which is expressed as ( translatetofrench , englishtext , frechtext ) (translateto french,english text, frech text)(translatetofrench,englishtext,frechtext),或者表示为 ( a n s w e r t h e q u e s t i o n , d o c u m e n t , q u e s t i o n , a n s w e r ) (answer the question, document,question,answer) (answerthequestion,document,question,answer)
  2. detail
    1. Data collection, there are many problems in the existing corpus database, including data volume and data quality, the OpenAI team collected 40GB of high-quality data
    2. Word-Level embedding needs to solve OOV (the words in the out of Vocabulary data are not in the pre-training model), and the effect of the char-level model is not as good as that of the word-level. The author team chose a middle way: splitting rare words into subwords
    3. Model changes relative to GPT1:
      1. layer norm is placed in front of each sub-blcok
      2. The parameter initialization of the residual layer is adjusted according to the depth of the network
      3. Expanded dictionary, input sequence length, batchsize

GPT3

GPT3 is the work that has been played by everyone, and its parameters have reached 170 billion

One of the selling points of GPT2 is: zero-shot, but the work of GPT3 found that the effect of few-shot will be better when the amount of parameters is increased, that is to say, fine-tuning given certain supervised data on the pre-trained model The effect obtained will be much better. So I personally think that the innovation of GPT3 lies in the fact that the huge network structure can be applied to various tasks (note that the fine-tuning here will not change the parameters of the network, because the gradient will not be calculated when the parameter amount is huge)

The network structure of GPT3 is the same as that of GPT2, but the training data has been enlarged by 100 times. OpenAI has put a lot of effort into data processing, including QC and deduplication of low-quality data.

The different curves in the above figure represent different GPT3 models. The darker the color, the smaller the parameter amount, and the brighter the parameter, the more parameter amount; when the parameter amount is small, the model will converge after a certain period of training, and the loss will not increase. If the amount of computation increases, there will be no obvious improvement; when the amount of parameters explodes, the loss will continue to decrease as the amount of computation increases; ==But there will be problems such as linear decline in loss and exponential increase in computing consumption==! ! !

Guess you like

Origin blog.csdn.net/jerry_liufeng/article/details/125679709