GPT1
GPT1 uses an unsupervised pre-training-supervised fine-tuning method, and builds an effective model in the NLP field based on the Transformer decoder; it is the basis of GPT2 and GPT3.
-
Unsupervised framework
1) Framework: use the language model for pre-training, use the n-gram method to predict the current word; (use the first k words to predict the kth word, maximize the possibility of occurrence) 2) Both GPT and
BERT is to use Transformer as the basis of the model, butGPT uses Transformer's decoder, while BERT uses an encoder;
3) The mathematical expression of the Transformer decoder used by GPT is as follows:Where U represents the context vector of the token, We is the embedding matrix of the token, Wp is the position encoding matrix; h0 represents the sum of the word embedding and position embedding of the token (each token is composed of word meaning and position); hl is the transformer decoder The output result; finally multiplied by a WeT for softmax to get the probability of classification;
-
Supervised fine-tuning
1) Known dataset CCThe format of C is x 1 , . . . , xm − > yx^1,...,x^m -> yx1,...,xm−>y , of whichx 1 , . . . , xmx^1,...,x^mx1,...,xm is the token, y is the label, the data is pre-trained model, input to the softmax layer for classification, and the model prediction result is obtained
GPT2
- Core idea: zero-shot, you can do things that supervised learning can do without supervised learning! ! (Supervised work with unsupervised pretraining)
1) The core of the language model is the conditional modeling of sequences p ( sn − k , . . . , sn ∣ s 1 , s 2 , . . . sn − k − 1 ) p(s_{nk},.. .,s_n \vert s_1,s_2,...s_{nk-1})p(sn−k,...,sn∣s1,s2,...sn−k−1)
2) Any supervised task is to estimatep ( output ∣ input ) p(output \vert input)p ( o u t p u t ∣ i n p u t ) , usually we need to use a specific network structure for modeling, but if we make a general model, the network structure of different tasks is the same, then the only difference is the input The data. For the input and output of NLP tasks, we can use vector representations. For different tasks, we can actually add our task description on the input, which is expressed as ( translatetofrench , englishtext , frechtext ) (translateto french,english text, frech text)(translatetofrench,englishtext,frechtext),或者表示为 ( a n s w e r t h e q u e s t i o n , d o c u m e n t , q u e s t i o n , a n s w e r ) (answer the question, document,question,answer) (answerthequestion,document,question,answer) - detail
- Data collection, there are many problems in the existing corpus database, including data volume and data quality, the OpenAI team collected 40GB of high-quality data
- Word-Level embedding needs to solve OOV (the words in the out of Vocabulary data are not in the pre-training model), and the effect of the char-level model is not as good as that of the word-level. The author team chose a middle way: splitting rare words into subwords
- Model changes relative to GPT1:
- layer norm is placed in front of each sub-blcok
- The parameter initialization of the residual layer is adjusted according to the depth of the network
- Expanded dictionary, input sequence length, batchsize
GPT3
GPT3 is the work that has been played by everyone, and its parameters have reached 170 billion
One of the selling points of GPT2 is: zero-shot, but the work of GPT3 found that the effect of few-shot will be better when the amount of parameters is increased, that is to say, fine-tuning given certain supervised data on the pre-trained model The effect obtained will be much better. So I personally think that the innovation of GPT3 lies in the fact that the huge network structure can be applied to various tasks (note that the fine-tuning here will not change the parameters of the network, because the gradient will not be calculated when the parameter amount is huge)
The network structure of GPT3 is the same as that of GPT2, but the training data has been enlarged by 100 times. OpenAI has put a lot of effort into data processing, including QC and deduplication of low-quality data.