1. Joint training tasks
1.1 NTP(Next Token Prediction)
There are two objective functions for gpt pre-training. The first one is the basic next word prediction task, select a K window, and use the embedding of K words in the window as a condition to predict the next word.
1.2 TC(Text Classification)
The second is a classification task, a paragraph is given a label, and then the label is predicted.
The objective function when fine-tuning as pre-training is the weighted sum of these two functions.
When he takes on the downstream task, he puts the input into the decoder of the transformer . Like bert, he uses the pre-trained parameters, and then adds the features to a subsequent FFN, as shown in the following figure:
His number of layers is 12 layers plus 768 dimensions. Bert set his parameters like this just to do a comparative experiment with him.
2. GPT2
GPT2 is open ai in response to bert, so