Re45:读论文 GPT-1 Improving Language Understanding by Generative Pre-Training

The Gods Are Silent - Personal CSDN Blog Directory
The Gods Are Silent Paper Reading Notes and Classifications

Full name of the paper: Improving Language Understanding by Generative Pre-Training
Paper download address: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf

This article is the work of OpenAI in 2018 and is the original paper of the first generation GPT.

First pre-train the language model (Transformer decoder) with unsupervised data, and then fine-tune it on the supervised data (add a layer of prediction head, and optimize the loss function of the language model and supervised tasks at the same time)
Insert image description here

1 Introduction

NLU tasks include subtasks such as textual entailment, question answering, semantic similarity assessment, and document classification. This article tested four tasks: NLI, QA, semantic similarity, and text classification.
Supervised data is scarce. The solution in this article is to use massive unlabeled data for generative pre-training on the language model, and then perform discriminative fine-tuning on specific subtasks.
(Calculated as semi-supervised learning)

A common method of using unsupervised methods to learn linguistic knowledge is to build pre-trained word embeddings to improve the performance of NLP tasks. This approach has two problems: 1. What optimization goals are used in learning text representations to be most effective for transfer? ,have no idea. So far there is no absolutely excellent method. 2. How to use text representation most effectively, I don’t know.

2. GPT-1

1. Unsupervised pre-trained language model

The standard language model goal is to maximize the likelihood of a text:
Insert image description here

k k k is the context window size, conditional probabilityPPP , the parameter of the neural networkΘ \Theta(i )

This article uses multi-layer Transofmer decoder 1 (multi-head self-attention mechanism + position-wise feedforward neural network to generate the output distribution on the target token):
Insert image description here
UUU is token,nnn is the number of layers,W e W_eWeis the token embedding matrix, W p W_pWpis the position embedding matrix

The advantage of Transformer compared to LSTM is in the processing of long texts.

2. Fine-tuning

The representation is obtained through the input (each task is converted into a different form of input, see figure 1) and fed into the linear output layer to predict yyy
Insert image description here

New optimization goals:
Insert image description here

In fact, the two optimization goals are added together:
Insert image description here

3. Experiment

1. Dataset

  1. Upstream pre-training data: BooksCorpus and 1B Word Benchmark
  2. Downstream fine-tuning data
    Insert image description here

2. Downstream task indicators

  1. Experimental results of NLI tasksInsert image description here
  2. Experimental results on QA and common sense reasoningInsert image description here
  3. Experimental results on semantic similarity and text classificationInsert image description here

3. Model analysis

  1. The impact of the number of layers on fine-tuning results (the answer is, the more the better) and the impact of the number of pre-training updates on zero-shot performance Insert image description here
    (the values ​​are obtained after normalization)
  2. ablation study
    Insert image description here

  1. Generating Wikipedia by Summarizing Long Sequences ↩︎

Guess you like

Origin blog.csdn.net/PolarisRisingWar/article/details/132670273