[Natural Language Processing | BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper Explanation

The BERT model is an NLP model proposed by Google in 2018 and has become the most breakthrough technology in the NLP field in recent years. Previous records have been refreshed on tasks in 11 NLP fields, such as GLUE, SquAD1.1, MultiNLI, etc.

insert image description here
The paper address is:

https://arxiv.org/pdf/1810.04805.pdf

I. Introduction


Google proposed the BERT model in the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". The BERT model mainly uses the Transformer's Encoder structure and uses the most primitive Transformer. In general, BERT has the following characteristics:

1.1 Structure

Transformer's Encoder structure is adopted, but the model structure is deeper than Transformer. Transformer Encoder contains 6 Encoder blocks, BERT-base model contains 12 Encoder blocks, and BERT-large contains 24 Encoder blocks.

1.2 Training

Training is mainly divided into two stages: pre-training stage and Fine-tuning stage. The pre-training stage is similar to Word2Vec, ELMo, etc. It is trained on a large data set based on some pre-training tasks. The Fine-tuning stage is followed by fine-tuning for some downstream tasks, such as text classification, part-of-speech tagging, question-answering systems, etc. BERT can fine-tune on different tasks without adjusting the structure.

1.3 Pre-training task 1

The first pre-training task of BERT is Masked LM, which randomly covers a part of the words in the sentence, and then uses the context information to predict the covered words at the same time, so that the meaning of the words can be better understood based on the full text. Masked LM is the focus of BERT, which is different from the biLSTM prediction method, which will be discussed later.

1.4 Pre-training task 2

The second pre-training task of BERT is Next Sentence Prediction (NSP), the next sentence prediction task. This task is mainly to enable the model to better understand the relationship between sentences.

2. BERT structure

insert image description here
The above figure is the structure diagram of BERT, the figure on the left shows the pre-training process, and the figure on the right is the fine-tuning process for specific tasks.

2.1 Input of BERT

The input to BERT can consist of a sentence pair (sentence A and sentence B) or a single sentence. At the same time, BERT adds some flags with special functions:

  1. The [CLS] flag is placed at the top of the first sentence, and the representation vector C obtained by BERT can be used for subsequent classification tasks.
  2. The [SEP] mark is used to separate two input sentences, for example, to input sentences A and B, add the [SEP] mark after sentences A and B.
  3. The [MASK] flag is used to cover some words in the sentence. After the words are covered with [MASK], the [MASK] vector output by BERT is used to predict what the word is.

For example, given two sentences "my dog ​​is cute" and "he likes palying" as input samples, BERT will turn it into "[CLS] my dog ​​is cute [SEP] he likes play ##ing [SEP]". BERT uses the WordPiece method, which splits words into subword units (SubWord), so some words will be split into roots, for example, "palying" will become "paly" + "##ing".

After BERT gets the sentence to be input, it needs to convert the words of the sentence into Embedding, and Embedding is represented by E. Unlike Transformer, BERT's input Embedding is obtained by adding three parts: Token Embedding, Segment Embedding, and Position Embedding.

insert image description here

Token Embedding: Embedding of words, such as [CLS] dog, etc., obtained through training.

Segment Embedding: It is used to distinguish whether each word belongs to sentence A or sentence B. If only one sentence is input, only EA is used, and it is learned through training.

Position Embedding: The position where the encoded word appears is different from Transformer using a fixed formula. BERT's Position Embedding is also obtained through learning. In BERT, the longest sentence is assumed to be 512.

2.2 BERT pre-training

After BERT inputs the Embedding of the words in the sentence, it trains the model through pre-training. The pre-training has two tasks.

The first one is Masked LM, which randomly replaces some words with [MASK] in the sentence, and then passes the sentence into BERT to encode the information of each word, and finally uses the encoding information T[MASK] of [MASK] to predict the correctness of the position word.

The second is the next sentence prediction, input sentences A and B into BERT, predict whether B is the next sentence of A, and use the coding information C of [CLS] to make predictions.

The process of BERT pre-training can be represented by the following figure.

insert image description here

2.3 BERT for specific NLP tasks

The pre-trained BERT model can be fine-tuned (Fine-tuning stage) when it is used for specific NLP tasks. The BERT model can be applied to a variety of different NLP tasks, as shown in the figure below.

insert image description here

Classification tasks for a pair of sentences : such as natural language inference (MNLI), sentence semantic equivalence judgment (QQP), etc., as shown in the above figure (a), need to pass two sentences into BERT, and then use the output value of [CLS] C performs sentence pair classification.

Single sentence classification tasks : such as sentence sentiment analysis (SST-2), judging whether the sentence grammar is acceptable (CoLA), etc., as shown in the above figure (b), only need to enter a sentence without using the [SEP] flag, and then use The output value C of [CLS] is classified.

Question answering task : such as the SQuAD v1.1 dataset, the sample is a sentence pair (Question, Paragraph), Question represents a question, Paragraph is a text from Wikipedia, and Paragraph contains the answer to the question. The goal of training is to find the starting position (Start, End) of the answer in Paragraph. As shown in Figure © above, the Question and Paragraph are passed into BERT, and then BERT predicts the positions of Start and End based on the output of all words in Paragraph.

Single sentence tagging task : such as named entity recognition (NER), input a single sentence, and then predict the category of the word according to the output T of BERT for each word, whether it belongs to Person, Organization, Location, Miscellaneous or Other (non-named entity).

3. Pre-training tasks

The pre-training part is the focus of BERT, and then understand the details of BERT pre-training. BERT includes two pre-training tasks Masked LM and next sentence prediction.

3.1 Masked LM

Let's first review the previous pre-training methods of language models, using the sentence "I/like/learning/natural/language/processing" as an example. When training a language model, it is usually necessary to perform some Mask operations to prevent information leakage. Information leakage refers to knowing the "natural" information in advance when predicting the word "natural". The reason for the information leakage of Transformer Encoder will be mentioned later.

Word2Vec 's CBOW: Predict word i through the context and context information of word i, but it uses the bag of words model and does not know the order information of words. For example, when predicting the word "natural", the above "I/Like/Learning" and the following "Language/Processing" will be used for prediction at the same time. CBOW is equivalent to masking the word "natural" during training.

ELMo : ELMo uses biLSTM during training. When predicting "natural", the forward LSTM will Mask all the words after "natural", using the above "I/like/learn" prediction; the backward LSTM will Mask "natural" Previous words, predicted using "Language/Processing" below. Then the output of the forward LSTM and the backward LSTM are stitched together, so ELMo separates the context information for prediction, instead of using the context information for prediction at the same time.

OpenAI GPT : OpenAI GPT is another algorithm that uses Transformer to train language models, but OpenAI GPT uses Transformer's Decoder, which is a one-way structure. Only use the above "I/Like/Learn" when predicting "Natural". The Decoder includes a Mask operation, which masks all the words after the current predicted word.

The figure below shows the difference between BERT and ELMo, OpenAI GPT:

insert image description here

The authors of BERT believe that when predicting words, it is best to use both left (above) and right (below) information of the word to predict. The ELMo model that performs left-to-right and right-to-left respectively is called a shallow bidirectional model (shallow bidirectional model). BERT hopes to train a deep bidirectional model deep bidirectional model on the Transformer Encoder structure, so This method of Mask LM is proposed for training.

Mask LM is used to prevent information leakage. For example, when predicting the word "natural", if the "natural" Mask of the input part is not removed, the predicted output can directly obtain "natural" information.

insert image description here

BERT only predicts the words at the [Mask] position during training, so that context information can be used at the same time. However, in the subsequent use, the word [Mask] will not appear in the sentence, which will affect the performance of the model. Therefore, the following strategy is adopted during training, 15% of the words in the sentence are randomly selected for Mask, 80% of the words selected as Mask are really replaced by [Mask], 10% are not replaced, and the remaining 10% are used A random word replacement.

For example, in the sentence "my dog ​​is hairy", if the word "hairy" is selected for Mask, then:

  1. With 80% probability, convert the sentence "my dog ​​is hairy" to the sentence "my dog ​​is [Mask]".
  2. 10% probability, keep the sentence "my dog ​​is hairy" unchanged.
  3. 10% probability, replace the word "hairy" with another random word, such as "apple". Convert the sentence "my dog ​​is hairy" to the sentence "my dog ​​is apple".

The above is BERT's first pre-training task Masked LM.

3.2 Next sentence prediction

The second pre-training task of BERT is Next Sentence Prediction (NSP), which is the next sentence prediction. Given two sentences A and B, it is necessary to predict whether sentence B is the next sentence of sentence A.

The main reason for BERT to use this pre-training task is that many downstream tasks, such as Question Answering System (QA) and Natural Language Inference (NLI), require the model to understand the relationship between two sentences, but it cannot be achieved by training the language model. this purpose.

When BERT is training, there is a 50% probability that it will choose two sentences AB that are connected, and there is a 50% probability that it will choose two sentences AB that are not connected, and then predict the sentence A through the output C of the [CLS] flag. Is the next sentence sentence B?

  • Input = [CLS] I like to play [Mask] Alliance [SEP] My best [Mask] is Yasuo [SEP]

    Category = B is the next sentence of A

  • Input = [CLS] I like to play [Mask] Alliance [SEP] Today's weather is very [Mask] [SEP]

    Category = B is not the next sentence of A

4. Summary of BERT

Because BERT pre-training uses Masked LM, each batch will only train 15% of the words, so more pre-training steps are required. Sequential models like ELMo make predictions for every word.

BERT uses the Transformer's Encoder and Masked LM pre-training methods, so it can perform bidirectional predictions; while OpenAI GPT uses the Transformer's Decoder structure, using the Mask in the Decoder, and can only predict sequentially.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130494954