Introduction to BERT

1. BERT pre-training model training steps:

  1. Use the Masked LM method to mask a certain part of the words in the corpus, and the model predicts the masked information through context, thereby training a preliminary language model
  2. Select consecutive upper and lower sentences in the corpus and use the Transformer module to identify the continuity of the sentences
  3. Pre-trained language representation model for bidirectional prediction by context through 1 and 2
  4. Fine-Tuning the model in a supervised learning manner using a small amount of labeled data

2. Contextualized word embedding

BERT chooses the Transformer encoder as its bidirectional architecture. It is common in Transformer encoders that positional embeddings are added to each position of the input sequence. However, unlike the original Transformer encoder, BERT uses learnable positional embeddings. The embedding of BERT's input sequence is the sum of word embedding, segment embedding and position embedding.
2

3. Masking Input (cloze)——> self-supervised

In order to train deep bidirectional representations, BERT adopts a straightforward method: randomly masking a certain proportion of Tokens, and then predicting only these masked Tokens. This process is Masked LM, also known as cloze. In this task, the final hidden vector of the masked token is fed into the output softmax layer of the vocabulary, just like a standard language model. During BERT cloud training, masked words are not always replaced with actual [MASK] tokens. Instead, it trains a data generator to randomly select 15% of the tokens. For example, in the following sentence:

Taiwan University

Select and then perform the following procedure:

  • 80% of the time, use [Mask] Token to mask the previous word. For example: The [Mask]is cute.
  • 10% of the time, mask the word with a random word. For example: The playing is cute.
  • 10% of the time, keep the word the same.

This transform encoder does not know which word is going to be predicted or which word is replaced by a random word. Therefore, it must maintain the contextual characteristics of the distribution of each input token. In addition, because the probability of random substitution is very low for all Tokens, it will not damage the model's understanding ability.

As shown in the picture:
2

As shown in the figure, for BERT, the input and output sizes are the same. In the course of Professor Li Hongyi of National Taiwan University, taking the input sequence "National Taiwan University" as an example, the model randomly masks the word "bay", then performs MLP processing on the output matrix of the position of the word "bay", and then uses softmax to classify it to obtain the current Classification of masked words.

4. Next Sentence Prediction

BERT input is a sequence pair. The text pair is filled with two special tokens. [CLS]It determines whether the two text sequences in the text pair are adjacent (that is, whether the second text sequence is the next sentence of the first text sequence). ). [SEP]Cut a text pair, which is the separator between two text sequences.

2

As shown in the figure, [CLS]the output matrix at the current position is classified into two categories to determine whether the second sequence in the current sequence is the next sentence of the first sequence.

5. Downstream Tasks ——> Fine-tune

The BERT network is just an encoder and cannot complete a specific task by itself. However, because of BERT's excellent architectural design, adding a decoder designed according to specific tasks after the pre- trained BERT network, and using the data set to fine-tune the network, the entire network can have excellent performance.
2

As shown in the figure, this is similar to the Backbone feature extraction network in CV. You only need to add a decoder designed for downstream tasks after BERT to complete the complete network design.

There is no need to train the Backbone feature extraction network from scratch. After designing the downstream task decoder, fine-tuning the entire pre-trained network using the downstream task-specific data set can solve the problem excellently.

Specific downstream tasks include single text classification, text pair classification or regression, text annotation and question answering, etc. There are already many mature solutions for specific methods and codes, so I won’t go into details due to my limited ability.

Guess you like

Origin blog.csdn.net/qq_44733706/article/details/128972424