[Notes] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Reference notes

title

Insert image description here

Summary

Insert image description here
GPT considers one-way (using the context information on the left to predict the future) , while BERT uses information on the left and right at the same time, which is bidirectional.

ELMO uses an RNN-based architecture, and BERT uses a transformer, so ELMo needs to make a little adjustment to the architecture when using some downstream tasks. However, BERT is relatively simple. Just like GPT, you only need to change the top layer. Got it

preface

In language models, pre-training can be used to improve many natural language tasks

Natural language tasks include two categories

Sentence-level tasks (sentence-level): mainly used to model the relationship between sentences, such as emotion recognition of sentences or the relationship between two sentences. Token-
level tasks (token-level): including entities Naming recognition (recognizing whether each word is an entity name, such as a person's name or a street name), these tasks require the output of some detailed word-level output.

When using pre-trained models for feature representation, there are generally two types of strategies:

  • One strategy is feature-based, and its representative work is ELMo. For each downstream task, a neural network related to this task is constructed . It uses the RNN architecture, and then combines these pre-trained representations (such as word embeddings, etc.) (or other things) as an additional feature and input into the model. It is hoped that these features have a better representation, so the model training is relatively easy. This is also the most effective way to use pre-trained models in NLP. Commonly used approach (put the learned features together with the input as a good feature expression)
  • Another strategy is based on fine-tuning, here is the example of GPT, that is, when the pre-trained model is placed in downstream tasks, it does not need to be changed too much, only a little change is needed. The pre-trained parameters of this model will be fine-tuned on the downstream data (all weights will be fine-tuned based on the new data set)

The above two approaches use the same objective function during pre-training, and both use a one-way language model (given some words to predict what the next word is, say a sentence and predict the following sentence) What is the word? It belongs to a prediction model and is used to predict the future, so it is one-way )

Insert image description here

in conclusion

Insert image description here

Related work

Insert image description here

This area is often used in computer vision: the model is often trained on ImageNet and then used elsewhere, but it is not particularly ideal in NLP (maybe on the one hand because these two tasks are quite different from other tasks) On the other hand, it may be because the amount of data is still far from enough), BERT and his subsequent series of work have proved that the model trained on NLP using a large number of unlabeled data sets is more effective than using labeled ones. Models trained on smaller data sets perform better . The same idea is now slowly being adopted by computer vision. That is to say, models trained on a large number of unlabeled images may be better than those trained on 1 million images like ImageNet. The model trained on the set may perform better!

BERT model

Insert image description here

All parameters are fine-tuned using labeled data from downstream tasks. Each downstream task has a separate fine-tuned model, even if they are initialized with the same pre-trained parameters

There are two steps in BERT:

  • Pre-training: In pre-training, the model is trained using unlabeled data on different pre-training tasks .
  • Fine-tuning: During fine-tuning, a BERT model is also used, but its weights are initialized to the weights obtained in pre-training . All weights will participate in training during fine-tuning, using labeled data.

During pre-training, the input is some unlabeled sentences.
Here, a BERT model is trained on unlabeled data, and its weights are trained. For downstream tasks, the same BERT is created for each task. model, but the initialization value of its weights comes from the weights trained in the previous pre-training. Each task will have its own labeled data, and then BERT will continue to be trained, thus obtaining the BERT version for a certain task.

Model architecture

Insert image description here

Which three parameters are adjusted in the model?
L: the number of transform blocks
H: hidden size
A: the number of heads in the multi-head of the self-attention mechanism

Digression:
Insert image description here
Insert image description here

The three small squares are Q, K, and V.

  • For the projection matrices of Q, K and V: the dimension of each projection matrix is ​​H*64 (assuming the dimension of each head is 64). Since there are A heads, the total number of parameters is 3 (Q, K and V) * A * H * 64
  • Self-attention output projection matrix: The dimension of this matrix is ​​also H*H.

Number of parameters = 3 * A * H * 64 + H * H = 4 * H * H

Insert image description here

Insert image description here
To illustrate: Let's say we have a text that contains the word "unhappiness". Using traditional word segmentation methods, it will be split into one word, "unhappiness". However, if we use the WordPiece algorithm, it may split the word into two sub-words: "un" and "happiness". This is because "un" and "happiness" are both common parts, with "un" usually meaning negation and "happiness" meaning happiness. This segmentation method better captures the meaning of the text
Insert image description here

For each word element entered into BERT's vector representation, it is the embedding of the word element itself plus the embedding of which sentence it is in plus the embedding of the position, as shown in the figure below. The figure above demonstrates the practice of BERT's embedding layer
Insert image description here
. , that is, a sequence of vectors is obtained from a sequence of word elements. This vector sequence will enter the transformer block.
Each square in the above figure is a word element.

  • Token embedding: This is a normal embedding layer that outputs its corresponding vector for each word element.
  • segment embedding: indicates whether it is the first sentence or the second sentence
  • position embedding: The size of the input is the maximum length of the sequence. Its input is the position information of each word element in the sequence (starting from zero), thus obtaining the corresponding position vector.

In the end, it is the embedding of each word element itself plus the embedding in which sentence plus the position in the middle of the sentence embedded
in the transformer. The position information is a matrix constructed manually, but in BERT, no matter which sentence it belongs to , or a specific position, its corresponding vector representation is obtained through learning

pre-training

Masked LM

**Language model with mask**

Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning, as the [MASK] symbol never appears during the fine-tuning stage. mismatch, since the [MASK] symbol never appears during the fine-tuning phase.

The main purpose of the masking strategy is to create a more consistent training signal between pre-training and fine-tuning to improve the model's generalization performance. This is because there is a certain data distribution mismatch between pre-training and fine-tuning. If a masking strategy is not adopted, the model may perform poorly in the fine-tuning phase because the knowledge it learned in the pre-training phase cannot be directly transferred to the fine-tuning task. .

Next Sentence Prediction (NSP)

Insert image description here

Insert image description here

Use BERT for fine-tuning

Insert image description here
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/132752663