[Paper Notes] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The basic work of many jobs has never had the opportunity to formally read the paper, this time I took this opportunity to learn

1. Background

1.1 Training Strategy

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.

1.1.1 Unsupervised feature methods.

It mainly relies on word embeddings, and later generalizes to larger fine-grained ones, such as sentence embeddings and paragraph embeddings. Word embedding research ELMo advances the state of the art on several major NLP benchmarks when combining contextual word embeddings with existing task-specific architectures

[Reference] ELMo principle analysis and simple use - Zhihu (zhihu.com)

1.1.2 Unsupervised fine-tuning method

Sentence or document encoders that produce contextually labeled representations have been pretrained on unlabeled text and fine-tuned for supervised downstream tasks. The advantage of these methods is that almost no parameters need to be learned from scratch.

1.2 MLM(masked language model)

Randomly mask some tokens from the input with the goal of predicting the original vocabulary of the masked words based only on their context

1.3 Main Contributions

  • BERT uses a masked language model to achieve pre-trained deep bidirectional representations.
  • Pretrained representations reduce the need for many carefully designed task-specific architectures.

2. BERT

 During pre-training, the model is trained on unlabeled data in a different pre-training task. For fine-tuning, the BERT model is first initialized with pre-trained parameters, and all parameters are fine-tuned using labeled data from downstream tasks. A notable feature of BERT is its unified architecture across different tasks . There are small differences between the pretrained architecture and the final downstream architecture.

2.1 Model Architecture

Model Architecture The model architecture of BERT is a multi-layer bidirectional transformer encoder.

2.2 Input/Output Representation

  • A "sentence" can be any span of continuous text, rather than an actual sentence of language.
  • "Sequence" refers to BERT's input token sequence, which can be a sentence or two sentences combined.
  • "special token"
  1. classification token ([CLS]): the first token of each sequence is always a special classification token
  2. special token ([SEP]): Sentence pairs combined into a single sequence. (1) We separate them with a special token ([SEP]). (2) We add a learned embedding to each token indicating whether it belongs to sentence a or sentence B.
  3. [MASK]: "Mask" word

The input representation is built by summing the corresponding token, segment and position embeddings

2.3 Pre-training PERT

2.3.1 Task 1: Masked Language Model (Masked LM)

A deep bidirectional model is strictly more powerful than a left-to-right model or a shallow cascade of left-to-right and right-to-left models. Because bidirectional conditioning allows each word to "see itself" indirectly, and the model can predict the target word in multiple layers of context.

Train deep bidirectional representations: randomly mask a percentage of input tokens, then predict the tokens of these masked words . We call this program "Masked LM" 

To compensate for the mismatch between pre-training and fine-tuning caused by the [MASK] token not appearing during fine-tuning:

(1) 80% of the time: [MASK] token

(2) 10% of the time: random token

(3) 10% of the time: the i-th token remains unchanged.

2.3.2 Task 2: Next Sentence Prediction

Train a model that understands sentence relationships, pre-trained a binary next sentence prediction task

2.4 Fine-tuning

For each task, we simply plug task-specific inputs and outputs into BERT and fine-tune all parameters end-to-end.

There are some experiments and ablation experiments at the end of the paper, so I won’t go into details here.

3. Code

A basic Bert model consists of three parts: BertEmbeddings, BertEncoder, and BertPooler

3.1 Embeddings

Each input token consists of three parts: word embeddings (word embeddings), position embeddings (position embeddings) and token type embeddings (token type embeddings)

  • Word embeddings: learn a vector for each word based on the context
  • Positional embedding: Each token in the input sequence is assigned an embedding vector representing its position in the sequence. These embedding vectors are usually learned and encode the relative positions of tokens in the sequence.
  • Tag type embedding: mainly used to process sentence-level tasks, such as text classification, sentence relationship judgment, etc. Its role is to provide the model with information about the sentence the token belongs to , helping the model distinguish the relationship between different sentences. Typically, the input sequence is divided into two sentences in the BERT model (for example, [CLS]sentence1[SEP]sentence2[SEP]), where the [SEP] token is used to separate the two sentences.

3.2 BertEncoder

 Stacked by several BertLayers

 BertAttention is actually a multi-head attention + [full connection with activation function + full connection output]

[Full connection with activation function + full connection output] In fact, it can also be regarded as a feed-forward structure. In fact, this is the structure of the encoder in the transformer

 3.3 BertPooler

 ————————————————————————————————————————

In fact, there are two core issues for bert: 1. How to jointly adjust the context in all layers? 2. Where is the two-way embodiment?

  1. In BERT, joint conditioning of context is achieved across all layers by sharing parameters. The BERT model uses Transformer's multi-layer stacking structure, and each layer is composed of a multi-head self-attention mechanism and a feedforward neural network. Each layer has its own parameters, but these parameters are shared through the pre-training phase. In the pre-training phase, the BERT model uses a large-scale corpus for unsupervised learning and learns contextual information through training. In this way, in the subsequent fine-tuning stage, the BERT model can use the learned context information to perform fine-tuning on specific tasks and achieve joint adjustment.

  2. Bidirectionality is an important feature of the BERT model, which is reflected in two aspects:

    a. Bidirectionality of input representation: The BERT model adopts a bidirectional Transformer structure, that is, both left and right context information are considered when processing input sequences. This is achieved by introducing a multi-head self-attention mechanism in Transformer, where the representation of each position is influenced by all positions on the left and right, enabling the model to capture global semantic information.

    b. Bidirectionality of pre-training: In the pre-training phase of BERT, two prediction tasks are used: Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM task requires the model to randomly cover some words in the input sequence, and then predict these covered words; the NSP task requires the model to judge whether two sentences are continuous. Both tasks involve modeling bidirectionality, i.e. the model needs to infer masked words from context or judge the relationship between two sentences. Through this pre-training method, the BERT model can learn bidirectional semantic information and benefit from this bidirectionality in subsequent tasks.

Guess you like

Origin blog.csdn.net/weixin_50862344/article/details/131144208