Introduction to BERT model
The BERT model is a Bidirectional Encoder Representation (BERT) based on Transformers, which adjusts the left and right context in all layers (learning upper and lower layer semantic information).
Transformer is a deep learning component that can process parallel sequences, analyze larger-scale data, speed up model training, has an attention mechanism, and can better collect contextual information related to words. Able to learn derivative information of other words. Changes produce better quality embedding representations.
Bidirectional models are widely used in the context of natural language processing. The text viewing order is left-to-right (left to right) and right-to-left (right to left).
BERT is suitable for creating high-quality contextualized embedding representations. Can use self-supervised tasks (without manual annotation) such as language modeling to train the BERT model.
The following figure is the direction of BERT information flow (BERT can better apply text representation to all layers)
BERT's input (based on Transformer)
BERT的input embedding主要由Token Embeddings, Segment Embeddings, Position Embeddings相加获得。
I n p u t _ E m b e d d i n g s = T o k e n _ E m b e d d i n g s + S e g m e n t _ E m b e d d i n g s + P o s i t i o n _ E m b e d d i n g s Input\_Embeddings = Token\_Embeddings + Segment\_Embeddings + Position\_Embeddings Input_Embeddings=Token_Embeddings+Segment_Embeddings+Position_Embeddings
- Token Embeddings mainly divides words into subwords. A specific example is dividing playing into play and ##ing.
- Segment Embeddings , mainly used to distinguish different sentences. For example, there are two sentences in the input, so there are two types of Segmnt Embeddings: EA E_{A}EAJapanese EB E_{B}EB
- Position Embeddings mainly stores position information. BERT's Position Embedding is also obtained through learning. In BERT, it is assumed that the longest sentence is 512.
BERT’s pre-training task Mask LM
The BERT pre-training tasks are mainly Mask LM task (mask prediction task) and Next sentence prediction (NSP) (next sentence prediction) task.
The Mask LM task better alleviates the information leakage problem of bidirectional text viewing order . The specific illustration of information leakage is as follows (some models know the predicted information when encoding) BERT only predicts
words at the MASK position during the pre-training task Mask LM , so that it can better utilize contextual information and obtain higher quality. The embedding representation. However, in subsequent tasks, the sentences will be a complete sentence (no MASK appears). In order to alleviate such problems, the following operations are taken during the training process. The specific operations are as follows.
The example sentence is "my dog is hairy", and hairy is selected as the MASK.
- 80% probability of converting the sentence "my dog is hairy" into "my dog is [MASK]"
- 10% probability, the sentence "my dog is hairy" will not be modified in any way
- 10% probability, replace hairy with apple, convert the sentence "my dog is hairy " into "my dog is apple "
BERT’s pre-training task Next Sentence Prediction (NSP)
The NSP (next sentence prediction) task is detailed as follows. Suppose there are two sentences A and B. BERT splices A and B together as follows, [CLS] A1 A2 A3 … An [SEP] B1 B2 B3 … Bn. In the NSP task, BERT will have a 50% probability of selecting two connected sentences ( A and B are the upper and lower sentences ) A and B. There is a 50% probability that the two connected sentences A and B are not selected. Judge (predict) whether the next sentence of sentence A is B through the output C of the [CLS] flag input by BERT. The specific situation is shown in the figure.
BERT code
BERT model effect
Experimental results of nine tasks of GLUE
Experimental results of SQuAD1.1 task
Big Bird model
reference
Big Bird
NVIDIA BERT introduction
to thoroughly understand the Google BERT model