Comparison between BERT model and Big Bird model

Introduction to BERT model

The BERT model is a Bidirectional Encoder Representation (BERT) based on Transformers, which adjusts the left and right context in all layers (learning upper and lower layer semantic information).
Transformer is a deep learning component that can process parallel sequences, analyze larger-scale data, speed up model training, has an attention mechanism, and can better collect contextual information related to words. Able to learn derivative information of other words. Changes produce better quality embedding representations.
Bidirectional models are widely used in the context of natural language processing. The text viewing order is left-to-right (left to right) and right-to-left (right to left).
BERT is suitable for creating high-quality contextualized embedding representations. Can use self-supervised tasks (without manual annotation) such as language modeling to train the BERT model.
The following figure is the direction of BERT information flow (BERT can better apply text representation to all layers)Insert image description here

BERT's input (based on Transformer)

BERT的input embedding主要由Token Embeddings, Segment Embeddings, Position Embeddings相加获得。
I n p u t _ E m b e d d i n g s = T o k e n _ E m b e d d i n g s + S e g m e n t _ E m b e d d i n g s + P o s i t i o n _ E m b e d d i n g s Input\_Embeddings = Token\_Embeddings + Segment\_Embeddings + Position\_Embeddings Input_Embeddings=Token_Embeddings+Segment_Embeddings+Position_Embeddings

  • Token Embeddings mainly divides words into subwords. A specific example is dividing playing into play and ##ing.
  • Segment Embeddings , mainly used to distinguish different sentences. For example, there are two sentences in the input, so there are two types of Segmnt Embeddings: EA E_{A}EAJapanese EB E_{B}EB
  • Position Embeddings mainly stores position information. BERT's Position Embedding is also obtained through learning. In BERT, it is assumed that the longest sentence is 512.
    Insert image description here

BERT’s pre-training task Mask LM

The BERT pre-training tasks are mainly Mask LM task (mask prediction task) and Next sentence prediction (NSP) (next sentence prediction) task.
The Mask LM task better alleviates the information leakage problem of bidirectional text viewing order . The specific illustration of information leakage is as follows (some models know the predicted information when encoding) BERT only predicts
Insert image description here
words at the MASK position during the pre-training task Mask LM , so that it can better utilize contextual information and obtain higher quality. The embedding representation. However, in subsequent tasks, the sentences will be a complete sentence (no MASK appears). In order to alleviate such problems, the following operations are taken during the training process. The specific operations are as follows.

The example sentence is "my dog ​​is hairy", and hairy is selected as the MASK.

  • 80% probability of converting the sentence "my dog ​​is hairy" into "my dog ​​is [MASK]"
  • 10% probability, the sentence "my dog ​​is hairy" will not be modified in any way
  • 10% probability, replace hairy with apple, convert the sentence "my dog ​​is hairy " into "my dog ​​is apple "

BERT’s pre-training task Next Sentence Prediction (NSP)

The NSP (next sentence prediction) task is detailed as follows. Suppose there are two sentences A and B. BERT splices A and B together as follows, [CLS] A1 A2 A3 … An [SEP] B1 B2 B3 … Bn. In the NSP task, BERT will have a 50% probability of selecting two connected sentences ( A and B are the upper and lower sentences ) A ​​and B. There is a 50% probability that the two connected sentences A and B are not selected. Judge (predict) whether the next sentence of sentence A is B through the output C of the [CLS] flag input by BERT. The specific situation is shown in the figure.
none

BERT code

Bert official code on Github

BERT model effect

Experimental results of nine tasks of GLUE

Insert image description here

Experimental results of SQuAD1.1 task

Insert image description here

Big Bird model

reference

Big Bird
NVIDIA BERT introduction
to thoroughly understand the Google BERT model

Guess you like

Origin blog.csdn.net/xiaziqiqi/article/details/131876902