[DataWhale Learning Record 15-06] Zero-based Introductory NLP-News Text Categorization Questions-06 Text Categorization Based on Deep Learning 3

BERT principle:
The innovation of BERT is that it uses a two-way Transformer for language models. The
previous model inputs a text sequence from left to right, or combines left-to-right and right-to-left training.
The results of the experiment show that the two-way training language model will have a deeper understanding of the context than the one-way language model. The
paper introduced a new technology called Masked LM (MLM), which was unable to perform two-way language before this technology appeared. Model training.

BERT makes use of the encoder part of Transformer.
Transformer is an attention mechanism that can learn the contextual relationship between words in text.
The Transformer's prototype includes two independent mechanisms, an encoder is responsible for receiving text as input, and a decoder is responsible for predicting the result of the task.
The goal of BERT is to generate a language model, so only the encoder mechanism is needed.

Transformer's encoder reads the entire text sequence at once, instead of sequentially reading from left to right or right to left.
This feature enables the model to learn based on both sides of the word, which is equivalent to a two-way function.

The following figure is the encoder part of Transformer. The input is a token sequence, which is first embedding called a vector, and then input to the neural network. The output is a vector sequence of size H, and each vector corresponds to a token with the same index.
Insert picture description here
When we are training a language model, one challenge is to define a prediction target. Many models predict the next word in a sequence.
The
two-way method of "The child came home from ___" is limited in such tasks. To overcome this problem, BERT uses two strategies:

  1. Masked LM (MLM)
    before inputting the word sequence to BERT, 15% of the words in each sequence are replaced by [MASK] token. Then the model tries to predict the original word that is masked based on the context of other words in the sequence that are not masked.
    This requires:

Add a classification layer
to the output of the encoder. Multiply the output vector by the embedding matrix and convert it to the dimension of the vocabulary.
Use softmax to calculate the probability of each word in the vocabulary.
The loss function of BERT only considers the predicted value of the mask, ignoring that there is no Predictions of masked words. In this case, the model converges more slowly than the one-way model, but the situational awareness of the result is increased.

  1. Next Sentence Prediction (NSP)
    In the training process of BERT, the model receives pairs of sentences as input and predicts whether the second sentence is also a follow-up sentence in the original document.
    During training, 50% of the input pairs are contextual in the original document, and the other 50% are randomly composed from the corpus and are disconnected from the first sentence.
    In order to help the model distinguish between the two sentences in training, the input must be processed in the following way before entering the model:

Insert the [CLS] tag at the beginning of the first sentence, and insert the [SEP] tag at the end of each sentence.
Add a sentence embedding representing sentence A or sentence B to each token.
Add a position embedding to each token to indicate its position in the sequence.
In order to predict whether the second sentence is a follow-up sentence of the first sentence, the following steps are used to predict:

The entire input sequence is input to the Transformer model
. A simple classification layer is used to transform the [CLS] marked output into a 2×1 shape vector.
Use softmax to calculate the probability of IsNextSequence

Guess you like

Origin blog.csdn.net/qq_40463117/article/details/107812100