Learning to do with the depth of the named entity recognition (six) -BERT Introduction

What is a BERT?

BERT, stands for Bidirectional Encoder Representations from Transformers. It can be understood as a two-way coding model characterizing Transformers main frame. So, we need to understand the principles of BERT, the need to first understand what is Transformers.
Transformers is simply a sequence of converting a set of black box into another set of sequences, the black box interior by the encoder and the decoder, encoding is responsible for encoding an input sequence, and is responsible for the decoder to convert the output of the encoder another set of sequences. Specifically refer to the article " I want to study BERT model? Take a look at this article! "

It should be noted that, Transformers BERT used in expressing the position information is not used Positional Encoding, but the use of Positional Embedding, so the location information is trained and able to make the model taking into account the words left and right contextual information, BERT uses a two-way Transformers architecture. And because of the location information is embedding methods used, the maximum length of the sequences will be limited, and the maximum length of time the sequence is limited to training, the maximum length sequence here BERT pre-training model is 512. That is to say if the training exceeds the length of the sample, it requires the use of truncated or other means to ensure that the length of the sequence is within 512.

What BERT do?

  • Text reasoning
    Given a sentence, second sentence and predict the relationship between the first sentence: contains contradictions, neutral.
  • Q
    given problem and essays, essay predicted from the corresponding span as the answer.
  • Text classification
    such as movie reviews do predict emotions.
  • Text similarity matching
    input two sentences, semantic similarity calculation.
  • Named Entity Recognition
    Given a sentence, the sentence output in a particular entity, such as name, address and time.

How to use BERT?

BERT There are two kinds of usage:

  • feature-based
    extracts a feature vector sequence of text directly model pre-trained using BERT. For example, text similarity matching.

  • fine-tuning
    added to the pre-training model new network layer; freeze all layers of pre-training model, after the training is completed, all layers of the pre-release model of training, joint training section and add defrosted parts. Such as text categorization, named entity recognition.

Why BERT can do that?

BERT when trained using unsupervised way, its main use two kinds of strategies to get characterization of the sequence.

MLM

In order to characterize a two-way depth training, some of the percentage of simple random mask input tokens, those tokens are then predicted mask off. This step is called "masked LM" (MLM), in the literature, it is called cloze tasks (Cloze task). tokens corresponding to mask out the last hidden layer feed a vector output softmax, as in the standard as in LM. In the experiment, the authors lost 15% of tokens for each sequence of random mask. Although this allows the author to obtain the two-way pre-training model, its negative impact is between the pre-training and fine-tune the model to create a mismatch, because [MASK] symbol will not appear in the fine-tuning stage. So, we need a way to get those words out of the mask is also characterized by the original model to learn, so here the authors used a number of tactics:
assuming that the original sentence was "my dog is hairy", the authors mentioned in Section 3.1 Task1 in will 15% of the randomly selected tokens for mask sentence position, randomly selected here assumed to be the fourth token to mask off position, i.e. to be hairy mask, then the mask process can be described as follows:

  • 80% of the time: Replace with the target word [MASK], for example: my dog ​​is hairy -> my dog ​​is [MASK].
  • 10% of the time: the word is replaced with a random target word, for example: my dog ​​is hairy -> my dog ​​is apple.
  • 10% of the time: do not change the target word, for example: my dog ​​is hairy -> my dog ​​is hairy. (The purpose of this is to tend to characterize the observed actual word.)

The above process, combined with the need to understand the process of training epochs, each epoch finished school again represents all the samples, each sample in multiple epochs process is repeated input into the model, knowing this concept, above 80%, 10%, 10% like to understand, that when a sample feeding each model, the probability of replacement target word with [MASK] is 80%; the probability of replacement target word in a random word 10%; the target word does not change the probability is 10%.

BERT introduce some articles to explain the MLM process when 80% here, 10%, 10% construed to replace the original sentences were randomly selected 15% of the tokens in the replacement of 80% of the target word with [MASK], 10% replacement target word random words, 10% of the target word does not change. This understanding is wrong.

Then, the author talked about the benefits of taking the above mask strategy in the paper. Bottom line is that after using the above strategy, Transformer encoder does not know which word will make its forecast, or did not know which word will be randomly word to replace, then it had to characterize a context for each input token remains distribution (a distributional contextual representation). This means that if the model you want to learn to predict what the words are, it will lose the learning context information, and if the model training process can not learn to predict which word is, then it must be judged by the learning context of the information We need to predict the word, such a model that has the ability to express the characteristic sentence. Further, since the probability of occurrence of random replacement of all tokens relative sentence only 1.5% (i.e., 15% of 10%), and therefore will not affect the language understanding model.

NSP

Many downstream tasks, such as questions and answers, natural language reasoning, need-based understanding of the relationship between the two sentences, and this relationship can not be obtained by direct modeling language to. In order to train a possible model for understanding the relationship between sentences, the authors forecast for the next sentence a binary classification task was pre-trained, these sentences can be obtained from any of the corpus into a single language. In particular, when a sentence is selected for each of the prediction samples A and B, 50% of the time B is behind the next sentence A (labeled IsNext), 50% of the time B is a random sentence corpus (labeled as NotNext). In FIG. 1, C to output the next sentence label (NSP).

"The next sentence prediction," the task examples:

Input = [CLS] the man went to [MASK] store [SEP]
            he bought a gallon [MASK] milk [SEP]
            
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]
            penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

What models can do that, what is the difference between them and BERT?

Thesis author mentions the other two models, namely OpenAI GPT and ELMo.
Figure 3 shows a comparison of the three models of architecture:

  • BERT uses two-way Transformer architecture, pre-training phase using MLM and NSP.
  • OpenAI GPT using the Transformer left-to-right in.
  • ELMo respectively using left-to-right and right-to-left independent training, and then outputs spliced ​​together, wherein the sequence provided downstream task.

The above three models architecture, characterized by BERT only model in each layer are jointly taken into account contextual information left and right. In addition to different architectures, but also to explain is: BERT and OpenAI GPT is based on fine-tuning, and ELMo is based on the feature-based.

more details

Read the original paper, or refer to the author of the article " BERT paper interpretation ."

Guess you like

Origin www.cnblogs.com/anai/p/11647115.html