Google's latest model bert, you know what?

Google's latest model bert, you know what?

 

BERT (Bidirectional Encoder Representations from Transformers).

 

October 11, Google AI Language published papers

BERT: Pre-training of Deep Bidirectional Transformers for

Language Understanding.

 

Let's first look at the leaderboard BERT Stanford Question Answering Dataset (SQuAD) above it:

https://rajpurkar.github.io/SQuAD-explorer/

 

BERT can be used to do?

 

BERT can be used for answering system, sentiment analysis, spam filtering, named entity recognition, document clustering and other tasks, as the infrastructure of these tasks ie language model,

 

BERT has an open source code:

https://github.com/google-research/bert

 

We can fine-tune them, apply it to our objectives and tasks, the fine-tuning of training BERT is fast and simple.

 

For example, in the NER problem, BERT language models have been pre-trained over 100 languages, this is a list of top 100 language:

https://github.com/google-research/bert/blob/master/multilingual.md

 

As long as the 100 languages, if there NER data, you can quickly train NER.

 

BERT principles outlined

 

The innovation is that it will be a two-way Transformer BERT for language model,

 

Before the model is a text input sequence from left to right, or left-to-right and right-to-left training together.

 

The experimental results show that two-way language training model for understanding the context of the language will be more profound than the one-way model, this paper introduces a new technology called Masked LM (MLM), before the emergence of this technology is not bidirectional language model training.

 

BERT use the encoder part of the Transformer.

 

Transformer is a mechanism of attention, you can learn the context of the relationship between text word.

 

Transformer prototypes include two independent mechanism responsible for receiving a text as input encoder, a decoder is responsible for forecasting the results of the task.

 

BERT's goal is to generate language model, so only need encoder mechanism.

 

The disposable Transformer encoder to read the entire text sequence, from left to right instead of right to left or read sequence, characterized in that the model can be based on both the word learning, is equivalent to a two-way function.

 

When we train the language model, one challenge is to define a predicted target, a lot of models to predict the next word in a sequence,

“The child came home from ___

 

Two-way approach such a task is limited, in order to overcome this problem, BERT uses two strategies:

 

1. Masked LM (MLM)

 

Prior to the input word sequence BERT, each sequence is 15% of the words [MASK] token replacement. Then try to model based on the sequence in other words not mask the original context to predict the word masked.

 

This requires:

 

i) adding a classification at the output layer encoder to

ii) multiplying the output vector with embedded matrix, to convert it to a dimension vocabulary

iii) calculating the probability of each word in the vocabulary with softmax

 

BERT loss function only considered predictive value mask, ignoring the predicted word is not masked. In this case, the model than the one-way model of convergence is slow, but the result of increased situational awareness.

 

2. Next Sentence Prediction (NSP)

 

BERT in the training process, the model receives as input a pair of the sentence, and wherein the second prediction whether the sentence is a subsequent sentence in the original document.

 

During training, 50% is input to the context in the original document, an additional 50% of the corpus from the random component, and is disconnected from the first one

 

To help separate the two sentences in the training area models, input is processed before entering the model in the following ways Yaoan:

 

In the beginning of the first sentence insertion [the CLS] tag, insert [the SEP] mark the end of each sentence.

A sentence representing a sentence or a sentence B is added to the embedding each token.

Each token is added to a position embedding, to indicate its position in the sequence.

 

In order to predict whether a subsequent second sentence is a sentence of the first sentence, predicted by the following steps:

 

Transformer entire input sequence input to the model.

A simple classification layer [the CLS] labeled 2 × 1 output into the shape vector.

IsNextSequence probability calculation softmax.

 

When training BERT model, Masked LM and Next Sentence Prediction is training together, the goal is to minimize the loss of function of a combination of the two strategies.

 

How to use the BERT?

 

BERT can be used for a variety of NLP tasks, just add a layer in the core model.

 

E.g:

 

a) In the classification task, e.g. emotion analysis, only need to add a category level above the output of Transformer.

 

b) a Q task (e.g. SQUAD v1.1), the system needs to receive the quiz question about the sequence of text, and needs to be marked in the sequence answer. BERT vectors can be used to learn two begin and end tag answer to Q & A training model.

 

c) in the named entity recognition (the NER), the system needs various types of entities (persons, organizations, dates, etc.) the received text sequences, marker text. BERT can be fed to the output vector for each token NER classification label layer prediction.

 

In fine-tuning, the most ultra-parameter holds the same BERT, the paper also gives specific guidance hyper-parameters need to be adjusted (Section 3.5).

 

 

Guess you like

Origin www.cnblogs.com/julyedu/p/11718453.html