ELMO, BERT and GPT Profile

1.Contextualized Word Embedding

The same words have different meanings, such as the following few sentences, the same "bank", has a different meaning. But with trained Word2Vec get vector "bank" it would be the same. As vector instructions "word" means the same, in fact not the case. This is Word2Vec defects.

The following sentence, the same "bank", indeed different token , but the same type of

We expect every word token has an embedding . Embedding each word token is dependent on its context. This method is called  contextualized Word Embedding .

 

2.EMLO

EMLO is an abbreviation Embeddings from Language Model, which is a RNN-based model, only to have a large number of sentences can train.

We can put the right RNN hidden layer training weight out, the vector vocabulary after the hidden layer output as the embedding of the word, because RNN considering the context, it is the same word in a different context it will be different vectors . The above is a positive in the RNN, if that information is not taken into account, can be trained two-way RNN, the same output as the hidden layer embedding.

If we have a lot of layers of RNN, that we bring the hidden layer output as embedding?

In ELMO, which remove each layer obtained vector , we obtained through the operation of each word embedding

For example figure, assume that we have two layers, so that each word will give two vectors, the easiest way is to put two vectors together as the embedding of the word.

EMLO the two vectors will be taken out, and then multiplied by different weights $ \ alpha $, and then we got to get the task of embedding do downstream.

$ \ alpha $ model is obtained by learning, it will be according to our downstream task of training to get together, so different tasks used in the $ \ alpha $ is not the same

For example, we can have three embedding the sources, as shown in FIG. They are

  • It had not been contextualized the embedding, which is above the Token
  • Token passing through the first layer embedding a first extraction 
  • Token passing through the second layer extracted second embedding 

Color depth represents the quantity of weight, you can see the different tasks (SRL, Coref, etc.) have different weights.

 

3.BERT

BERT is Bidirectional Encoder Representations from Transformers abbreviation, BERT is in Transformer Encoder. It is formed by a stack of a plurality Encoder

In BERT inside, we do not need a text label, only to collect a bunch of sentences can be trained.

BERT is Encoder, it can be seen as a sentence input, the output of embedding, embedding each corresponding to a word

The figure is an example of our "word" as a unit, sometimes we have to "word" as the unit will be better. For example, the Chinese "word" is a lot, but the common "word" is limited.

In BERT, there are two kinds of training methods, one is the Masked LM . The other is the Next Sentence Prediction . But in general use , it will achieve better results.

3.1Masked LM

In Masked LM, the sentence we will enter a random 15% of the vocabulary replaced with a special token, called the [MASK]

BERT task is to guess what these are replaced out of words.

Lyrics is like a game, digging a word in a sentence, so you fill in the right word

After we get a BERT embedding, embedding is then replaced [MASK] that position output by a linear classifier to predict what the word is

Because this classification is Linear, so it's very, very weak ability , so BERT to output a very good embedding, in order to predict what word is being replaced out

If you can fill two different words in the same sentence, they will have a similar embedding, because they are similar semantics

3.2Next Sentence Prediction

In the Next Sentence Prediction, we give BERT two sentences, let BERT predict these two sentences are not connected together

[SEP]: Special toekn, representative of the junction of two sentences

[CLS]: a special token, do on behalf of classification

We then [the CLS] output vector through a linear classifier, the classifier is determined so that the two sentences should not be connected together.

BERT is the Transformer Encoder, use it in self-attention mechanism, the sentence can read all the information, so [CLS] may be placed at the beginning

We also can directly enter this instance a vector classifier, determine the type of text, such as the following to determine spam

3.3ERNIE

ERNIE is E nHance R & lt ePresentation through K n- owledge the I NT E Gration Abbreviation

ERNIE is designed for preparation of the Chinese, the Chinese input BERT in units of words, the number of random characters cover off is very easy to guess, and as shown in FIG. So cover off a vocabulary more appropriate.

 

4.GPT

GPT is an abbreviation of Generative Pre-Training, particularly large amount of its parameters, as shown below, in which the parameter is an amount of about 4.5 times the BERT

 BERT is the Transformer Encoder, GPT is the Transformer Decoder . GPT enter some words, to predict the next word. The calculation process is shown below.

We enter the word "tide", after the self-attention many layers to get the output "back." Then the "back" as the input, a next predicted output.

GPT can do reading comprehension, sentence or paragraph and generated translation NLP tasks

 

References:

http://jalammar.github.io/illustrated-bert/

CHANG deep learning

Guess you like

Origin www.cnblogs.com/dogecheng/p/11615750.html