1.Contextualized Word Embedding
The same words have different meanings, such as the following few sentences, the same "bank", has a different meaning. But with trained Word2Vec get vector "bank" it would be the same. As vector instructions "word" means the same, in fact not the case. This is Word2Vec defects.
The following sentence, the same "bank", indeed different token , but the same type of
We expect every word token has an embedding . Embedding each word token is dependent on its context. This method is called contextualized Word Embedding .
2.EMLO
EMLO is an abbreviation Embeddings from Language Model, which is a RNN-based model, only to have a large number of sentences can train.
We can put the right RNN hidden layer training weight out, the vector vocabulary after the hidden layer output as the embedding of the word, because RNN considering the context, it is the same word in a different context it will be different vectors . The above is a positive in the RNN, if that information is not taken into account, can be trained two-way RNN, the same output as the hidden layer embedding.
If we have a lot of layers of RNN, that we bring the hidden layer output as embedding?
In ELMO, which remove each layer obtained vector , we obtained through the operation of each word embedding
For example figure, assume that we have two layers, so that each word will give two vectors, the easiest way is to put two vectors together as the embedding of the word.
EMLO the two vectors will be taken out, and then multiplied by different weights $ \ alpha $, and then we got to get the task of embedding do downstream.
$ \ alpha $ model is obtained by learning, it will be according to our downstream task of training to get together, so different tasks used in the $ \ alpha $ is not the same
For example, we can have three embedding the sources, as shown in FIG. They are
- It had not been contextualized the embedding, which is above the Token
- Token passing through the first layer embedding a first extraction
- Token passing through the second layer extracted second embedding
Color depth represents the quantity of weight, you can see the different tasks (SRL, Coref, etc.) have different weights.
3.BERT
BERT is Bidirectional Encoder Representations from Transformers abbreviation, BERT is in Transformer Encoder. It is formed by a stack of a plurality Encoder
In BERT inside, we do not need a text label, only to collect a bunch of sentences can be trained.
BERT is Encoder, it can be seen as a sentence input, the output of embedding, embedding each corresponding to a word
The figure is an example of our "word" as a unit, sometimes we have to "word" as the unit will be better. For example, the Chinese "word" is a lot, but the common "word" is limited.
In BERT, there are two kinds of training methods, one is the Masked LM . The other is the Next Sentence Prediction . But in general use , it will achieve better results.
3.1Masked LM
In Masked LM, the sentence we will enter a random 15% of the vocabulary replaced with a special token, called the [MASK]
BERT task is to guess what these are replaced out of words.
Lyrics is like a game, digging a word in a sentence, so you fill in the right word
After we get a BERT embedding, embedding is then replaced [MASK] that position output by a linear classifier to predict what the word is
Because this classification is Linear, so it's very, very weak ability , so BERT to output a very good embedding, in order to predict what word is being replaced out
If you can fill two different words in the same sentence, they will have a similar embedding, because they are similar semantics
3.2Next Sentence Prediction
In the Next Sentence Prediction, we give BERT two sentences, let BERT predict these two sentences are not connected together
[SEP]: Special toekn, representative of the junction of two sentences
[CLS]: a special token, do on behalf of classification
We then [the CLS] output vector through a linear classifier, the classifier is determined so that the two sentences should not be connected together.
BERT is the Transformer Encoder, use it in self-attention mechanism, the sentence can read all the information, so [CLS] may be placed at the beginning
We also can directly enter this instance a vector classifier, determine the type of text, such as the following to determine spam
3.3ERNIE
ERNIE is E nHance R & lt ePresentation through K n- owledge the I NT E Gration Abbreviation
ERNIE is designed for preparation of the Chinese, the Chinese input BERT in units of words, the number of random characters cover off is very easy to guess, and as shown in FIG. So cover off a vocabulary more appropriate.
4.GPT
GPT is an abbreviation of Generative Pre-Training, particularly large amount of its parameters, as shown below, in which the parameter is an amount of about 4.5 times the BERT
BERT is the Transformer Encoder, GPT is the Transformer Decoder . GPT enter some words, to predict the next word. The calculation process is shown below.
We enter the word "tide", after the self-attention many layers to get the output "back." Then the "back" as the input, a next predicted output.
GPT can do reading comprehension, sentence or paragraph and generated translation NLP tasks
References:
http://jalammar.github.io/illustrated-bert/
CHANG deep learning