CS224n notes-Subword Model (12)


Series of articles

Lecture 1: Introduction and Word
Lecture 2: Word Vectors and Word Senses
Lecture 12: Subword Model


Very nice BERT entry-level explanation

1. ELMO

First introduce the ELMo (Embeddings from Language Models) algorithm, you can go to the original text to view more detailed content. In the previous work of word2vec and GloVe, each word corresponds to a vector, which is incapable of polysemous words, or as the language environment changes, these vectors cannot accurately express the corresponding characteristics . The author of ELMo believes that a good word representation model should take into account two issues at the same time:

  • One is the complex characteristics of semantic and grammatical usage of words;
  • Second, as the language environment changes, these usages should also change.

The ELMo algorithm process is:

  1. First train the bidirectional LSTM model with the language model as the target on the large corpus;
  2. Then use LSTM to generate word representations;

The ELMo model contains multi-layer bidirectional LSTM, which can be understood as follows:

The state of the high-level LSTM can capture the context-related features in the meaning of words (for example, it can be used for semantic disambiguation), while the low-level LSTM can find grammatical features (for example, it can be used for part-of-speech tagging).

Bidirectional language models

The ELMo model has two key formulas:

Here we can see that the predicted sentence probability
p (t 1, t 2,..., Tn) p(t1,t2,...,tn)p(t1,t2,...,t n )
has two directions: the forward direction and the reverse direction.

(t1, t2,...tn) is a series of tokens.
Insert picture description here
Here Θ x represents token embedding, and Θ s represents the parameters of the softmax layer.

word feature

For each token, an L-layer biLM model must calculate a total of 2 L +1 representations: the
Insert picture description here
second "=" can be understood as:

When j=0, it represents the token layer. When j>0, the hidden characterization of both directions is included at the same time.

2. GPT

This section introduces the GPT model

The training of GPT is divided into two stages: 1) unsupervised pre-training language model; 2) fine-tuning of each task.

Model structure diagram:
(original).

2.1 Unsupervised pretrain

Use the language model to maximize the following equation:

image-20200607204843715

Where k is the context window size, θ is the language model parameter, and a neural network is used to simulate the conditional probability P

In the paper, a multi-layer transformer decoder is used as the LM (language model), which can be seen as a variant of the transformer. Remove the Encoder-Decoder Attention layer in the transformer decoder as the main body of the model, and then pass the output of the decoder through a softmax layer to obtain the output distribution of the target word:

image-20200607205533296

这里 U = ( u i − k , … , u i − 1 ) U=(u_{i-k},\dots,u_{i-1}) U=( uik,,ui1) Representsthe context ofui,W e W_eWeIs the word vector matrix, W p W_pWpIs the position vector matrix.

2.2 Supervised finetune

In this step, we adjust the parameters θ of the pre-trained language model according to our own tasks ,

image-20200607210752637

The final optimized formula is:

image-20200607210919284

In your own task, use the ergodic method to convert the structured input into an ordered sequence that the pre-trained language model can handle:

3. BERT

Bert (Original) is Google’s big move. The company’s AI team’s newly released BERT model has shown amazing results in the top level test of machine reading comprehension SQuAD1.1: it surpasses humans in all two measures, and is still 11 Achieve the best results in a variety of different NLP tests, including pushing the GLUE benchmark to 80.4% (absolute improvement of 7.6%), MultiNLI accuracy of 86.7% (absolute improvement rate of 5.6%) and so on. It is foreseeable that BERT will bring milestone changes to NLP and is also the most important recent development in the field of NLP.

The full name of BERT is Bidirectional Encoder Representation from Transformers, which is the Encoder of two-way Transformer. The main innovations of the model are in the pre-train method, which uses Masked LM and Next Sentence Prediction to capture word and sentence-level representations respectively.

BERT uses the Transformer Encoder model as the language model. The Transformer model comes from the classic paper "Attention is all you need". It completely abandons the structure of RNN/CNN and completely uses the Attention mechanism to calculate the relationship between input and output. , As shown in the left half of the figure below:Insert picture description here

Insert picture description here
The structure of the Bert model is as follows:

The difference between the BERT model and OpenAI GPT is the use of Transformer Encoder, that is, the Attention calculation at each moment can get the input at all times, while OpenAI GPT uses the Transformer Decoder, and the Attention calculation at each moment can only depend on that moment. The input at all moments before, because OpenAI GPT uses a one-way language model.

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107106958