Pre-Training NLP Road - from word2vec, ELMo to BERT

Foreword

Remember long ago reading field of machine comprehension, Microsoft and Ali respectively R-Net + and SLQA than humans on SQuAD, Baidu with V-Net Pa list in MS MARCO and the BLEU than humans. These networks can be said that a complex than a, it seems that "more work on how to design a network of task-specific" has become a politically correct research field of NLP. In this direction, no matter word2vec Ye Hao, glove Ye Hao, fasttext or whatever, can only act as an icing on the cake. Say good migration study, pre-training it? In NLP it did not always seem to be the protagonist.

When small evening writing this article a bit ashamed, engage in a migration represents for some time, although the feeling early on that intuition should be the core issue of NLP, but did not make some of my own satisfaction results until a few BERT before the day out, just feeling poverty limit my imagination ╮ (¯ ▽ ¯ "") ╭ (crossed out), it felt too narrow a focus point.

Each person has a different understanding of BERT, this article will try and ELMo word2vec from the perspective of talk BERT. The following briefly recap word2vec and ELMo in essence, been very thorough understanding of small partners to quickly drop down to BERT chapters friends.

word2vec

It was quite bored are some cliches and write sentences over and over again, in 2013 Google's word2vec one, let NLP fields everywhere, like a time without training on pre-term vectors are embarrassed to write a thesis. And what word2vec is it?

model

 

 

v2-fd235ed39f1b0c3ce83f4d830a49fad1_b.jpg

 

Obviously that is a "linear" language model. Since our goal is to learn word vector, and the word vector semantically to support some of the "linear semantic operations," such as "The Emperor - Empress = Male - Female" (ignoring Wu), then use a linear model naturally enough, run the fast and can complete the task, very elegant.

 

v2-0329ea23a5dee20bfd0f85bca67d7fa2_b.jpg


Another is the essence of word2vec that set softmax acceleration method to the way the language model is also optimized, with "negative sample" method appears to open a hole in the brain to replace the traditional hierarchical softmax and NCE practices. And the name "negative sample" on tall in the end what is it?

Negative sample

We know that for a language model for training, softmax layer is very difficult to count, after all, you want to predict the current position which is the word, then the number of categories is equivalent to the size of the dictionary, so tens of tens of thousands of categories, count softmax Of course, it functions very laborious. But if our goal is not to train a precise language model, but only to train the language model to obtain a byproduct - term vectors, then in fact, just use this implied a computational cost smaller "sub-tasks" like La.

Think about it, give you 10000 write digital cards, allowing you to identify the highest value, it is not particularly demanding? But if the inside of the maximum advance it out, with five randomly selected cards mingled together, let you choose the highest value, it is not easy to hold a candle?

Negative sampling is this idea that does not directly make the model most likely to find a whole word from the vocabulary, but directly given word (ie positive examples) and noise word a few random sampling (ie, sampling out of the negative cases) as long as the model from there to find the right word to complete the target is considered friends. So the idea corresponding objective function that is:

 

 

v2-c39ac80ebce05d2614f61af3de51c220_b.jpg

 

Here v’{w_O}is a positive example, v’{w_i}are randomly sampled out of a negative example (the k samples), \sigmais a sigmoid function. Then i.e. to maximize the likelihood of positive cases, minimizing the likelihood of negative examples.

Application of this negative thinking was successfully sampled in the BERT model, but from the word size into a sentence. Do not worry, slowly look back ~

char-level context

Although the 2015 and 2017 also have a lot of work trying to start from the char-level, another way to get rid of the rules of the game pre-term training vectors, however measured short-lived, soon to be a hate [8] [9]. However, it is also aware of the char-level text also contains some text word-level description of the difficult mode, so you can learn on the one hand there was the char-level features of word vectors FastText [5], another by the beginning of aspects shallow CNN, HIghwayNet, RNN and other network introduced char-level text in a supervised mission said.

However, so far, word vectors are context-free. In other words, the same word in different contexts always the same word vector, it is clear that the lack of which leads to word vector model WSD (WSD) capabilities. So, people in order to make word vector becomes context-sensitive, based on the word began to do the encoding vector sequence in specific downstream tasks.

Of course, the most common encoding method is to use the network RNN ​​system, in addition to the successful use of deep CNN to work encoding (such as text classification [6], machine translation [7], the machine reading comprehension [4]), However! and! Google said, CNN also Tai Su, we use a fully connected network! (Crossed out) self-attention! So there will be customized for NLP depth Transformer model [11], Transformer is presented on the machine translation tasks, such as retrieving dialogue but in other areas [3] also played a great power.

However, since the need to find various NLP tasks are basic encoding, then why not let the beginning of the word vector has context-sensitive power? So with ELMo [2].

ELMo

Of course, the fact ELMo is not the first attempt to produce a context-sensitive word vector model, but indeed let you have a good reason to give up word2vec model (manual smile), after all, the point of reasoning sacrifice speed in exchange for spicy so many performance improvements in most cases value ah ~ ELMo on the model layer is a stacked bi-lstm (strictly speaking trained stacked lstm two one-way), so of course there is a good encoding capability. While its source implementations also support the use of Highway Net or CNN to additionally introduced char-level encoding. To maximize the training it, then naturally the standard language model likelihood function, that is,

 

v2-e254688e1817d6c21ccef33db2659f86_b.jpg

But of course this is not the highlight of ELMo layer model, but rather indirectly by experiments described in the multilayer RNN, the different layers have actually learned characteristic differences, thus requiring the completion of the pre-trained ELMo and migrate into the downstream NLP when the task, we should have set up a training parameters for the original word vector layer and the hidden layer of each layer RNN, after these parameters are normalized by multiplying softmax layer to layer on their respective weighting and summing it played a role , then on the "weighted sum" term vectors obtained and then the entire word to the vector by a scaling parameters to better accommodate a downstream task.

ps: Actually this last parameter is very important, such as word2vec, in general cbow word vector variance differences and learn sg out of the relatively large, then the variance with follow-up mission for the downstream layer word vector variance match will converge faster more likely to have better performance

Mathematically,

 

v2-7653bedb63551e7508862a4503dc8d17_b.jpg

Where L = 2 is set in the paper ELMo, j = 0 behalf of the original word vector layers, j = 1 is the first hidden layer lstm, j = 2 is the second hidden layer. s_j^{task}Is the result of a subsequent parameters are softmax return (to say s_0+s_1+…+s_L=1).

Through this migration strategy, there is a demand for those tasks WSD more easily by train to the second hidden layer a lot of weight, while the parts of speech, syntax needs a clear mission might parameters for the first hidden layer learning to relatively large value (experimental results). In short, this will get a richer "can be customized downstream task" feature vector word, the effect is much better than word2vec not surprising.

But having said that, ELMO's goal was only learning to context-sensitive, more powerful word vector, the aim is still to provide a solid foundation for downstream tasks, yet want to regicide king meaning .

And we know, only the text is full and powerful encoding (ie get very precise characteristics of each word bit rich) is far from enough to cover all NLP tasks. In QA, machine reading comprehension (MRC), natural language reasoning (NLI), dialogue and other tasks, there are a lot more complex patterns need to capture, such as relationship between sentences. To this end, the downstream task in the network will add a variety of artistic attention (refer NLI, MRC, Chatbot in their SOTA).

With the capture mode requires more magical, the researchers downstream task for each customized to a wide variety of network structure, leading to the same model, a bit of a change task and hung up, even in the case of the same task another change under the distribution of the data set will be a significant performance loss occurs, it is clearly inconsistent with the language of human behavior Yup, you know human generalization ability is very strong, which indicates that perhaps now the NLP development path is wrong , especially under the leadership of SQuAD, and exhaust all the trick and fancy structure to brush standings, the real significance of NLP to be?

It seems to pull away, but fortunately, this road has finally been farther partial shutdown of a model, that is, a few days ago Google released from Transformers Through Hole Representations Encoder Bidirectional (BERT) [1] .

BERT

The most important significance of this paper lies not with what the model, nor how that training, but it presents a whole new rules of the game.

Like said before, the complex structure of the model to the depth of customization for each generalization poor NLP task is actually very unwise to go the wrong direction. Since compared ELMo word2vec have such a large increase, which shows the potential of the pre-training model is much more than provide a precise word vector downstream tasks , so can we direct a pre-training level keel model it? If it has been fully described in which a character level, word level, sentence level features even relationship between sentences, then in different NLP tasks, just go to customize a task extremely lightweight output layer (such as a single-layer MLP) just fine, after all, have already done the skeleton model thing.

The BERT exactly did this thing, or that it is really this thing made, as a general keel level model easily challenge the depth of customization model on the 11 mission. . .

So how it is done it?

Deep bidirectional encoding

First, it pointed out that the context-sensitive learning word vectors, the previous model of pre-training is not enough! Although supervision mission in downstream, encoding way is already very full of bells and whistles, and a depth of two-way encoding basic tasks become standard on many complex downstream (such as MRC, Dialogue) . But in the pre-training model, the previous most advanced model is only based on the traditional model of language to do, but the traditional language model is one-way (already mathematically defined) , that is,

p(s)=p(w0)\cdot p(w1|w0)\cdot p(w2|w1,w0)\cdot p(w3|w2, w1,w0) …p(wn|context)

And often very shallow (Imagine LSTM three-tier stack train did not move, get on the various trick), such as ELMo.

In addition, although ELMo a useful two-way RNN do encoding, but RNN both directions is actually separate training, but in the last layer in the loss made a simple addition. This leads to the words in each direction, in the time of the encoding is always the word not see the other side of it. And apparently some word sentence semantics will also depend on its left and right sides of certain words just do not describe the encoding is clear from a single direction.

So why did not it true bidirectional encoding downstream oversight missions that?

The reason I thought it was clear, after all, the traditional language model to predict the next word based on the training objective, but if you do bi-directional encoding, then, that does not represent the word you want to predict've seen Well ╮ (¯ ▽ ¯ " ") ╭ such predictions certainly does not make sense. So, in BERT proposes the use of a new mission to train the kind of oversight tasks can truly bi-directional encoding model, this task is called Masked Language Model (Masked LM).

Masked LM

As the name suggests, Masked LM is to say, we are not like the traditional LM has appeared as a given word, to predict the next word, but directly to the part of the entire word sentence (random selection) covers (make it masked), this model do not you can be assured of a bi-directional encoding Well, then you can rest assured that these cover model to predict what the word is. In fact, the beginning of this task is called cloze test (roughly translated as "cloze test").

This obviously leads to some minor problems. Although this can be assured bidirectional encoding, but this time in the encoding of these markers may also cover to encoding into the ╮ (¯ ▽ ¯ "") ╭ which mask marks downstream task it does not exist. . . then what should we do? In this regard, in order to maximize the impact model tuned to ignore these markers, the authors tell the model by way of "The noise is noise unreliable ignore them!!!", Is covered for a word:

  • With 80% probability "[mask]" tag is replaced
  • With 10% probability of a word to replace the random sampling
  • 10% chance not to replace (though not replace, but still have to predict Kazakhstan)

Encoder

In the choice of encoder, the authors did not use bad street bi-lstm, but the use of a deeper can do better parallelism Transformer encoder to do. Such words each word bits can ignore each word sentences directly to the direction and distance had the opportunity to come in encoding. On the other hand I feel Transformer subjective easier compared to lstm from the effects of mask mark, after all, the process is completely self-attention can mask mark targeted weaken match weight, but the input gate lstm is how to treat the mask marks it would not know.

Wait, small evening before the article also said, apparently not directly Transformer encoder position information is lost Well? Is there, like Transformer of original papers that people engage in a horrifying sin, cos function coding position? And wood, the author is very simple and crude directly to train here a position embedding ╮ (¯ ▽ ¯ "") ╭ here to say, for example, I cut to the length of the sentence of 50, then we have 50 locations Well, so characterized there are 50 word positions, i.e. the position from position 0 up to 49. . . Then word vector word to each location a random initialization, and then go with the training they (want to say that this special meow can also work? Too simple and crude, right ...). Further, the binding position embedding mode and the word embedding, BERT was added directly selected.

Finally, in depth, the final version of the encoder BERT completely frenzied superimposed multi-head attention block 24 layer (you know the dialogue in the SOTA model DAM also spent the five layers ...). . . And each block contains 16 taps, hidden units 1024 ╮ (¯ ▽ ¯ "") ╭ up posters here: money is all you need (crossed out)

Learning sentences and sentences represent the relationship

Like said before, in many tasks, only by encoding is not enough to complete the task (this is only learned a bunch of token-level features), you also need to capture some sentence-level model, to complete SLI, QA, dialogue and other needs sentence said that the task of interacting with the match between sentences. In this regard, BERT and the introduction of another extremely lightweight but extremely important task, to try to put this model to learn.

Sentence negative samples

Remember the evening in front of a small section word2vec said, the essence of a word2vec is the introduction of an elegant negative sampling task to learn the word vector (word-level representation) Well. So if we put this negative sampling process to generalize it to a sentence-level? This is the key to learning BERT sentence-level representation of it.

BERT here with word2vec similar approach, but construction is a sentence-level classification tasks. I.e., given a first sentence (corresponding to a given context word2vec), which is the next sentence positive cases (corresponding to the correct word in word2vec), randomly sampled a sentence as a negative example (corresponding to random sampling word2vec s), and then do the two classification sentence-level (i.e., determines the current sentence is a sentence or a noise). Through this simple sentence level negative sample task, BERT can learn as word2vec word for it as easy to learn sentence representation.

Sentence representation

And so on, said earlier such a long time, have not said how the sentence expressed it. . .

BERT here did not like the downstream oversight missions as common practice, and then put forward a global pool of like on the basis of encoding, it is first in each sequence (for sentences mission is to put together two sentences, for other tasks is a sentence) preceded by a special token, referred to as [CLS], FIG.

 

v2-8895e378c0c5ab199cf6fe2023d18486_b.jpg

 

ps: where [sep] is a delimiter between sentences, sentence expressed BERT support for learning, here [the SEP] is a period in order to distinguish the cut point.

Then let the encoder to [CLS] depth encoding, encoding the depth of the hidden layer is the highest whole sentence / sentence to represent it. At first glance this approach a bit hard to understand, but do not forget, Transformer can ignore the space and the distance of the global information encoding into each position, and [CLS] as a sentence / sentence is a direct representation of the output layer with the classifier connections, and therefore as a "checkpoint" on gradient back propagation path, of course, we will try to learn to classify the associated upper features friends.

In addition, in order to make the model be able to distinguish each word inside are "left the sentence" or "the right sentence," the authors here introduce the concept of "segment embedding" to distinguish sentences. For sentence For, on the use of embedding A and embedding B to represent the left and right sentence sentence; and for the sentence, the only embedding A friends. The embedding A and B also trained with the model.

ps: this approach feels like position embedding simple and crude, it is very difficult to understand why the BERT used on "quora question pairs" This network is theoretically required to maintain symmetry of the task still able to work, mixed feelings

 

v2-26473e6284abfeca8e62120ab707269f_b.jpg

Therefore, the final BERT each token is represented by a token word vector original token embedding, embedding position and the segment mentioned earlier herein embedding additively combining three parts, as shown:

 

 

v2-0571d8d12edc3950a5ce75a28591b228_b.jpg

 

Simple to excessive downstream task interface

BERT this model truly reflects the keel level model instead of the word vector is its task to each downstream interface design, or more Western style of another word called migration strategy.
First, since the upper sentence and the sentence of representation have been, then of course for text classification tasks and task matching text (text match is actually a text classification task, but to enter text), the only need to get the It represents (i.e., the top layer encoder outputs [the CLS] lexemes) with a layer MLP like ah ~

 

v2-9cab8eaac338f3849181c7a676e20ebb_b.jpg


Since the text is bi-directional encoding depth, then do the task sequence labeling only need to add the output layer softmax like it, do not even CRF ah ~

 

v2-ad0cc8bf6ebac4c22e47919410a03f15_b.jpg


Let small evening more wood has thought of that, span tasks such as removable on SQuAD, the depth of encoding depth attention and maybe even save a spree in even dare to directly to the pointer to the output layer of net lost? DrQA as direct as playfully with two linear classifier output span start and end, respectively? Not much to say, had knelt m (_ _) m

 

 

v2-a5f4d03c115b1df5158ce9a16bcfffe3_b.jpg

 

Finally, look at the test results

 

v2-4a1f394c5124a8293bcd4e32c2154641_b.jpg

 

v2-ffaf979580ce0e2a478b1c36ab0722a9_b.jpg

 

v2-9077b902afaa4581a345602c1fc43ad3_b.jpg

Ah, it's Google.

This paper a small evening very happy, because many previous ideas do not have to do the experiments, as it has been pressed BERT dead (. ︿.) Classification, labeling and migration tasks can start from scratch, SQuAD the building floor plan can be stopped, thanks BERT did not run generation tasks, which brings a little imagination. Ah, manual smiling tears.

 

More wonderful welcome attention to small article eve of micro-channel subscription number [of selling Meng Yao Xi small house] Oh (· ω <) ★

 

references

[1] 2018 | BERT- Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] 2018NAACL | Deep contextualized word representations
[3] 2018 ACL | Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network
[4] 2018ICLR | Fast and Accurate Reading Comprehension by Combining Self-Attention and Convolution
[5] 2017TACL | Enriching Word Vectors with Subword Information
[6] 2017ACL | Deep Pyramid Convolutional Neural Networks for Text Categorization
[7] 2017 | Convolutional Sequence to Sequence Learning
[8] 2017 | Do Convolutional Networks need to be Deep for Text Classification ?
[9] 2016 | Convolutional Neural Networks for Text Categorization/ Shallow Word-level vs. Deep Character-level
[10] 2013NIPS | Distributed-representations-of-words-and-phrases-and-their-compositionality

 

Published 33 original articles · won praise 0 · Views 3285

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553474