Overview of NLP pre-training models: from word2vec, ELMo to BERT

table of Contents

 

Foreword

word2vec

model

Negative sampling

char-level and context

ELMo

BERT

Deep bidirectional encoding

Learn sentence and sentence pair relationship representation

Simple and excessive downstream task interface


Foreword

Recall that in the field of machine reading comprehension not long ago, Microsoft and Ali surpassed humans on SQuAD with R-Net + and SLQA respectively, and Baidu surpassed humans on MS MARCO with V-Net domination and BLEU. These networks can be said to be more complicated than one, and it seems that "how to design a more work-specific network" has become the politically correct research direction in the NLP field. In this kind of wind, no matter word2vec, glove, or fasttext, it can only serve as a icing on the cake. What about transfer learning and pre-training? It seems that NLP has never been the protagonist.
Xiao Xi was a bit ashamed when he wrote this article. After a long period of presentation and migration, although he intuitively felt that this should be the core problem of NLP, he did not make some experimental results that were satisfactory to him. The day before BERT came out, I felt that poverty restricted my imagination ╮ ( ̄ ▽  ̄ ””) ╭ (crossed out), and I felt that my point of view was still too narrow. Everyone has a different understanding of BERT. This article will try to talk about BERT from the perspective of word2vec and ELMo. The following is a brief review of the essence of word2vec and ELMo. Those who have understood very well can quickly pull down to the BERT chapter.

word2vec

Speaking of these are all cliché sentences that I would like to write over and over again. As soon as Google ’s word2vec came out in 2013, all fields of NLP blossomed. It seemed that I was embarrassed to write a paper without using pre-trained word vectors. What is word2vec?

model

Obviously it is a "linear" language model. Since our goal is to learn word vectors, and the word vectors must semantically support some "linear semantic operations", such as "Emperor-Queen = Male-Female" (ignoring Wu Zetian), then a linear model is naturally sufficient, running It's very fast and able to complete the task, very elegant.

In addition, one of the essence of word2vec is to optimize the set of softmax acceleration methods of the language model, and replace the traditional hierarchical softmax and NCE methods with a seemingly open brain "negative sampling" method. And what exactly is "negative sampling" in this name?

Negative sampling

We know that for training language models, the softmax layer is very difficult to calculate. After all, what you want to predict is the word of the current position, then the number of categories is equivalent to the dictionary size, so the number of categories is often tens of thousands to tens of thousands. The function is of course very laborious. However, if our goal is not to train an accurate language model, but only to obtain the by-product of the language model-word vectors, then we only need to use a hidden sub-task that is less expensive to calculate. La. Think about it. Is it particularly laborious to give you 10,000 cards with numbers and let you find out the maximum of them? But if you extract the maximum value in advance and mix it with five randomly drawn cards to let you choose the maximum value, is it easier? Negative sampling is this idea, that is, instead of directly letting the model find the most likely word from the entire vocabulary, it is directly given the word (ie, positive example) and several randomly sampled noise words (ie, sampled negative examples) As long as the model can find the correct word from this, it is considered that the goal is completed. So the objective function corresponding to this idea is: here is a positive example, a negative example randomly sampled (k samples), and a sigmoid function. Then maximize the likelihood of positive examples and minimize the likelihood of negative examples. This negative sampling idea was successfully applied in the BERT model, but the granularity changed from words to sentences. Don't worry, look back slowly ~

char-level and context

Although a lot of work from 2015 to 2017 also tried to start with char-level, to find another way to get rid of the rules of the pre-trained word vector game, but the actual measurement is only a flash in the pan, and it was quickly defeated [8] [9]. However, people also realized that the text of char-level also contains some patterns that are difficult to describe in word-level text. Therefore, on the one hand, the word vector FastText [5] that can learn the characteristics of char-level appears. In supervised tasks, we began to introduce char-level text representation through shallow CNN, HIghwayNet, RNN and other networks. However, so far, the word vectors are all context-free. In other words, the same word is always the same word vector in different contexts. Obviously, this leads to the lack of word sense disambiguation (WSD) in the word vector model. Therefore, in order to make the word vector context-sensitive, people began to encode based on the word vector sequence in specific downstream tasks. ** Of course, the most common encoding method is to use the RNN network, in addition to the successful use of deep CNN to encode work (such as text classification [6], machine translation [7], machine reading comprehension [4] ) Of course! and! Google said, CNN is too vulgar, we have to use a fully connected network! (Cross out) self-attention! So there is a Transformer model [11] that is deeply customized for NLP. Transformer's proposal is on the task of machine translation, but it also exerts great power in other fields such as retrieval dialogue [3]. However, since it is found that there is basically a need for encoding in each NLP task, why not let the word vector have the context-sensitive ability at the beginning? So there is ELMo [2].

ELMo

Of course, ELMo is actually not the first model that tries to generate context-sensitive word vectors, but it is indeed a model that gives you a good reason to give up word2vec (manual smile). After all, the performance of sacrificing inference speed is so much better. In most cases, the value is ~ ELMo is a stacked bi-lstm on the model layer (strictly speaking, two unidirectional stacked lstm are trained), so of course there is a good encoding ability. At the same time, its source code implementation also supports the use of Highway Net or CNN to additionally introduce char-level encoding. Training it is naturally also the maximum likelihood function of the language model standard, but the highlight of this ELMo is of course not the model layer, but it indirectly shows through experiments that in the multi-layer RNN, the features learned by different layers are actually There are differences, so ELMo proposes to set a trainable parameter for the original word vector layer and the hidden layer of each RNN when the pre-training is completed and migrated to the downstream NLP task. These parameters are normalized by the softmax layer Multiplying to the corresponding layer and summing will play the role of weighting, and then the word vector obtained by the "weighted sum" will be scaled by a parameter to better adapt to the downstream task.

ps: In fact, this last parameter is still very important. For example, in word2vec, in general, the variance of the word vector learned by cbow and sg is relatively large. At this time, the variance matches the word vector that is suitable for the subsequent layer variance of downstream tasks. , Easier to perform better

The mathematical expression is as follows,

where L = 2 is the setting in the ELMo paper, j = 0 represents the original word vector layer, j = 1 is the first hidden layer of lstm, and j = 2 is the second hidden layer. It is the result of the parameter normalized by softmax (that is to say ). Through such a migration strategy, tasks that require word sense disambiguation are more likely to be given a large weight to the second hidden layer through training, while tasks that have obvious requirements for part of speech and syntax may have parameters for the first hidden layer. Learned a relatively large value (experimental conclusion). In short, this gives you a word vector that can be customized by downstream tasks and has more features. It is not surprising that the effect is much better than word2vec. But having said that, the goal of ELMo is just to learn more context-sensitive and more powerful word vectors. Its purpose is still to provide a solid foundation for downstream tasks.. And we know that simply performing sufficient and powerful encoding on the text (that is, getting very precise and rich features of each lexeme) is not enough to cover all NLP tasks. In tasks such as QA, machine reading comprehension (MRC), natural language inference (NLI), and dialogue, there are many more complex patterns that need to be captured, such as the relationship between sentences. For this reason, various fancy attentions will be added to the network in downstream tasks (refer to SOTA in NLI, MRC, Chatbot). With the need to capture more magical patterns, researchers have customized a variety of network structures for each downstream task, resulting in the same model, which hangs up after a slight task change, even in the case of the same task Subsequent changes to another distributed data set will result in significant performance loss, which is obviously not in line with human linguistic behavior ~ to know that human generalization ability is very strong, which shows that perhaps the development trajectory of the entire NLP is now wrong Yes, especially under the leadership of SQuAD, exhausting various tricks and fancy structures to brush the list, what is the significance of NLP? It seems that it is far away, but fortunately, this more and more deviation is finally shut down by a model, that is, Bidirectional Encoder Representations from Transformers (BERT) [1] released by Google a few days ago .

BERT

The most important significance of this paper is not what model is used, nor how to train it, but it proposes a brand new game rule. Before starting the game, help Xiao Xi order some small ads, OK? \ (// ∇ //) \

As mentioned before, it is actually very unwise to deeply customize the complex model structure with extremely poor generalization ability for each NLP task, and it goes in a wrong direction. Since ELMo will have such a big improvement compared to word2vec, this shows that the potential of pre-training models is far more than providing a precise word vector for downstream tasks , so can we directly pre-train a keel-level model? If it has fully described the characteristics of character level, word level, sentence level and even the relationship between sentences, then in different NLP tasks, only need to customize a very lightweight output layer for the task (such as a single layer MLP) That ’s fine, after all, the model skeleton is already done. And BERT did exactly this thing, or rather, it really made it. As a general keel-level model, it easily challenged the in-depth customization of 11 tasks. . . So how did it be done?

Deep bidirectional encoding

First, it points out that the previous pre-trained model is not enough for learning context-sensitive word vectors! Although in the downstream supervised tasks, the encoding method is already very full of bells and whistles, deep two-way encoding has basically become the standard for many complex downstream tasks (such as MRC, dialogue) . But on the pre-trained model, the previous state-of-the-art model is only based on the traditional language model, and the traditional language model is unidirectional (mathematically defined) , that is, and often very shallow (imagine the LSTM heap On the third floor, the train does not move, and you have to go to various tricks), such as ELMo. In addition, although ELMo uses bidirectional RNN for encoding, the RNNs in these two directions are actually trained separately, but a simple addition is made at the end of the loss layer. In this way, for the words in each direction, the words on the other side of it cannot be seen when they are encoded. Obviously, the semantics of some words in a sentence will depend on certain words on the left and right sides of it, and it is impossible to describe clearly by encoding from one direction. ** So why not do true two-way encoding as in downstream supervision tasks? ** The reason is clear when you think about it. After all, the traditional language model is to predict the next word as the training goal. However, if two-way encoding is done, it does not mean that the word to be predicted has been seen.╮ ( ̄ ▽  ̄ ””) ╭Of course such a prediction is meaningless. Therefore, in BERT, it is proposed to use a new task to train the model that can really encode in both directions in the supervision task. This task is called Masked Language Model (Masked LM).

Masked LM

As the name implies, Masked LM means that instead of giving the word that has already appeared and predicting the next word like traditional LM, we directly cover a part of the entire sentence (randomly selected) (make it masked). You can safely do two-way encoding, and then you can safely let the model predict what these covered words are. This task was actually called the cloze test (probably translated as "Gestalt Test"). This obviously causes some minor problems. Although the two-way encoding can be assured in this way, the covered tags are also encoded during encoding╮ ( ̄ ▽  ̄ "") ╭ and these mask tags do not exist in downstream tasks. . . then what should we do? In this regard, in order to tune the model to ignore the influence of these marks as much as possible, the author tells the model "these are noises and noises! Unreliable! Ignore them!" For a covered word:

  • There is an 80% probability of replacing it with a "[mask]" tag

  • There is a 10% chance of replacing it with a randomly sampled word

  • There is a 10% probability that no replacement will be made (although no replacement is made, it is still necessary to predict)

Encoder

In the choice of encoder, the author did not use the bi-lstm of the bad street, but used the Transformer encoder that can be made deeper and has better parallelism. In this way, the words in each lexical position can directly encode each word in the sentence regardless of the direction and distance. On the other hand, I subjectively feel that Transformer is easier to protect from the mask mark than lstm. After all, the self-attention process can completely reduce the matching weight of the mask mark, but how does the input gate in lstm view the mask mark Then it is unknown. Wait a second, Xiao Xi also said in the previous article, obviously using the Transformer encoder will not lose the location information? Does the author here have a dreaded sin and cos function coding position like the original Transformer paper? And there is no such thing, the author here is very simple and rude to directly train a position embedding ╮ ( ̄ ▽  ̄ ””) ╭ This is to say, for example, if I truncate the sentence to a length of 50, then we have 50 positions, so There are 50 words that represent position, from position 0 to position 49. . . Then give each position word a randomly initialized word vector, and then train with them (I want to say that this special meow can also work? It is too simple and rude ...). In addition, in the combination of position embedding and word embedding, direct addition is selected in BERT. Finally, in terms of depth, the final encoder of the BERT full version is superimposed with 24 layers of multi-head attention block (you need to know that the SOTA model DAM in the dialogue only uses 5 layers ...). . . And each block contains 16 taps, 1024 hidden units ╮ ( ̄ ▽  ̄ ””) ╭ slogan here: money is all you need (crossed out)

Learn sentence and sentence pair relationship representation

As mentioned before, in many tasks, encoding alone is not enough to complete the task (this just learns a bunch of token-level features), and you need to capture some sentence-level patterns to complete SLI, QA, dialogue, etc. Sentence representation, inter-sentence interaction and matching tasks. In this regard, BERT introduced another extremely important yet extremely lightweight task to try to learn this model.

Sentence-level negative sampling

Remember Xiao Xi said in the previous word2vec chapter, one of the essence of word2vec is the introduction of an elegant negative sampling task to learn word-level representation. So what if we generalize this negative sampling process to the sentence-level? This is the key to BERT learning sentence-level representation. BERT is similar to word2vec here, but constructs a sentence-level classification task. That is, a sentence given first (equivalent to the context given in word2vec), its next sentence is a positive example (equivalent to the correct word in word2vec), and a sentence is randomly sampled as a negative example (equivalent to random sampling in word2vec Word), and then do two classifications on the sentence-level (that is, judge whether the sentence is the next sentence of the current sentence or noise). Through this simple sentence-level negative sampling task, BERT can learn sentence representation as easily as word2vec learns word representation.

Sentence-level representation

Wait, I said that for a long time, but I haven't said how to express the sentence. . . BERT here is not like the common practice in downstream supervision tasks. On the basis of encoding, a global pooling or the like is first performed. It is first in each sequence (for the sentence to the task, it is two spelled sentences, For other tasks it is a sentence) a special token is added in front of it, and it is marked as [CLS], as shown in the figure

ps: [sep] here is the separator between sentences. BERT also supports the expression of learning sentence pairs. Here, [SEP] is to distinguish the cutting point of sentence pairs.

Then let the encoder perform deep encoding on [CLS]. The highest hidden layer of deep encoding is the representation of the entire sentence / sentence pair. This approach is a bit puzzling at first glance, but do n’t forget that Transformer can encode global information into each position regardless of space and distance, and [CLS] as a sentence / sentence pair representation directly follows the output layer of the classifier Connected, so as a "level" on the gradient back propagation path, of course, we will find a way to learn the upper-level features related to classification. In addition, in order to allow the model to distinguish whether each word in it belongs to "left sentence" or "right sentence", the author introduces the concept of "segment embedding" to distinguish sentences. For sentence pairs, embedding A and embedding B are used to represent the left and right sentences respectively; for sentences, there is only embedding A. The embedding A and B are also trained with the model.

ps: This method feels simple and rough like position embedding. It is really difficult to understand why BERT is used in the task of "quora question pairs", which theoretically requires the network to maintain symmetry. It can still work, and the mood is complicated.

So in the end, the representation of each token of BERT is formed by adding the original word vector token embedding of the token, the position embedding mentioned above, and the segment embedding here, as shown in the figure:

Simple and excessive downstream task interface

It really shows that the BERT model is a keel-level model and is no longer a word vector. It is its interface design to various downstream tasks, or a more foreign word is called a migration strategy. First of all, since the upper-level representations of sentences and sentence pairs are obtained, of course, for text classification tasks and text matching tasks (text matching is actually also a text classification task, but the input is a text pair), only need to use Representation (that is, the encoder is output at the top of the [CLS] lexeme) plus a layer of MLP ~ Since the text is deeply bidirectionally encoded, then only the softmax output layer needs to be added for sequence labeling tasks. Even the CRF is useless ~ What made Xi Xi even more think is that on span extractive tasks such as SQuAD, it is enough to save the two encoding packages of deep encoding and deep attention, and even dare to directly use the pointer of the output layer. Net lost? Directly like DrQA, use two linear classifiers to output the start and end of the span respectively? Not much to say, already kneeling m (_ _) m Finally, let's take a look at the experimental effect.

Well, this is very Google. As soon as this paper came out, Xiao Xi was very happy, because many of the previous ideas did not need to be verified by experiments, because they have been killed by BERT (。 ́︿ ̀。) classification, annotation and migration tasks can all be started from the beginning, SQuAD The building plan of the building can also be stopped. Thanks to BERT for not running the generation task, which gives people a little imagination. Well, smile and cry manually.

references

[1] 2018 | BERT- Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] 2018NAACL | Deep contextualized word representations
[3] 2018 ACL | Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network
[4] 2018ICLR | Fast and Accurate Reading Comprehension by Combining Self-Attention and Convolution
[5] 2017TACL | Enriching Word Vectors with Subword Information
[6] 2017ACL | Deep Pyramid Convolutional Neural Networks for Text Categorization
[7] 2017 | Convolutional Sequence to Sequence Learning
[8] 2017 | Do Convolutional Networks need to be Deep for Text Classification ?
[9] 2016 | Convolutional Neural Networks for Text Categorization/ Shallow Word-level vs. Deep Character-level
[10] 2013NIPS | Distributed-representations-of-words-and-phrases-and-their-compositionality

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105460274