Detailed Bert (1) --- the WE, ELMO, GPT to BERT

table of Contents

1. Pre-training in computer vision

We design a good network structure later, the image is typically a multi-layer structure overlay network CNN, and you can start with a training set of training such as training set A or B of this network is a collection of pre-trained in the task A or task B learn the network parameters, and save it for later use. Suppose we faced third task C, the network structure to take the same network structure, while relatively shallow layers of the structure CNN, the network initialization parameters can be loaded B A task or tasks learn good argument, other CNN executives still random initialization parameters . Then we use the training data to train the network task C, then there are two ways, one is shallow loading parameter does not move the task C in the training process, this method is called "Frozen"; the other is Although the underlying network parameters are initialized, still with the process of changing the training mission in C training process, which is generally called "Fine-Tuning", as the name suggests, it is better to adjust the parameters making it more adapted to the current task C . General image or video to do pre-training areas generally do

Pre-training benefits:
less (1) if the training data set C tasks, such as easy to use CNN stage Resnet / Densenet / Inception peer network architecture layers deep, a few millions of millions parameters amount to start training less data is difficult to be well trained so complex network, but if one of a number of parameters such as ImageNet pre-trained by a large collection of good training most directly used to initialize the network structure parameter, then use the amount of data on relatively poor C task at hand Fine-tuning process to adjust the parameters to make them more suitable for solving the task C, then things would be much easier. Such training is not the original task can be solved, even if the task at hand is also a lot of training data, plus pre-training process can greatly speed up the convergence of mission training.

Why connect good results pre-training can take?

For CNN hierarchy structure, the different levels of neural-membered learned different types of image feature, a hierarchical structure is formed by the bottom up features to face recognition, for example, by tensorboard visualization, we can see the bottom of neurons learn characterized in that the other segments, illustrated in the second hidden layer to learn the facial contour face, the third layer is learned face contour, formed by a three step hierarchical structure characteristics, the more the underlying the more features such as whether all the basic features of what the underlying edge line of the arcs and the like are provided in areas of the image, the higher up the extracted features related to the task at hand. Because of this, it is pre-trained network parameters, especially the underlying network parameters extracted feature has nothing to do with the more specific tasks, with the more common tasks, so this is why usually with underlying pre-trained network parameters to initialize a new task Cause parameter, and the high-level tasks associated with larger features, can not use the actual, or with Fine-tuning washed with the new data set level independent feature extractor.
Under conclusion is that low-level features and high-level of reusability features of task dependencies

Pre-training ---- word Embedding 2. NLP in

The following figure obtained when a training mission parameter matrix W (learning needs) in fact every word corresponding word embedding value (multiplying the weight with onehot matrix weight matrix is to remove the word vector of each word)
Here Insert Picture Description
then the word embedding in the end and pre-training what does it matter? In fact, this is the standard pre-training process. It is widely used in downstream task of NLP.

Use Word Embedding Onehot equivalent to the network layer to the well layer embedding matrix Q parameters initialized with a pre-training. This low-level pre-training process with speaking in front of the image field is actually the same, the only difference is nothing more than Word Embedding initialize the first layer of the network parameters, then high-level parameters can not do anything. NLP downstream task when using Word Embedding also similar image there are two ways, one is a Frozen, is a network layer parameter that Word Embedding immobilized; another is the Fine-Tuning, Word Embedding this layer is to use the new parameters training set training also need to follow the process of updating the training out

There is the problem of polysemy word Embedding

Word Embedding a word at the right time to encode, which is indistinguishable multiple meanings, because although they are different in the context of the word, but in the language model training time, no matter what the context of the sentence after word2vec, are forecast one and the same word, the same word is accounted for in the same row of the parameter space, which leads to two different contextual information will be encoded into the same word embedding space to go. So word embedding can not distinguish between the different semantics polysemy

3. From the Word Embedding to ELMO

Is the Word Embedding before the static nature of the way that after a good training expression of each word on a fixed, time after use, no matter what the new sentence context the word is the word of the Word Embedding does not change along with the context of the scene changes, so for example the term Bank, it learn in advance of mixing several Word Embedding semantics, a new sentence to the application, even from the context (for example, a sentence comprising the words money) it is apparent representatives is the meaning of "bank", but the corresponding Word Embedding content will not change, it is a mix of semantics.

ELMO's essential idea is: first learn a word of the language model Word Embedding, this time polysemy can not be distinguished, but it does not matter. In actual use Word Embedding, the word already has a specific context, and this time I can go adjusted according to the word semantic context of words in Word Embedding said so after Word Embedding adjusted more specific expression in this context meaning, naturally solve the problem of the ambiguity of the word. ELMO so the idea itself is based on the current context of the Word Embedding dynamically adjusted.

ELMO uses a typical two-stage process, the first stage is a pre-trained using a language model; Word Embedding second stage is downstream in doing the task, the corresponding word extracted from the pre-trained network layers of the network as a new feature added downstream task.
ELMO's network structure with a double two-way LSTM, the current language model training mission objectives is based on the context of the word Wi is to correctly predict the word sequence Context-before before the word Wi, Wi referred to above, the following sequence of words Context-after It referred to hereinafter. FIG front left end of the electric double layer encoder LSTM represents a positive direction, the input is in addition to the above prediction outer word Wi Context-before and hereinafter Context-after sequence from left to right; LSTM double reverse direction at the right end represents trans encoder input from right to left in reverse order of the sentence above and below; depth of each encoder is LSTM superimposed layers, and each layer will encode the forward and reverse word spliced together; below shown
Here Insert Picture Description

After a good pre-training network architecture, how to use it downstream task?

The following figure shows the process of using downstream tasks, such as our downstream QA task is still a problem, this time for questions X, we can first sentence X as a pre-trained ELMO enter the network, so that each word in a sentence X can obtain the corresponding ELMO network in three Embedding (Embedding word, Embedding syntactic level, semantic level Embedding), after giving Embedding a right in each of the three Embedding weight a, the weight can come to learn, the respective weights cumulative summing, integrating into a three Embedding. Then this Embedding consolidated as the X input of the corresponding word in the sentence that the network structure their own task, as a complement of new features to the downstream tasks. For the downstream task QA shown in FIG reply sentence Y is also so treated. Because ELMO is provided downstream to each word of the characteristic form, the method of this type is referred to as pre-trained "Training-PreS the Feature-based"
Here Insert Picture Description
ELMO, there are some disadvantages
(1) feature extractor selection, using ELMO LSTM rather than upstart Transformer; Transformer ability to extract features is far stronger than LSTM of
capacity (2) ELMO take this fusion splicing two-way features may be weaker than Bert integration features an integrated way, however, this is just a truth from inferred suspicion generated, there is no specific test to illustrate this point.

4. From the Word Embedding GPT

GPT is also a two-stage process, the first stage is a pre-trained using a language model, the second stage downstream task solved by Fine-tuning mode. The following figure shows the pre-training process GPT, in fact, and ELMO is similar, the main difference lies in two things:
(1) feature extractor is not used RNN, but with the Transformer
(2) pre-training, although still based GPT task language model as a target, but using the language model is one-way, i.e., according to the word Wi the word Wi to correctly predict above, below and aside the
Here Insert Picture Description
use of GPT downstream task:
first, different downstream tasks, you can design your own would have any network architecture, now I can not, you want to emulate the network structure of GPT, the task transformed into a network structure and network structure GPT is the same. Then, do the downstream task, the first step in the use of pre-trained network structure GPT initialization parameters, thus being introduced to the task at hand in to your linguistic knowledge through pre-training learned, once again, you can use the task at hand is to train the network, the network parameters Fine-tuning, making the network more suitable for solving the problem at hand.

GPT how to transform the downstream task?
(1) For the classification, not how to move, together with a symbol to the beginning and end;
(2) the problem for determining a relationship between sentences, for example entailments, plus two intermediate sentence delimiters can;
(3) text similarity judgment, made under the order of two sentences reversed two inputs can be, which is to tell the model sentence order does not matter;
(4) for multiple choice questions, the multi-channel input, each way and the article answer options stitching as an input to
Here Insert Picture Description
GPT drawback: the main way is that the language model

5.BERT

Bert and GPT using identical two-stage model, the first is a pre-trained language model; followed by the use Fine-Tuning mode downstream task to solve. The main difference is that GPT and uses a dual language model similar ELMO in the pre-training phase, of course, another point is the large-scale data than GPT language model.

  • For Transformer is, how to do it in two-way language model task on this structure?

The authors present a group called Masked language model to solve this problem, in fact, this is very similar with CBOW model thought word2vec in its core idea is: do language model task, I want to predict the word to pull out, and then the Context-before its above and below to Context-after predicted word, specifically as follows:
randomly selected 15% of the words in the corpus, to pull it off, i.e. by [mask] masks instead of the original word and model requirements to correctly predict the word to pull out, but there is a problem: the training process a lot to see [mask] tag, but the time is not really behind with this mark, which will guide the model output is considered for the [mask] this flag, but it did not see the actual use of the mark, which naturally there will be problems. To avoid this problem, Bert transform a moment:
15% is selected to be performed [mask] substitute words, only 80% actually be replaced [mask] mark, 10% were randomly replaced with another word, 10% of the cases the word to be standing still not modified

  • Bert Another innovation is called the Next Sentence Prediction

Pretraining language model do when selecting two sentences are two cases, one is to select the order of two sentences corpus real connected; the other is the second sentence selected randomly from the corpus into a first spell end of a sentence. We demand model in addition to doing the above tasks Masked language model, the relationship comes and then make a prediction sentence, judge the second sentence is not really a subsequent sentence of the first sentence. Did it take into account the many NLP tasks is to determine the relationship between the task sentence, word prediction training granularity relationship not to sentence this level, this task will help increase the downstream sentence relationship judgment task. So you can see, it's pre-training is a multi-task process.

  • The input section of BERT

Its input is a part of a linear sequence, two sentences delimited by the partition, and the last increase in the front two identification symbol. Each word has three embedding: token embedded, Segment embedding and embedding Position.
(. 1) embedded token
token embedded layer is to convert the role of the word vector representation of a fixed dimension. BERT the example, each word is represented as a 768-dimensional vector; assumed that the input text is "I like strawberries". The following diagram depicts the embedded layer token role
Here Insert Picture Description
in the input text token before passing it to the buried layer, is first subjected to token based . In addition, add extra tokens at the beginning of tokens ([CLS]) and end ([SEP]) at. These tokens are used as an input classification task of representation, and are separated by a pair of input text
tokens of the method is to use a technique called WordPiece token of accomplished, this approach is similar to BPE, can effectively solve the problem of unknown words, this is "strawberries" is divided into "straw" and "berries" manner
token embedded layer converts each wordpiece token 768-dimensional vector representation. This will enable us six input token is converted into a shape (6,768) matrix, or a shape (1,6,768) tensor, if we include the batch dimension words.

(2) Segment embedded
BERT can solve NLP task contains text classification. An example of such issue is whether the two texts are classified semantically similar. This is simply connected to the input and the model of the input text. So BERT is how to distinguish enter it? The answer is embedded Segment.
We assume that the pair is entering text ( "I like cats", " I like dogs"). Here is how to help Segment embedded BERT distinguish the input of tokens:
Here Insert Picture Description
Segment only two layers embedded vector representation . The first vector (index 0) is assigned to all the tokens belonging to the input 1, and the last one vector (index 1) is assigned to the input belongs to all the tokens 2. If only one input an input sentence, it is embedded in the Segment Segment embedded table corresponding to the index 0 is a vector.

(3) Position fitted
Transformers wherein the coding sequence is not input, so to join the position encoder; BERT is designed to handle the input sequence of length 512. On learning by allowing BERT vector indicating the position of each order to include characteristics of the input sequence. This means that the layer is embedded Position size of a lookup table (512, 768), wherein the first row is the vector of any word in a first position on said second line is the vector of any word in the second position said and many more. Thus, if we enter "Hello world" and "Hi there", "Hello" and "Hi" will have the same embedded Position, because they are the first word of the input sequence. Similarly, "world" and "there" is embedded in the same Position

(4) embedding the total word
length of the input sequence of n token will have three different representation, namely:

  • embedded token, the shape (1, n, 768), which is just the word vector representation
  • Segment embedded form (1, n, 768), which is a vector representation to help distinguish the input sequence BERT pairs.
  • Position embedded shapes (1, n, 768), so that the input BERT know having a time attribute.

These summation representation elements, to generate a shape (1, n, 768) a single representation. This is passed to the input of the encoder BERT layer representation.
Here Insert Picture Description

  • Fine-Tuning stage

Practices and GPT this stage is the same. Of course, it also faces the problem of structural transformation of the downstream network task, the task of transformation in terms of Bert and GPT somewhat different
(1) sentence relationship classes for the task is very simple, and similar to GPT, plus a start and end symbols, the sentence can be applied between the delimiters. For output, the symbol corresponding to the first start position to the last layer Transformer top layer to connect a softmax classification.
Here Insert Picture Description
(2) for a general classification task, the input is a sequence, all belong to the same Token Segment (Id = 0), we first special Token [the CLS] The last layer output connected softmax classification, classification of data to the Tuning-Fine
(. 3) for the sequence tagging problem, and an input portion of a single sentence classification is the same, only the last layer output portion Transformer position corresponding to each word are to be classified.

6. Summary

From the point of view of the model or method, Bert draws on the ELMO, GPT and CBOW, mainly Masked proposed language model and Next Sentence Prediction, but here Next Sentence Prediction basically does not affect the overall situation, and Masked LM obviously borrowed the idea CBOW. So Bert model no big innovations, like significant progress in recent years, NLP master of.
Is a network architecture to do the pre-training process is essentially through good design language model task, then a lot even without endless natural language text marked up using the pre-job training to a large number of linguistic knowledge extracted coded into the network structure when the task at hand with the label data is limited, these transcendental linguistic characteristics of the task at hand will certainly have great features complementary role, because when limited data, many linguistic phenomenon is not covered generalization ability is weak, integrated as much as possible common linguistic knowledge will naturally strengthen the generalization ability of the model
Here Insert Picture Description

7. References

(1) NLP big kill BERT model interpretation
(2) How BERT embedded layer is achieved? After reading you will understand

Published 33 original articles · won praise 1 · views 2601

Guess you like

Origin blog.csdn.net/orangerfun/article/details/104546833