Three feature extractor (RNN / CNN / Transformer)

Three feature extractor - RNN, CNN and Transformer

Brief introduction

In recent years, deep learning in various NLP tasks have made SOTA results. In this section, we first look at the extraction stage structure in natural language processing most commonly used features.

Section of this article reference teacher Zhang Junlin article "give up their fantasies, to embrace Transformer: natural language processing three feature extractor (CNN / RNN / TF) Compare" (very well written, learn NLP must-see blog post), where on the one hand, Bowen summarize certain degree, plus some personal understanding.

After learning the depth of popular, as our network is getting deep, our neural network model more like a black box, as long as we feed it data, our model can be given in accordance with our goal of automatic learning for extracting the most advantageous feature of the task (this is more pronounced in the CV, for example in different layers of a convolutional neural network can output detailed features on different levels in the image), thereby achieving a known "end to end" model . Here, we can put our CNN, RNN and Transformer considered as characteristic feature extraction data extractor. Now, this article will introduce you briefly the basic structure of their advantages and disadvantages RNN, CNN and NLP's new favorite Transformer.

Recurrent neural network RNN

Traditional RNN

In 2018, State NLP in various sub-fields of Art is the result of RNN (contained herein LSTM, GRU and other variants) get. Why RNN in the field of NLP to have such a wide range of applications? We know that if a fully connected network NLP applied to the task, it will face three major problems:

  • Input samples for the different input and output may have different lengths, and therefore the number of neurons in the input layer and the output layer can not be fixed.
  • The same characteristics learned from different locations of the input text can not be shared.
  • Too many parameters in the model, too computationally intensive.

To solve this problem, we have a familiar RNN network structure. Scanned by the input data, such that all the network parameters each time step are shared, and each time step of receiving input not only the current time, and receives the output of a moment, so that it can be successfully exploited in the past auxiliary information inputted to the current time determination.

However, there is a problem RNN original, it takes a linear sequence structure continuously gather input from front to back, but the linear sequence structure does not good long-term dependence of the captured text, as shown in FIG. This is mainly because the reverse propagation path is too long, thus easily lead to the disappearance or severe gradients gradient explosion.

Short and long term memory network (LSTM)

RNN traditional practice is to extract all the knowledge of all, without any processing of the input to the next time step to iterate. Like the exam, like, if you want to advance knowledge on all the books we have to remember, to the exam, I am afraid that early knowledge has been completely covered the recent knowledge, information extraction less than the long-term time step is normal. Humans do so? Obviously not, we usually practice is to have a knowledge of rational judgment, important knowledge given more weight, focusing on memory, not less important it may not take long to forget, so in order to have in the face of the exam better play.

In my opinion, LSTM structure more similar to the way human memory for knowledge. The key lies in understanding appreciated LSTM two states \ (c ^ {t} \ ) and \ (a ^ t \) and three internal gating mechanism:

Figure we can see, LSTM Cell received at each time step of the time steps has two inputs, to pass next time step also has two outputs. Usually, we will \ (c (t) \) be seen as a global information, \ (A ^ t \) be seen as a global information hidden under the impact of a Cell.

Forgotten gate input gate (update gate in the drawing) and the output of each gate is a small unilamellar activation function of the sigmoid neural network. Since the sigmoid ((0, 1) \) \ values within the range, it is effective for determining whether to keep or "forgetting" information (a value close to 1 is multiplied expressed reservations multiplied value close to 0 indicates forgotten), It provides the ability to selectively transfer of information for us. Thus, we well understand the door in the LSTM how it works:

  • Forgotten gate has two inputs: the input of the current time step \ (x ^ t \) and the output of the hidden layer \ (A-T. 1} ^ {\) , a gate door forgotten by these two inputs Training function, the output of gate Note that this function is in \ ((0, 1) \) between, on the global level to output information \ (c ^ {t-1 } \) multiplied represents the global information selected portions forgotten.
  • For input gate, we train a door that same function, at the same time, the received \ (a ^ {t-1 } \) and \ (x ^ t \) together with the activation by a small nerve function tanh network, this part of the traditional RNN is no different, is to get information on a time of integration and information obtained at that time. The output of the AND gate function integration of information multiplying corresponds to the same extraction choose to retain new information, and applied directly to the global information.
  • For the output of the gate, a gate similar training function, at the same time, a new hidden \ (c ^ t \) is multiplied by the output of AND gate by a simple function tanh function (activation function only) after the this time information can be hidden by a global effect on the Cell \ (a ^ t \)

So look down, it is not that LSTM has been very "smart" of it? But in fact, there are limitations LSTM: sequential structure on the one hand it is difficult to have high computing power parallel (only the current state of the computation relies on the current input, but also relies on the output of a state), the other LSTM the one hand, makes the whole model (RNN including other models, such as GRU) more similar to a Markov decision process in general, more difficult to extract global information.

The configuration of the GRU I will not go into detail here, a lot of relevant information in the literature, we want to know can go and see, in simple terms, it can be seen as a simplified version of a GRU LSTM, which will \ (a ^ t \) and \ (c ^ t \) two variables together and talk about forgotten door and gate input gate integrated into the update, the output of the gate was changed to remake the door, the general idea does not change much. Properties often little difference between the two, but the GRU relatively smaller amount of parameters. Convergence faster. For smaller data sets I recommend using the GRU had enough for large data sets, you can try to have more parameters amount of LSTM have no cause unexpected results.

Convolution neural network CNN

CNN is a major breakthrough in computer vision, is currently CV for core processing task model. CNN also applies to NLP tasks feature extraction, but it is slightly different usage scenarios and RNN, this part I will write more, because on the application of CNN in NLP tasks we did so relatively speaking, should understand.
About two-dimensional convolution kernel of operation as shown below, I will not go into details.

From the data structure point of view, the input data CV task image pixel matrix, the correlation between the pixels in the respective directions which are substantially identical. NLP data input task sequence is generally text, sentence length is assumed as \ (n-\) , dimensionality of term vectors are \ (D \) , our input became a \ (n \ times d \) of matrix, apparently, the correlation between the ranks of the matrix "pixel" is not the same, the same behavior matrix of a vector characterization of the word, rather than peers to represent different words. Convolution network can make normal "read" our text, we need to use a one-dimensional convolution in the NLP. Kim in 2014 for the first time CNN text categorization task in NLP, the network its proposed structure as shown below:

Can be seen, different one-dimensional two-dimensional convolution is convolved with the width dimension of each of the word vectors of a convolution kernel \ (D \) is the same, in order to ensure complete processing each convolution kernel of n words word vector, in order to slide down from the convolution, the output of this process we need to become a feature vector. This is the process of extracting features of CNN. After convolution layer typically connected Max Pooling layer (for extracting the most significant features), feature vector output for down-Viti fetch operation, followed by one last full link layer and text categorization.

While traditional CNN after a simple change can be successfully applied to NLP tasks, and the results were good, but the effect is only "good" only, or in the case of many tasks completely suppressed. This suggests that the traditional CNN still some problems in the field of NLP.

CNN evolutionary history of NLP community model

CNN talked about in the evolution of NLP community, we first look at what are the problems Kim version of CNN.

  • Kim CNN version is actually similar to a k-gram model (k ie, the convolution kernel window, indicate how many words each convolution of the time covered) for k-gram model of a single layer is difficult to capture the distance \ (d \ ge k \) feature;
  • Feature vector output from the convolution layer containing the location information (convolution with a convolution kernel related sequence), then Max Pooling layer after convolution layer (extracting features retain only the maximum value) will result in the feature information and important position encoding information is lost.

To solve this problem, researchers have adopted a series of methods to improve Kim edition of CNN.

  • One of the main method to solve the long-term information extracted from the network is that you can do deeper, deeper convolution kernel will have greater receptive fields to capture more distant features.
  • Further, we can expansion convolution (Dilated Convolution) of the way, which means that we no longer covers the continuous area convolution window, but jumping cover, so that the same size of the convolution kernel that we can extract more wherein the distance. Of course, empty convolution with CV here is still not the same, it is only the presence of voids between the word, the word inside the empty vector does not exist. In the Soviet Union of blog God comparison in the same \ (window = 3 \) of the convolution kernel, expanding general convolution and convolutional neural receptive field size of each element in the three-tier network, as shown below, can be seen the expansion range receptive fields of neurons convolution is greatly increased.

  • To prevent the loss of the position information of the text, the NLP field trend is discarded Pooling CNN layer by the full convolution depth layer overlay network, and the position encoder input portion is added, the position of the artificial feature added to the corresponding word word vector. Position encoding methods can be used "Attention is All You Need" in the embodiment, described below in further detail when Transformer

  • We know that in the CV field, there will be a series of problems after the network so deep, so there is residual network. While the NLP also be used in the residual network, to solve the problem disappears gradient, the gradient nature of solving the problem is to accelerate the disappearance of the flow of information, so that information transmission may simply have an easier path, so that the network so deep, it is possible to ensure good performance.
  • Activation function began to use GLU (Gated Linear Unit), as shown below, left and right convolution exactly the same size, but do not share a weight parameter, and wherein a through a sigmoid function, not by the other, both multiplied. Is not feeling a little familiar, this is actually the door mechanism LSTM is the same effect, the activation function can control the output of their own? Size strength characteristics.

  • Su God blog also learn another application is using can be \ (window = 1 \) a one-dimensional convolution of the embedded compression characterizing feature synthetic words, to obtain a more effective method of term vectors characterization.

CNN seen in many places more suitable for text classification tasks, in fact, from the "Convolutional Sequence to Sequence Learning", "Fast Reading Comprehension with ConvNets" and other papers and practice reports point of view, CNN has developed into a mature feature extractor, and, in RNN, the CNN of the sliding window has no relationship compared, any mutual influence on the convolution kernel before the difference, so with a very high parallel degree of freedom, which is the advantage of a very good.

Transformer

Transformer is in the paper "Attentnion is all you need" was first proposed in.

Transformer Detailed recommend this article: https://zhuanlan.zhihu.com/p/54356280

Before introducing Transformer, we take a look Encoder-Decoder framework. At this stage of deep learning models, we usually see it as black boxes, and Encoder-Decoder framework sucked the black box is divided into two parts, one to do the encoding, decoding do the other part.

In various NLP tasks, Encoder Decoder frame and frame are formed by stacking a plurality of individual feature extractor is made, for example, we mentioned before LSTM structure or CNN structure. From the initial one-hot vector by Encoder frame, we will obtain a matrix (or a vector), which can be regarded as one of its inputs to the coding sequence. Decoder for decoding the structure is more flexible hungry, we can depending on task, we get "signature" matrix or "feature" vector output as the output result of our task requires. Therefore, for different tasks, if we stack of feature extractor to be able to extract better features, so in theory, all of NLP tasks we are able to get better performance.

In 2018, Google launched BERT, refresh the record, detonated the entire NLP community, which achieved a key factor in the success of the new feature extraction structures: the powerful role of Transformer.

Transformer structure is proposed in the paper "Attention is All You Need" model, as shown in FIG. FIG red frame to frame Encoder, Decoder yellow box for the frame, which are formed by stacking a plurality Transformer Block formed. Transformer Block here on CNN instead of LSTM and structure as we mentioned before our feature extractor, which is also the most critical part. More detailed schematic as shown in FIG. We can see that the encoder Transformer and decoder Transformer is slightly different, but we usually feature extraction structures (including Bert) mainly Encoder in Transformer, so we are here mainly to understand what Transformer in Encoder, It is how it works.

Seen from the figure, a single Transformer Block mainly consists of two parts: long attentional mechanisms (Multi-Head Attention) and feedforward neural networks (Feed Forward).

3.1 long attentional mechanisms (Multi-Head Attention)

Multi-Head Attention block structure as shown below:

Here, we can see why this part is called Multi-Head, because it is itself a \ (h \) sub-modules Scaled Dot-Product Attention stacking, the module is also known as Self-Attention module. About entire Multi-Head Attention, there are a few key points about the need to understand:

  • Linear can be seen as a full connection is not active layer functions, each of which maintains a linear mapping matrix (the nature of the neural network is the matrix multiplication)
  • For each of the three linear mapping matrix Self-Attention, are maintained separately \ (^ W is V_i \) , \ (^ W is K_i \) and \ (W ^ Q_i \) weights between (different modules Self-Attention value is not shared) by the input matrix \ (X-\) and the mapping matrix is multiplied by three to give the three input Self-Attetnion Queries, Keys and values. This V, Q, three input matrices K \ (X-\) is exactly the same (both input sentence Input Embedding + Positional Encoding Transformer or the output layer), which is also from the entire model Encoder you can be seen.
  • In the paper, the authors for 8 Self-Attention output, simple splicing, and by a mapping matrix \ (W ^ O \) its multiplication (to compress the output matrix), whereby the entire Multi- output of Head Attention

In the Multi-Head Attention, the most critical part is the part of the Self-Attention, which is the core formula of the entire model, we will expand it, as shown below.

We have previously mentioned, Self-Attention is the only input matrix \ (X \) of three linear mappings. Then the internal operation of the Self-Attention has what the meaning of it? We slowly understand a single word from the process of coding:

  • First of all, we have to enter the word vector \ (X \) to generate three corresponding vectors: Query, Key and Value. Note that these three vectors compared to vectors \ (X-\) is much smaller (paper \ (X-\) length is 512, the length of the three vectors is 64, this is only an option based on the architecture, since the dissertation Multi-Head Attention has 8 Self-Attention module 8 Self-Attention output to be spliced, to restore it to a length of the vector 512), which is a part of operation for each word independently
  • Computes the product with the point Queries and Keys all the words relative to the current word (the figure is Thinking) Score Score, which scores decisions when coding the word "Thinking" given how many other words contributions
  • Score dividing the vector dimension (64) of the square root of (Score guaranteed in a smaller range, i.e. zero or 1 softmax a result), and then subjected to the Softmax (the scores of all the words are normalized such that all and the word is positive and 1). Thus obtained for each word word word scores will all contribute to the coding, of course, the current word will get the most points, but also will focus on the contribution of other words
  • The resulting scores for Softmax, we multiply it by the vector corresponding to each Value (weakening of the influence of low scores softmax word, thought before the door is somewhat similar sigmoid function)
  • All weight vector sum of the resultant, i.e. to obtain Self-Attention output for the current word "Thinking" of

In fact, think carefully about what you can find, Self-Attention and CNN are very similar. CNN convolution operation by simple extraction features, although increasing the depth and Dilated Convolution other ways to increase the receptive field, but which is essentially an n-gram model. While in the Attention-Self, \ (W ^ Q \) , \ (W ^ K \) , \ (^ W V \) is not seen as a three convolution kernel can do, but by a more clever the result of the convolution operation mode of integration, to achieve the intuitive so-called "attention", so that the coding results for each word are the result of joint action of all the words in a sentence, is a large bag of words model which is essentially (including all the words in a sentence). Obviously, the above process can be calculated in parallel using the following matrix form:
\ [the Attention (Q, K, V) = SoftMax (\ ^ QK FRAC {T} {\ d_k sqrt {}}) V \]

Wherein, Q, V, K represent the input sentence Queries, Keys, Values matrix, the behavior of each matrix corresponding to each word vector Query, Key, Value vector, \ (d_k \) represents vector length. Thus, Transformer also has a very efficient parallel computing capabilities.

We go back to Multi-Head Attention, we will maintain the independence of eight Self-Attention output for simple stitching, via a first mapping layer, you get the output of a single bull's attention. The entire process can be summarized by the following diagram:

A position encoder (Positional Encoding)

We mentioned earlier, due to the timing of the RNN structure, so it has a natural position encoding information. CNN itself can actually extract a certain location, but after a multi-layer stack, the location information will weaken the position coding can be seen as an aid. Encoding each word Transformer process such that it substantially does not have any location information (the word order after the change does not disrupt the results of Self-Attention), so that the position vector is necessary here, it is possible to better the distance between the expression of words and word. Spatial coding structure formula as follows:

\[\begin{cases} PE_{2i}(p)=sin(p/10000^{2i/d_{pos}}) \\ PE_{2i+1}(p)=cos(p/10000^{2i/d_{pos}}) \end{cases}\]

If the embedded word length \ (d_ {pos} \) , then we need to construct a similar length of \ (d_ {pos} \) a position-coding vector \ (the PE \) . Wherein \ (p \) represents the position of the word, \ (PE_i (p) \) represents the value of the p-th word positions of the i-th vector element, and then directly adding the word vector and a position vector. The absolute position encoder contains not only the position information from the \ (sin (\ alpha + \ beta) = sin \ alpha cos \ beta + cos \ alpha sin \ beta \) and \ (cos (\ alpha + \ beta) = COS \ Alpha COS \ Beta - SiN \ Alpha SiN \ Beta \) , which means we can \ (p + k \) of the position vector may be expressed as position \ (P \) linear transformation vectors of position, such that the relative position information It has also been expressed. Transformer papers mentioned their position with effect vector encoding the position vector encoding and training obtained in this way is to get very close.

A residual module (Residual Block)

We said before the similarity of Self-Attention to CNN's. Here residuals operations also adopts the idea of CNN, its basic principle is the same, I will not go into details. After the residual connections, also we need to normalize operation layer, and the specific process Layer Normalization consistent.

Transformer Summary

Here, the basic structure of the entire Transformer tells finished. Comparison of phase with respect to its performance RNN and CNN, Zhang Junlin the teacher's article detailed description of the data, I only attach a brief summary:

  1. Semantic features extracted from the ability to: Transformer significantly exceed RNN and CNN, CNN and RNN both not too poor.
  2. The ability to capture long-distance features: CNN extremely significantly weaker than RNN and Transformer, Transformer RNN model is better than weak, but at a relatively long distance (subject predicate distance greater than 13), RNN is better than a weak Transformer, so the overall look can be considered Transformer and RNN capacity in this area is not too bad, but CNN is significantly weaker than the first two. This part also we mentioned before, CNN ability to extract long-distance characteristic of the convolution kernel receive limit its receptive field, the experiments show that the convolution kernel size increases, increase the depth of the network, can increase the long-range capture feature CNN ability. For Transformer, its ability to capture the main long-distance characteristics affected Multi-Head number, the greater the number of Multi-Head, long-distance characteristic Transformer, the stronger the ability to capture
  3. Task comprehensive feature extraction capabilities: Usually, machine translation tasks is the processing power of NLP one of the most complex requirements of the task, in order to obtain high-quality translation result, the two languages ​​for lexical, syntactic, semantic, context processing capabilities, wherein the capture performance requirements of long-distance and other aspects are very high. Comprehensive features extracted from the ability to measure the angle, Transformer significantly stronger than RNN and CNN, and CNN RNN and the performance is not too bad.
  4. Parallel computing capabilities: For parallel computing capabilities, many places mentioned above, parallel computing is a serious flaw RNN, while the Transformer and CNN about the same.

Since the usual scientific mission is relatively heavy, the code is no time to upload, such as when the release sequence annotation and text classification and other articles of the code will be synced to the Github, RNN / CNN / Transformer code will be included.

Reference material

https://zhuanlan.zhihu.com/p/54743941
http://www.ai-start.com/dl2017/html/lesson5-week1.html#header-n194
https://zhuanlan.zhihu.com/p/46327831
https://zhuanlan.zhihu.com/p/55386469
https://kexue.fm/archives/5409
https://zhuanlan.zhihu.com/p/54356280
http://jalammar.github.io/illustrated-transformer/
https://kexue.fm/archives/4765

Guess you like

Origin www.cnblogs.com/sandwichnlp/p/11612596.html
Recommended