Convolutional Sequence to Sequence Learning paper notes

Brief introduction

Writing this blog is mainly to learn more about how to use CNN as Encoder structure, one of the paper while the paper is a must-see. This paper demonstrates the use of CNN as a feature extraction structure to achieve Seq2Seq, can be achieved with close RNN even better results, and CNN's high parallelism can greatly reduce our model training time (in the original paper did not clearly part comb, taking advice and original mix)

Original link: Convolutional Sequence to Sequence Learning

Model structure as shown below:

Next, each part of the model description is divided into blocks:

Position Embeddings

Convolution Network and Transformer same, not like RNN timing models, and therefore need to add code to reflect the position of the positional relationship between the word and the word

  • Word vector sample input: \ (w = (W_1, w_2, ..., w_n) \)
  • Sample position encoding: \ (P = (P_1, P_2, ..., P_n) \)
  • The final word vector Characterization: \ (E = (W_1 + P_1, P_2 w_2 +, ..., + w_n P_n) \)

GLU or GRU

GLU and GTU are presented in the same paper, which, GLU is the main structure of the CNN Seq2Seq. Which can be directly viewed as activation function, which will be output to the input of a convolution kernel to the same two convolutional network structure, so that the output of one \ (A \) , the other is \ (B \ ) .
GLU and GRU difference is that a different activation function output A:
\ [GLU: H_0 = A \ otimes \ Sigma (B) \]

\[GTU:H_0=tanh(A) \otimes \sigma (B)\]

And CNN Seq2Seq on the use of GLU as the activation function model

原文链接:Language Modeling with Gated Convolutional Networks

Convolutional Block Structure

Encoder and decoder are constituted by a plurality of convolution layers (called description block, actually layer), each layer comprising a core and a one-dimensional convolution linear gating means (Gated linear units, GLU ). Suppose the number of words is entered length \ (m \) , Kernel size \ (K \) , PAD is \ (P \) , then the formula for calculating the output sequence length is \ ((m + 2p-k ) / stride + 1 \) , as long as the appropriate setting of the convolution kernel kernel size, pad, and the step size parameter, such that the dimensions of the output sequence to the input sequence consistent. In the text, input 25, Kernel is 5, then the output sequence of length (2 * 25 + 2-5) / 1 + 1 = 25.

In addition, in order to fully make the output node with the entire sequence of words are linked, you must use multiple convolution layer, so as to make a final convolution kernel receptive field was large enough to feel the characteristics of the whole sentence, but also to capture the local sentence feature.

Look forward mode of transmission of the entire encoder:

  • Each input to the convolution kernel size of the sentence \ (X-\ in R & lt ^ {K \ D} Times \) , each convolution kernel indicates that a sequence length that can be read as \ (K \) , i.e. volume the width of the convolution kernel is \ (K \) , for the word vector dimension $ d
  • Right convolution kernel weight matrix of size \ (^ W is {2D \ Times K \ D} Times \) , the offset vector is \ (B_W \ in R & lt 2D} ^ {\) , indicates that each layer has \ (2D \ ) a convolution kernel, so the dimension of the output sequence is \ (2D \) , and because the prior design, so that the length of the input and output sequences are the same, so after convolution, the matrix size is obtained sequence \ (the Y \ ^ {K in R & lt \ Times 2D} \) .
  • The above will be \ (2D \) a convolution kernel separated into two parts, the convolution kernel size and number of these two parts are identical, the output is exactly the same dimensions, it can be used as \ (GLU \ ) of the two inputs, input to the GLU after integrating the sequence and the output becomes a dimension \ (\ hat {Y} \ in R ^ {k \ times d} \)
  • In order to achieve deep network, between the input and output of each layer using the residual structure
  • For decoding sequence, we need to hide the characterization of decoding the sequence of extraction, but the decoding process of decoding the sequence of timing is recursive, that we can not observe the sequence after the current prediction object, so the authors of the decoder input sequence

This strategy ensures that the convolution of each layer of the one-input and output sequences, and can be seen as a simple encoder means to encode multi-layer stack to achieve deeper.

Multi-step Attention

For the calculation of Attention, the key is to find the Query, Key and Value. Attention schematic below shows the calculated and decoded

Attention calculation process is as follows:

  • Query the final output of a convolutional decoder layer \ (h_i ^ l \) and final decoder timing generated on a target \ (G_i \) jointly decided, \ (^ W is l_d \) and \ (b_d ^ l \ ) is a linear mapping parameters.
    \ [d_i ^ l = W ^ l_dh ^ l_i + b_d ^ l + g_i \]

  • Key is used to output Encoder \ (z_j U ^ \) , typical 2D matching model, correctly aligned with the Query Key, calculated dot attention score:
    \ [A_ ij of {L} = ^ \ FRAC {exp (^ D l_i \ cdot z ^ u_j)} {\ sum_ {t = 1} ^ mexp (z_j ^ u + e_j)} \]

  • Value is the value of output of the encoder take \ (z_j ^ u \) and word vector characterizing \ (e_j \) sum, the purpose of characterizing the position information is output plus encoder. Value obtained corresponding value \ (c_i ^ l \) then directly outputs the current time and Decoder \ (h_i ^ l \) are added, and then input to the classifier for classification.
    \ [c_i ^ l = \ sum_ {j = 1} ^ ma_ {ij} ^ l (z_j ^ u + e_j) \]

Normalization Strategy

The model also guaranteed by the normalization strategy by the variance of each layer of the model will not change much, here to simply record what the specific details of the operation need to go back to digest code. The main strategy normalized as follows:

  • Residual network input and output is multiplied by \ (\ sqrt {0.5} \ ) to ensure that the input and output of the variance by half (assuming that both are equal variance, although this is not always true, but proved this is valid)
  • Since the output vector attention module as a weighted sum of m vectors thus multiplied by \ (m \ sqrt {m} \) to offset the change in the variance, which is multiplied by \ (m \) is the vector in order to enlarge to its original size (in practice usually do not do, but doing so good results)
  • As a result of multiple mechanisms attention convolutional decoder, according to the number of attention to the mechanism of the backpropagation gradient compression encoder, which encoder receives avoid excessive gradient information, such that more training smooth.

Initialization

The purpose of the initialization normalization is consistent with, that is to ensure the dissemination of data variance with the terms of the former can be maintained at a relatively stable level, the model initialization strategy is as follows:

  • The previous layer by an average value of 0 to 0.1 and a standard deviation of the distribution is too initialized.
  • For the output layer is not directly input gating means linear, we normal \ (N (0, \ sqrt {1 / n_l}) \) to initialize the weights, where \ (N_L \) each neuron enter the number of connections. This ensures that the variance of the distribution of the input is too preserved
  • GLU is connected to the output layer, we adopt a different strategy. If the input mean 0 and variance GLU is sufficiently small, then the output is approximately equal to the input variance is the variance 1/4. Therefore, an initialization such that the weight-activated input GLU having four times the variance of the input layer, i.e. the layer is initialized distribution \ (N (0, \ sqrt {4 / N_L}) \) .
  • Moreover, the bias of each layer \ (B \) uniformly set to 0
  • Further, considering the dropout will also affect the distribution of the variance of the data, assuming the probability of dropout retention is p, then the variance is enlarged to \ (1 / p \) times, so the above-mentioned initialization strategy needs to be corrected is: \ (N ( 0, \ sqrt {p / n_l }) \) and \ (N (0, \ sqrt {4p / n_l}) \)

The last part of the experiment is not recorded, interested students can go in and see the original.

https://zhuanlan.zhihu.com/p/26918935
https://zhuanlan.zhihu.com/p/27464080
https://www.zhihu.com/question/59645329/answer/170014414

Guess you like

Origin www.cnblogs.com/sandwichnlp/p/11876987.html