Dry sharing technology, Wan word text of the depth of interpretation Machine Translation

Editor's Note

In "How to Make Machine Translation (on)," the article, we review the history of the development of machine translation. In this article, we will share practical theory algorithms and techniques of machine translation systems, machine translation nerve to explain exactly how to Make. After reading this article, you will learn:

  • How nerve machine translation model evolved and developed into NLP researchers to make highly anticipated Transformer model;
  • Based Transformer model, how do we build industrial-grade machine translation of the nervous system.

2013 ~ 2014 tepid natural language processing (NLP) field has undergone enormous changes, because Google brain Mikolov, who proposed a large-scale word embedded technology word2vec, RNN, CNN and other networks began to be used depth the NLP tasks, NLP researchers around the world rejoiced, eager, ready to bid farewell to cause suffering flat period, opening a new era of NLP belongs.

In the past two years occurred in the field of machine translation is also "The Big Bang". 2013 Oxford University proposed Nal Kalchbrenner and Phil Blunsom nerve-end machine translation (Encoder-Decoder model), 2014 Google's Ilya Sutskerver LSTM, who will be introduced to the Encoder-Decoder model. Two events marked the neural networks as a basis for machine translation, we had begun to go beyond the statistical model-based statistical machine translation (SMT), and quickly become the mainstream standard online translation system. After 2016 Google deployment nervous machine translation system (GNMT), then the Internet has a widely circulated saying: "As a translator, see this news, I understand the worry and fear during the 18th century textile workers see the steam engine. "

2015 attentional mechanisms and memory-based neural network to ease the information Encoder-Decoder model represents a bottleneck, the neural network is superior to the classical machine translation key phrase-based machine translation. 2017 Google Ashish Vaswani et al reference attentional mechanisms proposed Transformer models based on self-attention mechanism, Transformer family has remained the best results in the tasks of NLP. NMT summarizes the development of the last decade mainly through three stages: general encoder - decoder model (Encoder-Decoder), attention mechanism model, Transformer model.

Below will gradually depth analysis of NMT these three stages, the paper a small amount of mathematical formulas and concepts definitions may be full of "mechanical sense" if you feel very strenuous in the reading process, it please you to directly read Part 4, understand the points of how to build their own industrial-grade NMT system.

01 new dawn: Encoder-Decoder model

As already mentioned in this end-to-machine translation model proposed in 2013. A natural language sentence can be regarded as a time-series data, similar LSTM, GRU circulatory comparing the neural network is adapted to process time series data sequence. If we assume that the source and target languages ​​are treated as a separate time series data, machine translation is to generate a sequence of tasks, how to achieve a task sequence to generate it? Generally recurrent neural network-based coder - decoder model frame (also called Sequence to Sequence, referred Seq2Seq) do sequence generation, Seq2Seq model consists of two sub-models: an encoder and a decoder, an encoder, a decoder are independent of one cycle of the neural network, the model of a given source language sentence, using a first encoder which maps a continuous, dense vector, using a decoder and then the vector into a target language sentence.

Encoder The encoder input source language sentence is encoded by a non-linear transformation expressed converted to intermediate C semantics:
Here Insert Picture Description

At time i, and C represents a decoder Decoder previously generated history information y₁, y₂, ......, yᵢ-₁ to generate a next word in the target language sentence according to the semantics of the intermediate output of the encoder:
Here Insert Picture Description
Here Insert Picture Description

Each yᵢ are generated sequentially so that seq2seq model is to generate a model of the target language translation of the sentence in accordance with the input source language sentence. Source language sentences and target language although the language, word order is not the same, but with the same semantics, Encoder After the source language sentence into a concentrated embedding space vector C, Decoder can use semantic information implied in the vector to re generating the target language sentence with the same semantics. All in all, Seq2Seq nerve translation model can simulate two main processes of a human translator:

  • The encoder Encoder context interpretation of the source text;
  • Decoder Decoder recompiling the context to the target language.

02 breakthrough leap: attention mechanism model
2.1 limitations Seq2Seq model.

An important assumption is Seq2Seq encoder may model the semantics of the input sentence, all compressed into a fixed dimension semantic vector, the decoder can use this information to regenerate the vector has the same meaning but different language sentence. Since the sharp decline in performance increases as the input codec sentence length to a fixed intermediate semantic vector dimension as the encoder output will lose a lot of details, and therefore difficult to handle long cycle neural network input sentence, there is a general information model Seq2Seq bottleneck represented.

General Seq2Seq model the source statements are processed separately with the target statement, the source can not be modeled relationship between statements directly with the target statement. So how to solve this limitation it? 2015 Bahdanau et al published for the first time to pay attention to the mechanism applied to the joint translation and alignment word, to solve the bottleneck problem of Seq2Seq. Attentional mechanisms may calculate the target word and each source relationships between words, so that the relationship between the direct modeling source statements and goals statement. Artifact attention mechanism is what allows NMT fame winning a war machine translation contest it?

2.2 General principles attentional mechanisms of
Here Insert Picture Description
popular interpreted in the database for general use primary key Key uniquely identify that individual records Value, when accessing individual records can query the primary key Key Query search and query matching and remove where data Value. The idea is similar to the mechanism of attention is the concept of a soft addressing: hypothetical data in accordance with <Key, Value> storage, computing all primary key Key matching degree Query one query, and then the weight value as pieces of data, respectively Value weighted do and as a result of the query, the result of that is attention. Thus, general principles attentional mechanisms (see above): First, the elements constituting the source statement imagined as a series of <Key, Value> configuration data, the target sentence is constituted by a Query sequence elements; Tell then an element Query statement given target, and by calculating the similarity of each Query Key or correlation, to obtain a weight coefficient corresponding to each Key value; finally, can be weighted to value, i.e., to obtain the final value of Attention. Thus, the focus mechanism is essentially source statements Value is a weighted sum of the elements, and Query Key used to compute the weight coefficient corresponding to the Value. General formula is:
Here Insert Picture Description

In machine translation model Seq2Seq laminated by a plurality of generally LSTM / GRU like RNN. September 2016 Google released nervous machine translation system GNMT, using Seq2Seq + model framework attentional mechanisms, the encoder network and decoder network has 8 layers LSTM hidden layer, the output of the encoder by a weighted average of attentional mechanisms input to the decoding LSTM each hidden layer device, and finally the probability of each word softmax layer output target language dictionary for each connection.

How to make GNMT computing performance boost attention it? Suppose (X, Y) parallel corpus to any one set of source statements - target sentence pair, then:

Source statements of a string of length M:
Here Insert Picture Description
target statement string of length N:
Here Insert Picture Description
encoder output as the d-dimensional vector encoding h: the
Here Insert Picture Description
use of Bayes' theorem, the conditional probability of the sentence:
Here Insert Picture Description

When decoded by the decoder according to the output of the encoder before encoding and decoders i-1 output at the time point i, maximizes P (Y | X) can be obtained target words.

GNMT attentional mechanisms actual calculation steps:
Here Insert Picture Description

Read here, you may begin full weariness, even cast aside this article unintelligible. Please give patience to read because so far exciting time began: the protagonist of the article Transformer (Transformers) students played!

03 Highlight moment: Transformer model based on self-attention mechanism

Part 2 we mentioned on seq2seq + attentional mechanisms achieved better results than the average model architecture seq2seq, then this combination of what are the disadvantages? In fact there is a recurrent neural network researchers have been plagued by long-standing problem: Unable to effectively parallel computing, but soon the researchers would like the gospel. June 2017 Transformer model turned advent of Google was in a paper published in "Attention Is All You Need" in reference to the mechanisms of attention, attention from proposed mechanism (self-attention) and new neural network architecture - -Transformer. The model has the following advantages:

  • The traditional model Seq2Seq RNN mainly restricted training GPU speed, Transformer model is a model completely without RNN and CNN's parallel computing mechanism for attention;
  • Transformer improve the shortcomings of slow training RNN most criticized by self-attention mechanism to achieve rapid parallel computing, and Transformer can be increased to a very deep depth, to fully exploit the characteristics of the DNN model to improve model accuracy.

Let's Insights Transformer model architecture.

3.1. Transformer model architecture

Transformer model is essentially the Seq2Seq a model, by an encoder, a decoder and a connection layer therebetween composition, as shown below. Described in the original "The Transformer" encoder: encoder Encoder made identical N 6 th coded Encoder layer = layer stacking, each layer has two sublayers. The first sub-layer is a Multi-Head Attention mechanism, a second sub-layer is a simple, fully connected feedforward network location of the Feed-Forward Network. We then use a residual for each sub-layer connection Residualconnection, followed by standardization layer Layer Normalization. The output of each sub-layer is LayerNorm (x + Sublayer (x)), where Sublayer (x) is a function realized by the sub-layer itself.

"The Transformer" Decoder: Decoder The decoder also made of identical N = 6 th layer decoding Decoder Layer stacking. In addition to two sub-layers with the same layer in each of the encoders, the decoder further inserted into the third sub-layer (Encoder-Decoder Attention layer), the layer stack encoder output execution Multi-HeadAttention. Similar to the encoder, and then we use the residuals connected between each sublayer, and then normalized layer.

Transformer model calculates the attention of three ways:

  1. Attention from the encoder, each has a Multi-Head Attention Encoder layer;
  2. Attention from the decoder, each Decoder has Masked Multi-Head Attention layer;
  3. Coder - decoder of attention, each has a Decoder Encoder-Decode Attention, processes and past seq2seq + attention model similar.
    Here Insert Picture Description
    3.2. Since the mechanism of attention

Transformer model is the core idea of ​​self-attention mechanism (self-attention), you can pay attention to the ability to represent the sequence of the input sequence of different positions to calculate. Since attention mechanism between the name suggests does not refer to the source and target sentence statement of attention mechanism, but the mechanism of attention with a sentence between internal elements. Usually the focus in the calculation model Seq2Seq Decoder output as the query vector q, Encoder output sequence as the key vector k, between the values ​​of all elements of the vector v, Attention mechanism of the element and the source sentence in the target sentence.

Attention from the calculation mechanism is the position vector of each input sequence by the Encoder or the Decoder three linear conversion into three vectors, respectively: query vector q, the key vector k, the value of the vector v, and the position of each q do take the matching with the sequence of other locations k, is calculated using the degree of matching after acquisition layer softmax weight values ​​between 0 and 1, and thus a weighted average with a weight v for each position, and finally acquires the position output vector z. The following calculating method introduced self-attention.

▶ zoomable attention dot product
Here Insert Picture Description

Attention zoomable i.e. how the dot product calculated from the attention vectors, calculated by four steps from attention:

  1. Generating a vector from each of the three input vector encoder (word vectors for each word) in which: query vector q, the key vector k, the value of the vector v. Matrix operation in which a weight vector matrix Wᴼ̴ three inputs and three X codec by weight, Wᴷ, Wᵛ multiplied created.
  2. Calculate the score. Illustrates a example of the input sentence "Thinking Machine", the first word "Thinking" calculated from the vector of attention, the need to enter each word in the sentence of "Thinking" score. Scores determine how much attention to other parts of the sentence in the coded word "Thinking" in the process. By scoring word score (word all the input sentence) and the key vector k "Thinking" query vector q with dot product calculated. For example, the first score is the dot product of q₁ and k₁, fractions of a second dot product of q₁ and k₂.
  3. Scaling summation: the fraction multiplied by a scaling factor 1 / √dₖ (dₖ dₖ key is the dimension of the vector = 64) so ​​that more stable gradient, and then passed through softmax results. Softmax role is to make all the words score normalization, and the resulting scores are positive and 1. softmax scores determine the contribution of each word to encode the current position ( "Thinking") of.
  4. Each value is multiplied by the vector v softmax score, we want to focus on semantically related words, and weaken irrelevant words. Summing the weighted value vector, and obtain the output from the focus zᵢ layer at that location.

Therefore, attention may be scaled dot product may be calculated by the following equation:
Here Insert Picture Description

In practice, the calculation of attention in a matrix form to complete in order to be regarded as faster. That we are going to look at how to achieve self-attention mechanism by matrix operations.

First obtains a query vector matrix Q, and K values ​​of the key vector matrix vector matrix V, by the weight matrix Wᴼ̴, Wᴷ, Wᵛ obtained by multiplying the input matrix X; the same arbitrary word score is obtained by its vector key k and all words with the query vector q calculated dot product, we can put all the words of the key vector k transpose of vector matrix to form a bond Kᵀ, the query vector q combination of all the words together as a query vector matrix Q, which multiplying two matrices attention scoring matrix a = QKᵀ; then, attention scoring matrix a softmax seeking to obtain a normalized score matrix a ^, the matrix obtained by multiplying the left matrix output vector matrix V Z.
Here Insert Picture Description

▶ long attention
Here Insert Picture Description
if only a calculated attention, difficult to capture all the information input sentence space, in order to optimize the model, the original paper proposes a novel approach --Multi-Head Attention. Multi-Head Attention is not only embedded vector dimension K d (model) of, Q, V single Attention do, but the K, Q, V h linear space projected to different times, respectively, into dimension dq, dₖ, dᵥ and then do their own attention.

Wherein, dq = dₖ = dᵥ = d (model) / h = 64 is projected onto the h Head. Multi-Head Attention allows the model to different representations of joint attention subspace information in different locations, if there is only one attention Head then its mean value will weaken the message.

Multi-Head Attention Head remain separate for each query / key / value weight matrix Wᴼ̴ᵢ, Wᴷᵢ, Wᵛᵢ, resulting in different queries / key / value matrix (Qᵢ, Kᵢ, Vᵢ). X multiplied by Wᴼ̴ᵢ, Wᴷᵢ, Wᵛᵢ matrix to generate a query / key / value matrix Qᵢ, Kᵢ, Vᵢ. Calculated from the above-described same attention, only eight different weight matrix calculation to obtain eight different Zᵢ matrices, each representing an input text to a different implicit vector space projected. Finally, this matrix 8 pieces together, a weight matrix by multiplying the weight Wᵒ, reduced to a matrix output Z.

Each Head Multi-Head Attention in the end of the sentence concerned about what information it? Different focus of attention of Head Where? For example in the following two sentences what "The animal did not crossthe street because it was too tired" and "The animal did not cross the street because it was too wide", two sentence "" refers to the it? "It" refers to the "street", or "animal"? When we encode the word "it", it's focus on the "animal" and the "street", in a sense, the model expression of the word "it" in a way that "animal It stands for "and" street ", but not in the semantics, the first sentence of it more strongly pointing animal, the second sentence it more strongly pointing street.

3.3. Transformer model for other structures

▶ connection with normalized residuals

Codec has a special structure: Multi-HeadAttention output to a sub-layer between the Feed-forward layer: residual connection and layer normalization (LN), i.e., the residual layer is connected to normalization. Residual connection is to construct a new residual structure, the input and output of the residual rewritten, such training models, small changes can be noted that the method used in computer vision.

Before the data into the activation function to be normalized, because we do not want to enter the saturated zone data falls activation function. LN is a regularization depth study, and general batch normalization (BN) were compared. BN's main idea is carried out on each batch of data in each layer of normalization, LN is to calculate the mean and variance in each sample, LN advantage is that the Independent Computing and normalized against a single sample, but not that BN kind of mean and variance in batches direction.

▶ Before feed-forward neural network

Attention will sublayer output codec layer to a fully connected network: Feed-forward networks (FFN), and a linear conversion comprises two ReLu, paper is the (per character input sentence) according to the respective positions, respectively do FFN, hence the name of FFN point-wise. Calculated as follows:
Here Insert Picture Description

And a linear transformation layer ▶ softmax

Finally, the decoder outputs a vector of real numbers. How to become a floating point word? This is the linear transformation layer of work to be done, then it is softmax layer. Is a simple linear transformation layer fully connected neural network, which can be generated by the decoder vector projected onto a much larger than it is, is called a logarithmic probability vector (logits)'s.

Let's assume our model to learn a million different English words ( "Output vocabulary" Our model) from the training set. Thus the number of ten thousand probability vector is the vector length of the cell - each cell corresponds to a fraction of a word. The next layer will softmax those scores become probabilities (are positive, the upper limit of 1.0). The cells with the highest probability is selected, and its corresponding word is outputted as the time step.

▶ position encoder

Enter Seq2Seq model only word vector, but abandoned the Transformer model and convolution cycle, unable to extract information sequence order of the sequence order if missing information may result in all the words are right, but can not form a meaningful statement. The author is how to solve this problem? To make use of the model order of the sequence, sequence information regarding the relative or absolute position of the words to be injected. In the paper the authors introduced Positional Encoding: the position of the word in the sequence appears to be encoded. The figure is a 20 position code words embedded in word 512 dimensions visualization.
Here Insert Picture Description

The words of each sentence "location code" is added to the bottom of the encoder and decoder is embedded in the input stack, position encoding the same dimensions and the embedded word dmodel, it may be added Talia. Paper use sine and cosine functions of different frequencies to obtain location information:
Here Insert Picture Description

Pos is the position where, i is the dimension, in even position sinusoidal coding is used in odd positions cosine encoder. Each dimension corresponds to a position-coding sinusoid.

references:

  1. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous Translation Models. In Proceedings of EMNLP 2013
  2. Ilya Sutskever,etc.2014. Sequence to Sequence Learning with Neural Networks.In Proceedings of NIPS 2014. Dzmitry Bahdanau etc. 2015.
  3. Neural Machine Translation by Jointly Learningto Align and Translate. In Proceedings of ICLR 2015.
  4. Ashish Vaswani,etc.Attentionis All You Need. In Proceedings of NIPS2017.
  5. Jay Alammar TheIllustrated Transformer,http://jalammar.github.io/illustrated-transformer/
  6. Zhang Junlin depth study of attention model (2017 version), https: //zhuanlan.zhihu.com/p/37601161
Original articles published 0 · won praise 0 · Views 14

Guess you like

Origin blog.csdn.net/Percent_bigdata/article/details/104753879