seq2seq + attention Interpretation

1 What is the mechanism of attention?
Attention is a mechanism to enhance the effect Encoder + Decoder Model for.

2.Attention Mechanism Principle

Attention Mechanism to introduce the structure and principles, we first need to introduce structural under Seq2Seq model. Seq2Seq model, trying to solve the main problem is how to machine translation, the input variable length X mapped to issue a variable length output Y, the main structure shown in Fig.

  

3 Seq2Seq conventional structure of FIG.

As can be seen from the figure, seq2seq model is divided into two phases: Phase encoding and decoding stage.

Coding stage:

A variable length of the input sequence x1, x2, x3 .... xt input RNN, LSTM GRU or model, and then the resulting output of each hidden layer is aggregated to generate a semantic vector:

The final layer is also hidden layer output may be used as semantic vector C:

Here semantic vector c has two functions: 1, as the initial vector prediction decoder model of y1. 2, as the semantic vector, guidance output sequence y y of each step.

Decoding stages:

Decoder output step is mainly semantic vector yi-1 c and the time t is obtained by decoding the output yi based on:

 Y I = G ( Y I-. 1, Si, C)

Wherein Si is the output of the hidden layer. Wherein g represents a nonlinear activation function.

Until it hits the end flag (<EOS>), the decoding end.

These are the codec stage of seq2seq. As can be seen from the above, there are two obvious problems this model:

1, all the information X is input to a fixed-length compression hidden vector C. When the input sentence length is very long, especially when the training set than the first sentence length is longer, a sharp decline in performance of the model.

2, the input X is encoded into a fixed length, for each word in the sentence is given the same weighting, is unreasonable to do so, for example, in the machine translation, the output of the input sentence and between sentences, often input a word or words corresponding to a word or words in the output. Therefore, given the same right to re-enter each word, to do so without discrimination, often the model performance.

Therefore, it is necessary to introduce Attention Mechanism to solve this problem.

We'll formula when decoded form yi read as follows:

Y I = G ( Y I-. 1, Si, Ci of)

I.e., the output y of different times using different semantic vector.

Wherein, the decoder Si RNN is hidden in a state at time i, which is calculated as: 

Ci semantic vector calculation Here, the traditional calculation models directly accumulated Seq2Seq not the same, where ci is a value after a weight of (Weighted), which was expressed as shown in Equation 5:

Where, i denotes the i-th word decoder side, HJ implicit representation vector of the j-th word encoder end, of aij represents the weight between the j-th i-th word memory decoder end encoder end, showing the source of the j-th words of degree of influence on the target end of the word i, aij calculated as shown in equation 6:

In Equation 6, aij is a softmax model output, and the probability value is 1. eij encoder for measuring the position of the end of the j-th word, the position of the end of the i-th word decoder degree of influence, in other words: when the end of the word decoder generates a position i, the number of the exposure to the end position j of word encoder. There are many eij calculation method, a different calculation, representing different Attention model, the simplest and most commonly used alignment model dot product the product matrix, i.e., the output of the hidden state output hidden state ht encoding end decoding side hs be matrix multiplication. Common alignment is calculated as follows:

There are more than common method of calculation in several ways. Dot (Dot product), network mapping weights (General) and concat mappings in several ways.

Guess you like

Origin www.cnblogs.com/yangyanfen/p/11785964.html