Seq2Seq model
Fundamental
- The core idea: a sequence as input for a map as a sequence of output
- Coding input
- Decoded output
-
- The first step in decoding, the decoder enters the final state of the encoder, to generate a first output
- After step reads decoder output generates an output current step
- Components:
- Encoder
- Decoder
- Connecting both the fixed size State Vector
Decoding method
- The core part, most of the improvements
- Greedy
- After choosing a metric, to select the best result in the current state until the end of
- Low cost of computing
- Local optima
- After choosing a metric, to select the best result in the current state until the end of
- Beam search (Beam Search)
- heuristic algorithm
- Save a presently preferred beam size selected, determines the amount of calculation, the best 8 to 12
- Each step of selecting the next sorted and stored extended decoding result, beam size before saving a selected
- Loop iterates until the end. Select the best output results
- Improve
- Stack RNN
- Dropout mechanism
- Establish a connection between the encoder and a residual
- Attentional mechanisms
- Memories
Attentional mechanisms
Attentional mechanisms Seq2Seq model
- In practice, found that as the input sequence of growth, significantly reduced the occurrence of model performance
- Tips
- The reverse input source language sentence, repeat input or twice, to obtain a certain performance
- Decoding the current context information and position information of the source language words and the corresponding word is missing in the encoding and decoding process
- The introduction of attention mechanism to solve the problem:
- When decoding, each output words are dependent on a previous hidden state and hidden state of the input sequence corresponding to each
\ [S_I = F (S_ {I-. 1}, Y_ {I-. 1}, C_i) \]
\ [P (y_i | y_1, \ cdots,
y_ {i-1}) = g (y_ {i-1}, s_i, c_i) \] where, \ (Y \) is the output word, \ (S \) is the current implicit state, \ (F, G \) is a non-linear transformation, the neural network is generally - Context Vector \ (C_i \) is the input sequence all hidden state \ (h_1, \ cdots, h_T \) weighted and
\ [C_i = \ SUM \ limits_. 1} = {J ^ A_ {T} ij of h_j \]
\ [A_ {ij of} = \ FRAC {\ exp (E_ {ij of})} {\ sum_k \ exp (E_ {ij of})} \]
\ [E_ {ij of} = A (S_ {I-. 1}, h_j) \] - Neural Network \ (A \) an output sequence of hidden states \ (s_ {i-1} \) and the input sequence hidden state \ (h_j \) as input, calculates a \ (x_j, y_i \) aligned value \ (e_ {ij} \)
- Consider each input word is aligned with the current output word relationship, the better the alignment of the word, will have a greater weight, greater impact on the current output
- Bidirectional Recurrent Neural Networks
- In one direction: \ (H_i \) contains only \ (x_0 \) to \ (x_i \) information, \ (A_ ij of {} \) lost \ (x_i \) information following the
Bidirectional: a first \ (I \) hidden state input word corresponding to include \ (\ overrightarrow {h} _i \) and \ (\ overleftarrow {H} _i \) , the former encoding the \ (x_0 \) to \ (x_i \) information, which is encoded \ (x_i \) information and after, to prevent the loss of information
- When decoding, each output words are dependent on a previous hidden state and hidden state of the input sequence corresponding to each
Common forms of Attention
Nature: a query (query) to a series (key key- value value) on the map
- calculation process
- The query and the similarity of each key computed weights, the similarity function used a little volume, stitching, etc. perceptron
- Use a softmax function of these weights are normalized
- Weights corresponding key value and a weighted sum to yield a final attention