Seq2Seq model and attentional mechanisms

Seq2Seq model

Fundamental

  • The core idea: a sequence as input for a map as a sequence of output
    • Coding input
    • Decoded output
    • The first step in decoding, the decoder enters the final state of the encoder, to generate a first output
    • After step reads decoder output generates an output current step
  • Components:
    • Encoder
    • Decoder
    • Connecting both the fixed size State Vector

Decoding method

  • The core part, most of the improvements
  • Greedy
    • After choosing a metric, to select the best result in the current state until the end of
      • Low cost of computing
      • Local optima
  • Beam search (Beam Search)
    • heuristic algorithm
    • Save a presently preferred beam size selected, determines the amount of calculation, the best 8 to 12
    • Each step of selecting the next sorted and stored extended decoding result, beam size before saving a selected
    • Loop iterates until the end. Select the best output results
  • Improve
    • Stack RNN
    • Dropout mechanism
    • Establish a connection between the encoder and a residual
    • Attentional mechanisms
    • Memories

Attentional mechanisms

Attentional mechanisms Seq2Seq model

  • In practice, found that as the input sequence of growth, significantly reduced the occurrence of model performance
  • Tips
    • The reverse input source language sentence, repeat input or twice, to obtain a certain performance
  • Decoding the current context information and position information of the source language words and the corresponding word is missing in the encoding and decoding process
  • The introduction of attention mechanism to solve the problem:
    • When decoding, each output words are dependent on a previous hidden state and hidden state of the input sequence corresponding to each
      \ [S_I = F (S_ {I-. 1}, Y_ {I-. 1}, C_i) \]
      \ [P (y_i | y_1, \ cdots,
      y_ {i-1}) = g (y_ {i-1}, s_i, c_i) \] where, \ (Y \) is the output word, \ (S \) is the current implicit state, \ (F, G \) is a non-linear transformation, the neural network is generally
    • Context Vector \ (C_i \) is the input sequence all hidden state \ (h_1, \ cdots, h_T \) weighted and
      \ [C_i = \ SUM \ limits_. 1} = {J ^ A_ {T} ij of h_j \]
      \ [A_ {ij of} = \ FRAC {\ exp (E_ {ij of})} {\ sum_k \ exp (E_ {ij of})} \]
      \ [E_ {ij of} = A (S_ {I-. 1}, h_j) \]
    • Neural Network \ (A \) an output sequence of hidden states \ (s_ {i-1} \) and the input sequence hidden state \ (h_j \) as input, calculates a \ (x_j, y_i \) aligned value \ (e_ {ij} \)
      • Consider each input word is aligned with the current output word relationship, the better the alignment of the word, will have a greater weight, greater impact on the current output
    • Bidirectional Recurrent Neural Networks
      • In one direction: \ (H_i \) contains only \ (x_0 \) to \ (x_i \) information, \ (A_ ij of {} \) lost \ (x_i \) information following the
      • Bidirectional: a first \ (I \) hidden state input word corresponding to include \ (\ overrightarrow {h} _i \) and \ (\ overleftarrow {H} _i \) , the former encoding the \ (x_0 \) to \ (x_i \) information, which is encoded \ (x_i \) information and after, to prevent the loss of information

Common forms of Attention

  • Nature: a query (query) to a series (key key- value value) on the map

  • calculation process
    • The query and the similarity of each key computed weights, the similarity function used a little volume, stitching, etc. perceptron
    • Use a softmax function of these weights are normalized
    • Weights corresponding key value and a weighted sum to yield a final attention

Guess you like

Origin www.cnblogs.com/weilonghu/p/11923017.html