Summary attentional mechanisms of natural language processing of Attention Module in NLP_state of art method till 2019 (Updating)

Summary of Attention Module in NLP_state of art method till 2019

  • First proposed in 'Neural Machine Translation by Jointly Learning to Align and Translate'(2015). But more originally in computer vision field.
  • Calc the weight of linear combination of encoder/decoder hidden layers to make up a part of decoder hidden states

The reason we need attention model:

    • Original NMT failed to capture information in long sentences
    • RNN, or encoder-decoder structure was proposed to deal with the long-term memory problem
    • But encoder-decoder still has an issue with very long sentences, because a fix-length hidden state cannot store infinite information while test paragraph can be very long(you can input 5000 words in Google's translation). Bahdanau indicated that the calculation of context vector could be the bottleneck of the generation performance.
    • Besides, information in the near sentences may not be more important than the farther one, which means we need to take into the relevance between words and sentences into account, rather than only considering words' distance
    • Attention was first proposed in image processing, then in Bahdanau's paper in 2014
    • [1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

Calculation of attention:

  • Alignment model (proposed by Bahdanau et al.)

    • calc01.png
  • Exp (Aij) Oij eaj

  • where e is a position function (alignment function) judging how well the word y_i matches x_j

  • the author used a relevant weight matrix obtained from a feedforward networks

    • We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system
    • img
  • Therefore, one point we can modify is the e function, hereby provide several classic methods:

  • Bahadanau's addictive attention:

    • Alternatively the computer generated text: Lv re @, = tanh (Wa [st; ht])
  • Multiplicative attention

  • Addictive attention

  • Location based function:

  • Scaled function:

    • img
    • scaling is used to deal with the derivation backward propagation problem, when attention weight is too large, it would be difficult to fine-tuned in training
    • [1706.03762] Attention Is All You Need https://arxiv.org/abs/1706.03762
  • Difference of the selection standard:

    • soft attention(consider all states, and parameterized) and hard attention(only select most relevant state, using monte-carlo stochastic sampling)

      • drawback of hard attention: non-differentiable
    • global attention(alignment of all states) and local attention(only focus on part of the source)

    • local attention is kind of similar to hard attention

  • self attention:

  • Hierarchical model:

  • Difference between Memory mechanism and Attention mechanism?

  • UNDER modal:

  • ref:

Guess you like

Origin www.cnblogs.com/joezou/p/11247974.html