Attentional mechanisms - "hands-on learning deep learning pytorch"

Reason for introducing mechanisms attention

In the "coder - decoder (seq2seq)" ⼀ section ⾥, the decoder the same step at various times dependent variable background (context vector) to obtain sequence information input START.

However, there are practical problems RNN mechanism of long-range gradients disappear for longer sentences, it is difficult to hope that the input sequence into fixed-length vector and save all the useful information, so as the required length of sentence translation the increase, the effect of this structure will be significantly reduced.

At the same time, the target words may be decoded only with some words about the original input, but is not associated with all input, attention to this selection process mechanism may be explicitly modeled.

Attention institutional framework

Attention is a common pool of weighted means, the input consists of two parts: inquiry (query) and a key-on (key-value pairs), the vector o is the output value of the weighted sum, and each key is calculated weight and value one by one.

Attention dot product

MLP attention

The introduction of attention mechanisms Seq2seq model

Note that this section will be added to the sequence to sequence mechanism model to explicitly use right reaggregation states. The following figure shows encoding and decoding of the model structure, at time step t is time. Now attention layer holds all the information encodering see - that every step of encoding output. In the decoding phase, the time tt hidden state of the decoder is as query, hidden states at each time step of the encoder as a key and value for attention polymerization. Attetion model is output as the context information to the context vector, and the decoder input DtDt stitching together along to the decoder

Transformer

In the previous section, we have introduced the mainstream of neural network architectures such as convolution neural network (CNNs) and recurrent neural network (RNNs). Let's review some:

  • CNNs easy to parallelize, is not suitable to capture the dependencies in the longer sequence.
  • RNNs suitable capture sequence dependent long distance becomes long, it is difficult to achieve parallel processing sequence.

为了整合CNN和RNN的优势,[Vaswani et al., 2017] 创新性地使用注意力机制设计了Transformer模型。该模型利用attention机制实现了并行化捕捉序列依赖,并且同时处理序列的每个位置的tokens,上述优势使得Transformer模型在性能优异的同时大大减少了训练时间。

相关知识

1多头注意力层

2基于位置的前馈网络

3、Add and Norm相加归一化层

4位置编码 、解码器

发布了105 篇原创文章 · 获赞 27 · 访问量 2万+

Guess you like

Origin blog.csdn.net/serenysdfg/article/details/104501490