Introduction
In sequence codec
- RNN cannot learn global structural information well because it is essentially a Markov decision process.
- The CNN scheme is also very natural, window traversal, such as convolution of size 3
- google proposes attention
attention process:
Reference:
1. Attention attention mechanism in nlp + Transformer detailed explanation
2. "Attention is All You Need" shallow reading
3. (Exploration of Linear Attention: Does Attention have to have a Softmax?) [https://kexue.fm/ archives/7546]