Video link:
https://www.youtube.com/watch?v=ugWDIIOHtPA&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=60
Article directory
Seq2Seq
RNN is not easy to be parallelized.
It is proposed to use CNN to replace RNN. CNN can be parallelized, but the number of layers required is relatively deep to read all the input content.
Self-Attention layer
b1 to b4 can be calculated at the same time.
Can be used to replace RNN.
Source: Attention is all you need
Then use each a to make attention for each k
Accelerated matrix multiplication process
Multi-head self-attention
Different heads can focus on different content to achieve a better attention effect.
Positional encoding
Self-attention does not consider location information.
Therefore, it is necessary to add ei at the same time as ai to indicate the location information, and there is manual control.