Meet Transformer: Getting Started

Video link:
https://www.youtube.com/watch?v=ugWDIIOHtPA&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=60

Seq2Seq

RNN is not easy to be parallelized.
It is proposed to use CNN to replace RNN. CNN can be parallelized, but the number of layers required is relatively deep to read all the input content.

Self-Attention layer

insert image description here
b1 to b4 can be calculated at the same time.
Can be used to replace RNN.

Source: Attention is all you need

insert image description here
Then use each a to make attention for each k

insert image description here
insert image description here
insert image description here
Accelerated matrix multiplication process
insert image description here

Multi-head self-attention

Different heads can focus on different content to achieve a better attention effect.
insert image description here

Positional encoding

Self-attention does not consider location information.
Therefore, it is necessary to add ei at the same time as ai to indicate the location information, and there is manual control.

insert image description here

Seq2Seq with Attention

insert image description here

Transformer

insert image description here

insert image description here

Universal Transformer

insert image description here

Guess you like

Origin blog.csdn.net/NGUever15/article/details/132279424