RNNs
are hard to parallelize
Transformer
1 and Input vectors x1-x4 are respectively multiplied by matrix W to obtain embedding vectors a1-a4.
2. Vectors a1-a4 are multiplied by Wq, Wk, and Wv to obtain different qi, ki, and vi (i={1,2,3,4}).
3. Use q1 to do attention for each k (ki) to get a1, i (i={1,2,3,4}), q1, k1, q1, k2,..., and do a normalization operation.
4. Perform the softmax operation on the calculated a1, i to get ~a1, i, then multiply it by a1, i and all the vi values, and then add them to get b1, and then calculate in turn to get bi.
Multi-head self-attention
is in the second step of self-attention, multiplying multiple wq, wk, and wv matrices to obtain qi,j, ki,j, vi,j.
Position code
position emb
DETR
1. Use CNN to extract features
2. Use transformer-encoder to encode and extract global data
3. Use transformer-decoder to generate prediction frames
4. Use frames and GT to make bipartite graph loss