transformer and DETR

RNNs
are hard to parallelize

Transformer
1 and Input vectors x1-x4 are respectively multiplied by matrix W to obtain embedding vectors a1-a4.
2. Vectors a1-a4 are multiplied by Wq, Wk, and Wv to obtain different qi, ki, and vi (i={1,2,3,4}).
3. Use q1 to do attention for each k (ki) to get a1, i (i={1,2,3,4}), q1, k1, q1, k2,..., and do a normalization operation.
4. Perform the softmax operation on the calculated a1, i to get ~a1, i, then multiply it by a1, i and all the vi values, and then add them to get b1, and then calculate in turn to get bi.
insert image description here
insert image description here
insert image description here
Multi-head self-attention
is in the second step of self-attention, multiplying multiple wq, wk, and wv matrices to obtain qi,j, ki,j, vi,j.

Position code
position emb

DETR
insert image description here
1. Use CNN to extract features
2. Use transformer-encoder to encode and extract global data
3. Use transformer-decoder to generate prediction frames
4. Use frames and GT to make bipartite graph loss

Guess you like

Origin blog.csdn.net/threestooegs/article/details/129678202