Anchor DETR

Anchor DETR(AAAI 2022)

Improve:

  1. Proposed anchor-based object query
  2. Proposed Attention variant-RCDA

In previous DETR, the target query was a set of learnable embeddings. However, each learnable embedding has no clear meaning (because it is randomly initialized), so there is no explanation of where it will end up concentrated. In addition, since each object query will not focus on a specific area , optimization during training is also difficult.

Insert image description here

Notes on visualization in DETR: (slots is one of 100 queries)img

Insert image description here

The three prediction patterns here may be the same or different.

simple model

There are no particularly big changes from DETR

6encoder, 6decoder, the lower right corner is Anchor Points

position embedding will be added to q and k of decoder

object query:[100,256] adds anchor point, encodes it into positon embedding, replaces the original oq

img

There are two ways to generate anchor points

img

(a) The anchor is fixed, the width and height are uniformly distributed, and the sampling is uniform.

(b) First randomly initialize the points of a tensor with a uniform distribution of 0-1 and use it as a learning parameter (embedding). The experimental effect is good.

Insert image description here

Convert anchor point to object query

Insert image description here

First, obtain the learned anchor points of [100 (NA), 2] ;

Then convert it into [100, 256] high-frequency position coding through sin/cos (the function in the code is pos2posemb2d);

After two layers of MLP learning (adapt_pos2d in the code), it is converted into Q_P: [Np (pattern), 256].

There are some differences between the code and the article, as follows:

Insert image description here

Multiple Predictions for Each Anchor Points

Assume there are 100 reference points, and each point predicts a target. A real image may have multiple targets near the same point.

anchor detr is designed to predict multiple modes (3 types) for one point, and Np modes are set for each point (Np=3)

Original detr, object query is [100,256] each is [1,256]

Anchor detr adds a pattern embedding, as follows;
Q fi = Embedding ⁡ ( N p , C ) Q_{f}^{i}=\operatorname{Embedding}\left(N_{p}, C\right)Qfi=Embedding(Np,C )
That is, each point has Np(3) patterns, [3, 256]. In the paper, Np=300, pattern=3, which is 900 points.

In the end, you only need to add the Q_p of the pattern embedding and anchor point to get the final object query. The Pattern Position query can be expressed as:

img

In fact, the above formula is not used in the code.

In the code in the previous code diagram, the reference point is directly repeated from 300 to 900.

Please remind me if I understand something wrong.

The code pattern is the input of the first decoder, and the tgt of the original detr are all 0

Insert image description here

Row-Column Decoupled Attention

What is reduced is memory overhead! ! ! !

The row-column decomposition attention mechanism accelerates convergence. The q length is 900, reducing memory and memory overhead.

Insert image description here

The original transformer input token (H*W) will be flattened into a one-dimensional input

Ax (W), calculated first in the row dimension

Ay (H), performing Ay operation

Ay and Z perform a weighted sum in the height dimension

QK all performs row and column decomposition, V does not decompose [Nq, H*W]

Original attention: Nq * H * W * M (head)

RCDA:

Ax:Nq * W * M

Is: Nq * H * M

You only need to compare the sizes of the two matrices. The right side of the figure is the proportion formula. After comparing the two dimensions, W * M/C is left. W is assumed to be 32 (DC5), M=8, C=256. That's the same, look at C and W*M

DC5 means that a hole convolution is added to the last stage of the backbone network (default resnet50) and a pooling layer is reduced to double the resolution.

experiment

1. Comparing the memory and AP of different linear attentions

img

2. Mode a is usually for large objects, mode b is for small objects, and mode c is more balanced.

img

reference

https://www.bilibili.com/video/BV148411M7ev/?spm_id_from=333.788&vd_source=4e2df178682eb78a7ad1cc398e6e154d

Guess you like

Origin blog.csdn.net/qq_52038588/article/details/133180095