Paper reading notes 9-Deformable DETR: Deformable Transformers for end-to-end object detection


2022.4.29 Supplement
Regarding the comparison of the complexity of DETR and Deformable DETR algorithms, I just wrote a paper today, so I briefly reviewed it and posted it here for reference.

DETR complexity:

insert image description here
insert image description here
insert image description here

Deformable DETR complexity:
insert image description here



For the problem of slow convergence and long training time of DETR, this paper analyzes the reasons, and combines the advantages of deformable conv's sparse space sampling and the characteristics of the Transformer structure to focus on the relationship between elements, and proposes Deformable DETR.

The article is well written and difficult to read.


-1.Preliminaries: deformable conv和DETR

You need to learn these two first.

-1.1 deformable conv

The original intention of deformable conv is to solve the problem of allowing CNN to adapt to the changes brought about by the geometric deformation (size, pose, etc.) of the target. In the past, to solve this problem, either data enhancement or SIFT was used.

In fact, compared to the traditional square convolution, the variable convolution is to add a 2D offset. This 2D offset is learned by passing the feature map through an additional convolution layer.
insert image description here

Expressed by a formula. The traditional 2D convolution can be expressed as:
y ( p 0 ) = ∑ pn ∈ R w ( pn ) x ( p 0 + pn ) y(p_0)=\sum_{p_n\in R}w(p_n )x(p_0+p_n)and ( p0)=pnRw(pn)x(p0+pn)
wherep 0 , pn ∈ R 2 , R p_0,p_n\in\mathbb R^2,Rp0,pnR2,R is the index set of kernel.

If we add 2D offset (I don't know why we add it in xxx on), independently:
y ( p 0 ) = ∑ pn ∈ R w ( pn ) x ( p 0 + pn + Δ pn ) y(p_0)=\sum_{p_n\in R}w(p_n)x (p_0+p_n+\Delta p_n)and ( p0)=pnRw(pn)x(p0+pn+p _n)
where{ pn } \{p_n\}{ pn} The size of the set andRRR is the same, that is, an offset is assigned to each point.

But the learned offset is usually a decimal, so bilinear interpolation is usually used to calculate the value. Simply put, bilinear interpolation is to use the 4-nearest neighbor grayscale to calculate the grayscale of the specified point.

-1.2 DETR

The general process of DETR is to use the CNN backbone to extract the feature map, and then convert the feature map into a one-dimensional representation (I think it is flatten) and input it to the Encoder (with position encoding added), and then use the learned object queries and the output of the Encoder as input. , into the Decoder. The output embedding output by the Decoder enters the FFN to perform different tasks.

insert image description here

  \space  

0.Abstract

DETR is a good paradigm for end-to-end target detection, but the defect of DETR is slow convergence and limited spatial resolution. This is due to the limitation of transformer attention processing image feature map.

To solve this problem, the authors propose Deformable DETR, whose attention module only focuses on a small set of key sampling points around the reference.

Compared with DETR, it has better performance, especially on small objects, and the training epoch is ten times smaller.

  \space 

1.Introduction

DETR has two main flaws:

  1. It takes a lot of training epochs to converge, about 10~20 times that of Faster RCNN
  2. The ability to detect small targets is not good. Small targets are usually detected on high-resolution feature maps. But for DETR, high-resolution feature maps can lead to very large complexity.

The author believes that the reason for the above problem is that the Transformer itself lacks the processing of the image feature map. Specifically, when (DETR) is initialized, the attention module initializes almost the same weight for the pixels in the feature map, which must go through a long epochs The sparse target can only be learned later. Moreover, the computational complexity of pixels is quadratic , and the computational cost is very high.

Regarding complexity: Suppose the feature map is c × h × wc\times h \times wc×h×w
1. The complexity of convolution operation: one convolution operationO ( k 2 d ) O(k^2 d)O ( k2 d), assuming the output dimension is alsoddd , withhwd hwdh w d operations, so the complexity isO ( hwk 2 d 2 ) O(hwk^2d^2)O ( h w k2 d2 )
2. Self-attention complexity: byS elf A ttn ( Q , K , V ) = softmax ( QKT dk ) V {\rm SelfAttn}(Q,K,V)={\rm softmax}(\frac {QK^T}{\sqrt{d_k}})VSelfAttn(Q,K,V)=softmax(dk QKT)V:
Q , K , V ∈ R n × d Q,K,V\in \mathbb R^{n\times d} Q,K,VRThe complexity of n × d is nd ndn d (one element is one multiplication)
Q , KTQ,K^TQ,KT is multiplied, each element after multiplication isddd times of multiplication, so a total ofn 2 dn^2dnThe
complexity of 2d multiplicationn 2 n^2n2
softmax results andVVV is multiplied, and in the same way, each elementnnn times of multiplication, totalnd ndn d elements, a total ofn 2 dn^2dn2 dmultiplications
so the total complexity isO ( n 2 d ) O(n^2d)O ( n2 d)
3.DETR: After converting the feature map into one-dimensional, sincen = hwn=hwn=h w , so the complexity isO ( ( hw ) 2 d ) O((hw)^2d)O((hw)2d ), which is why the complexity of DETR is the square of the pixel.

As mentioned later, deformable conv is very effective in the detection of sparse targets, avoiding the problems mentioned above. But it lacks the ability to model the relationship between elements, which is exactly what DETR (Transformer) is good at.

Deformable DETR combines the advantages of deformable conv's sparse spatial sampling with the ability of Transformer to learn the relationship between inputs.

The author proposes a deformable attention module, which cares about a small set of sampling locations as a pre-filter for important key elements in all feature map pixels. This model can be naturally extended to integrate multi-scale features, without feature pyramid networks.

In Deformable DETR, the author uses deformable attention module to replace the attention modules in Transformer.

  \space  

2.Related Work

2.1 Efficient attention mechanism.

Traditional Transformers have high time and memory complexity. Many works have been devoted to solving this problem, which can be roughly divided into three categories:

  1. The first type is to use a predefined sparse attention pattern on the key. The most direct paradigm is to limit the attention pattern to a fixed window (it may be said that the attention is calculated on a small area). But this will still be lost Global information. So some people pay attention to key elements at fixed intervals (maybe similar to kernel dilation) to increase the receptive field, and some people use some special tokens, which can access all key elements to increase the global information. Perception. Some people use a pre-set sparse attention pattern to directly focus on key elements in the distance.

  2. The second category is to learn sparse attention based on data. These people's methods don't understand.

  3. The third category is to explore the low-level properties of self-attention. Wang et al. reduce the number of key elements by linear projection on the size dimension instead of the channel dimension. Some people have rewritten the calculation of self-attention through kernelization approximation (there is a view that transformer is essentially a nuclear method).

Deformable DETR belongs to the second idea of ​​reducing the amount of calculation (learning sparse attention based on data). It only focuses on a small fixed set of sampling points, which is predicted by the feature of the query element.

2.2 Multi-scale feature representation for target detection.

The representative method of multi-scale feature extraction is FPN, and there are still a bunch of nets, which I have never learned anyway. But the author said that the method proposed by the author, that is, the multi-scale variable attention module can naturally integrate multi-scale through the attention mechanism . feature maps without going through FPN.
  \space 

3. Look at Transformer and DETR again

This part mainly further analyzes the computational complexity of Multi-head self attention and DETR.

The author writes Multi-head self attention as:

M u l t i H e a d A t t n ( z q , x ) = ∑ m = 1 M W m [ ∑ k ∈ Ω k A m q k W m T x k ] {\rm MultiHeadAttn}(z_q,x)=\sum_{m=1}^MW_m[\sum_{k\in\Omega_k}A_{mqk}W_m^Tx_k] M u l t i H e a d A t t n ( zq,x)=m=1MWm[kΩkAm q kWmTxk]

其中: z q , x k ∈ R C , C z_q,x_k \in {\mathbb R^C},C zq,xkRC,C is the channel dimension of the feature map (because the query key is the pixel on the feature map)
A mqk = exp ⁡ { zq TU m V mxk C v } A_{mqk}=\exp\{\frac{z_q^TU_mV_mx_k}{\ sqrt{C_v}}\}Am q k=exp{ Cv zqTUmVmxk} is the attention weight,U m , V m ∈ RC v × C , C v = C / M U_m,V_m \in {\mathbb R^{C_v\times C}},C_v=C/MUm,VmRCv×C,Cv=C/M
W m ∈ R C × C v W_m\in {\mathbb R^{C\times C_v}} WmRC×Cvis the mapping corresponding to each head.
zq , x z_q,xzq,x is usually the sum or concat of element and position encodings.

This formula is a bit confusing. In fact, the original intention of multi-head attention should be the following formula:

Cordonnier et al. On the relationship bewteen self-attention and conv layers

M H S A ( X ) = c o n c a t h ∈ [ N h ] [ S e l f A t t n h ( X ) ] W o u t + b o u t MHSA(X)={\rm concat}_{h\in[N_h]}[{\rm SelfAttn}_h(X)]W_{out}+b_{out} MHSA(X)=concath[Nh][ S e l f A t t nh(X)]Wout+bout

These two formulas are the same. The formula in this article inputs two vectors to calculate the attention of a query relative to all keys, while the input of the above formula is a matrix, which contains all vectors, and is the vector form of attention. Only However, the formula in this article does not consider the bias, and the letters indicate (U, V) are different.

Complexity Analysis

How to calculate the complexity of multi-head attention? O ( N q C 2 + N k C 2 + N q N k C ) O(N_qC^2+N_kC^2+N_qN_kC)O ( NqC2+NkC2+NqNkC)

I did not calculate the results given by the author. The results given by the author can be understood as follows: calculate A mqk A_{mqk}Am q k, each query and each key must calculate C 2 C^2C2 multiplications (this is because e.g.U mxk U_mx_kUmxkThe multiplied dimension dimension is C, and there is a V u VuV u ), and calculateW x WxW x needs to be counted C times, each query and key must be counted as a C, and the total isN q C 2 + N k C 2 + N q N k C N_qC^2+N_kC^2+N_qN_kCNqC2+NkC2+NqNkC. I
always feel wrong.

一般 N q = N k > > C N_q=N_k>>C Nq=Nk>>C , the complexity is simplified toO ( N q N k C ) O(N_qN_kC)O ( NqNkC)

Look at DETR again

As mentioned earlier, the Encoder of DETR takes the feature map as input. According to the previous calculation, the complexity of self-attention is O ( ( hw ) 2 C ) O((hw)^2C)O((hw)2 C).

The Decoder's attention consists of two parts, one is self-attention, and the other is cross-attention with Encoder.

In cross attention, the key is the pixel in the feature map, as before, so N k = hw N_k=hwNk=h w .Decoder input is a fixed number of object queries, the number isNNN (usually 100), soN q = N N_q=NNq=N , substituted into the previous formula to get the complexity of cross attentionO ( hw C 2 + N hw C ) O(hwC^2 + NhwC)O(hwC2+N h w C ) .

One piece is missing, it should be hw C 2 + NC 2 + N hw C hwC^2+NC^2+ NhwCh w C2+NC2+N h w C ,don't understand.

In self-attention, the relationship between object queries is learned, so N q = N k = N N_q=N_k=NNq=Nk=N , the complexity of substitutionis O ( 2 NC 2 + N 2 C ) O(2NC^2+N^2C)O ( 2 N C2+N2C)
  \space  

4.Method

4.1 Variable attention module

The core problem of introducing transformer attention is that it will browse all possible spatial positions. (And this is unnecessary)

In order to solve this problem, the author proposes a variable attention module.
Inspired by the variable convolution, the variable attention only focuses on a small part of the keys, rather than all the information of the feature map. By assigning only a small amount
to each query keys , the problems of slow convergence and feature space resolution can be alleviated.

The formula is expressed as follows:

D e f o r m A t t n ( z q , p q , x ) = ∑ m = 1 M W m [ ∑ k = 1 K A m q k W m T x k ( p q + Δ p m q k ) ] {\rm DeformAttn}(z_q,p_q,x)=\sum_{m=1}^MW_m[\sum_{k=1}^{K}A_{mqk}W_m^Tx_k(p_q+\Delta p_{mqk})] DeformAttn(zq,pq,x)=m=1MWm[k=1KAm q kWmTxk(pq+p _m q k)]

in:

x ∈ R C × H × W x\in \mathbb R^{C\times H \times W} xRC × H × W is a feature map, ifqqq means a query, thenzq z_qzqIndicates the query vector, pq = ( pqx , pqy ) p_q=(p_{qx},p_{qy})pq=(pqx,pqy) indicates the two-dimensional reference point of this query (that is, this formula calculates queryzq z_qzqAt reference point pq p_qpqBelow feature map xxattention in x ). W m , A mqk W_m,A_{mqk}Wm,Am q kThe meaning is the same as before, Δ pmqk ∈ R 2 \Delta p_{mqk}\in \mathbb R^2p _m q kR2 is the offset relative to the reference point, which is a real number with unlimited range and can be learned.

In addition, since only a small part of the keys is concerned, K < < HW K<<HWK<<HW.

Like variable convolution, pq + Δ pmqk p_q+\Delta p_{mqk}pq+p _m q kis a fraction, so use bilinear interpolation to calculate xk ( pq + Δ pmqk ) x_k(p_q+\Delta p_{mqk})xk(pq+p _m q k).

In specific implementation, zq z_qzqWill go through a linear layer, the output dimension is 3 MK 3MK3 M K vectors, of which the first2 MK 2MK2 M K ForecastMK MKM KΔ p \Delta pΔ p offset, afterMK MKM K dimension calculationMK MKM K attention scoresA mqk A_{mqk}Am q k.

The computational complexity of variable attention is O ( 2 N q C 2 + min ⁡ ( HWC 2 , N q C 2 ) O(2N_qC^2+\min(HWC^2,N_qC^2)O ( 2 NqC2+min(HWC2,NqC2), 若 N q = H W N_q=HW Nq=H W , it turns intoO ( HWC 2 ) O(HWC^2)O(HWC2 ), aboutHW HWH W is linear, not DETR'sO ( ( HW ) 2 C ) O((HW)^2C)O((HW)2 C), which is the reason for reducing the complexity.

The entire variable attention module is shown in the figure below:

insert image description here
4.2 Multi-scale variable attention module

It was said before that the attention based on the transformer cannot obtain multi-scale information, and thus cannot detect small targets very well. How did the author do it? It is to use feature maps of different sizes as input, and only need to slightly modify the formula in 4.1 :

If { xl } ∣ l = 1 L \{x_l\}|_{l=1}^L{ xl}l=1Lmeans LLL -size feature map (extracted from the output of different layers of CNN backbone, such as some layers of ResNet50),pq ^ ∈ [ 0 , 1 ] 2 \hat{p_q}\in [0,1] ^2pq^[0,1]2 representsnormalizedreference point coordinates, then:

MSD eform A ttn ( zq , pq ^ , { xl } ∣ l = 1 L ) = ∑ m = 1 MW m [ ∑ l = 1 L ∑ k = 1 KA mlqk W m T xl ( ϕ l ( pq ^ ) + Δ pmlqk ) ] {\rm MSDeformAttn}(z_q,\hat{p_q},\{x_l\}|_{l=1}^L)=\sum_{m=1}^MW_m[\sum_{l=1 }^L\sum_{k=1}^{K}A_{mlqk}W_m^Tx^l(\phi_l (\hat{p_q})+\Delta p_{mlqk})]MSDeformAttn(zq,pq^,{ xl}l=1L)=m=1MWm[l=1Lk=1KAm l q kWmTxl (ϕl(pq^)+p _m l q k)]

where the function ϕ \phiϕ represents the inverse transformation, transforming the normalized coordinates back to the coordinates corresponding to the size of the original feature map.

When L = 1 , K = 1 L=1,K=1L=1,K=When 1 , it degenerates into a variable convolution; when the sampling point passes through all possible positions, it is equivalent to the ordinary transformer attention.

4.3 Encoder of variable transformer

The author replaces all attention modules in DETR Encoder with the above-mentioned multi-scale variable attention. The input of Encoder is a multi-scale feature map, and the output is also a corresponding multi-scale feature map of the same resolution.

The reason is that the number of inputs and outputs of self-attention is the same.

As mentioned before, the feature map generally comes from the features extracted by different layers of ResNet. The author uses the C3~C5 layers of ResNet50, and uses 1x1 convolution to change the number of channels to 256, as shown in the following figure:

insert image description here

Naturally, for 2D input, both the query and the key are the pixels of the feature map. However, due to the use of multi-scale, in addition to position encoding, in order to distinguish which scale the query is on, an additional scale-level embedding is introduced.

This scale-level embedding is also learnable.

4.4 Decoder of variable transformer

Like DETR, the decoder also includes self-attention and cross-attention. Self-attention is also the relationship between learning object queries, which has nothing to do with the advantages of using variable attention, so the self-attention module has not changed, and the The cross-attention module is changed to multi-scale variable attention. Cross-attention also uses the pixel in the map output by the Encoder as the key, and the object query learns features from the map.

Different networks are used behind DETR to perform category and bbox predictions, but the author said that since we already have the learned offset( pq ^ \hat{p_q}pq^), you can use offset to reduce the difficulty of detection . Specifically, the reference point is used as the initial guess (initial guess) of the box center. As for the size of the box, you can use offset pq ^ \hat{p_q}pq^Do some kind of calculation and get. For example:

bq ^ = ( σ ( bqx + σ − 1 ( pqx ^ ) ) , σ ( bqy + σ − 1 ( pqy ^ ) ) ) \hat{b_q}=(\sigma(b_{qx}+\sigma^{- 1}(\hat{p_{qx}})),\sigma(b_{qy}+\sigma^{-1}(\hat{p_{qy}})))bq^=( p ( bqx+p1(pqx^)),s ( bqy+p1(pqy^)))

where σ \sigmaσ represents the sigmoid function, of coursebq ^ \hat{b_q}bq^Also normalized, bqx , bqy b_{qx},b_{qy}bqx,bqyis learned.
  \space 

Additional improvements haven’t been read yet. Read it again if you need it.
This article proposes multi-scale variable attention, which is still very thoughtful, and the writing is not bad, but I haven’t learned how to calculate the computational complexity. Learn,
learn endless, too deep

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/121888314