Computer vision algorithm - Transformer-based target detection (DETR / Deformable DETR / DETR 3D)

Computer vision algorithm - Transformer-based target detection (DETR / Deformable DETR / DETR 3D)

DETR is the abbreviation of DEtection TRansformer. This method was published in ECCV in 2020. The original paper was titled "End-to-End Object Detection with Transformers".

Traditional target detection is based on Proposal, Anchor or None Anchor methods, and requires at least non-maximum suppression to post-process the results of the network output, which involves a complex parameter tuning process. DETR uses the Transformer Encoder-Decoder structure, and realizes a true end-to-end target detection method through the collective prediction loss . How is Transformer Encoder-Decoder implemented? What is the ensemble prediction loss? Details will be given later.

Students who are not very familiar with the direction of target detection can refer to the computer vision algorithm - summary of target detection network .

1. DETR

The DETR network structure is shown in the figure below:
insert image description here
First, the first step is to extract features from the input image through a CNN, and then straighten the feature map into the Transformer Encoder-Decoder. The Transformer Encoder part of the second step is to make the network better learn global features; the third step uses Transformer Decoder and Object Query to learn the object to be detected from the features; the fourth step is to compare the results of Object Query with the true value Bipartite graph matching (Set-to-Set Loss), and finally calculate the classification Loss and position regression Loss on the matching results.

The above is the basic process of training. The only difference in the reasoning process is in the fourth step. In the fourth step, the final detection result is output by setting a threshold value for the Object Query. This result does not need any post-processing, but is directly used as final output.

Below we combine the code to expand the details of Transformer Encoder-Decoder and Set-to-Set Loss:

1.1 Transformer Encoder-Decoder

The Transformer Encoder-Decoder structure is shown in the figure below, where the red comment is the input is 3 × 800 × 1066 3\times 800\times 10663×800×After the 1066 size picture, the Feature size of each step.
insert image description here
The forward function of Transformer Encoder-Decoder is as follows:

def forward(self, src, mask, query_embed, pos_embed):
        # flatten NxCxHxW to HWxNxC
        bs, c, h, w = src.shape
        src = src.flatten(2).permute(2, 0, 1)
        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
        mask = mask.flatten(1)

        tgt = torch.zeros_like(query_embed)
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)
        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

Among them ,
src is the feature extracted by Backbone, which needs to be flattened before being input into the Encoder;
pos_embed is the position code, which is a position code with a fixed value in DETR. For details, see the introduction in 1.3 below;
query_embed is A learnable position code, that is, the Object Query mentioned above, its function in the Decoder is to continuously do Cross Attention through the Feature and query_embed after the Encoder, and each dimension of the query_embed is the output of a detection result; the mask
is DETR In order to be compatible with different resolution images as input, different images will be Zero Padding into a fixed resolution during input. The Zero Padding part does not contain any information, so it cannot be used to calculate Attention, so the author reserves the Zero Padding part here. Entered src_key_padding_mask.

The following Encoder-Decoder part is almost the same as in "Attention is All You Need". The Encoder layer structure is shown in the figure below: the
insert image description herecode is as follows:

def forward_post(self,
                 src,
                 src_mask: Optional[Tensor] = None,
                 src_key_padding_mask: Optional[Tensor] = None,
                 pos: Optional[Tensor] = None):
    q = k = self.with_pos_embed(src, pos)
    src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
                          key_padding_mask=src_key_padding_mask)[0]
    src = src + self.dropout1(src2)
    src = self.norm1(src)
    src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
    src = src + self.dropout2(src2)
    src = self.norm2(src)
    return src

The Decoder structure is shown in the figure below:
insert image description herethe code is as follows:

def forward_post(self, tgt, memory,
                 tgt_mask: Optional[Tensor] = None,
                 memory_mask: Optional[Tensor] = None,
                 tgt_key_padding_mask: Optional[Tensor] = None,
                 memory_key_padding_mask: Optional[Tensor] = None,
                 pos: Optional[Tensor] = None,
                 query_pos: Optional[Tensor] = None):
    q = k = self.with_pos_embed(tgt, query_pos)
    tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
                          key_padding_mask=tgt_key_padding_mask)[0]
    tgt = tgt + self.dropout1(tgt2)
    tgt = self.norm1(tgt)
    tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                               key=self.with_pos_embed(memory, pos),
                               value=memory, attn_mask=memory_mask,
                               key_padding_mask=memory_key_padding_mask)[0]
    tgt = tgt + self.dropout2(tgt2)
    tgt = self.norm2(tgt)
    tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
    tgt = tgt + self.dropout3(tgt2)
    tgt = self.norm3(tgt)
    return tgt

Here is a detail. Except for the first layer, query_embed needs to do a Self Attention before doing Cross Attention. Each Query of Self-Attention understands the information mastered by other Query.

In conclusion, what are the benefits of Transformer Encoder-Decoder?
I think Transformer Encoder-Decoder should be one of the reasons for the success of Set-to-Set . Before DETR, there were actually some articles that proposed the idea of ​​Set-to-Set, but because the Backbone used by the network was not strong enough, it did not achieve very well. Effect. The Transformer Encoder-Decoder learns the global features, which can make one of the features interact with other features in the global, and the network can know more clearly where is an object, where is another object, an object It should correspond to an output, which is more in line with the assumption of Set-to-Set.

1.2 Set-to-Set Loss

The so-called Set-to-Set Loss is the process of adding a bipartite graph matching process before calculating the network loss, so that the final prediction result only calculates the loss with the matching true value, as shown in the following formula: σ ^ = arg ⁡ min ⁡ σ ∈ SN ∑ i NL match ⁡ ( yi , y ^ σ ( i ) ) \hat{\sigma}=\underset{\sigma \in \mathfrak{S}_{N}}{\arg \min } \sum_{i }^{N} \mathcal{L}_{\operatorname{match}}\left(y_{i}, \hat{y}_{\sigma(i)}\right)p^=σSNargminiNLmatch(yi,y^σ ( i )) of whichyi y_{i}yiis true, y ^ σ ( i ) \hat{y}_{\sigma(i)}y^σ ( i )is the predicted value, L match ⁡ \mathcal{L}_{\operatorname{match}}LmatchFor the bipartite graph matching algorithm, students who are not familiar with bipartite graph matching can refer to the introduction in Visual SLAM Summary - SuperPoint / SuperGlue . The difference is that the linear_sum_assignment function in the scipy library is called in the implementation of the DETR code code. The function input A M × NM\times NM×Cost matrix of N size can calculateMMM andNNThe matching relationship between N , the Cost matrix in DETR is composed of classification lossp ^ σ ( i ) ( ci ) \hat{p}_{\sigma(i)}\left(c_{i}\right)p^σ ( i )(ci)和Box损失 L b o x ( b i , b ^ σ ( i ) ) \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\sigma(i)}\right) Lbox(bi,b^σ ( i )) consists of two parts, the classification loss is the probability of negative Softmax, and the Box loss is L1 loss and Generalized IOU loss. The two parts are as follows:

def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets])
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]

After obtaining the matching result σ ^ \hat{\sigma}p^ According to the equation:L Hungarian ( y , y ^ ) = ∑ i = 1 N [ − log ⁡ p ^ σ ^ ( i ) ( ci ) + 1 { ci ≠ ∅ } L box ( bi , b ^ σ ^ ( i ) ) ] \mathcal{L}_{\text {Hungarian}}(y, \hat{y})=\sum_{i=1}^{N}\left[-\log \hat{ p}_{\hat{\sigma}(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i}\right\varnothing\right\}}\ mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right]LHungarian (y,y^)=i=1N[logp^p^(i)(ci)+1{ ci=}Lbox(bi,b^p^( i ) ) ] is slightly different from the loss used in the matching process. The classification loss here uses the general Cross Entropy loss. Why is there such a difference? The paper does not seem to mention it. As mentioned above, the binary matching process only occurs during the training process, and the final output result is obtained by directly passing the output result of the network over a threshold during the test process.

1.3 Positional Embedding

The Positional Embedding in DETR is a fixed value. The code of Positional Embedding is as follows. Let's briefly analyze it:

class PositionEmbeddingSine(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats
        self.temperature = temperature
        self.normalize = normalize
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi
        self.scale = scale

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        mask = tensor_list.mask
        assert mask is not None
        not_mask = ~mask
        y_embed = not_mask.cumsum(1, dtype=torch.float32)
        x_embed = not_mask.cumsum(2, dtype=torch.float32)
        if self.normalize:
            eps = 1e-6
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)

        pos_x = x_embed[:, :, :, None] / dim_t
        pos_y = y_embed[:, :, :, None] / dim_t
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
        return pos

In order to make the network perceive the location information of different inputs, the most intuitive way is to assign 1 1 to the first Feature1 , the second Feature assignment2 22 , but this assignment method is unfriendly to larger input, so someone proposes to use the sine function to control the value at− 1 -11 and1 11 , but the sine function is periodic, which may cause the same value at different positions.

So the author extended the sine function to ddd- dimensional vector, different channels have different wavelengths, as follows:
PE ( pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / d model ) P E_{(pos, 2 i)}=\sin \left( pos / 10000^{2 i / d_{\text {model }}}\right)PE( p os , 2 i )=sin(pos/100002 i / dmodel ) P E ( pos  , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d model  ) P E_{(\text {pos }, 2 i+1)}=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) PE( pos  , 2 i + 1 )=cos(pos/100002 i / dmodel ) of whichiii is the number of channels, for example, we setd = 6 d=6d=6N : i = [ 1 , 2 , 3 , 4 , 5 , 6 ] {i}=[1,2,3,4,5,6]i=[1,2,3,4,5,6] w i = [ 1 1000 0 1 / 6 , 1 1000 0 2 / 6 , 1 1000 0 3 / 6 , 1 1000 0 4 / 6 , 1 1000 0 5 / 6 , 1 1000 0 6 / 6 ] w_i=\left[\frac{1}{10000^{1 / 6}}, \frac{1}{10000^{2 / 6}}, \frac{1}{10000^{3 / 6}}, \frac{1}{10000^{4 / 6}}, \frac{1}{10000^{5 / 6}}, \frac{1}{10000^{6 / 6}}\right] wi=[100001/61,100002/61,100003/61,100004/61,100005/61,100006/61] P o i s i o n = 2 Poision=2 Poision=2时,得到: P o s i t i o n E n c o d i n g = [ sin ⁡ ( 2 w 0 ) , cos ⁡ ( 2 w 1 ) , sin ⁡ ( 2 w 2 ) , cos ⁡ ( 2 w 3 ) , sin ⁡ ( 2 w 4 ) , cos ⁡ ( 2 w 5 ) ] Position Encoding=\left[\sin \left(2 w_{0}\right), \cos \left(2 w_{1}\right), \sin \left(2 w_{2}\right), \cos \left(2 w_{3}\right), \sin \left(2 w_{4}\right), \cos \left(2 w_{5}\right)\right] PositionEncoding=[sin( 2 w0),cos( 2 w1),sin( 2 w2),cos( 2 w3),sin( 2 w4),cos( 2 w5) ] A multi-dimensional vector obtained in this wayis difficult to be the same in different positions, so the effect of encoding different positions is achieved.

After DETR was put forward, it quickly attracted everyone's attention with its simple structure. DETR itself also has many problems, such as the training convergence speed is not fast enough, the result is not enough SOTA, and the detection effect on small objects is poor, so there are many other problems. DETR-related algorithms, such as Deformable DETR, Anchor DETR, etc., as well as DETR3D applied to the field of autonomous driving, etc., here I briefly summarize some of the algorithms.

2. Deformable DETR

Deformable DETR mainly solves the problems of slow training speed of original DETR and poor detection effect on small objects . The slow convergence speed of DETR is mainly due to the time-consuming training process of the Attention Map from uniform distribution to sparse distribution, and the poor detection effect for small objects is mainly because Backbone does not have multi-scale features, but even if there are, it is unrealistic to input multi-scale features into Transformer , because the computational complexity of Transformer is O ( n 2 ) O(n^2)O ( n2 ), high-resolution features will bring huge memory and time consumption.

To this end, Deformable DETR proposes the Defomer Attention module, which solves the above problems well.

2.1 Deformable Attention Module

The author first introduced the formula of the multi-head attention mechanism in the original Transformer in the paper:  MultiHeadAttn ( zq , x ) = ∑ m = 1 MW m [ ∑ k ∈ Ω k A mqk ⋅ W m ′ xk ] \text { MultiHeadAttn }\ left(z_{q}, x\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{k \in \Omega_{k}} A_{mqk} \cdot W_{ m}^{\prime} x_{k}\right] MultiHeadAttn (zq,x)=m=1MWm[kΩkAm q kWmxk] This is the same as the formula we usually see, but the expression is slightly different. wherezq , x z_{q}, xzq,x are two sets of vectors for Attention,V mx V_{m} xVmx得到Key Embedding, U m z q U_{m} z_{q} UmzqGet Query Embedding, A mqk A_{mqk}Am q kThe weight is obtained for the normalization after dot multiplication of Query Embedding and Key Embedding, which is proportional to exp ⁡ { zq TU m TV mxk C v } \exp \left\{\frac{z_{q}^{T} U_{m} ^{T} V_{m} x_{k}}{\sqrt{C_{v}}}\right\}exp{ Cv zqTUmTVmxk} W m ′ x k W_{m}^{\prime} x_{k} Wmxk为Value Embedding, W m W_{m} WmIt is responsible for aggregating the multi-head results after Concate. where U m , V m , W m ′ , W m U_{m}, V_{m}, W_{m}^{\prime}, W_{m}Um,Vm,Wm,Wmare the learned parameters. In DETR, such an original multi-head attention mechanism is applied to Encoder's Self-Attention and Decoder's Cross-Attention.

Next, the author introduced the principle of Deformable Attention Module, the expression formula is: DeformAttn ⁡ ( zq , pq , x ) = ∑ m = 1 MW m [ ∑ k = 1 KA mqk ⋅ W m ′ x ( pq + Δ pmqk ) ] \operatorname{DeformAttn}\left(\boldsymbol{z}_{q}, \boldsymbol{p}_{q}, \boldsymbol{x}\right)=\sum_{m=1}^{M} \boldsymbol {W}_{m}\left[\sum_{k=1}^{K} A_{mqk} \cdot \boldsymbol{W}_{m}^{\prime} \boldsymbol{x}\left(\ boldsymbol{p}_{q}+\Delta \boldsymbol{p}_{mqk}\right)\right]DeformAttn(zq,pq,x)=m=1MWm[k=1KAm q kWmx(pq+p _m q k) ] whereδ pmqk \delta p_{mqk}p _m q kis the position offset obtained from Query Embedding, and A mqk A_{mqk} in the formulaAm q kIt is no longer the weight obtained through the dot product of Query Embedding and Key Embedding, but the weight obtained directly from Query Embedding. This process can be understood through the following figure: The difference from the multi-head attention mechanism used in DETR is
insert image description here
:

  1. The multi-head attention mechanism used in DETR uses global features as the Key value, while Deformable Attention is near each Query and independently selects KK through Query EmbeddingK key values;
  2. The multi-head attention mechanism used in DETR obtains weights through the inner product of Key Embedding and Query Embedding, while Deformable Attention is obtained directly from Query Embedding through a linear layer.

It is also the above two differences that make Deformable Attention more efficient than the original Attention mechanism . In addition, Deformable Attention and Deformable Convolution are also different. Deformable Attention is to directly predict multiple offsets at a point in the Query position, while Deformable Convolution is to predict a bias for each pixel in the convolution kernel. displacement.

On the basis of Deformable Attention Module, the author further proposed Multi Scale Deformable Attention Module, the formula is as follows: MSDeformAttn ⁡ ( zq , p ^ q , { xl } l = 1 L ) = ∑ m = 1 MW m [ ∑ l = 1 L ∑ k = 1 KA mlqk ⋅ W m ′ xl ( ϕ l ( p ^ q ) + Δ pmlqk ) ] \operatorname{MSDeformAttn}\left(z_{q}, \hat{\boldsymbol{p}}_{ q},\left\{x^{l}\right\}_{l=1}^{L}\right)=\sum_{m=1}^{M} W_{m}\left[\sum_ {l=1}^{L} \sum_{k=1}^{K} A_{mlqk} \cdot \boldsymbol{W}_{m}^{\prime} \boldsymbol{x}^{l}\ left(\phi_{l}\left(\hat{\boldsymbol{p}}_{q}\right)+\Delta \boldsymbol{p}_{mlqk}\right)\right]MSDeformAttn(zq,p^q,{ xl}l=1L)=m=1MWm[l=1Lk=1KAm lq kWmxl( ϕl(p^q)+p _m lq k) ] Compared with the Deformable Attention Module, the main difference is that the Deformable Attention Module samplesKKK positions, while the Multi Scale Deformable Attention Module is fromLLEach layer of L layer samplesKKK positions, totalLK LKL K sampling positions. In this way, the network realizes the fusion of multi-scale featuresat a small cost.

2.2 Deformable Transformer Encoder-Decoder

The Deformable Transformer Encoder-Decoder structure is shown in the figure below:
insert image description here
In the Encoder , the author replaces all Self-Attention Modules with Deformable Attention Modules, and the input and output of each Encoder are multi-scale feature maps of the same resolution. The multi-scale feature map comes directly from the last three stages of ResNet. As shown in the figure below:
insert image description here
In addition, on the basis of retaining Positional Embedding, a learnable Scale-Level Embedding related to the number of feature layers is added.

In Decoder , the author keeps Self-Attention unchanged, and replaces Cross-Attention with Deformable Attention. For each Object Query Embedding, the linear layer and Sigmoid learn its corresponding reference point and query on the Feature output by Encoder The corresponding Value Embedding is obtained, and finally the weighted summation is performed based on the weights learned by the linear layer and Softmax. When I read this for the first time, I had a question: Object Query Embedding is a preset value, if the reference point position and weight are inferred from Object Query Embedding (in DETR, they are all associated with the Encoder Feature obtained), how can we ensure that the detection target is correct? After thinking about it carefully, the reference point or weight of the first layer of Decode Layer may indeed be a fixed value or a random value, but the Encoder Feature information is already included in the Object Query Embedding output by the first layer of Decode Layer. The position of the reference point generated by the Decoder Layer will start to correlate with the image, and with the superposition of the Decoder Layer, this correlation will become stronger and stronger.

In addition, a little different from DETR is that Deformable DETR predicts Bounding Box does not directly regress the absolute coordinates from Object Query, but regresses the distance to the reference point, because the weight and the reference point position are all from the same Object Query Embedding is inferred, so the method of regression relative to the reference point can speed up the convergence .

2.3 Conclusion

The knowledge about Deformable DETR is far more than the two points summarized above in this article. The Deformable DETR paper also introduces other network structures such as Two-Stage Defomable DETR, and you can dig deeper. Here we compare the improvement of Deformable DETR relative to DETR:
The following is a comparison of training speed. The training convergence speed of Deformable DETR has improved a lot:
insert image description here
From the table below, we can see that Deformable has improved a lot in the detection accuracy of small objects:
insert image description here
DETR was proposed mainly based on its end-to-end The network structure of the terminal is out of the circle, but its performance may not have reached SOTA at that time, but with the blessing of Deformable DETR, this type of method can already compete with the SOTA method. The comparison results are as follows:
insert image description here

3. DETR3D

DETR3D applies DETR to the field of autonomous driving to realize 3D object detection under the perspective of multi-camera input BEV, as shown in the figure below: insert image description here
its principle is very similar to the Deformable Transformer mentioned above. Let’s briefly summarize the Transformer in BEV The application under the perspective task is like another blog post to learn later.

The structure diagram of the network is shown in the figure below:
insert image description here
the network first extracts multi-scale features for each camera input through ResNet and FPN, which is the Image Feature Extraction part in the figure. The feature input 2D to 3D Feature Transformation for further study, the following is a detailed introduction to this part.

3.1 2D to 3D Transformer

For each camera input, we extract features of four scales, which are recorded as F 1 , F 2 , F 3 , F 4 \mathcal{F}_{1}, \mathcal{F}_{2}, \mathcal{F }_{3}, \mathcal{F}_{4}F1,F2,F3,F4, under the configuration of the paper, there are a total of six camera inputs: F k = { fk 1 , … , fk 6 } ⊂ RH × W × C \mathcal{F}_{k}=\left\{\boldsymbol{f} _{k 1}, \ldots, \boldsymbol{f}_{k 6}\right\} \subset \mathbb{R}^{H \times W \times C}Fk={ fk 1,,fk 6}RH×W×C

In the DETR3D algorithm, there is no Transfomer-based Encoder part, but the image features extracted above are directly input into the Decoder part. I think the reason should be that the Encoder part has too much calculation, and the Decoder part is the same as DETR or Deformable DETR. Cross Attention is performed through Object Query Embedding and the input Feature, and then the corresponding object is returned from the final output Object Query Embedding. In DETR3D, the biggest difference is the position in the 3D space of the regression, and the input is the Feature of the 2D image, so such a 2D to 3D Transformer Decoder is needed.

The Decoder of DETR3D is still composed of Self-Attention and Cross-Attention. The methods in Self-Attention and DETR are basically the same. The main function is to ensure that each query knows what the other is doing and avoid repeated extraction of the same features. Cross- Attention is quite different, let's look at the specific steps below:

  1. First through an independent network Φ ref \Phi^{\text {ref }}Phiref  returns a 3D position from Object Query Embedding, which is somewhat similar to the operation of Deformable DETR:c ℓ i = Φ ref ( q ℓ i ) \boldsymbol{c}_{\ell i}=\Phi^{\mathrm {ref}}\left(\boldsymbol{q}_{\ell i}\right)ci=Phiref(qi) wherec ℓ i \boldsymbol{c}_{\ell i}cican be considered as iiThe center position of the i Box.
  2. Through the camera parameters, the position c ℓ i \boldsymbol{c}_{\ell i}ciProject to the Feature of each camera to obtain the 3D position of the Box used for Key Embedding and Value Embedding to refine prediction: c ℓ i ∗ = c ℓ i ⊕ 1 \boldsymbol{c}_{\ell i}^{*}= \boldsymbol{c}_{\ell i} \oplus 1ci=ci1 c ℓ m i = T m c ℓ i ∗ \boldsymbol{c}_{\ell m i}=T_{m} \boldsymbol{c}_{\ell i}^{*} cmi=Tmciwhere T m T_{m}Tmis the camera parameter.
  3. Since each input is a multi-scale feature map, in order to avoid the influence of different scale feature resolutions, bilinear interpolation is used to interpolate the feature map, and if the coordinates fall outside the picture, fill in zeros: f ℓ kmi = f bilinear ( F km , c ℓ mi ) \boldsymbol{f}_{\ell kmi}=f^{\text {bilinear }}\left(\mathcal{F}_{km}, \boldsymbol{c}_{\ ell mi}\right)f km=fbilinear (Fkm,cmi)
  4. Add the above features and finally add them to Object Query Embedding for Refinement: f ℓ i = 1 ∑ k ∑ m σ ℓ kmi + ϵ ∑ k ∑ mf ℓ kmi σ ℓ kmi \boldsymbol{f}_{\ell i }=\frac{1}{\sum_{k} \sum_{m} \sigma_{\ell kmi}+\epsilon} \sum_{k} \sum_{m} \boldsymbol{f}_{\ell kmi} \sigma_{\ell kmi}fi=kmp km+ϵ1kmf kmp km q ( ℓ + 1 ) i = f ℓ i + q ℓ i \boldsymbol{q}_{(\ell+1) i}=\boldsymbol{f}_{\ell i}+\boldsymbol{q}_{\ell i} q(+1)i=fi+qiI think this is actually the step equivalent to the weighted summation of Cross Attention. In DETR, this step is to update Query by weighting the value of Value Embedding by multiplying the weight of Query Embedding and Key Embedding. In Deformable, the step of Query Embedding and Key Embedding point multiplication is omitted, and the regression weight is directly performed through Query Embedding, and then the Value Embedding is weighted. Here, the author is equivalent to adding Value Embedding and Query Embedding , further omitting the amount of calculation. (This part is my conclusion based on the description of the paper and the summary of other blogs. It may be different from the code implementation. If there is any error, please point out to the reader)
  5. After iterating the above operations many times, the category and position are finally regressed from Query Embeding: b ^ ℓ i = Φ ℓ reg ( q ℓ i ) \hat{\boldsymbol{b}}_{\ell i}=\ Phi_{\ell}^{\mathrm{reg}}\left(\boldsymbol{q}_{\ell i}\right)b^i=Phireg(qi) c ^ ℓ i = Φ ℓ c l s ( q ℓ i ) \hat{c}_{\ell i}=\Phi_{\ell}^{\mathrm{cls}}\left(\boldsymbol{q}_{\ell i}\right) c^i=Phicls(qi)

The above is the understanding of the DETR3D Decoder part. Since there is no time to read the source code, the understanding may not be sufficient, and many details may be ignored. Since Tesla AI Day last year, everyone has carried out a lot of research on the application of Transformer in the field of autonomous driving. I plan to go further in the future. How to associate 2D Features in 3D is very interesting.

The relevant summary of this article about DETR is here for the time being, and I will add it later when I have time. If you have any questions, please advise and communicate~

Guess you like

Origin blog.csdn.net/weixin_44580210/article/details/125951431