DETR study notes

DETR study notes

End-to-End Object Detection with Transformers

Abstract.

We propose a new approach that treats object detection as a direct set prediction problem. Our approach simplifies the detection process, effectively eliminating the need for many hand-engineered components such as non-maximal suppression procedures or anchor generation (explicitly encoding prior knowledge about the task). The main components of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that enforces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a small fixed set of learning object queries, DETR infers the relationship between the object and the global image context, and directly outputs the final set of predictions in parallel( The reason for parallelism is that there is no sequence in the target detection process, and parallelism can increase the speed). Unlike many other modern detectors, the new model is conceptually simple and does not require specialized libraries. On the challenging COCO object detection dataset, DETR demonstrates comparable accuracy and runtime performance to the well-established and highly optimized Faster RCNN baseline. Furthermore, DETR is easy to generalize, producing pan-view segmentation in a unified manner. We show that it significantly outperforms competing baselines. The training code and pre-trained models are available at https://github.com/facebookresearch/detr.

1 Introduction

The goal of object detection is to predict a set of bounding boxes and class labels for each object of interest. Modern detectors approach this ensemble prediction task in an indirect way, by defining surrogate regression and classification problems on large proposal sets [37, 5], anchors [23] or window centers [53, 46]. Their performance is significantly affected by the following factors: the post-processing step of folding near-duplicate predictions, the design of the anchor set, and the heuristics used to assign target boxes to anchors [52]. To simplify these pipelines, we propose a direct ensemble prediction method to bypass the agent task. This end-to-end philosophy has achieved significant progress in complex structured prediction tasks such as machine translation or speech recognition, but has not yet made progress in object detection**: previous attempts [43, 16, 4, 39] either increased Other forms of prior knowledge have either proven unable to compete with strong baselines on challenging benchmarks**. This article aims to bridge this gap.

We simplify the training process by treating object detection as a direct ensemble prediction problem. We adopt an encoder-decoder structure based on transformers [47], which is a commonly used structure for sequence prediction. Transformer's self-attention mechanism, which explicitly models all pairwise interactions between elements in a sequence, makes these architectures particularly suitable for the specific constraints of ensemble prediction, such as deleting duplicate predictions.

Our detection transformer (DETR, see Figure 1) predicts all objects at once and is trained end-to-end using an ensemble loss function that performs bipartite matching between predicted objects and ground truth objects. DETR simplifies the detection pipeline by removing multiple hand-designed components that encode prior knowledge, such as spatial anchors or non-maximum suppression. Unlike most existing detection methods, DETR does not require any custom layers and therefore can be easily reproduced in any framework containing standard CNN and transformer classes.

image-20221113221452630

Compared with most previous direct set prediction work, the main feature of DETR iscombining bipartite matching losses and transformers with (non-autoregressive) parallel decoding< /span> and is invariant to the permutation of the predicted objects, so we can Emit them in parallel. Our matching loss function uniquely assigns a prediction to a ground truth object[29,12,10,8]. In contrast, previous work focused on autoregressive decoding of RNNs [43, 41, 30, 36, 42].

We evaluate DETR on the most popular object detection dataset COCO [24], compared with the very competitive Faster R-CNN baseline [37]. Faster RCNN has gone through multiple design iterations, and its performance has improved dramatically since its initial release. Experimental results show that the new model has comparable performance. More precisely, DETR shows significantly better performance on large objects, possibly enabled by non-local computation of the transformer. However, its performance is lower on small objects. We hope that future work will improve this aspect, as FPN [22] did for Faster R-CNN.

The training setup of DETR differs from standard object detectors in many aspects. The new model requires extremely long training times and benefits from secondary decoding losses in the transformer. We will thoroughly explore which components are critical to the performance of the demo.

DETR's design philosophy can be easily extended to more complex tasks. In our experiments, we show that simple segmentation heads trained on pre-trained DETR outperform competing baselines on Panoptic segmentation [19], a challenging pixel-level recognition task that has recently been subject to welcome.

2 Related work

Our work builds on previous work in several areas: bipartite matching loss for set prediction, transformer-based encoder-decoder architectures, parallel decoding, and object detection methods.

2.1 Set Prediction

No canonical deep learning model can directly predict the set. The basic set prediction task is multi-label classification (see for example [40, 33] for references in the context of computer vision), for which the baseline method one-vs-rest is not suitable when there is underlying structure between elements ( That is, problems such as the detection of almost identical boxes). The first difficulty with these tasks is to avoid almost repetitive tasks. Most current detectors use post-processing such as non-maximum suppression to solve this problem, but direct set prediction does not require post-processing. They require global inference schemes to model the interactions between all prediction elements to avoid redundancy. For constant small set prediction, dense fully connected networks [9] are sufficient but expensive. A general approach is to use autoregressive sequence models such as recurrent neural networks [48]. In all cases, the loss function should remain constant across the predicted permutations. The usual solution is to design a loss based on the Hungarian algorithm [20] to find a bipartite match between the ground truth and predictions . This enforces permutation invariance and guarantees a unique match for each target element. We adopt the bipartite matching loss method. However, in contrast to most previous work, we move away from autoregressive models and use parallel decoding transformers, which we describe below.

2.2 Transformers and Parallel Decoding

Vaswani et al. [47] introduced transformers as a new attention-based machine translation building block. The attention mechanism [2] is a neural network layer that aggregates information from the entire input sequence. The transformer introduces a self-attention layer, which is similar to a non-local neural network [49], scanning each element of the sequence and updating it by aggregating information from the entire sequence. One of the main advantages of attention-based models is their global computation and perfect memory, which makes them more suitable for long sequences than RNNs. Transformers now replace RNNs in many problems in natural language processing, speech processing, and computer vision [8, 27, 45, 34, 31].

First, a transformer is used in an autoregressive model, following an early sequence-to-sequence model [44], to generate output tokens one by one. However, prohibitive inference costs (proportional to the output length and difficult to batch process) have led to the development of parallel sequence generation in audio [29], machine translation [12, 10], word representation learning [8], and more recently Speech recognition [6] and other fields. We also combine transformers and parallel decoding to achieve an appropriate trade-off between computational cost and global computational power required to perform ensemble predictions.

2.3 Object detection

Most modern object detection methods make predictions based on initial guesses. Two-stage detectors [37, 5] predict boxes w.r.t. proposals, while single-stage methods predict w.r.t. anchors [23] or grids of possible object centers [53, 46]. Recent work [52] shows that the final performance of these systems depends heavily on exactly how these initial guesses are set up. In our model, we are able to remove this handcrafted process and simplify the detection process by directly predicting the input image (instead of anchor points) using absolute box prediction (w.r.t.).

Set-based loss

Some object detectors [9, 25, 35] use bipartite matching loss. However, in these early deep learning models, the relationship between different predictions is only modeled with convolutional layers or fully connected layers, and hand-designed NMS post-processing can improve their performance. Recent detectors [37, 23, 53] use non-unique allocation rules and NMS between ground truth and predictions.

Learnable NMS methods [16, 4] and relational networks [17] explicitly model the relationship between different predictions and attention. Using direct setup losses, they do not require any post-processing steps. However, these methods use additional hand-crafted contextual features, such as proposal box coordinates, to efficiently model relationships between detections, while we look for solutions that reduce the prior knowledge encoded in the model.

Recurrent detectors

The closest ones to our method are end-to-end ensemble prediction [41, 30, 36, 42] for object detection [43] and instance segmentation. Similar to us, they directly generate a set of bounding boxes using a two-part matching loss based on the encoder-decoder architecture of CNN activation. However, these methods have only been evaluated on small datasets without reference to modern baselines. In particular, they are based on autoregressive models (more precisely RNNs), so they do not take advantage of the nearest parallel decoding transformers.

3 The DETR model

In detection, direct ensemble prediction has two elements: (1) the ensemble prediction loss, which forces a unique matching box between the prediction result and the ground truth; (2)An architecture that predicts (at a time) a set of objects and models the relationships between them. We describe our architecture in detail in Figure 2.

3.1 Object detection set prediction loss

DETR infersa fixed-size set of N predictions in one pass through the decoder, where N is set to explicit Larger than the typical number of objects in the image. One of the main difficulties in training is scoring the predicted objects (class, location, size) against the ground truth. Our loss produces an optimal bipartite match between the predicted object and the ground truth object, and then optimizes the object-specific (bounding box) loss.

Let us use y to represent the basic truth set of the object, and use y ^ = { y ^ i } i = 1 N \hat{y}=\left\{\hat {y}_{i}\right\}_{i=1}^{N} and^={ and^i}i=1N represents N prediction sets. Assuming that N is larger than the number of objects in the image, we also regard y as a set of size N and equipped with ∅ (no objects). To find a bipartite match between these two sets, we look for the N elements with the lowest cost σ ∈ S N \sigma \in \mathfrak{S}_{N} pSNArrangement:

Classic bipartite graph matching algorithm – Hungarian algorithm

image-20221117094936346

image-20221114090326022

其中 L match  ( y i , y ^ σ ( i ) ) \mathcal{L}_{\text {match }}\left(y_{i}, \hat{y}_{\sigma(i)}\right) Lmatch (yi,and^σ(i))是ground truth y i y_i andigiven code σ ( i ) σ(i) between the predictions of σ(i) Match the cost pairwise. Following previous work (e.g. [43]), this optimal allocation is efficiently computed using the Hungarian algorithm.

The matching cost takes into account both the class prediction and the similarity between the predicted ground truth box and the basic ground truth box. Each element i of the ground truth set can be regarded as y i = ( c i , b i ) y_i = (c_i, b_i) andi=(ci,bi), inside c i c_i ci is the target category label (possibly ∅), b i ∈ [ 0 , 1 ] b_i∈[0,1] bi[0,1] is a vector that defines the center coordinates of the true frame and its height and width relative to the image size. For the prediction of the index σ(i), we will c i c_i ciThe probability of class is defined as p ^ σ ( i ) ( c i ) \hat{p}_{\sigma(i)}\left(c_{i}\right ) p^σ(i)(ci),将预测框硆义为 b ^ σ ( i ) \hat{b}_{\sigma(i)} < /span>b^σ(i). We use these symbols to define L match ( y i , y ^ σ ( i ) ) \mathcal{L}_{\text {match }}\left(y_{i}, \hat{ y}_{\sigma(i)}\right) Lmatch (yi,and^σ(i)) − 1 { c i ≠ ∅ } p ^ σ ( i ) ( c i ) + 1 { c i ≠ ∅ } L box  ( b i , b ^ σ ( i ) ) -\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \hat{p}_{\sigma(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\text {box }}\left(b_{i}, \hat{b}_{\sigma(i)}\right) 1{ ci=}p^σ(i)(ci)+1{ ci=}Lbox (bi,b^σ(i)). (This function includes classification loss and frame loss)

This process of finding matches serves the same purpose as the heuristic assignment rules used in modern detectors for matching proposals [37] or anchors [22] to ground truth objects. The main difference is that we need to find one-to-one matching direct set predictions without duplication.

The second step is to calculate the loss function, which is the Hungarian loss for all pairs matched in the previous step. We define the loss similarly to that of a normal object detector, as a linear combination of the negative log-likelihood of the class prediction and the box loss defined later:

image-20221114092747298

Among them, σ is the optimal allocation calculated in the first step (1). In practice, when c i = ∅ c_i =∅ ci=When ∅, we reduce the weight of the log probability term by 10 to consider class imbalance. This is similar to how the Faster R-CNN training process balances positive/negative suggestions by subsampling [37]. Note that the cost of matching between an object and ∅ does not depend on the prediction, which means that in this case the cost is a constant. In the matching cost, the probability we use is p ^ σ ( i ) ( c i ) \hat{p}_{\sigma(i)}\left(c_{i}\right ) p^σ(i)(ci) instead of log probability. This makes the class predictor equal to L b o x ( ⋅ , ⋅ ) L_{box}(·,·) Lbox() (discussed below) is commensurable, and we observe better empirical performance.

Bounding box loss.

The second part of the matching cost and Hungarian loss is to score the bounding box L b o x ( ⋅ ) L_{box}(·) Lbox(). Unlike many detectors that use box predictions as initial guesses for Δw.r.t, we make box predictions directly. While this approach simplifies implementation, it also introduces the issue of the relative scale of losses. Most commonly used l 1 l_1 l1The loss will have different scales for small and large boxes, even though their relative errors are similar. To alleviate this problem, we use l 1 l_1 l1IoU loss[38] L i o u ( ⋅ , ⋅ ) L_{iou}(·,·) Liou(Linear combination of ). Overall, our box loss is λ iou L iou ( b i , b ^ σ ( i ) ) + λ L 1 ∥ b i − b ^ σ ( i ) ∥ 1 \lambda_ {\text {iou }} \mathcal{L}_{\text {iou }}\left(b_{i}, \hat{b}_{\sigma(i)}\right)+\lambda_{\mathrm {L} 1}\left\|b_{i}-\hat{b}_{\sigma(i)}\right\|_{1} liou Liou (bi,b^σ(i))+lL1bib^σ(i)1 , among which λ i o u , λ L 1 ∈ R λ_{iou},λ_{L_1}∈R liou,lL1R is a hyperparameter. Both losses are normalized based on the number of objects in the batch.

3.2 DETR architecture

The entire DETR architecture is very simple, as shown in Figure 2. It consists of three main components, which we describe below: a CNN backbone that extracts compact feature representations, an encoder-decoder converter, and a simple feedforward network (FFN) that makes the final detection prediction.

image-20221114094950182

image-20221114093611093

(The boxes class box and no object behind FFN in the picture correspond to category loss and box loss respectively)

(For the loss calculation of the decoder process, as an auxiliary calculation, each transformer decoder can be followed by the loss calculation of FFN to enhance the calculation effect)

Unlike many modern detectors, DETR can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation that requires only a few hundred lines of code. In Py Torch [32], the inference code for DETR can be implemented in less than 50 lines. We hope that the simplicity of our method will attract new researchers to the detection community.

Backbone

image-20221114093714401

Transformer encoder

First, 1x1 convolution reduces the channel dimension of the high-level activation map f from C to a smaller dimension d, forming a new feature map z 0 ∈ R d × H × W z_0∈R^{d×H×W} With0Rd×H×W .编码灁灁一个顺顺的为轻入,因此本们将 z 0 z_0 With0The spatial dimension of is decomposed into one dimension, thereby obtaining the d×HW feature map. Each encoder layer has a standard architecture consisting of a multi-head self-attention module and a feed-forward network (FFN). Since the transformer architecture is permutation invariant, we complement it with fixed position encodings [31, 3] that are added to the input of each attention layer. We follow the detailed definition of the architecture described in [47], see the supplementary material.

Transformer decoder

The decoder follows the standard architecture of a transformer, transforming N embeddings of size d using a multi-head autosum encoder-decoder attention mechanism. The difference from the original converter is that our model decodes N objects in parallel at each decoder layer, while Vaswani et al. [47] use an autoregressive model to predict the output sequence one element at a time. We refer readers unfamiliar with these concepts to the supplementary material. Because the decoder is also permutation invariant, the N input embeddings must be different to produce different results. These input embeddings are learned positional encodings, which we call object queries, and similar to the encoder, we add them to the input of each attention layer. N object queries are converted into embedded outputs by the decoder. They are then independently decoded into box coordinates and class labels through a feedforward network (described in the next subsection), resulting in N final predictions. Using self-attention and encoder-decoder attention on these embeddings, the model uses the pairwise relationships between them to globally reason about all objects while being able to use the entire image as context.

Prediction feed-forward networks (FFNs)

Finally, prediction is made through a 3-layer perceptron and linear projection layer with ReLU activation function and implicit dimension d. FFN predicts the normalized center coordinates, height and width of the box w.r.t the input image, and the linear layer predicts the class label using a softmax function. Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in the image, an additional special class label ∅ is used to indicate that no objects are detected within a slot. This class acts like the "background" class in standard object detection methods.

Auxiliary decoding losses

During training, we found it helpful to use an auxiliary loss [1] in the decoder, specifically to help the model output the correct number for each class of objects. We add prediction ffn and Hungarian loss after each decoding layer. All prediction ffn share their parameters. We use an additional shared layer norm to normalize the input of predicted ffn from different decoder layers.

4 Experiments

image-20221114094211883

image-20221114094233824

image-20221114094302900

image-20221114094340340

image-20221114094351647

image-20221114094521538

image-20221114094544531

image-20221114094558466

image-20221114094610222

image-20221114094642627

image-20221114094659195

image-20221114094712798

5 Conclusion

We propose a new transformer and two-part matching loss based object detection system design for direct set prediction. The method achieves comparable results to the optimized Faster R-CNN baseline on the challenging COCO dataset. DETR is simple to implement, has a flexible architecture, is easy to extend to panoramic segmentation, and the results are competitive. Furthermore, it performs significantly better than Faster R-CNN on large objects, which may be due to the global information processing performed by self-attention.

This new design of the detector also brings new challenges, especially intraining, optimization and performance of small objects . Current detectors will require years of improvements to handle similar problems, and we hope that future work will successfully address these issues with DETR.

6. Pseudocode

import torch
from torch import nn
from torchvision.models import resnet50

class DETR(nn.Module):

	def __init__(self, num_classes, hidden_dim, nheads,
		num_encoder_layers, num_decoder_layers):
		super().__init__()
		# We take only convolutional layers from ResNet-50 model
		self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
		self.conv = nn.Conv2d(2048, hidden_dim, 1)
		self.transformer = nn.Transformer(hidden_dim, nheads,
		num_encoder_layers, num_decoder_layers)
		self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
		self.linear_bbox = nn.Linear(hidden_dim, 4)
		self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
		self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
		self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

	def forward(self, inputs):
		x = self.backbone(inputs)
		h = self.conv(x)
		H, W = h.shape[-2:]
		pos = torch.cat([
		self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
		self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
		], dim=-1).flatten(0, 1).unsqueeze(1)
		h = self.transformer(pos + h.flatten(2).permute(2, 0, 1),
		self.query_pos.unsqueeze(1))
		return self.linear_class(h), self.linear_bbox(h).sigmoid()

detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/127842090