Interpretation of the paper: End-to-End Object Detection with Transformers

Date of publication: 2020
Paper address: https://arxiv.org/pdf/2005.12872.pdf
Project address: https://github.com/facebookresearch/detr

A new approach to treat object detection as an ensemble prediction problem is proposed. Our approach simplifies the detection pipeline, effectively eliminating the need for many hand-engineered components, such as non-maximum suppression procedures or anchor box generation, that explicitly encode prior knowledge about the task. The new framework is called detection transformer or DETR, and its architecture is transformer encoder-decoder, which forces the prediction results of the model to correspond to the real box through set-based global loss. Given a fixed small set of queries about the learned target, DETR directly infers the relationship between the target and the global image context to directly output the final set of predictions in parallel. Unlike many other modern detectors, this new model is conceptually simple and does not require specialized NMS libraries. DETR demonstrates accuracy and runtime performance on the challenging COCO object detection dataset that is comparable to the well-established and highly optimized Fast RCNN baseline. Furthermore, DETR can be easily generalized to produce panoptic segmentations in a unified manner. We demonstrate that it significantly outperforms competing baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.

key interpretation

background introduction

1. The current target detection uses a proxy method (dense anchor frame allocation and nms of the prediction frame) to realize the prediction of the target set

2. Bipartite matching loss can make target detection not require dense anchor boxes | anchor point allocation strategy, so that the model can remove the label matching strategy and nms post-processing

3. The existing non-nms model still has certain defects, and cannot be extended and applied to large data sets such as coco

4. Transformer has high parallel computing efficiency and is stronger than rnn. It has been used in multiple scenarios in nlp.

Introduction to DETR

NMS-free object detection models already exist, but do not perform well on large datasets. DETR reconstructs these algorithms based on transformers. DETR consists of Backbone, Transformer encoder, Transformer decoder, and Prediction feed-forward networks. Based on bipartite matching loss, it designs a loss based on the Hungarian algorithm to achieve matching training between the target set and the prediction set.
picture 18

1. DETR regards target detection as a target set prediction problem, uses set-based global loss to force the prediction results of the model to correspond to the real frame one by one, and removes the concepts of anchor frame assignment and nms

2. DETR works well on large targets (probably affected by the non-local mechanism), but its effect on small targets needs to be improved

3. DETR is difficult to converge and requires an auxiliary head

4. From the experimental results, compared with Faster RCNN, the performance advantage of DETR is not obvious. After applying the training strategy of DETR to Faster RCNN, it improves the AP by 2 points.

DETR features

1. With yolo and other series of target detection method necklaces, DETR has an extra bipartite matching loss, which is used to match the relationship between the prediction set and the target set

2. The regular loss of DETR is cls_loss, boxes_loss(=L1_loss+GIOU_loss)

additional knowledge

1. The random crop technology improves the performance of about 1 AP; using a larger epoch can also improve the epoch; DETR replaces the empty result with the second-highest class result, improving the performance of about 2 APs

2. The Transformer model is used to using the Adam optimizer, the learning rate is set to 10−4, the backbone is set to 10−5 (1/10 of the encoder and decoder), and the weight decay is 10−4

3. DETR puts forward the following views (experience) for multi-task (instance segmentation) model training. For multi-task models, one head (one task, target detection) can be trained first, and then the backbone (common part) can be frozen to train another head separately ( Another task, semantic segmentation)

1 Introduction

The goal of detection is to predict a set of bounding boxes and class labels for each object of interest. Modern detectors address this ensemble prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [37, 5] , anchor boxes [23] or anchors [53, 46]. Their performance is significantly affected by the post-processing step (nms, to remove repeatedly predicted anchor boxes), the design of anchor sets, and the matching of object boxes to anchors [52]. To simplify these processes, we propose a direct-set prediction method to bypass the proxy task. This end-to-end philosophy has led to significant advances in complex structured prediction tasks, such as machine translation or speech recognition, but has not yet been used in object detection. Previous research on algorithms [43, 16, 4, 39] either added other forms of prior knowledge or were not proven to achieve breakthrough performance. This article aims to bridge this gap.

We simplify the training process by treating object detection as a direct-set prediction problem. We adopt an encoder-decoder structure based on Transformer [47], a popular structure for sequence prediction. The transformer's self-attention mechanism, which explicitly models all pairwise interactions between elements in a sequence, makes these architectures particularly suitable for specific constraints on ensemble predictions, such as predictions that remove duplicates.

Our detection transformer (DETR, see Figure 1) predicts all objects at once and is trained end-to-end using an ensemble loss function that performs bipartite matching between predicted objects and GT objects. DETR simplifies the detection pipeline by removing multiple hand-crafted components that encode prior knowledge, such as spatial anchors or non-maximum suppression. Unlike most existing detection methods, DETR does not require any custom layers and thus can be easily reproduced in any framework containing standard CNN and transformer classes.
picture 15

Compared with most previous direct-set prediction works, the main feature of DETR is the combination of bipartite matching loss, transformer, (non-autoregressive) parallel decoding [29, 12, 10, 8]. In contrast, previous work focused on autoregressive decoding of RNNs [43, 41, 30, 36, 42]. Our matching loss function uniquely assigns a prediction to a GT object and is invariant to the permutation of prediction objects, so we can predict them in parallel. We evaluate DETR on COCO [24], one of the most popular object detection datasets, using Faster R-CNN as the baseline [37]. Faster RCNN has gone through many design iterations, and its performance has been greatly improved since its initial release. Experiments show that our new model achieves similar performance. More precisely, DETR shows significantly better performance on large objects, a result most likely achieved by the transformer's non-local computation. However, it has lower performance on small objects. We hope that future work will improve this aspect as FPN [22] developed for R-CNN.

The training setting for DETR differs from standard object detectors in several ways. The new model requires a very long training plan and corresponding auxiliary decoding. We will thoroughly explore which components are critical to the demonstrated performance. DETR's design philosophy is easily extended to more complex tasks. In our experiments, we demonstrate that a simple segmented head trained on top of pre-trained DETR outperforms the competitive baseline of panoptic segmentation [19], a challenging pixel-level recognition task.

1. The current target detection uses a proxy method (dense anchor frame allocation and nms of the prediction frame) to realize the prediction of the target set. 2. DETR regards
target detection as a target set prediction problem, and uses set-based global loss to force the model. The prediction results are in one-to-one correspondence with the real frame, and the concept of anchor frame allocation and nms is removed.
3. DETR works well on large targets (probably affected by the non-local mechanism), but its effect on small targets needs to be improved. 4
. DETR is difficult to converge and requires an auxiliary head

2 Related work

Our work builds on previous work in several areas: bipartite matching losses for ensemble prediction, transformer-based codec architectures, parallel decoding, and object detection methods.

2.1 Set Prediction

There are currently no canonical deep learning models that can directly predict ensembles. The basic ensemble prediction task is multi-label classification (e.g., in computer vision see references [40, 33]), where baseline methods, one-vs-rest, are not suitable for detecting the presence of underlying structure between elements (i.e. close to same bboxes). The first difficulty with these tasks is avoiding near-repetition. Current detectors use post-processing such as non-maximum suppression to solve this problem, but direct set prediction is without post-processing. They require a global reasoning scheme to model the interactions between all predictive elements to avoid redundancy. For constant size set predictions, dense fully connected networks [9] are sufficient, but expensive. A general approach is to use autoregressive sequence models such as recurrent neural networks [48]. In all cases, the loss function should be invariant to the permutation of predictions. The usual solution is to design a loss based on the Hungarian algorithm [20] to find a bipartite match between GT and predicted values. This enforces permutation invariance and guarantees that each target element has a unique match. We follow the approach of bipartite matching loss. However, compared to most previous work, we move away from autoregressive models and use a transformer with parallel decoding, which we describe next.

1. Design a loss based on the Hungarian algorithm to achieve matching training between the target set and the prediction set

2.2 Transformers and Parallel Decoding

The transformer was proposed by Vaswani et al. [47] as a new attention-based building block for machine translation. An attention mechanism [2] is a neural network layer that aggregates information from the entire input sequence. The transformer introduces a self-attention layer, similar to non-local neural networks [49], which scans each element in a sequence and updates it by aggregating information from the entire sequence. One of the main advantages of attention-based models is their global computation and perfect memory utilization, which makes them more suitable for long sequences than RNNs. Transformers are now replacing RNNs in many problems in natural language processing, speech processing, and computer vision [8, 27, 45, 34, 31].

The transformer was first used in the autoregressive model, following the early seq2seq model [44], to generate output tokens one by one. However, the high inference cost (proportional to output length, difficult to batch) has led to parallel sequence generation, in audio [29], machine translation [12, 10], word representation learning [8] and more recently speech recognition [6 ] field development. We also incorporate transformers and parallel decoding to strike an appropriate trade-off between computational cost and the ability to perform the global computations required for ensemble prediction.

1. Transformer has high parallel computing efficiency and is stronger than rnn. It has been used in multiple scenarios in nlp

2.3 Object detection

Most modern object detection methods make predictions relative to some initial guess (pre-assigned anchor boxes). Two-stage detectors [37, 5] predict box proposals, while single-stage methods predict boxe or object center locations. Recent work [52] shows that the final performance of these systems strongly depends on the setting rules of anchor points | anchor boxes. In our model, we are able to remove this handcrafted process and directly predict the detection set to simplify the detection process. You only need to input images during training, and you don't need to design anchor boxes|anchor points.

Set-based loss. Some object detectors [9, 25, 35] use bipartite matching loss. However, in these early deep learning models, the relationship between different predictions is only modeled with convolutional or fully-connected layers, and hand-designed NMS post-processing can improve their performance. Recent detectors [37, 23, 53] use non-unique assignment rules and NMS between GTs and predictions. Learnable NMS methods [16, 4] and relational networks [17] explicitly model the relationship between different predictions. Using direct set losses, they do not require any post-processing steps. However, these methods use additional hand-crafted contextual features such as proposal box coordinates to effectively model the relationship between detections, whereas we seek solutions to reduce the prior knowledge encoded in the model.

Recurrent detectors. The closest approach to ours is end-to-end set prediction for object detection [43] and instance segmentation [41, 30, 36, 42]. Similar to ours, they use a bipartite matching loss based on a CNN activation-based encoder-decoder architecture to directly produce a set of bounding boxes. However, these methods are only evaluated on small datasets and not against modern baselines. In particular, they are based on autoregressive models (more precisely RNNs), so they do not take advantage of recent transformers that can be decoded in parallel.

1. Bipartite matching loss can make target detection not require intensive anchor frame|anchor distribution strategy, so that the model can remove the label matching strategy and nms post-processing. 2. The existing nms-free model still has certain defects and cannot be popularized and applied To large data sets such as coco

3 The DETR model

In detection, Set Prediction loss is essential: (1) Set Prediction loss, which forces a unique match between predicted boxes and GT; (2) A set of predictions (in a forword) and modeling them Architecture of relationships. We describe our architecture in detail in Figure 2.

3.1 Object detection set prediction loss

DETR infers a fixed-size set of N predictions in one pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties in training is scoring the predicted objects (category, location, size). Our loss produces an optimal bipartite match between predicted objects and GT objects, and then optimizes the object-specific (bounding box) loss.

let's use yyy represents the GT set of the object, and yˆ is the prediction set, which contains N prediction sets [我们认为y也是一组大小为N的集合,其对象数量少于N时被填充了∅()] (assuming N is larger than the number of objects in the image). To find a bipartite matching between these two sets, we look for a permutation of N elements σ ∈ N with the lowest cost:
picture 16

其中, L m a t c h ( y i , y ˆ σ ( i ) ) L_{match}(y_i,yˆ_{σ(i)}) Lmatchyi, y ˆσ ( i )是GTyi y_iyiand the pairwise matching cost between ˆy with index σ(i). Following previous work (e.g. [43]), the Hungarian algorithm efficiently computes this optimal assignment.

The matching loss takes into account the class predictions and the similarity between predictions and GT boxes. The GT set for each element i can be viewed as a ( ci c_ici b i b_i bi) elements, ci c_iciis the target class label (probably ∅) and bi ∈ [0, 1] 4 is a vector (representing bboxes in cx, cy, w, h, yolo format). For the prediction of the subscript σ ( i ), we define the probability of the class as b_i∈[0,1]^4 is a vector (representing bboxes in cx, cy, w, h, yolo format). For the prediction of the subscript σ(i), we define the class probability asbi[0,1]4 is a vector(representingbboxes, the format iscx,cy,w,h , yo l o format ) . For the prediction of the subscript σ ( i ) , we define the probability of the class as ˆp_{σ(i)}(ci), the prediction box is defined as, and the prediction box is defined as, the prediction box is defined as ˆb_{σ(i)}. Through these symbols, we will. With these symbols, we will. With these notations, we define L_{match}(y_i, yˆ_{σ(i)})$ aspicture 19

This process of finding matches plays the same role as the heuristic assignment rules (anchor box assignment mechanisms such as ota, ppa, etc.) used to match proposals [37] or anchors [22] to GT objects in modern detectors. The main difference is that we need to find one-to-one matching direct set predictions, no duplicate predictions.

The second step is to calculate the loss function, the Hungarian loss for all pairs matched in the previous step. Our definition of loss is similar to the loss of ordinary object detectors, which is a linear combination of the class prediction negative log-likelihood loss and bboxes loss:
picture 17

where ˆσ is the optimal assignment calculated in the first step (1). In practice, we set ci = ∅ c_i = ∅ci=The cls_loss weight at ∅ is reduced by a factor of 10 to account for the class imbalance in 6 Carion et al. This is similar to how the Faster R-CNN training process balances the number of samples by subsampling [37] with positive/negative proposals. Note that the matching cost between objects and ∅ does not depend on the prediction, which means that in this case the cost is a constant. In the matching cost, we use the probabilityˆp ˆ σ ( i ) ( ci ) ˆp_{ˆσ(i)} (ci)p _ˆ σ ( i )( ci ) instead of log probabilities . This allows cls_loss to be balanced with bbox_loss, and we observe better empirical performance.

Bounding box loss The second part of the matching cost and Hungarian loss is Lbox( ), which scores the bounding box. Unlike many detectors that have some initial guess about the number of bboxes to predict, we make bboxes predictions directly. While this approach simplifies implementation, it raises a question about the relative scaling of losses. The most commonly used L1 loss will have different scales for small and large bboxes even though their relative errors are similar. To alleviate this problem, we use a linear combination of `1 loss and GIoU loss [38] (Liou( , )), which is scale-invariant. Overall, our loss for bboxes is Lbox( bi , ˆb σ ( i ) ) L_{box}(b_i, ˆb_{σ(i)})Lboxbiˆbσ ( i ) , defined asλiou L iou ( bi ,ˆ b σ ( i ) ) + λ L 1 ∣ ∣ bi − ˆ b σ ( i ) ∣ ∣ 1 λ_{iou}L_{iou}(b_i,ˆb_{σ( i)})+λ_{L1}||b_{i}−ˆb_{σ(i)}||_1liouLioubiˆbσ ( i )+lL 1∣∣biˆbσ ( i )1, among whichλ iou , λ L 1 ∈ R λ_{iou},λ_{L1}∈Rliou,lL 1R is a hyperparameter. These two losses are normalized by the number of objects within the batch.

1. Compared with yolo and other series of target detection methods, DETR has an extra bipartite matching loss, which is used to match the relationship between the prediction set and the target set. 2. The regular loss of DETR is cls_loss, boxes_loss (=L1_loss+GIOU_loss)

3.2 DETR architecture

The overall DETR architecture is surprisingly simple, as shown in Figure 2. It consists of three main components, which we describe below: a CNN backbone to extract a dense feature representation, an encoder-decoder transformer, and a simple feedforward network (FFN) to make final detection predictions. Unlike many modern detectors, DETR can be implemented in any deep learning framework, which provides a generic CNN backbone and a transformer architecture implementation with only a few hundred lines. The inference code for DETR can be implemented in less than 50 lines in PyTorch [32]. We hope that the simplicity of our method will attract new researchers to the detection community.
picture 18

Backbone starts from the initial image ximg ∈ R 3 × H 0 × W 0 x_{img} ∈ R^{3×H_0×W_0}ximgR3×H0×W0(with 3 color channels 2) Initially, a traditional CNN backbone generates a low-resolution activation map. Typical values ​​we use are C=2048 and H, W=H0/32, W0/32H, W=H_0/32, W_0/32HW=H0/32W0/32

Transformer encoder. First, a 1x1 convolution reduces the feature dimension of the high-level activation map f from C to a smaller dimension d. Create a new feature map z 0 ∈ R d × H × W z_0∈R_{d×H×W}z0Rd×H×W. The encoder expects a sequence as input, so we put z 0 z_0z0The spatial dimension of is folded into one dimension, resulting in a dxHW feature map. Each encoder layer has a standard architecture consisting of a multi-head self-attention module and a feed-forward network (FFN). Since the transformer architecture is permutation-invariant, we complement it with a fixed positional encoding [31, 3], added to the input of each attention layer. We follow the detailed definition of the architecture in the supplementary material, which follows the definition described in [47].

Transformer decoder The decoder is a standard transformer architecture that converts d data into N embeddings using a multi-head self-attention and encoder-decoder attention mechanism. The difference from the original transformer is that our model decodes N objects in parallel at each decoder layer, whereas Vaswani et al. [47] use an autoregressive model that predicts the output sequence one element at a time. Since the decoder is also permutation invariant, the N input embeddings must be different to produce different results. These input embeddings are learned positional encodings, which we refer to as object queries, and similar to encoders, we add them to the input of each attention layer. N object queries are transformed by the decoder into one output embedding. These are then independently decoded into bboxes coordinates and class labels by a feed-forward network (described in the next section), resulting in N final predictions. Using autoencoders and codec attention on these embeddings, the model uses the pairwise relationship between them to globally reason about all objects while being able to use the entire image as context.

The final prediction of Prediction feed-forward networks (FFNs) is calculated by 3 FC-relu layers as hidden layers and a linear projection layer (softmax as activation function). FFN predicts normalized center coordinates, height and width of bboxes. Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the number of objects of interest in the image, an additional special class label ∅ is used to indicate that no objects are detected within the slot. This class plays a similar role to the "background" class in standard object detection methods.

Auxiliary decoding losses. We find it helpful to use auxiliary losses [1] in the decoder during training, especially to help the model output the correct number of objects per class. We add prediction FFn and Hungarian loss after each decoder layer. All predictive FFNs have their parameters. We use an additional shared layer norm to normalize the input of the predicted FFN from different decoder layers.

DETR consists of Backbone, Transformer encoder, Transformer decoder, and Prediction feed-forward networks.

4 Experiments

The results show that DETR achieves competitive results compared to R-CNN in quantitative evaluation on COCO. We then provide a detailed ablation study of structure and loss. Finally, to demonstrate that DETR is a general and scalable model, we present results on panoptic segmentation with only a small extension trained on a fixed DETR model. We provide code and pre-trained models to reproduce our experiments at https://github.com/facebookresearch/detr.

Dataset We conduct experiments on the COCO 2017 detection and panoptic segmentation datasets [24, 18], which contain 118k training images and 5k validation images. Each image is annotated with a bounding box and panoptic segmentation. On average there are 7 instances per image, and as many as 63 instances in one image in the training set, ranging from small to large on the same image. If not specified, we report AP as bbox AP, the integral metric over multiple thresholds. For comparison with Faster R-CNN, we report the validation AP for the last training epoch, and for ablation we report the median of the validation results from the past 10 epochs.

Technical details. We use AdamW [26] to train DETR, set the learning rate of the initial transformer to 10−4, the backbone to 10−5, and the weight decay to 10−4. All transformer weights are initialized with Xavier init [11], while the backbone is a ResNet model pre-trained with imageNet [15]. We report results on two different backbones: ResNet-50 and ResNet-101. The corresponding models are called DETR and DETR-R101, respectively. Following [21], we also increase feature resolution by adding an extension to the last stage of the backbone and removing a step from the first convolution of this stage (removing conv_stride_2). The corresponding models are called DETR-DC5 and DETR-DC5-R101 (expanded C5 stage), respectively. This modification increases the resolution by a factor of 2, which improves performance for small objects, but at the cost of a 16× increase in the encoder's self-attention cost, resulting in an overall 2× increase in computational cost. Table 1 gives the full comparison of FLOPs of these models with Faster R-CNN.
picture 20

We use multi-scale augmentation, resizing the input image so that the shortest side is at least 480 pixels, the longest is 800 pixels, and the longest side is at most 1333 [50]. To help learn global relations through the encoder's self-attention, we also apply the random crop technique during training , improving the performance by about 1 AP . Specifically, the train image is cropped into a random rectangular patch with probability 0.5, and then rescaled to 800-1333. The dropout of transformer training is set to 0.1. At inference time, some slots are predicted as empty classes. To optimize AP, we overlay the predictions for these slots with the second highest scoring class using the corresponding confidence . This increases AP by 2 points compared to filtering out empty slots . Other training hyperparameters are detailed in Section A.4. In our ablation experiments, we use a training schedule of 300 epochs, and the learning rate drops by a factor of 10 after 200 epochs, where one epoch is passed over all training images once. It takes 3 days to train the baseline model for 300 epochs on 16 V100 GPUs with 4 images per GPU (thus a total batchsize of 64). Compared with Faster R-CNN, a longer epoch (500 epochs) was trained, and the learning rate decreased after 400 epochs. This epoch number increases AP by 1.5 compared to shorter epoch numbers.

1. From the experimental results, compared with Faster RCNN, the performance advantage of DETR is not obvious. After applying the DETR training strategy to Faster RCNN, it improves the AP by 2 points. 2. The random crop technology improves about The performance of 1 AP; with larger epochs, the epoch can also be improved; DETR replaces the empty result with the next highest class result, which improves the performance of about 2 APs

4.1 Comparison with Faster R-CNN

Transformers are usually trained by Adam or Adagrad optimizers, and the training time is very long, and using dropout, DETR is also the same. However, Faster R-CNN is trained with SGD and minimal data augmentation, and we are not aware of the successful application of Adam or dropout. Despite these differences, we try to make the Faster R-CNN baseline stronger. To align it with DETR, we add generalized IoU [38] to the boxes loss, the same random crop and long training time to improve the results [13]. The results are shown in Table 1. In the top part, we show the results of Faster R-CNN from Detectron2 Model Zoo [50] (model trained with 3 times the number of epochs). In the middle part, we show the results of the same model but trained (with "+", using 9x the number of epochs (109 epochs) and the described augmentation, which adds up to 1-2 AP in total).

In the last section of Table 1, we show the results of several DETR models. For comparison in the number of parameters, we choose a model with 6 transformer layers and 6 decoder layers with a width of 256 layers and 8 attention heads. Like FPN's Faster R-CNN, the model has 41.3M parameters, of which 23.5M are in ResNet-50 and 17.8M are in transformer. Although Faster R-CNN and DETR may still improve further with long training, we can conclude that DETR can compete with Faster R-CNN, achieving 42 AP on the COCO val subset. The way DETR achieves this is by improving the APL AP_LAPL(+7.8), but note that this model is in APS AP_SAPS(-5.5) is still behind. DETR-DC5 with the same number of parameters and similar FLOP counts has the same higher AP, but at APS AP_SAPSalso significantly lagged behind. Using the ResNet-101 backbone, Faster R-CNN and DETR also show similar results.

4.2 Ablations

The attention mechanism in the transformer decoder is a key component in modeling the relationship between feature representations of different detections. In our ablation analysis, we explore how other components and losses of our architecture affect the final performance. In this study, we chose the resnet-50 based DETR model with 6 encoder, 6 decoder layers and a width of 256. The model has 41.3M parameters and achieves 40.6 and 42.0 AP on short and long schedules, respectively, running at 28 FPS, similar to Faster R-CNN-FPN with the same backbone.

Number of encoder layers. We evaluate the importance of global image boosting self-attention by varying the number of encoder layers (Table 2). Without the encoder layer, the overall AP drops by 3.9 points and large objects drop by 6.0 AP. We hypothesize that the encoder is important for decoding objects by using global scene reasoning. In Figure 3, we visualize the attention map of the last encoder layer of the model, focusing on a few points in the image. The encoder seems to have separated instances, which may simplify object extraction and localization for the decoder.
picture 21

Number of decoder layers.
We apply an auxiliary loss (see Section 3.2) after each decoding layer, thus, each FFNs is designed to feed into the output of the decoder and predict the corresponding object. We analyze the importance of each decoder layer by evaluating the objects that will be predicted at each stage of decoding (Fig. 4). Both AP and AP50 improve after each layer, for a total of a very significant +8.2/9.5 AP improvement between the first and last layer. Since DETR has a set-based loss, NMS is not required by design . To verify this, we run a standard NMS procedure with default parameters [50] on the output after each decoder. NMS improves the performance of predictions from the first decoder. This can be explained by the fact that a single decoding layer of the Transformer cannot compute any correlations between output elements, thus it is prone to multiple predictions for the same object. In the second and subsequent layers, the self-attention mechanism after activation allows the model to suppress repeated predictions. We observe that the improvement brought by NMS decreases with depth. At the last layer, we observe a small loss in AP because NMS erroneously eliminates correct predictions.
picture 22

Similar to visualizing encoder attention, we visualize decoder attention in Fig. 6, coloring the attention map of each predicted object with different colors. We observe that the decoder's attention is quite local, meaning it mainly focuses on the extremities of objects, such as heads or legs. We hypothesize that after the encoder separates out instances via global attention, the decoder only needs to pay attention to the ends to extract class and object boundaries.
picture 23

Importance of FFN. The FFN inside the transformer can be seen as a 1×1 convolutional layer, making the encoder similar to an attention-enhanced convolutional network [3]. We try to remove it completely and only pay attention to the transformer layer. By reducing the number of network parameters from 41.3M to 28.7M, leaving only 10.8M in the transformer, the performance drops by 2.3 AP, so we conclude that FFN is important for good results.

Importance of positional encodings. There are two kinds of positional encodings in our model: spatial positional encodings and output positional encodings (object queries). We experiment with various combinations of fixed and learned encodings, and the results are presented in Table 3. The output position encoding is required and cannot be removed, so we experiment with a single pass at the decoder input, or adding queries at each decoder attention layer. In the first experiment, we completely removed the spatial position encoding and passed the output position encoding at the input position, interestingly, the model still achieved more than 32 AP, a loss of 7.8 AP compared to the baseline. We then pass a fixed sinusoidal spatial position encoding on input and output encoding once, as in the original transformer [47], and find that this results in a 1.4 AP drop compared to passing the position encoding directly. Similar results are obtained for the learned spatial encoding passed to attention. Surprisingly, we find that not passing any spatial encoding in the encoder results in only a small AP drop of 1.3 AP. When we pass encodings to attention, they are shared across all layers and output encodings (object queries) are always learned.
picture 24
Considering these issues, we conclude that the transformer components: encoder's global self-attention, FFN, multiple decoder layers, and positional encoding, all contribute significantly to the final object detection performance.

Loss ablations. To evaluate the matching cost and the importance of different components of loss, we trained the model. The loss has three components: classification loss, L1 bboxes loss, and GIoU [38] loss. Classification loss is essential to training and cannot be removed, so we train a model without bounding box distance loss and a model without GIoU loss, and compare with a baseline trained with all three losses. The results are shown in Table 4. GIOU basically dominates the performance of the model (the loss effect without GIOU is particularly poor), and the use of L1 bboxes loss only improves the AP by 0.7. We only briefly experimented with the ablation of these losses, and other methods combining them may achieve different results.
picture 25

4.3 Analysis

Decoder output slot analysis In Figure 7, we visualize the boxes of different slot predictions for all images in the COCO 2017 val set. Each of DETR's result slots focuses on targets in different locations. We observe that each slot has several modes of operation focusing on different regions and bboxes sizes. In particular, all the slots have the pattern of boxes predicting the extent of the image (visible in the aligned red dots in the middle of the figure). We hypothesize that this has something to do with the distribution of objects in COCO.
picture 26

Generalization to unseen numbers of instances. Some classes in COCO cannot be well represented by many instances of the same class in the same image. For example, there are no images of more than 13 giraffes in the training set. We created a synthetic image3 to verify the generalization ability of DETR (see Figure 5). Our model was able to find all 24 giraffes on the image, which is clearly out of distribution. This experiment confirms that there is no strong class recognition in every object query.
picture 27

4.4 DETR for panoptic segmentation

Panoptic segmentation [19] has recently attracted much attention from the computer vision community. Similar to Faster R-CNN [37] extended to Mask R-CNN [14], DETR can be extended by adding a mask header on the decoder output. In this section, we demonstrate that such a head can produce panoptic segmentations by handling things and thing classes in a unified manner [19]. We conduct experiments on panoptic annotations on the COCO dataset, which has 53 object categories in addition to 80.

We train DETR to predict bboxes on the COCO class, using the same parameters. Predicting bboxes is feasible for training because Hungarian matches are computed using the distance between bboxes. We also add a mask header, which predicts a binary mask for each predicted box, see Figure 8. It takes as input the output of the transformer decoder for each object, and computes the embedded multi-head (with M heads) attention score on the output of the encoder, generating M attentions per object at a small resolution heat map. To make final predictions and improve resolution, we use an FPN-like architecture. We describe the architecture in more detail in the Supplementary Section. The final resolution of the mask is 4 steps, and each mask is independently supervised using DICE/F-1 loss [28] and focal loss [23].
picture 28

Mask head can be trained jointly or in two steps. We only train DETR of bboxes, then freeze all weights, and only train Mask head for 25 epochs. Experimentally, the two approaches give similar results, and we report results using the latter approach (separate training, shorter training time).

To predict the final panoptic segmentation, we simply use argmax on each pixel's mask score and assign the corresponding mask to the corresponding class. This process guarantees that the final masks do not overlap, thus, DETR does not require a heuristic that is usually used to align different masks [19].

Training details. We train DETR, DETR-DC5 and DETR-R101 models to predict bboxes around objects and object classes in the COCO dataset according to the method of bboxes detection. The new mask head has been trained for 25 epochs (see Supplement for details). During inference, we first filter out detections with confidence lower than 85%, and then compute the argmax of each pixel to determine which mask each pixel belongs to. We then collapse different mask predictions of the same content category into one category and filter empty predictions (less than 4 pixels)

Main results. The qualitative results are shown in Figure 9. In Table 5, we compare our unified all-seeing segmentation approach with several established approaches that handle things differently. We report Panopin quality (PQ) and things ( PQ th PQ_{th}PQth) and something ( PQ st PQ_{st}PQst)Defects. We also report the mask AP (calculated from the object class), before any panorama effect fixes (in our case, before using pixel-wise argmax). We find that DETR outperforms the results published in COCO-val 2017, as well as our strong strong PanopticFPN baseline (trained with the same data augmentation as DETR, for fair comparison).
picture 29
picture 30

The resulting decomposition shows that DETR is especially dominant in the East-West class, and we hypothesize that the global reasoning allowed by the encoder's attention is the key factor leading to this result. For the thing class, DETR achieves a competitive PQ th PQ_{th} despite being severely flawed up to 8 mAP compared to the baseline on mask AP computationPQth. We also evaluate our method on the test set of the COCO dataset and get 46 PQ. We hope that our approach will inspire future work exploring panoptic segmentation with a fully unified model.

5 Conclusion

We propose a novel design of an object detection system based on transformer and bipartite matching loss. This method achieves comparable results to the Faster R-CNN baseline on the challenging COCO dataset. DETR is easy to implement and has a flexible architecture that is easily extended to panoptic segmentation with competitive results. In addition, its performance on large objects is significantly better than Faster R-CNN, which may be attributed to the processing of global information by self-attention.

This new detector design also introduces new challenges, especially in terms of training, optimization, and performance. Current detectors require several years of improvement to address similar issues, and we hope that future work will successfully address these issues with DETR.

Guess you like

Origin blog.csdn.net/a486259/article/details/131345753