DINO paper accuracy, and parse its model structure & variants of DETR

As of July 25, 2022, DINO is the SOTA of object detection.

I write this blog based on my experience of reproducing the source code and my intensive reading of DINO papers, hoping to help you.

Table of contents

1. Summary

2. Conclusion

3. Analyzing the DINO model

(1) Overview The DINO model draws on previous work

(2) Overview of the DINO model

4. Innovative methods

(1) What is Contrastive DeNoising Training?

(2) What is Mixed Query Selection?

(3) What is Look Forward Twice?

5. Experimental aspects

(1) Dataset and Network Backbone

(2) Implementation details

6. Brilliant data visualization


1. Summary

We present DINO ( D ETR with I improved de Noising anch O r boxes), a state-of-the-art end-to-end object detector. DINO by using:

  • Contrastive denoising training methods;
  • hybrid query selection method for anchor initialization;
  • Forward twice scheme for box prediction;
  • This article will break through the above three innovative methods one by one.

Improves the previous DETR model in performance and efficiency. Using ResNet - 50 backbone and multi-scale features, DINO obtained 49.4 AP in 12 epochs and 51.3 AP in 24 epochs ( very fast convergence! ), comparable to the previous DN-DETR model (the best DETR-like model) ratio, gaining a significant boost of +6.0 AP and +2.7 AP, respectively. DINO performs well on both model scale and data scale. Without bells and whistles, DINO achieves the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP) after pre-training on the Objects365 dataset with SwinL backbone. Compared with other models on the leaderboard, DINO significantly reduces the model size and pre-training data size while achieving better results.

Keywords: Object Detection; Detection Transformer; End-to-End Detector

Paper link: https://arxiv.org/abs/2203.03605

Source link: https://github.com/IDEACVR/DINO

Supplement:  noun + '-like' >>> adjective, meaning "like...like, like...like, like...of" 

 2. Conclusion

In this paper, we propose DINO, a strong end-to-end Transformer detector with contrastive denoising training, hybrid query selection, and two lookaheads, which significantly improves training efficiency and final detection performance. Therefore, on COCO val2017, DINO outperforms all previous ResNet-50 based models in both 12th-order and 36th-order scenes using multi-scale features. Inspired by the improvement, we further explore training DINO with a stronger backbone on a larger dataset and achieve a state-of-the-art 63.3 AP on COCO 2017 test-dev. This result establishes the DETR-like model as a mainstream detection framework, not only because of its novel end-to-end detection optimization, but also because of its superior performance.

The author showed off the SOTA results at the beginning of the article, as shown in Figure 1:

Figure 1 Comparison of DINO with other detection models on the COCO dataset

  • Figure 1 (a) Comparison with ResNet-50 backbone w.r.t. trained epoch model. Among them, the DC5-marked model uses extended and larger-resolution feature maps, and other models use multi-scale features.
  • Fig. 1 (b) Comparison of pre-training data size and model size with SOTA model. The SOTA model comes from the COCO test-dev leaderboard. In the legend, we list the bone intervention training data size (first number) and detection pre-training data size (second number).

Summary: The DINO model achieves an accuracy that other models cannot match after a few epochs, and the pre-training data size and model size are also smaller than other models.

3. Analyzing the DINO model

(1) Overview The DINO model draws on previous work

As studied in Conditional DETR [25] and DAB-DETR [21], the query in DETR [3] consists of two parts: a location part and a content part, referred to as location query and content query in this paper. DAB-DETR [21] explicitly represents each location query in DETR as a 4D anchor box (x, y, w, h), where x and y are the center coordinates of the box, and w and h correspond to its width and height. This explicit anchor box format makes it easy to dynamically refine anchor boxes layer by layer in the decoder .

How to solve the problem of slow convergence of DETR?

DN-DETR [17] introduces a denoising (DN) training method to speed up the training convergence of DETR-like models. This suggests that the slow convergence problem in DETR is caused by the instability of bipartite matching. To alleviate this problem, DN-DETR proposes to add noisy ground-truth (GT) labels and boxes in the Transformer decoder and train the model to reconstruct the ground truth. The added noise   is constrained by  , where ( x , y , w , h) denotes a GT box and λ is a hyperparameter controlling the magnitude of the noise. Since DN-DETR follows DAB-DETR to treat decoder queries as anchors, since λ is usually small, a noisy GT box can be regarded as a special anchor with a GT box nearby. Besides the original DETR query, DN-DETR adds a DN part in the decoder, which feeds the noisy GT labels and boxes into the decoder to provide an auxiliary DN loss. The DN loss effectively stabilizes and speeds up the training of DETR, and can be plugged into any DETR model.

Notes: λ: While the DN-DETR model uses λ1 and λ2 for the noise scale for center offset and box scaling, λ1 = λ2 is set. For simplicity, λ1 and λ2 are replaced by λ in this paper.

Deformable DETR [41] is another early work to speed up the convergence of DETR. To compute deformable attention, it introduces the concept of reference point, which enables deformable attention to focus on a small set of key sampling points around the reference point. The concept of reference points makes it possible to develop several techniques to further improve DETR performance. The first technique is "two-stage", which directly selects features and reference boxes from the encoder as input to the decoder. The second technique is iterative bound-box refinement with careful gradient-separation design between the two decoder layers. In our paper, we refer to the "two-stage" and gradient separation techniques as  "query selection" and "look forward once", respectively.

Following DAB-DETR and DN-DETR, DINO represents location queries as dynamic anchor boxes and is trained with an additional DN loss. It is worth noting that DN-DETR also adopts some techniques of Deformable DETR to achieve better performance, including its deformable attention mechanism and "look ahead once" implementation in layer parameter update . DINO further adopts the query selection idea in Deformable DETR to better initialize location queries. Based on this strong baseline, DINO introduces three new methods to further improve the detection performance, which will be described in Sec. 3.3, Sec. 3.4 and Sec. 3.5 respectively.

(2) Overview of the DINO model

As shown in Figure 2, our improvements are mainly reflected in the Transformer encoder and decoder. The top-K encoder features in the last layer are selected to initialize the positional queries of the Transformer decoder, while the content queries are kept as learnable parameters. Our decoder also contains a Contrastive DeNoising (CDN) part with positive and negative samples.

Keyword explanation:

Flatten : tiling

Matching : match

Pos Neg: positive and negative samples

Init Anchors : Initialize the anchor box

CDN:Contrastive DeNoising

Position Embeddings : position embedding

Multi-Scale Feature : multi-scale features

Encodor Layers × N : Encoder with N encoding layers

Decodor Layers × N : Decoder with N decoding layers

GT + Noise : Ground Truth Label Boxes with Noise

Learning Content Queries : learnable content queries

Transformer 中的 K、V、Q:Key、Value、Query

Figure 2 Framework of DINO model

As a DETR-like model, DINO is an end-to-end architecture consisting of a backbone, a multi-layer Transformer encoder, a multi-layer Transformer decoder, and multiple prediction heads. The overall pipeline is shown in Figure 2.

The propagation process of the DINO model, and the improvement of some modules:

  1. Given an image, we extract multi-scale features with backbones such as ResNet or Swin Transformer.
  2. Then use the corresponding position embedding input to the Transformer encoder for feature enhancement.
  3. After using the encoder layer for feature enhancement, we propose a new mixed query selection strategy to initialize anchors as positional queries for the decoder. Note that this strategy does not initialize content queries, but makes them learnable.
  4. With initialized anchors and learnable content queries , we use deformable attention [41] to combine the features output by the encoder and update the query layer by layer.
  5. The final output is formed by refined anchor boxes and refined content feature predicted classification results.
  6. Like DN-DETR, we have an additional DN branch for denoising training. In addition to standard DN methods, we propose a new contrastive denoising training approach , which is implemented by considering hard negative samples.
  7. In order to make full use of the refined box information of the later layer to help optimize the parameters of its adjacent early layers , a new look forward twice method is proposed to transfer the gradient between adjacent layers.

4. Innovative methods

(1) What is Contrastive DeNoising Training?

DN-DETR is very effective in stabilizing training and accelerating convergence. With the help of DN queries, it learns to make predictions based on anchors with GT boxes nearby. However, it lacks the ability to predict "no object" for anchors with no nearby objects. To address this issue, we propose a Contrastive DeNoising (CDN) method to reject useless anchors.

Figure 3 The structure of the CDN group and the demonstration of positive and negative examples

As shown in the figure above, although both positive and negative examples are 4D anchors and can be represented as points in 4D space, we represent them as points in 2D space on concentric squares for simplicity. Assuming that the center of the square is a GT box, then:

  • Points within the inner square are considered positive examples .
  • Points between the inner and outer squares are considered negative examples

a) CDN implementation: DN-DETR has a hyperparameter λ to control the noise scale. The generated noise is no larger than λ, since DN-DETR expects the model to reconstruct the ground truth (GT) from moderately noisy queries. In our method, we have two hyperparameters λ1 and λ2, where λ1 < λ2. As shown by the concentric boxes in Figure 3, we generate two types of CDN queries: positive queries and negative queries. Positive queries within the inner square have a noise scale smaller than λ1 and are expected to reconstruct the background ground-truth box corresponding to the positive query . Negative queries between inner and outer blocks with noise scales larger than λ1 and smaller than λ2 are expected to predict 'no object'. We usually adopt a smaller λ2 because hard negative samples closer to the GT boxes are more helpful to improve performance. (Careful  \lambda selection)

As shown in Figure 3, each CDN group has a set of positive queries and negative queries. If an image has n GT boxes, a CDN group will have 2 × n queries, and each GT box generates a positive query and a negative query. Similar to DN-DETR, we also use multiple CDN groups to improve the effectiveness of our method.

b) Selection of loss function:

  • The reconstruction loss of BOX regression is  l_{1} and GIOU loss,
  • Focal loss for classification Focal loss for dense object detection.
  • The loss of classifying negative samples as background is also Focal loss.
  • Note: Focal loss is a loss function proposed by He Kaiming to solve the problem of sample imbalance.

c) Analyze why the CDN approach works: because it suppresses confusion and selects high-quality anchors (queries) to predict bounding boxes. Confusion can arise when multiple anchors are close to an object. In this case, it is difficult for the model to decide which anchor to choose. This confusion can cause two problems.

  1. The first problem is repeated predictions. Although DETR-like models can suppress repeated boxes with the help of set-based loss and self-attention [DETR: End-to-end object detection with transformers ], this ability is limited. As shown in the left panel of Figure 8, when replacing our CDN query with a DN query, the boy pointed by the arrow has 3 duplicate predictions. Through CDN query, our model can distinguish subtle differences between anchors and avoid repeated predictions, as shown in the right figure of Figure 8.
  2. The second problem is that undesired anchors may be selected farther from the GT box. While denoising training improves the model's ability to select nearby anchors, CDNs further improve this ability by teaching the model to reject more distant anchors.

Figure 8 left is the detection result of the model trained with DN query, and the right is the result of CDN. In the left image, the boy pointed by the arrow has 3 repeated bounding boxes. For clarity, we only show boxes of class "person".

d) Verify the effectiveness of the CDN: To prove the effectiveness of the CDN, we define the average Top-K Distance (Average Top-K Distance, ATD ( k ) ) , and use it in the matching part to evaluate the distance between the anchor point and the target GT box distance. Like DETR, each anchor corresponds to a prediction which may match a GT box or the background. Here we only consider those that match the GT box. Suppose there are N GT bound boxes (b0, b2, .., bN-1) in a validation set, where For each b_{i}, we can find its corresponding anchor and denote it as  . a_{i} is the initial anchor box of the decoder, which assigns the last decoder layer to  b_{i}subsequent refinement boxes during matching. Then we have:

where is the distance between bi and ai , a function that returns the set of the k largest elements in x. The reason we choose the top-K elements is that the aliasing problem is more likely to occur when GT boxes are matched with farther anchors. As shown in Figure 4 (a) and (b), DN is sufficient to select a good anchor ensemble. However, CDNs find better anchors for small objects. Figure 4 (c) shows CDN query performance improvement of +1.3 AP on ResNet-50 and multi-scale features for DN query on small objects over 12 epochs.

Figure 4 (a) and (b): ATD(100) on all objects and small objects, respectively; (c): AP on small objects. 

(2) What is Mixed Query Selection?

Figure 5 Comparison of three different query initialization methods (note the English nouns in the figure)

The term "static" means that, in inference, they will remain the same for different images. A common implementation for these static queries is to make them learnable. 

Static Queries: In DETR and DN-DETR, decoder queries are static embeddings that do not require any encoder features to be extracted from a single image, as shown in Figure 5(a). They learn anchors (in DN - DETR and DAB - DETR ) or location queries (in DETR ) directly from the training data, and set all content queries to 0 vectors.

Pure Query Selection: Deformable DETR learns location query and content query at the same time, which is another implementation of static query initialization. To further improve performance, Deformable DETR proposes a query selection variant ("two-stage") that selects the top K encoder features from the last encoder layer as priors to enhance decoder queries. As shown in Figure 5(b), both location and content queries are generated by linear transformations of selected features . Furthermore, these selected features are fed to an auxiliary detection head to get predicted boxes, which are used to initialize the reference boxes. Similarly, Efficient DETR also selects the top K features based on the objective (class) score of each encoder feature. 

Mixed Query Selection: In our model, the dynamic 4D anchor box format of the query makes it closely related to the decoder position query, which can be improved by query selection. We follow the above practice and propose a hybrid query selection method. As shown in Figure 5(c), we initialize the anchor boxes with only the location information associated with the selected top-K features, but keep the content query static as before. Note that Deformable DETR utilizes top-K features to enhance not only location queries but also content queries. Since the selected features are preliminary content features, without further refinement, it may be ambiguous and misleading to the decoder. For example, a selected feature may contain multiple objects or be only a part of an object. In contrast, our hybrid query selection method only enhances positional queries with top-K selection features and maintains the learnability of content queries. This helps the model leverage better positional information to incorporate more comprehensive content features from the encoder.

(3) What is Look Forward Twice?

Figure 6 Comparison between box update in Deformable DETR and the method in this paper

Look Forward Once: We propose a new box prediction method in this section. The iterative box refinement in Deformable DETRcan prevent gradient backpropagation to stabilize training. We name the method as Look Forward Once, because the parameters of the i-th layer are only b_{i} updated according to the auxiliary loss of the box, as shown in Fig. 6 (a).

Look Forward Twice: However, we conjecture that the improved box information from a later layer may be more helpful in correcting the box predictions of its adjacent earlier layers. Therefore, we propose another method called Look Forward Twice to perform box updates, where the parameters of the i-th layer are affected by the loss of the i-th layer and (i + 1) layer , as shown in 6(b). For each predicted offset \Delta b_{i}, it will be used to update the box twice, once for and once for , so we name our method forward twice.

The specific implementation process of Look Forward Twice is as follows:

The final accuracy of the predicted box is determined by two factors: the quality of the initial box b_{i-1} and the offset of the predicted box \Delta b_{i} .

The forward-pass scheme only optimizes the latter, since the gradient information is separated from the i-th layer to the (i-1)-th layer. Instead, we improve both the initial box  b_{i-1}and the predicted box offset \Delta b_{i}. A simple way to improve the quality is to use the output of the next layer  \Delta b_{i+1} to supervise the final box in layer i . Therefore, we use the sum  \Delta b_{i+1} of and as the predicted box for layer ( i + 1 ). (similar to the idea of ​​​​recurrent neural network time series)

More specifically, given an input box at layer i  b_{i-1}, we get the final predicted box : 

  • is the non-standalone version b_{i}of .
  • Gradient Detach: Obtained b_{i} by gradient separation. 
  • The term Update(·,·) is a function that performs \Delta b_{i}  the refinement operation by the predicted box offset box  .b_{i-1}

We use the same box update method as in Deformable DETR: Deformable DETR uses the normalized form of box in the model, so each value of box is a floating point number between 0 and 1. Given two boxes, sum them after inverse sigmoid and then sum through sigmoid transformation.

5. Experimental aspects

(1) Dataset and Network Backbone

Datasets: We evaluate on the COCO 2017 object detection dataset [20], which is split into train2017 and val2017 (also known as minival).

Network Backbone: We report results using two different backbones:

  • ResNet-50 pretrained on ImageNet-1k.
    "Deep residual learning for image recognition"
  • SwinL pretrained on ImageNet-22k.
    《Swin transformer: Hierarchical vision transformer using shifted windows》

DINO using ResNet-50 was trained on train2017 without additional data, while DINO using SwinL was first pretrained on Object365 "Objects365: A large-scale, high-quality dataset for object detection" and then fine-tuned on train2017 . We report standard average precision (AP) results for val2017 at different IoU thresholds and object scales. We also report the test results of DINO with SwinL.

(2) Implementation details

DINO consists of a backbone, a Transformer encoder, a Transformer decoder, and multiple prediction heads. In Appendix D, we provide more implementation details, including all hyperparameters and engineering techniques used in our model, for those wishing to reproduce our results. We will release the code after the blind review (already released, I have run it, will update later).

Appendix D: Some training optimization techniques, selection of hyperparameters, and GPU information used. When reading the source code, you can refer to the appendix; some hyperparameters are shown in Table 8:

 Table 8 Hyperparameters used by the DINO model

6. Brilliant data visualization

Table 1. Results of DINO and other detection models on COCO val2017 using a ResNet50 network backbone trained using 12 epochs (the so-called 1 × setting) . For models without multi-scale features, we test their GFLOPS and FPS to get the best model ResNet-50-DC5.

  • DINO uses 900 queries.
  • Identify models that point to 900 queries or 300 queries using 3 modes that have similar effects to 900 queries.
  • Other DETR -like models except DETR (100 queries) which use 300 queries.
  • * Indicates testing with the mmdetection framework. 
  • 4scale and 5scale: multi-scale feature map (multi-scale features).

Replenish:

GFLOPS: Giga Floating-point Operations Per Second, that is, 1 billion floating-point operations per second, is often used as a GPU performance parameter but does not necessarily represent the actual performance of the GPU. It is a measure of the computing power of a computer, which is often used in scientific calculations that require a large number of floating-point operations.

MMDetection  is an open source project launched by SenseTime and the Chinese University of Hong Kong for target detection tasks. It implements a large number of target detection algorithms based on Pytorch, and encapsulates the processes of data set construction, model construction, and training strategies into modules. In the way of module calling, we can implement a new algorithm with a small amount of code, which greatly improves the code reuse rate. The DINO source code uses the config.py file in mmcv.

Table 2 DINO and other detection models use ResNet-50 as the backbone on COCO val2017 and use more epoch training results

Figure 7. Training convergence curves evaluating DINO and two previous state-of-the-art ResNet-50 models on COCO val2017 using multi-scale features. It fully reflects that while DINO guarantees to improve the accuracy, the convergence speed has increased from the speed of the antelope to the speed of the cheetah. I used a fast GPU to train for 30 hours on DETR to achieve the same effect as training on DINO for more than three hours.

Table 3 Comparison of DINO with previous best detection models on MS-COCO

 Table 4 Ablation experiment results for the proposed innovative modules

>>> If you have any questions, welcome to discuss in the comment area.

Computer Vision Paper Accuracy Outline

Guess you like

Origin blog.csdn.net/qq_54185421/article/details/125949343