[CVPR2022] Intensive Reading of QueryDet Papers


Paper: https://arxiv.org/abs/2103.09136

Source code: https://github.com/ChenhongyiYang/QueryDet-PyTorch

1 Introduction

Recently, I have been thinking about how to improve the detection of small targets in remote sensing images. I happened to see QueryDet, a small target detection work proposed by Tucson in the future. The main idea of ​​the article is to use cascaded sparse query to accelerate the detection of small targets at high resolution . It greatly reduces the computing and storage overhead of the network. The following will mainly talk about my understanding and thinking about this article.

2 Research Background

I think the author's article is better written in that they did not mechanically carry and list the work done by the predecessors, but focused on summarizing the task itself, and the inherent challenges and difficulties of the task were clearly sorted out. very clear.

2.1 The challenge of low detection accuracy of small targets

On the COCO data set, the current mainstream detector RetinaNet can reach 51.2mAP and 44.1mAP for large targets and medium targets respectively, but the detection accuracy for small targets only stops at 24.1mAP. The author concludes that the accuracy of small targets degrades There are three main reasons:

  1. The CNN downsampling operation will digest the feature information of small targets, and will also cause features to be polluted by the background;
  2. The receptive fields of low-resolution feature maps may not match the size of small objects;
  3. The impact of the bounding box disturbance of the small target on the detection result is much greater than that of the large target, so the positioning is more difficult;

Detection Challenges of Small Objects

2.2 Motivation for improvement

Existing small target detection methods usually maintain larger resolution features by enlarging the input image size or reducing the downsampling rate. This method introduces a large number of redundant calculations, which makes the calculation of detection on low-level features complex. high, as shown in the figure below.
Comparison of operational overhead of different structures

The author denied this stupid way (I also wanted to use this stupid method to improve before, fortunately I read this article first), and described his two key findings:

  1. The feature calculation in the high-resolution and low-level feature map is highly redundant, and the spatial distribution of small targets is sparse, accounting for only a small part of the feature map. The aircraft in the remote sensing image shown in the figure below occupies than very small;
  2. In the FPN structure, even if the low-resolution feature layer cannot accurately detect small targets, it can roughly determine whether a small target exists and the corresponding area with a high degree of confidence. The sampling characteristics of the feature pyramid are similar to the convolution characteristics of the convolutional neural network (translation, scaling, and distortion invariance), and feature inference can be performed based on its downsampling and upsampling characteristics;

Example of a small target image
Based on the above starting point, QueryDet proposes a cascade sparse query (Cascade Sparse Query) mechanism. Among them, Query represents using the query passed from the previous layer (higher-level feature with lower resolution) to guide the small target detection of this layer, and then predicts that the query of this layer is further passed to the next layer, and the next layer’s small target The process of target detection guidance; Cascade represents the idea of ​​this cascade; Sparse represents the use of sparse convolution (sparse convolution) to significantly reduce the computational overhead of the detection head on the low-level feature layer.

To put it bluntly, the feature map of the previous layer has high-level features and low resolution, and is responsible for the initial screening of small targets; this kind of query is transmitted to the lower layer with high-resolution information and then refined, this "glance and focus" The two-stage structure can effectively perform dynamic reasoning and detect the final result.

2 Model structure

As mentioned earlier, in previous designs of feature pyramid-based detectors, small objects tend to be detected from high-resolution low-level feature maps. However, since small objects are usually sparsely distributed in space, the intensive computational paradigm on high-resolution feature maps is very inefficient. Inspired by this observation, the authors propose a coarse-to-fine approach to reduce the computational cost of low-level pyramids: first, predict the coarse locations of small objects on the coarse feature maps, and then centrally compute the corresponding locations on the fine feature maps. This process can be regarded as a query process: the rough location is the query key, and the high-resolution features used to detect small objects are the query value. The whole process is shown in the figure below.
QueryDet detection process
The original text strictly defines this process with a formula, which is not easy to understand. Below I will borrow the picture on the author's homepage and try to explain this detection process in plain language: for the above picture, it contains two levels
QueryDet detection process diagram
. Linked query operations, namely: Large->Medium and Medium->Small, we take Large->Medium as an example. First, the network will mark the small target in the image at the Large level (the object whose size is smaller than the preset threshold s is defined as a small target), and the network at the Large level will predict the confidence of the small target during the prediction process, and get Contains the grid information of the small target; secondly, in the reasoning process, the network selects the position whose prediction score is greater than the threshold s as the query, and maps this position to the feature map of Medium, and the specific mathematics is set as shown in formula (1) ;Finally, the corresponding three heads on the Medium will only calculate the head and the queries for the next layer at the positions corresponding to the key position set, and this calculation process is realized through sparse convolution.

concrete mathematical description

3 Experimental results

The article has done a relatively full ablation experiment, mainly including:

  • Comparing RetinaNet & QueryDet on COCO mini-val
  • Compare RetinaNet & QueryDet on Visdrone
  • Conduct ablation experiments on COCO mini-val, compare HR (high-resolution feature), RB (loss
    re-balance, which is to add weights to different layers), QH (extra Query Head)
  • Use different query thresholds to compare the trade off of AP, AR, and FPS on COCO and Visdrone 2018
  • Comparing methods without query and using three different queries on COCO mini-val: CSQ optimal
  • Compare the query from different layers on COCO mini-val, the corresponding AP and FPS
  • Switch to different backbone (MobileNet V2 & ShuffleNet V2) test results
  • Using FCOS embedded in QueryDet on COCO mini-val, comparing the results
  • Methods that are not used in COCO test-dev & VisDrone validation

CSQ, CQ, CCQ performance comparison
The results are not listed, just look at the visualization.
Visualization

4 Summary

QueryDet uses the high-resolution feature to improve the performance of small target detection. Through the CSQ mechanism, it uses the high-level low-resolution features to initially screen the areas containing small targets, and uses the positions obtained by the initial screening on the high-resolution feature layer, and uses sparse volumes. Product operation, which greatly saves calculation consumption. In fact, the SOFT described in the paper is still open to discussion. I will share with you the specific performance after I study the source code carefully.

Guess you like

Origin blog.csdn.net/weixin_43427721/article/details/125116134