CVPR 2022 Oral | New Jobs for Object Detection! NTU Open Source AdaMixer: A Fast Converging Query-Based Object Detector...

Author : Wang Limin |   Reprinted with permission (source: Zhihu) Editor: CVer

https://zhuanlan.zhihu.com/p/493049779

df0532f5568bfc79c6da4e3ca620e0e0.png

AdaMixer: A Fast-Converging Query-Based Object Detector

Code: https://github.com/MCG-NJU/AdaMixer

Paper (just open sourced):

https://arxiv.org/abs/2203.16507

This paper introduces our new work in object detection, AdaMixer, which accelerates the convergence and final performance of query-based detectors (DETR-like detectors and Sparse RCNN) by enhancing the adaptive modeling capabilities of the detector, and makes the model architecture maintained on a relatively simple structure. We propose a series of techniques to enhance the decoder decoding part of query-based detectors, including 3D feature space sampling and dynamic MLP-Mixer detection head, which saves us from introducing various attention encoders with heavy design and heavy computation (attentional encoder), or feature pyramid-style multi-scale interaction network, while maintaining the effect (in fact, we surpassed many previous models), further simplifying the structure of query-based detectors.

research motivation

First, we briefly introduce our research motivation. Now the query-based detector has become a hot spot in academic research. It extracts features through the iterative interaction of the query set (also called proposal set in some articles) and the image feature map, and continuously improves the semantics of the query itself, so that it can be completed under matching loss. One-to-one cls and bbox predictions of query to object. The query-based detector does not require subsequent NMS operations, making the entire detection process simpler and more elegant. However, we found that query-based detectors, especially DETR-like detectors, usually introduce multiple layers of attentional encoders that densely perform global or local attention on each pixel. Force calculation introduces a large amount of computation, and it is not easy to extend to high-resolution feature maps, which brings the problem of difficult detection of small objects, and may bring about the trouble of training time. The Sparse R-CNN genre introduces an explicit Feature Pyramid Network FPN to enhance the modeling of small objects, but again, the Feature Pyramid Network introduces additional computation. We feel that adding an extra network between the backbone and the decoder is actually a bit inelegant, and this is a bit contrary to the goal of using query for detection. If the detector needs a thick and dense encoder, it is a bit different to use a small number of queries to detect objects through the decoder as the bright spot of the model. The root cause of these problems is that the decoder is not strong enough, and the modeling ability of the encoder is needed to make up for it. Therefore, the fundamental motivation of our method is to enhance the ability of the decoder, so that the detector can avoid introducing various encoders as much as possible.

But how to enhance the ability of the decoder, especially the diversified modeling ability for different images and different targets? This problem is critical for decoders that only use sparse and limited queries. Looking back at the typical query decoder itself, it is a structure based on the transformer decoder. First, self attention is performed between the query and the query, and then the query interacts with the image feature feat, and then each query goes through FFN. Although these initial queries are generally learnable vectors, they are fixed during inference and cannot be changed for different inputs (although there is a trend to generate the initial query from class RPN), so how to ensure the query decoder itself The adaptive ability of the decoding mechanism to input different objects in different pictures becomes a problem. To this end, we propose to improve this query-based target detector from two aspects: the adaptive ability of sampling positions and the adaptive ability of decoding features, which correspond to our proposed 3D feature space sampling and dynamic MLP-Mixer detection. head.

method

We briefly introduce two representative innovations of our AdaMixer detector, so that readers can quickly grasp the context of our method. Some details are ignored here, you can check the original text for details.

Adaptive feature sampling locations

43ce200e52094368b9ec827870ff6019.png

Like other methods now, we decouple the query into two vectors, the content vector and the positional vector, where the frame represented by the query can be decoded from the positional vector. At each stage, the query decoder updates the refine two vectors. It is worth noting that the parameterization we use for the position vector is not the lrtb coordinate or ccwh coordinate of the commonly used frame, but the xyzr form, where z represents the logarithm of the box size and r represents the logarithm of the box aspect ratio. , this parameterized form of xyz can directly connect our query with the 3D feature space formed by multi-level features. As shown in the figure above, the query coordinates in the 3D feature space are naturally determined by xyz. The adaptive 3D feature sampling first generates multiple sets of offsets by the query according to its own content vector, and then performs interpolation sampling on the corresponding points in the 3D feature space to obtain the corresponding features, the 3D feature space is beneficial for our method to uniformly and adaptively learn the changes in the location and scale of target objects. Note that this step does not require any multi-scale interaction network.

Adaptive sample content decoding

For a query, the feature shape collected in the above steps is, where is the number of sampling points and is the number of channels. Inspired by MLP-Mixer, we propose a query-by-query adaptive channel and spatial mixing operation (adaptive channel mixing). , ACM and adaptive spatial mixing, ASM). Specifically, our decoder uses weights that dynamically depend on the query to mix the collected features along two dimensions (channel and space). Since the collected features may come from feature maps of different levels, such a mixing operation naturally gives The ability of the decoder to model multi-scale interactions.

333ec8037f874307580fe2046e5623ca.png

General structure

35c9db5363a9023c0105f0c509881ea0.png

The overall structure of our AdaMixer decoder is shown in the figure above. Although it looks a bit cumbersome, the basic structure of the operation on the content vector is still the same as that of the Transformer decoder. The position vector can simply be regarded as participating in coordinate transformation and calculation in a stage. , and then update at the end of a stage.

The overall AdaMixer detector consists of only two main parts: one is the backbone network, and the other is our proposed AdaMixer decoder, which does not require an additional attention encoder and an explicit multi-scale modeling network.

result

0bdb4c8a5b46c8bd275bdee7cc5f4591.png

The experimental results were quite impressive at the time of submission. Under the training condition of 12 epochs, our performance surpassed other detectors (including traditional and query-based detectors), where N is the number of queries, which proves our method. The convergence rate and final effect. And our 12 epoch is still relatively fast on the 8-card V100, only 9 hours.

7f4057eaa873089ab73dbcaa80c755e0.png

We also outperform other query-based detectors, and we are the only model in the table that does not require additional attention encoders or pyramid feature networks.

2f364989b2ee0df21fb7d5847e387b8a.png

Ablation experiment

We did a relatively rich ablation experiment to verify the effectiveness of our proposed modules. Here, we select some representative ablation experiments for discussion.

60c94901cd805606134108355cfba378.png

Table (a) is an exploration of the required adaptations at the core of our approach, both sampling location (loc.) and decoding content (cont.) adaptations have a large impact on the performance of our final model.

Table (b) is an exploration of our proposed adaptive mixing, the sequential combination of dynamic channel mixing (ACM) and dynamic space mixing (ASM) is the best choice.

Table (c) is the effect of our AdaMixer plus different multi-scale interaction networks. We are surprised to find that the effect of no additional pyramid network is still better. We guess it may be because our AdaMixer decoder naturally has many The ability of scale interaction and additional pyramid networks with more parameters require more training time to converge.

292f03b706f6abf958ada4cb9fc05945.png

Table 8 further explores 3D feature space sampling. Note that none of the experimental models in Table 8 are equipped with an FPN network, and it is reasonable for us that RoIAlign performs poorly in this case. The model with adaptive 2D sampling (without learning the offset in the z direction) lags behind the 3D feature space sampling by nearly 1.5 APs, illustrating the necessity of learning the offset in the 3D sampling especially in the z direction. In addition, another interesting conclusion is that using only C4 features is better than C5, which may be attributed to the larger resolution of C4 features. And when only C4 features are used, the subsequent feature extraction stage of ResNet can be directly cut off (because there is no FPN, and the C5 feature map is not used), which may represent the direction that such detectors can be lightweighted. We haven't done much research yet.

Summarize

We propose AdaMixer, a detector with a relatively simple structure, fast convergence, and good performance. By improving the adaptive decoding ability of the decoder for target objects, our AdaMixer does not need to introduce a heavy attention encoder and explicit multi-scale interactive network. We hope that AdaMixer can serve as a simple and effective baseline model for subsequent query-based detectors.

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

33b4da0a4b6216f8908fa44c05cb0d65.png

▲Long press to add WeChat group or contribute

454a5314abc5c0a095db84de23e245a4.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

2259fa1ff2d128b4e18a2583f6822f76.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124030625