CVPR 2022 | Plug and Play! Nanyang Technological & SenseTime Open Source SAM-DETR: Using Semantic Alignment and Matching to Achieve Fast Converg...

Author丨Qingchen smile @zhihu

Source丨https://zhuanlan.zhihu.com/p/489839282

Editor丨CVer

Introduction: At CVPR 2022, the scientific research team of Nanyang Technological University in Singapore and SenseTime Research Institute proposed SAM-DETR-Using Semantic Alignment Matching to Accelerate DETR Detector Convergence. It only introduces a simple plug-and-play module to align the semantics of object query and image features by sampling the features of "target salient points", so that DETR can quickly converge on the MS-COCO dataset. Due to the plug-and-play nature of this method, SAM-DETR can easily be combined with other existing methods to accelerate convergence to achieve better results. According to the author's open source code, on the MS-COCO dataset, using only ResNet-50, the proposed method can achieve 42.8% AP detection accuracy within 12 epochs, and 47.1% AP detection within 50 epochs. precision.

Paper Title: Accelerating DETR Convergence via Semantic-Aligned Matching

21c1b11e22d03f213d3bbc12246a7eca.png

Paper: https://arxiv.org/abs/2203.06883

Code (now open source): https://github.com/ZhangGongjie/SAM-DETR

Issues and Challenges

DEtection TRansformer (DETR) [1] is a novel object detection framework. Compared with traditional target detectors based on Faster R-CNN or YOLO, DETR has the advantages of not requiring human-designed components (such as Anchor, Non-Maximum-Suppression, sampling rules for positive and negative samples during training, etc.) and better performance. Detection accuracy has received a lot of attention. However, one of the biggest problems of DETR is that it converges very slowly, so it takes a long time to train to achieve high accuracy. On the MS-COCO dataset, Faster R-CNN generally only needs 12~36 epochs of training to converge, while DETR needs to be trained for 500 epochs to achieve the desired accuracy. Such high training cost limits the widespread use of detectors based on the DETR framework.

motivation

b97c51fcd3e45f48ede50b6a6d144bcc.png

DETR [1] uses a set of object queries to represent latent objects at different positions in the image as the input to the Transformer decoder. As shown in the left part of the figure above, the Cross-Attention (Encoder-Decoder Attention) module in DETR's Transformer decoder can be understood as a "matching + information extraction" process: each object query needs to match its corresponding area first, Features are then extracted from these regions for subsequent prediction. This process can also be described by the formula as:

3b7119a7a7ef94bdc254987023a52655.png

Where Q represents the object query, F represents the image features output by the Transformer Encoder, and Q' represents the object query containing the features extracted from F.

However, the author observed that in the Cross-Attention module in the Transformer decoder, it is difficult for the object query to accurately match its corresponding region, which makes the object query unable to accurately extract the features of its corresponding region in the Cross-Attention. This directly leads to the difficulty of training DETR. As shown in the right part of the above figure, the reason why the object query cannot be correctly focused on a specific area is that multiple modules (Self-Attention and FFN) between Cross-Attention map the object query multiple times, so that the object query and image features are The semantics of F are not aligned, that is, the object query and image features F are mapped into different embedding spaces. This makes it difficult for the dot product (Dot-Product) + Softmax between the object query and the image feature F to focus on a specific area.

Method introduction

Based on the above observations, the authors propose Semantic-Aligned-Matching DETR (SAM-DETR) to achieve fast-converging DETR. The core idea of ​​its method is to use the excellent performance of the Siamese Network in various matching tasks, so that the object query in Cross-Attention can more easily focus on a specific area. The core idea of ​​the Siamese Network is to use the exact same two sub-networks to map both sides of the match into the same embedding space, that is to say, the two sides of the match will calculate the similarity under the same semantics. This reduces the difficulty of matching and improves the accuracy of matching.

170474053d5d9a349d016a75c9f75e97.png

Specifically, as shown in the figure above, SAM-DETR inserts a "plug and play" module - Semantics Aligner before the Cross-Attention of each Transformer Decoder Layer. The Semantics Aligner resamples the image features F for each object query input to Cross-Attention to ensure that the matching pairs are semantically aligned. In addition, unlike DETR [1], which models a corresponding learnable position encoding (Position Embedding) for each object query, the author directly models a reference box (Reference Box) for each object query to limit the resampling. Scope. In addition, other settings are basically the same as the original DETR [1]. Since the core part of the proposed SAM-DETR is a "plug-and-play" module that does not require magic modification of Cross-Attention, SAM-DETR can be easily combined with existing DETR convergence solutions, achieve better results.

1. Using resampling to achieve semantically aligned matching

e0f52bef1fc1eaa15e612f56d10a5b6f.png

The figure above shows the main structure of the Semantics Aligner proposed by the author. For each object query, Semantics Aligner uses RoIAlign to obtain the 2D features of its corresponding region from the image features according to the reference box (Reference Box), and re-sampling (Re-Sampling) from it as the input to the object query embedding in Cross-Attention. The author tried a variety of resampling methods (including AvgPool, MaxPool, etc., see the experimental results section), and found that using multiple searched Salient Point features works best.

2. Resampling with salient point features

For detection tasks, the salient points of objects (including boundary points, endpoints, strong semantic points, etc.) are the key to their identification and localization. Therefore, the author samples the features of the salient points as the output of the Semantics Aligner.

The author directly performs the convolution + MLP operation on the regional features obtained by RoIAlign, predicts the coordinates of 8 salient points, and then uses bilinear interpolation (Bilinear Interpolation) to sample the corresponding position features from the image features, and concatenate them together as New object query embedding. Similarly, the coordinates of these salient points are also used to generate the corresponding position encoding (Position Embedding), and also concatenate together as output. The salient points found are shown in the figure below. The new object query and its position encoding obtained in this way can also be input into the subsequent Multi-Head Attention mechanism for processing without modification.

ff75792b3199b69faa44e06058631707.png

3. Utilize previous information through feature reweighting

Through the above operations, Semantics Aligner outputs a new object query as the input of Cross-Attention. But the previous object query still contains useful information for Cross-Attention. In order to effectively utilize this information, the author uses the previous object query to generate reweighting parameters, and performs Feature Reweighting on the new object query. In this way, the previous information is effectively utilized, while keeping the semantics of the output object query still aligned with the image features.

4. Integration with existing methods

Since SAM-DETR only introduces a plug-and-play module, it can be easily combined with existing methods. The author takes SMCA-DETR [2] as an example to demonstrate the good scalability of SAM-DETR.

Experimental results

The figure below shows a visual comparison of SAM-DETR and DETR [1]. It can be found that SAM-DETR successfully searches for meaningful salient points, and the Cross-Attention response of SAM-DETR is more concentrated than that of DETR. This proves that SAM-DETR can effectively reduce the matching difficulty between object query and image features.

af9295d3e76d3716fb0a215399041609.png

The table below is the results of ablation experiments demonstrating the effectiveness of semantic alignment and searching for salient point features.

7fcd3076acedc150ab1ea29f40b2bc68.png

Finally, compared with the current SOTA, SAM-DETR can converge in a very short training period. Notably, when combined with SMCA [2], its detection accuracy surpasses Faster R-CNN even when trained on MS-COCO dataset for only 12 epochs.

b5572c1a7b8f43c3d40972808bd0d247.png

It is worth mentioning that in the author's open source code, a multi-scale version of SAM-DETR and a trained model are additionally provided. In the MS-COCO dataset, the detection accuracy can reach 42.8% AP under 12-epoch and 47.1% AP under 50-epoch.

Epilogue

This paper introduces the SAM-DETR detector to accelerate the convergence of DETR. The core of SAM-DETR is a simple plug-and-play module that semantically aligns object queries and image features to facilitate matching between them. This module also explicitly searches features of salient points for semantic alignment matching. SAM-DETR can be easily integrated with existing DETR convergence solutions to further improve performance. Even if only trained for 12 epochs, the proposed method can surpass the detection accuracy of Faster R-CNN on the MS-COCO dataset.

References

[1] Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. "End-to-end object detection with Transformers." In European Conference on Computer Vision (ECCV), pp. 213-229. 2020.

[2] Gao, Peng, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. "Fast convergence of DETR with spatially modulated co-attention." In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3621-3630. 2021.

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

152efad57de5ebf690c92b5d6a661c48.png

▲Long press to add WeChat group or contribute

ac7f5cc0ca3c2c6fb7ac140d74b14531.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development jobs and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

3600380c418fd7856b06a788cb3a3b61.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/123911635