ICCV 2023 oral | One article to understand how SLAM/SfM should handle similar non-loopback scenarios

Author: Pickled pepper flavored chewing gum | Source: 3D Vision Workshop

In the backend of the public account " 3D Vision Workshop ", reply " Original Paper " to get the paper pdf and code link.

Add WeChat: dddvisiona, note: SLAM, and join you in the group. Industry segmentation groups are attached at the end of the article.

0. The author’s personal experience

Similar structures have always been a difficult but necessary problem to deal with in SLAM and SfM. If the robot encounters a very similar but actually different structure, it is easy to cause false positive loopback and reconstruction failure because the number of matches is large enough. Traditional methods still use the threshold of the number of matches or the ratio threshold of other geometric relationships to judge. This method can easily fail when encountering highly symmetrical structures.

Today, the author will bring you a new solution to this problem, which is the Doppelgangers proposed by ICCV 2023 Oral, which can automatically determine whether two views are the same or just similar. This solution actually models the visual disambiguation problem as a binary classification task on image pairs, and develops a learning-based solution and data set. Here we also recommend the new course "3D Vision Workshop" "(Second Issue) ORB-SLAM3 Theory Explanation and Code Analysis" .

1. Effect display

The following are some typical similar structures that can easily be considered the same scene even when viewed with the human eye. When SLAM and SfM encounter such scenarios, it is easy to trigger false positive loopbacks, resulting in tracking loss or reconstruction failure.

ff59a84f279db7467e154586dfba2ad1.png

This article Doppelgangers mainly solves the matching and reconstruction problems in this type of scene. Even if the scene used for reconstruction has a highly symmetrical and similar structure, a complete three-dimensional reconstruction can still be performed, without more or less structures out of thin air.

63cd80d1469b106a6bca5770f87c1c19.png 3589823e51e51cc68a32f105698937bd.png

2. Summary

We consider the visual disambiguation task of determining whether a pair of visually similar images depicts the same or different three-dimensional surfaces (e.g., same or opposite sides of a symmetrical building). False image matching, where two images observe different but visually similar 3D surfaces, is challenging for humans to distinguish and can also cause 3D reconstruction algorithms to produce erroneous results. We propose a learning-based visual disambiguation method modeled as a binary classification task on image pairs. To this end, we introduce a new dataset to the problem, Doppelgangers, which contains pairs of similarly structured images with real labels. We also design a network architecture that takes local keypoints and matching spatial distributions as input, allowing for better reasoning about local and global cues. Our evaluation shows that our method can distinguish false matches in difficult cases and can be integrated into SfM pipelines to produce correct, unambiguous 3D reconstructions.

3. Algorithm analysis

Now let’s dismantle the task step by step: We hope that when doing SLAM and SfM, even if we encounter scenes with very similar structures, we can correctly perform 3D reconstruction and tracking, and there should not be mismatches that cause the reconstruction result to suddenly have one more piece or one less piece. , let alone false positive loops.

Specific task description:

Given two very similar images, determine whether they are the same surface of the same structure, or two different 3D structures (the authors call these doppelgangers). This mainly occurs in scenes with symmetrical buildings, repeated visual elements, and multiple identical landmarks.

The analysis found that although the two images taken from the highly symmetrical building are very similar overall and it is difficult to distinguish positive and negative directly through the image pair, there are still some internal details that are different. The areas where these details are located have few matching relationships when performing image matching, and this subtle difference can be used to distinguish images . In fact, this is in line with human beings' idea of ​​"finding differences".

0dda091da669e2b64b42255c0580e16e.png

Based on this finding, the author proposed a learning-based method to eliminate visual ambiguity, and also disclosed a data set with a similar structure and GT labels.

The principle of this similar scene recognition algorithm is to first use RANSAC to estimate the basic matrix and filter abnormal matches (the specific experiment is to directly use LoFTR+RANSAC to calculate the matches), and then input the original image, extracted feature points, and matched feature points into the network , the output is the probability of similar images , converting similarity recognition into a binary classification problem.

d3730d1fa205733a8402f797335256f1.png

The idea of ​​this algorithm is to use key points and matching positions to provide information to the network, so that the network not only knows which key points are matched, but also knows which key points are not matched, so that the corresponding areas may represent missing information or different targets, which is quite Yu utilizes the spatial distribution information of key points and matching . It is obvious that visually the matches for regions with similar structures will be denser, but the matches for regions with different structures will be sparser.

In order to better compare different structural areas, the author also geometrically aligned the input image pairs, that is, estimated affine transformation, and then warp the image and binary mask. Of course, this alignment is not necessarily particularly accurate, but I hope to better connect the two overlapping areas.

When it comes to specific training, the author also uses Focal loss . This is also easy to understand. The matching of feature points with similar structures must be unbalanced between positive and negative samples. Using Focal loss can increase the contribution of difficult-to-distinguish samples.

4. Experiment

The network structure used for binary classification is very simple, which is three residual modules, an average pooling, and a fully connected layer. Only 10 epochs were trained, the batch size was set to 16, and the learning rate was reduced from 0.0005 to 0.000005.

The experiment mainly evaluates the performance of visual ambiguity elimination on the Doppelgangers data set they proposed, then integrates the trained image two-classifier into SfM, evaluates the disambiguation performance through the reconstruction effect, and finally conducts an ablation experiment to prove each Modules are useful.

The first is to use only local feature matching to predict whether the image pair is a positive (true) match. The baselines for comparison include SIFT+RANSAC, LoFTR, and DINO-ViT (self-supervised SOTA classification/segmentation). The matching methods used by the author include two types: (1) thresholding the number of matches after geometric verification; (2) thresholding the ratio of the number of matches to the number of key points. The idea behind (2) is that if there are very few matches relative to the number of keypoints, it is likely to be a mismatch . The AP of the model trained by the author is 95.2%, and the ROC AUC is 93.8%. The results of DINO are not very good, mainly because the features it generates are very suitable for semantic classification tasks, but not suitable for visual disambiguation.

de4212296e5b7d6c6ee7eded88b3d134.png

The following is the test image pair and the corresponding positive matching probability . This qualitative comparison is still very detailed, and there are many experimental scenarios. The left column represents negative pairs of matching relationships (wrong), and the right column represents positive pairs (correct). What is more eye-catching is that it can correctly classify positive and negative matching relationships under challenging conditions such as different lighting, viewing angle changes, and weather changes.

66ef062b18c0daa9fb1ce8967f0b80a7.png

Then we started working on 3D reconstruction. The authors integrated the binary classifier they trained into COLMAP to evaluate the 3D reconstruction effect in repeated/symmetric scenes. There are two types of landmarks used, including 13 landmarks that are difficult to reconstruct due to symmetry and repeated structures, and 3 scenes with repeated structures that are significantly different from the training data, mainly to test generalization.

The following table is the reconstruction result of SfM. The second column represents the number of scenes in the data set. The other √ and ❌ represent whether the reconstruction is successful. Whether the success is successful is compared with the corresponding structure in Google Earth (new point to evaluate the effect of three-dimensional reconstruction? ).

fac163e6266c1da71ee1444c480693be.png

The following is a comparison of the reconstruction effect of the model reconstructed directly using COLMAP and the method proposed by the author. Obviously, using COLMAP directly will produce many redundant towers, domes and other structures out of thin air. However, the method proposed by the author can well eliminate the "semantic ambiguity" of this symmetrical similar structure, thereby reconstructing a complete three-dimensional model. Here we also recommend the new course "3D Vision Workshop" "(Second Issue) ORB-SLAM3 Theory Explanation and Code Analysis" .

9e465f489d3617f3ea7400d052071a8b.png

The last one is an ablation experiment. There is nothing much to say. It mainly compares the use of different network structures for binary classifiers, whether there is data enhancement, and the performance comparison brought by different network inputs.

24428aa4a1deac7c8afcf5b89601c1a3.png

5. Summary

Doppelgangers is an article that solves a specific task, implements two classifications of similar structures, and solves a very important problem. Although the experiments of Doppelgangers are more oriented towards image matching and SfM, the author personally feels that it is also easy to apply to SLAM scenarios. Interested friends can make further attempts.

—END—

Efficiently learn 3D vision trilogy

The first step is to join the industry exchange group and maintain the advancement of technology.

At present, the workshop has established multiple communities in the direction of 3D vision, including SLAM, industrial 3D vision, and autonomous driving. The subdivision groups include: [ Industrial direction ] 3D point cloud, structured light, robotic arm, defect detection, 3D measurement, TOF, camera calibration, comprehensive group; [ SLAM direction ] multi-sensor fusion, ORB-SLAM, laser SLAM, robot navigation, RTK|GPS|UWB and other sensor exchange groups, SLAM comprehensive discussion group; [ autonomous driving direction ] depth estimation, Transformer , millimeter wave|lidar|visual camera sensor discussion group, multi-sensor calibration, automatic driving comprehensive group, etc. [ 3D reconstruction direction ] NeRF, colmap, OpenMVS, etc. In addition to these, there are also communication groups for job hunting, hardware selection, and visual product implementation. You can add the assistant on WeChat: dddvisiona, note: add group + direction + school | company, the assistant will add you to the group.

3c742bd22d7bb142901298eda75196a3.jpeg
Add assistant WeChat: cv3d007 to join you in the group
The second step is to join Knowledge Planet and get your questions answered in a timely manner.

Video courses for the field of 3D vision (3D reconstruction, 3D point cloud, structured light, hand-eye calibration, camera calibration, laser/visual SLAM, autonomous driving, etc.), source code sharing, knowledge point summary, introductory and advanced learning routes, latest paper sharing , question answering , etc., and algorithm engineers from various major manufacturers provide technical guidance. At the same time, Planet will work with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans integrating technology, employment, and project docking. 6,000+ Planet members will work together to create a better AI world. Progress, Knowledge Planet Entrance: "3D Vision from Beginner to Master"

Learn 3D vision core technology, scan and view, and get an unconditional refund within 3 days 0ae55980594bd01ad4c05a3a849d3211.jpeg
High-quality tutorial materials, answers to questions, and help you solve problems efficiently
The third step is to systematically learn 3D vision, deeply understand and run the module knowledge system

If you want to study systematically in a certain subdivision of 3D vision [from theory, code to practice], we recommend the 3D vision quality course learning website: www.3dcver.com

Scientific research paper writing:

[1] China’s first tutorial on scientific research methods and academic paper writing for 3D vision

Foundation Course:

[1] In-depth explanation of important C++ modules for three-dimensional vision algorithms: from basic entry to advanced

[2] Linux embedded system tutorial for 3D vision [theory + code + practical]

[3] How to learn camera model and calibration? (Code + actual combat)

[4] ROS2 from entry to mastery: theory and practice

[5] Thoroughly understand dToF radar system design [theory + code + practical]

Industrial 3D Vision Direction Course:

[1] (Second issue) Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

[2] Nanny-level linear structured light (monocular & binocular) 3D reconstruction system tutorial

[3] Robotic arm grabbing from entry to practical course (theory + source code)

[4] Three-dimensional point cloud processing: algorithm and practical summary

[5] Thoroughly understand the point cloud processing tutorial based on Open3D!

[6] 3D visual defect detection tutorial: theory and practice!

SLAM direction courses:

[1] In-depth analysis of the principles, codes and actual combat of 3D laser SLAM technology for the field of robotics

[1] Thorough analysis of laser-vision-IMU-GPS fusion SLAM algorithm: theoretical derivation, code explanation and practical combat

[2] (Second issue) Thoroughly understand 3D laser SLAM based on LOAM framework: source code analysis to algorithm optimization

[3] Thoroughly understand visual-inertial SLAM: In-depth explanation of VINS-Fusion principles and source code analysis

[4] Thoroughly analyze the key algorithms and actual combat of indoor and outdoor laser SLAM (cartographer+LOAM+LIO-SAM)

[5] (Second Issue) ORB-SLAM3 theoretical explanation and code analysis

Visual 3D reconstruction

[1] Thoroughly complete perspective 3D reconstruction: principle analysis, code explanation, and optimization improvements )

Autonomous driving course:

[1]  In-depth analysis of vehicle-mounted sensor spatial synchronization (calibration) for the field of autonomous driving

[2]  China’s first Transformer principle and practical course for the field of autonomous driving target detection

[3] Monocular depth estimation method: algorithm review and code implementation

[4] Full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single modal + multimodal/data + code)

[5] How to deploy deep learning models into actual projects? (Classification + Detection + Segmentation)

at last

1. Recruitment of authors for 3D visual article submissions

2. Recruitment of main teachers for 3D vision courses (autonomous driving, SLAM and industrial 3D vision)

3. Top conference paper sharing and 3D vision sensor industry live broadcast invitation

Guess you like

Origin blog.csdn.net/Yong_Qi2015/article/details/133053627