Literature Review: Improving Image-Based Localization by Active Correspondence Search

Abstract

Input: A query image

Source: A point cloud reconstruction of a large scene (there are more than one million 3D points)

Result:pose

关键:an efficient and effective search method to establish matches between image features and scene points needed for pose estimation.


A plurality of additional dynamic matching search frame, based on the 2D-to-3D and 3D-to-2D search.

One kind of advantage in two directions, while avoiding disadvantages.

We have the best registration program performance and run-times and the fastest solution comparable.

1. Introduction

If required a higher positioning accuracy, a 3D point cloud is a more scene representation.

Using correspondences between 2D image features and 3D scene point, the full camera pose can be estimated with high precision.


In this article, 3D point cloud SfM offline we get to represent a scene. (Comparative I, we represent a scene as a 3D point cloud obtained from an online Multi-view Stereo reconstruction), each 3D point and a descriptor associated.

The key challenge is to establish association before the 2D query image features and millions of 3D scene point of the thousands.

RANSAC can then be used to estimate the pose, the number of iterations and RANSAC to find relevant match quality.

[11] proposed a simple, tree-based approximate search, register to get a good performance. However, this solution requires a few seconds to do a search.

We propose a and [11] of the tree-based scheme similar performance, but is still one order of magnitude faster (almost an order of magnitude). And its core is through the establishment of 2D-to-3D correspondences to start 3D-to-2D match.

Related Work:。。。

2. The Image-Based Localization Problem

image-20200223164750547

As shown above, since the cloud point is SfM out, so that each point has at least two SIFT descriptor.

2D-to-3D vs. 3D- to-2D search because the image feature point clouds have several orders of magnitude less than the point (an image contains several orders of magnitude fewer features than there are points in the model), so that:

  • Let match a single point against the features (3D-to-2D) is more effective.

  • But slower 2D-to-3D has a higher quality match (??? because more candidates to go through ratio test more difficult. If there are several points related to this feature is very close, the ratio test will reject ambiguous feature). Return an In, Denser descriptor Space also more likely to reject the correct match.

We have a combination of two strategies of strategic advantage.

Prioritized Search considering only to the feature portion / point will make more efficient matching. This requires a prioritization scheme to first measure the most promising features / points.

  • For 3D-to-2D matching, Li provides optimization of co-visiblity only a point-based 3D. When a point \ (p \) matches were built, and all the other \ (p \) with priority being seen on improved. (Which also make sense).

  • For 2D-to-3D matches, Sattler provides a solution based only on the appearance of [11]. They also assign to the feature points and a set of visual words (visual words). Because the process is a feature of the operator based on the calculated force descriptor to find the nearest neighbor distance. The number of points is calculated and the number of visual word is proportional to assign to the feature. In a feature would then require progressively more examples order to be processed, as will be more prefer the appearance of discrimination, only a few more points of the feature can be coupled.

We will move based on [11] Then, based on the appearance of the common view and to prioritize strategies that exploit 2D-to-3D and 3D-to-2D of the intrinsic similarity.

Bridging the Performance Gap ...

Map points will be assign to visual words in an offline stage.

FIG. Given a query, it SIFT feature is mapped to a vocabulary. Get this: Each feature has a list of candidate points, then it will be characterized by increasing order of their list of concerns.

For each feature \ (f \) , we find all the points in this word, and then find the nearest two neighbors. Two points \ (p_1, p_2 \) if the ratio Test satisfies the SIFT:. $ || F || _2-P_1 / || F || _2-P_2 <0.7 \ (, then a matching \) (F, P_1, || f-p_1 || _2) $ is established.

Active Correspondence Search

Use visual vocabulary induction (induce) the quantity effects (quantization effect), limits the number of matches can be found. A simple way to solve this effect is associated with a soft, put a feature mapped to multiple visual words. We used the Active Search correspondence - a more efficient hierarchical approach.

image-20200224122255941

When a 2D-to-3D match ( \ (f \) and \ (p \) ) is found, there is a high probability say that the 3D point \ (p \) dot the area around the query image can also be seen . But the program [11] to ignore this information and continue to do matching of 2D-to-3D.

We dynamically in \ (P \) point near the space \ (N_ {3D} \) come to match. Each such point \ (p '\) will be inserted into priority programs, if it is activated (turn handle it mean?), We will look for matching 3D-to-2D features in the image. This group of similar features may be visual vocabulary then identified.

Because we need a more sophisticated vocabulary for 2D-to-3D match to limit the search space, we need a more coarse vocabulary for 3D-to-2D search phase to ensure there is enough features to be considered.

A key observation is that a coarser vocabulary can without any additional calculation by using the vocabulary tree.

Additional benefits of using numbers coarser level is to be restored match because quantization effect caused.

Note active search only matched by 2D-to-3D instead of 3D-to-2D match.

Prioritization

Efficient search is the key to match priorities (prioritization). In our framework, matching any search direction is to find the nearest two neighbors in the visual word storage.

In [11], the framework is based on the priorities of the number of comparisons to find the nearest neighbor. . (Blah blah, etc. [11] should be looked at can understand.) After we find the N matching interrupt the search. The remaining question is when to active search, prefer is based on appearance (2D-to-3D) or information covisibility (3D-to-2D) is.

  • When covisibility More importantly, as long as a 2D-to-3D matches are found, direct prioritization strategy can perform active search. When \ (N_ {3D} \) for the feature night match is, it can restore action 2D-to-3D matches. This will result in a match will only be curled up in a small area in the image, resulting in an unstable computing pose.
  • When appearance is more favored, 2D-to-3D match will be the first to be executed, and then perform active search and 3D-to-2D match. So it may only be little benefit from active search.

Therefore, we propose a balanced strategy of both orientations. When a new match is found, it will perform active search. Then predicted search cost to sort them into a common prioritization scheme for both directions. This strategy information for both equal treatment, the side always easier to evaluate the tendency.

image-20200224134404602


Computational Complexity

Given a point cloud have \ (P \) points, and a vocabulary of size \ (W \) , this flower, the average number of points a word is \ (P / W \) . Considering \ (C + 1 \) words, then the search cost a feature is \ (C \ Times P / W is \) . In this case, a \ (F. \) Is calculated is the figure of the features of the \ ( \ mathcal {O} \ left (C \ CDOT F. \ W is FRAC {P} {} \ right) \) .

In comparison, active search up is triggered \ (N \) times with a kd-tree, then, \ (of N_ {3D} \) most nearest point may be \ (\ mathcal {O} \ left ( N_ {3D} \ log_2 (P ) \ right) \) time using a thicker Vocabulary (size \ (W is \) ), this is the case, a dot corresponds \ (\ frac {F} { W '} \) feature. since \ (N \) and \ (N_ {3D} \) is constant, the additional computational active search is \ (\ mathcal {O} \ left (\ log_2 (P) \ frac {F} W is { '} \ right) \) .

Comparison with Existing Methods

With [11] comparison, our program can recover because the fine word list and lost a match. Reflects these key match in Sec.6.

With [10] than words, our active search is based on the 2D-to-3D, and will be more reliable than-based 3D-to-2D match. Let our program get better performance, but also more efficient.

active search even better than tree-based search, as described in sub-space density and the number of 3D points related, and ratio test will remove a lot of the right match (for large data sets).

4. Efficient Implementation

A fine visual vocabulary, to have a 100k word for matching the 2D-to-3D. By [15] based on approximate k-means to generate SIFT Based on this vocabulary, we generate a vocabuary tree (with branching factor 10) .

Active search is contained in [11] in the 2D-to-3D pipeline.

Find \ (N = 100 \) after matching, we RANSAC-variant [22] to use 6:00 DLT algorithm [16].

image-20200224064944579

Because each point when a plurality of frames are reconstructed observed, so a candidate point \ (p '\) when 3D-to-2D matching, there will be a plurality of descriptors \ (D (p') \ ) stored in a different fine vocabulary word. defines a set of \ (AW \) is at level \ (L \) is activated visual word. we find \ (AW \) under and covered \ (D (p ' ) \) has two characteristics closest distance of \ (f_1, f_2 \) .

Then in Alg. 1 for the two cycle acknowledge \ (p '\) search cost. We require \ (f_1, f_2 \) two characterized by a degree of distinction. In order to consider 3D-to-2D do not match certainty, 3D-to-2D match can not replace the 2D-to-3D match was found matching 3D-to-2D features, will not be taken into 2D-to-3D match.

Level \ (L \) selecting the number of direct control of the image features is dispensed and a candidate point \ (p '\) of search cost. Since a query FIG 1k-20k generally feature points (depending on its precision), we then have the time to use features 5k level 2 (100 words), otherwise using level 3 (1000 words). in this case, an average of 5-50 features a word. \ (P '\) the search cost can be considered constant.

5. Incorporating Additional Visibility Information

Although more efficient than soft matching, active search also requires additional computation. In this framework, we will talk about how to speed up the positioning of the pipeline to compensate for run-time increase.

The lifting observation image used at the time of reconstruction of the estimated that approximates the set of viewpoints from which a point is visible. It is possible to accelerate the 3D-to-2D matching pose estimation and RANSAC-based bit (impossible to see by filtering location). because approximative nature, this filter will filter out the right spot. we propose a simple strategy to restore the lost performance.

The filtered phase may appear as a bipartite graph (Graph bipartite) \ (G \) , and 3D point defined by the camera.

image-20200224085823027

Filtering 3D points

3D space from the past (close proximity in 3D space) does not mean that the two co-visibiliy. They may never be the same as in Figure 4.a the same position is observed. Our point filter to remove all of the $ N_ 3D} { \ (nearest neighbor points can not be directly observed (not directly visible) point \) P $. far only two edge points in order to be used as a 3D-to-2D matching.

The RANSAC the Filter PreS-A (Review needed)

4.c shown in FIG establish matching define subgraphs (The subgraph) \ (g_c \) of \ (G \) . Our RANSAC pre-filter in the \ (g_c \) found containing the last point and the 3D only able to connect to this component composition. this accelerates the estimation algorithm based on RANSAC pose, because the outer points are removed. Note that this matched filter has some effect on the 3D-to-2D, because a wrong 2D-to -3D match is unlikely to be able to find further matches.

Camera Sets the Using (Review required)

The filtering step may be too aggresive up. Green Point in Figure b. By combining a camera, we hope to find a better, more continuous points of view were estimated for each Photo \ (I_j \) , we define similar Atlas \ (sim (I_j) \) is \ (k \) nearest cameras do not cut perspective gaps and \ (I_j \) up to the difference between \ (60 \) °. this set contains the issue (set cover problem) is a Greedy algorithm to solve [10].

6. Experimental Evaluation

The largest edge all query graph is 1600 pixels. We believe that if a map of the best inlier number greater than 12, were registered up.

image-20200224094252980

Evaluation of Active Search

5 drew an average registration time and the average number of registered figure.

Strategy direct method of the first to consider visibility information, registration has the best performance, but it is easy to find a match was on a video, resulting in a bad pose accuracy. As shown in table 2. Our program will not registration and give up performance positioning accuracy.

image-20200224095057262

Faster Registration using Filtering

Although the focus of this paper in the registration point performance, we still sooner the better positioning of the program of interest. Point filter and RANSAC pre-filter is designed for this purpose.

image-20200224112222593

As expected, with a small amount of filter will reduce the average registration performance. We found different filter for different data sets performance.

The greatest impact RANSAC pre-filter to Vienna, because it has a minimum data set of descriptors. 2D-to-3D matches more likely to find fault with. Since the mismatch is distributed throughout the model, RANSAC pre-fitler you can remove most of them.

In contrast, filter have minimal impact in Rome. It has rebuilt a different landmark, there are many similar perspective. This will affect the efficiency of the RANSAC pre-filter in. And, more dense descriptor space for pre-filter is difficult to remove mismatches.

Using Camera Sets

image-20200224113019208

Comparison with State-of-the-Art

Table 3 Comparison of our program ( \ (= 200 is of N_ {3D}, K = 10 \) ), and other programs.

\(P2F\) 表示3D-to-2D matching by Li.

\ (P2F + F2P \) will do 2D-to-3D matching, if \ (P2F \) failed.

image-20200224113526469

7. Conclusion & Future Work

Combination of 2D-to-3D and 3D-to-2D scheme best. Order of magnitude faster than a tree-based scheme.

Guess you like

Origin www.cnblogs.com/tweed/p/12358398.html