Binary word bag generation process and working principle in SLAM

picture

One of the most important requirements of long-term visual SLAM (Simultaneous Localization and Mapping) is robust location recognition. After an exploration period, when areas that have not been observed for a long time are re-observed, the standard matching algorithm fails.

When they are detected robustly, loop closure detection provides correct data association to obtain a consistent map. The same methods used for loop detection can be used for robot relocalization after trajectory loss, for example due to sudden motion, severe occlusions, or motion blur.

The basic technique of bag-of-words involves building a database from images collected online by a robot in order to retrieve the most similar images when new images are acquired. If they are similar enough, a closed loop is detected. Traditional text classification mainly uses methods based on the bag of words model. However, there is an important problem in the BoW model, namely data sparsity.

Since there are usually many words in the text, and a text contains only a small part of them, the feature vectors constructed by the BoW model are mostly zero vectors and are very sparse. This results in poor classification and computational inefficiency. The BoBW model (Binary Bag of Words) overcomes the sparsity problem of the BoW model. In order to solve the sparsity problem of the BoW model, researchers proposed the bag of binary words (BoBW) model based on binary features. The BoBW method uses fixed-size binary codes to represent text instead of high-dimensional word frequency vectors.

In this way, the sparsity problem in the BoW model is overcome. The BoBW model can also improve computational efficiency. Because the BoBW model uses low-dimensional binary features, it greatly reduces the amount of calculation and memory requirements. This gives the BoBW model significant advantages in classification speed and efficiency.

Binary bag of words is a feature representation method that maps words in text into binary vectors of limited length. Specifically: First, set a vocabulary list for the text, and use all unique words that appear in the text as words in the vocabulary list. Then, for a specific text, check whether each word in the vocabulary appears in it. 1 if present; 0 otherwise. This constructs a fixed-length binary vector to represent the text, where each element corresponds to a word in the vocabulary.

Binary feature representation uses the FAST algorithm to detect corner points. The FAST algorithm detects corner points by comparing the pixel grayscale of a Bresenham circle with a radius of 3 around the corner point. In this way, only a small number of pixels need to be compared, and the calculation efficiency is high. Compute the BRIEF descriptor for each FAST corner point. The BRIEF descriptor is a binary vector, and each element is the result of the comparison of the brightness of two pixels in the patch around the corner point. BRIEF descriptor formula:

picture

Where Bi(p) is the i-th element of the descriptor, I() is the brightness at the pixel, and ai and bi are the offsets of the two compared pixels relative to the patch center. Given the patch size S_b and the number of elements L_b, a_i and b_i are randomly selected in the offline stage. The distance between two BRIEF descriptors is calculated using Hamming distance. Use binary to build a Bag of Words model, and discretize the binary description subspace into visual words through binary clustering (k-medians). Direct indexing and reverse indexing are implemented to speed up the process of similar image retrieval and geometric verification. By considering the consistency with previous matches, the semantic similarity problem is effectively handled. The feature extraction and semantic matching of the final algorithm only takes 22ms, which is an order of magnitude faster than features such as SURF.

picture

Figure : Example of a vocabulary tree and the forward and reverse indexes that make up the image database. Vocabulary is the leaf node of the tree. The inverse index stores the weight of a word in the image in which it appears. Direct indexing stores the features of an image and its associated nodes at some level of the vocabulary tree.

1. Image database modeling

This section introduces the use of the Bag of Words model to convert image features into sparse digital vectors to facilitate processing of large numbers of images. A vocabulary tree is used to discretize the description subspace into W visual words. Different from other features, what is discrete here is a binary description subspace, and the modeling is more compact. The semantic tree is built through hierarchical k-medians clustering.

First perform k-medians clustering on the training samples and take the center. Then recursively repeat for each clustering branch to build an Lw layer semantic tree with W leaf nodes as the final visual words. Each semantic word is given a weight according to its frequency in the training corpus, and high-frequency and low-discrimination words are suppressed. Use tf-idf value. The image It is converted into a bag-of-words vector vt, and its binary descriptor traverses the semantic tree starting from the root, selecting the intermediate node with the smallest Hamming distance between each layer and finally reaching the leaf node. The similarity of two bag-of-words vectors v1 and v2 is calculated as:

picture

In addition to bag of words and reverse index, the article also proposes to use direct index to store the words of each image and their corresponding features. Direct indexing is used to quickly calculate corresponding points and only compares features of ancestor nodes belonging to the same level.

2. Loopback detection
1. Database query _

When getting the latest image It, convert It into the bag-of-words vector vt. Searching the database, the result is the images <vt,vt1>, <vt,vt2>,..., which are most similar to vt, and their scores s(vt,vtj). Calculate the normalized similarity to the best matching image:

picture

where s(vt,vt-Δt) is the score with the previous image, which is used to approximate the highest score of It.

2. Match grouping _

To prevent competition between consecutive images, similar consecutive images are grouped. If the time difference between two images is small, then they belong to the same group. Calculate a group's score:

picture

The group with the highest score is taken as the initial match.

3. Temporal consistency

Consistency checking for continuous queries. The match <vt,VT'> must be consistent with the k previous matches <vt-Δt,VT1>,...,< vt-kΔt,VT" >, and the time interval between adjacent groups should be shorter. Only The <vt,vt'> with the largest eta score is retained as a candidate loop closure match.

4. Effective geometric consistency

When given a matching image pair <It, It'>, we first query It' in the direct index. Direct indexing stores the words associated with each image and their corresponding features. We only compare features of parent nodes belonging to the same vocabulary tree level l.

The parameter l is a factor that weighs the number of matching points and the time cost. When l = 0, only features belonging to the same word are compared (fastest), but fewer corresponding points are obtained. When l = Lw, the number of corresponding points is not affected but the time is not improved. Once enough corresponding points are obtained, we use the RANSAC algorithm to find the fundamental matrix. Although we only need the basic matrix to verify the match, after calculating the basic matrix, we can provide the data association between images for the SLAM algorithm at no extra cost.

3. Experimental testing

The evaluation includes: using 5 public data sets, covering indoor and outdoor, static and dynamic environments. Manually create loopback ground truth, including matching time intervals. Measure correctness using precision and recall. Use different data sets to adjust parameters and evaluate effects to prove the robustness of the algorithm.

Comparing with SURF, the results show that the effect of BRIEF is close to that of SURF, and is better than SURF64 and U-SURF128 on Bicocca25b. BRIEF is faster, but sensitive to scale and rotation changes. BRIEF is more suitable for matching distant objects, and SURF is suitable for large changes at close range.

picture

Figure 2: Precision-recall curves obtained by BRIEF, SURF64 and U-SURF128 on the training data set without geometric inspection.

Second, a certain amount of time-consistent detection is required to detect loopbacks. The result of k=3 is the best and is stable for different frequencies. As shown below:

picture

Figure Similarity threshold α, time consistent matching number k and processing frequency f

In terms of time consumption, the complete algorithm only takes 22ms, which is an order of magnitude slower than SURF. Extracting features takes the most time. Using a large vocabulary takes more time to convert, but queries are faster. 

picture

picture

Figure Examples of words matched using BRIEF (paired on the left) and SURF64 descriptors (paired on the right)

4. Conclusion

Binary features are very effective and extremely efficient in the bag-of-words approach. In particular, results show that FAST+BRIEF features are as reliable as SURF (64- or 128-dimensional and without rotation invariance) for solving the loop detection problem of in-plane camera motion common in mobile robots.

The execution time and memory requirements are an order of magnitude smaller without requiring special hardware. Public datasets describe indoor, outdoor, static and dynamic environments, including front or side cameras. Unlike most previous work, to avoid over-tuning, we restrict ourselves to present all results using the same vocabulary obtained from independent datasets and the same parameter configuration obtained from a set of training datasets, without peeking at the evaluation dataset .

Therefore, we can claim that our system provides robust and efficient performance in a wide range of real-world situations without any additional tuning. The main limitation of this technique is the use of features that lack rotation and scale invariance.

Guess you like

Origin blog.csdn.net/soaring_casia/article/details/132872881