ORB-SLAM3 nanny-level tutorial: DBoW2 from theory to implementation

1. Bag of Words - 词袋模型

Bag of WordsThe model is a commonly used document representation method in the field of information retrieval. To put it simply, different words can form a piece of text. We do not consider the frequency of these words, so this text can be represented by a histogram of the frequency of these words. The words are Words and the histogram is Bag of Words.

PS: This article provides available code out of the box for loopback detection practice;Code

1.1 Two examples

(from Wikipedia**Bag of Word**):

文本1:John likes to watch movies. Mary likes movies too
文本2:John also likes to watch football games.

词典:[“John”, “likes”, “to”, “watch”, “movies”, “also”, “football”, “games”, “Mary”, “too”]

Then the Bag of Words of these two paragraphs of text is:
The Bag of Words of text 1: [1, 2, 1 , 1, 2, 0, 0, 0, 1, 1]
Text 2'sBag of Words: [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
The number indicates the number of occurrences of the word at the corresponding position

1.2 Too many words

If there are too many words in the text, then the vector of Bag of Words will be very sparse and the vector length will be very large. In this case, the calculation amount will be very large when calculating distance. Therefore, needs to cluster words and cluster similar words together, so that the number of words can be reduced.

One of the simplest processing methods is to use a hash table (Hashing) to hash the words into a fixed-length vector. In this case, the number of words can be controlled to a fixed length. But in this case, similar words will be hashed to different positions, so that similar words cannot be clustered together.

2. 图像的Bag of Words

The Bag of Word model is a very commonly used model in text retrieval. In image retrieval, this model can also be used. But there is a big difference between images and text. The text itself can be regarded as a discrete signal, composed of different words; butthe image is a two-bit matrix and cannot be directly used as some elements. collection. Therefore, the image'sBag of Wordsmodel needs to undergo some processing before it can be used.

To put it simply, through some feature extraction algorithms, the feature points in the image can be extracted, and then these feature points can be used as words, and the Bag of Words model can be used.

2.1 Construction of Image Bag of Words Vocabulary

After extracting feature points from a large amount of data offline, a large number of feature points and feature descriptors will be obtained. By performing K-mean/K-means++ clustering on these descriptors, a batch of center points can be obtained, called nodes. During the clustering process, due to averaging, if the feature descriptor is a binary vector, the binary vector of the cluster center point will not be 0 or 1, but a floating point number. In this case, it is necessary to binarize the values ​​of these center points to obtainnodes. These nodes are in Bag of Words. That is vocabulary. words

ThenBag of Words is the frequency histogram (or frequency vector) of the image feature point set in these words

Specifically, take a new picture and extract the feature points. For each feature point, calculate its relationship with each of distance, find the nearest , and then use the index of this as the of this feature point. In this way, the collection of of this picture is obtained. Then perform histogram statistics on this set, and get the of this picture. nodesnodenodenodewordwordBag of Words

3. 词汇权weight

Different words contain different amounts of information. For example, the word frequency is more distinguishable than the word the. These words should be given Highly discriminative words are given greater weight and vice versa. Here are some commonly used weight calculation methods.

Term-Frequency : TF

Term Frequency(TF) :

w i T F = n i d n d w_i^{TF} = \frac{n_{id}}{n_d} IniTF=ndnid

in

  • n i d n_{id} nid : number of occurrences of word i in document d,
  • n d n_d nd : number of words in document d

Intuitive meaning: the frequency of a certain feature appearing in the current picture

3.1 Inverse Document Frequency : IDF

Inverse document frequency (IDF):

w i I D F = l o g ( N N i ) w_i^{IDF} = log(\frac{N}{N_i}) IniIDF=log(NiN)

in

  • N: number of documents
  • N i N_i Ni: number of documents containing word i i i.

Intuitive meaning: the frequency of a certain feature appearing in the data set

3.2 TF-IDF

This is the most commonly used weight calculation method, and it is also the default weight calculation method in **DBoW2**

Term Frequency – Inverse Document Frequency

w i = n i d n d l o g ( N N i ) w_i = \frac{n_{id}}{n_d} log(\frac{N}{N_i}) Ini=ndnidlog(NiN)

Intuitive meaning: Words that appear frequently in the entire data set, such asthe (frequently appearing features in images) have less distinction and the weight should be reduced. But if a word appears frequently in the document, the weight of the word should become larger

Through the weighting oftf-idf, those features that appear less often, that is, the more recognizable features, will have greater weight.

The relevant weight definitions are explained in **README of the initial version of DBoW**

4. Creation of dictionary and vocabulary tree

The so-called词典 is obtained from clustering a large number of feature descriptorsnodes. These nodes are in Bag of Words. That is vocabulary. These words represent the more representative features of the image. words

Bag of WordsThe dictionary in the traditional sense in does not have a tree structure, but is just a collection of words (features). But in **DBoW2**, the dictionary has a tree structure. In this case, the search can be accelerated through the tree structure. The tree is only used to speed up search, and the collection of all leaf nodes of the tree is the dictionary. Leaf nodes are also calledwords.

4.1 Create vocabulary tree

Finally, the following vocabulary tree will be constructed

Insert image description here

The paper describes how to build a vocabulary tree like this:

To build it, we extract a rich set of features from some training
images, independently of those processed online later. The descriptors
extracted are first discretized into kw binary clusters by performing
k-medians clustering with the k-means++ seeding [22]. The medians
that result in a non binary value are truncated to 0. These clusters
form the first level of nodes in the vocabulary tree. Subsequent levels
are created by repeating this operation with the descriptors associated
to each node, up to Lw times.

There are some ambiguities between the description in the paper and the description in the figure. The leaf node in the diagram is level 0. But in the above description, it means top-down, node 0 is the root node, and clustering is done downwards. In this case, the leaf node level should be level Lw. Here Lw means level word. That is, the level of the vocabulary tree. k w k_w kwis the number of child nodes of each node.

In the code implementation, the actual construction sequence is actually as shown below:

Insert image description here

4.2 创built词文

The process of creating a dictionary is actually just traversing the vocabulary tree and collecting all leaf nodes.

4.3 Calculate word weight (i.e. leaf node weight)

Here we take the default weight calculation methodTF-IDF as an example. The calculation method of TF-IDF has been explained above. When creating a dictionary, you can only calculate IDF items first, and TF items when using this dictionary to perform Bag of Words can be calculated only when .

Review again, in IDF is the number of all pictures, is the picture containing this word quantity. NN_i

Therefore, when calculatingIDF, you also need to use the vocabulary tree created in the first step to count the number of pictures in each leaf node. The specific calculation process can be as follows:

for each image:
    for each feature in image:
        word_id = tree.find(word)
        if word_id not been counted:
            word_Ni[word_id] += 1

for each word_id:
    word_idf[word_id] = log(N / word_Ni[word_id])

4.4 Save dictionary and vocabulary tree

**DBoW2** The YAML structure is directly used to save dictionaries and vocabulary, and at the same time Use cv:FFileStorage to read and write, and you can save compressed files directly. The generated dictionary and vocabulary tree files are as follows

%YAML:1.0
---
vocabulary:
   k: 9
   L: 3
   scoringType: 0
   weightingType: 0
   nodes:
      - {
    
     nodeId:1, parentId:0, weight:0.,
          descriptor:"114 237 18 190 93 135 214 232 143 132 232 11 110 202 37 208 248 251 235 227 78 3 218 255 179 244 143 59 17 63 47 142 " }
      - {
    
     nodeId:2, parentId:0, weight:0.,
          descriptor:"190 116 236 103 236 127 255 216 123 238 247 191 91 125 223 59 255 221 247 231 173 246 255 138 127 251 229 181 231 253 94 243 " }
      - ...
      - {
    
     nodeId:9, parentId:0, weight:0.,
          descriptor:"208 188 159 182 168 236 17 7 190 159 130 25 247 183 64 91 161 119 109 16 58 77 147 35 217 227 126 89 128 64 135 43 " }
      - {
    
     nodeId:717, parentId:9, weight:0.,
          descriptor:"208 62 159 183 170 106 52 55 190 191 130 245 247 183 192 91 167 119 108 0 57 205 131 2 200 163 127 144 168 112 135 27 " }
      - {
    
     nodeId:718, parentId:9, weight:0.,
          descriptor:"208 189 31 254 43 204 85 103 185 19 226 56 190 183 32 115 162 247 108 0 105 203 146 111 209 243 86 27 0 16 167 43 " }
      - ...
      - 省略后续所有节点...
   words:
      - {
    
     wordId:0, nodeId:19 }
      - {
    
     wordId:1, nodeId:20 }
      - {
    
     wordId:2, nodeId:21 }
      - {
    
     wordId:3, nodeId:22 }
      - {
    
     wordId:4, nodeId:23 }
      - 省略后续所有单词...

There are a few points worth noting above:

  1. Due to the traversal order problem when creating the vocabulary tree mentioned above (actually it is neither breadth-first nor depth-first traversal), the last one of the root node will appear The serial number of the child node , that is, the child node nodeId:9 above, is particularly large. This serial number is based on the traversal order, not based on the structure of the tree. .
  2. The type of feature descriptor (explained by ORB feature) in **DBoW2** iscv::Mat, type = CV_8UC1 matrix, one ORB feature has a total of 256 bits, so 32 unsigned char are used to save it. You can see that the above descriptor field is a string of 32 numbers, each number is between 0-255
  3. Only leaf nodes (that is, word) have weights, and other weights are 0

5. Add pictures to the database

The process of adding pictures to the database can be understood as calculating Bag of Words for the newly added pictures during real-time operation, and at the same time calculating the so-called Direct index and < /span>Inverted index.

Inverted index stores the index of the picture corresponding to eachword. When a new picture is sent in for calculationBag of Words, the index of other pictures contained in each word of the image can be obtained. This can greatly speed up searches.

Direct index stores the index of the feature points contained in word of each image. It is mainly used to accelerate feature point matching. When a frame of image I_t is sent in and its corresponding picture \prime{I_t} is obtained through some method (previous frame or loopback detection), it can be passed Direct indexFind the same feature pointsword in the two frames of images, and retrieval and matching between these feature points will be much faster than directly calculating distances and matching all feature points.

5.1 Calculated through feature setBag of Words

In fact, a more accurate description is特征集合turnBag of Words. It is necessary to first use the feature point detector to extract feature points, and then use the feature descriptor to calculate the feature descriptor. Then performBag of Words on these feature descriptors.

Each feature can be found through 词汇树 to find the final node, and the vote of this node is increased by 1. Performing this operation on all features yields a histogram of words. This histogram isBag of Words.

The feature set here refers to the descriptors of all feature points in a picture.

The specific process is actually very simple:

  1. For each feature point, find its corresponding node, and also find the word of this feature point.
  2. Collect the weight of word and add this weight to the corresponding position of BoW vector.
  3. After traversing all the feature points, normalizeBoW vector. The normalizedvector is this pictureBag of Words.

5.2 Calculation method of TF-IDF weight in implementation

IDFThe calculation of has been explained above, and the of each word has been determined when the dictionary is built. As for the item, that is, the frequency of a certain appearing in a certain picture, it has not been explicitly calculated. Instead, when calculating , the weight of of each is directly accumulated. In this case, another normalization is performed later, and the weight of each in is IDFTFwordBoW vectorfeaturewordBoW vectorwordTF-IDF

5.3 Database storage structure

The storage structure of the database adds adatabasefield after the storage structure of the dictionary/vocabulary

%YAML:1.0
---
vocabulary:
   k: 9
   L: 3
   scoringType: 0
   weightingType: 0
   nodes:...
   words:...
database:
   nEntries: 4
   usingDI: 1
   diLevels: 0
   invertedIndex:
      -
         - {
    
     imageId:0, weight:1.5644960771493644e-03 }
         - {
    
     imageId:2, weight:2.0755306357238515e-03 }
      -
         - {
    
     imageId:3, weight:4.0272643792627289e-03 }
      -
         - {
    
     imageId:0, weight:1.5644960771493644e-03 }
         - {
    
     imageId:2, weight:4.1510612714477030e-03 }
      -
         - {
    
     imageId:0, weight:1.5644960771493644e-03 }
         - {
    
     imageId:1, weight:3.7071104700705723e-03 }
      - ...
   directIndex:
      - # 图片1
         -
            nodeId: 19
            features:
               - [ 20, 40, 256, 300, ..., feature_id]
         -
            nodeId: 21
            features:
               - [ 34 ]
         -
            nodeId: 22
            features:
               - [ 207 ]
      - # 图片2
         -
            nodeId: 19
            features:
               - [ 256 ]
         - ...
      - # 图片3
         -
            nodeId: 19
            features:
               - [ 256 ]
         - ...
      - # 图片4
         -
            nodeId: 19
            features:
               - [ 256 ]
         - ...

6. Use the database to retrieve similar images and match feature points

6.1 Image retrieval

The process can be broken down into two steps:

  1. calculationBag of Words, getBoW vector, record v t v_t int(手机, v t v_t intThe length of is the length of the vocabulary, that is, the number of leaf nodes of the vocabulary tree)
  2. Retrieve the image closest to this BoW vector from the database

But in fact, because of Inverse index, you can get it directly from v t v_t Inverse indexintwordThe index of the image corresponding to each in v t v_t . In this way, you can get these pictures directly from the database, and then calculate the relationship between these pictures and intdistance to find the nearest picture. The actual calculation steps are as follows:

image_value: map<ImageId, RelavantCoefficient>
return: map<ImageId, RelavantCoefficient>
for each word in v_t:
    for each image in invertedIndex[word]:
        coeff = fabs(word.weight - image.weight) - fabs(word.weight) - fabs(image.weight)
        image_value[image.imageId] += coeff

sort image_value by value (ascending)

select top N images from image_value -> return
for each image in return:
    image.coeff = -image_value[image.imageId] * 0.5

The only strange thing here is the calculation of weights. Theoretically, the distance calculation of BoW vector is as follows (taking the distance of L1 as an example):

d ( v t , v q ) = ∣ ∣ v t − v q ∣ ∣ 1 = ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ d(v_t, v_q) = ||v_t - v_q||_1 = \sum_{i=1}^{N} |v_t[i] - v_q[i]| d(vt,inq)=∣∣vtinq1=i=1Nvt[i]inq[i]

The score calculated in this way is between 0 and 2, with 0 representing the best and 2 representing the worst. But in **DBoW2** in order to control the final score between 0-1, and 0 means the worst, 1 means the best, so Some transformations were made to the distance calculation above:

First of all,L1 the distance can be written as follows:

(Of course, the premise of this conversion is that v_t and v_q are both normalized)

d ( v t , v q ) = ∣ ∣ v t − v q ∣ ∣ 1 = 2 + ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ − ∣ v t [ i ] ∣ − ∣ v q [ i ] ∣ d(v_t, v_q) = ||v_t - v_q||_1 = 2 + \sum_{i=1}^{N} |v_t[i] - v_q[i]| - |v_t[i]| - |v_q[i]| d(vt,inq)=∣∣vtinq1=2+i=1Nvt[i]inq[i]vt[i]vq[i]

Split the distance calculation into two steps:

d 1 ( v t , v q ) = ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ − ∣ v t [ i ] ∣ − ∣ v q [ i ] ∣ , [ − 2   b e s t   . . .   0   w o r s t ] d ( v t , v q ) = − 0.5 ∗ d 1 ( v t , v q ) , [ 0   w o r s t   . . .   1   b e s t ] d^1(v_t, v_q) = \sum_{i=1}^{N} |v_t[i] - v_q[i]| - |v_t[i]| - |v_q[i]| ,\quad [-2 \ best\ ...\ 0 \ worst]\\ d(v_t, v_q) = - 0.5 * d^1(v_t, v_q), \quad[0\ worst\ ...\ 1\ best] d1(vt,inq)=i=1Nvt[i]inq[i]vt[i]vq[i],[2 best ... 0 worst]d(vt,inq)=0.5d1(vt,inq),[0 worst ... 1 best]

6.2 Special attack point allocation

对于图片 I q I_q Iqsum I d I_d Id, (q for query && d for destination), respectively have feature sets F q = { f i , i ∈ N q } F_q = \{f_i, i \in N_q\} Fq={ fi,iNq} F d = { f j , j ∈ N d } F_d = \{f_j, j\in N_d\} Fd={ fj,jNd}. Among them N q N_q Nqsum N d N_d Nd are the number of feature points in the two pictures respectively. f f f is a feature descriptor. For ORB features, f ∈ [ 0 , 1 ] 256 f\in \mathbb{ [ 0,1 ] }^{256} f[0,1]256. No matter what feature descriptor it is, there are roughly two traditional ways of matching feature points:

  1. Exhaustive : 对于每个 f i f_i fi, calculate its sum F d F_d FdMedium possession f j f_j fj distance, find the nearest f j f_j fj, if this distance is less than a certain threshold, the two feature points are considered to match. The disadvantage of this method is that the amount of calculation is large, because it requires F d F_d FdDistance calculation is performed on all feature points in .
  2. Flaann: First F d F_d FdConstructionKD-tree,Returning to the next page f i f_i fi, search for the nearest feature point in KD-tree. Generally speaking, flann(Fast Library for Approximate Nearest Neighbors) is used to speed up KD-tree construction and search. In the paper, the author also conducted experiments on this method. In fact, due to the extra time-consuming construction of KD-tree, the actual efficiency of the flann method is not higher than a>

In **DBoW2**, in order to speed up the search process of feature point matching, when the image is added to the database, the image is additionally saved. The set of feature points corresponding to each word of the image is called the so-called direct index. In this case, when matching the feature points of two images, you only need to go through the two images word one by one, and match the word contained in the two images. By matching the feature points of __Exhaustive__, you can find the matching feature points. Since the number of features contained in each word is very limited, the efficiency of feature point matching can be greatly improved.

What is worth learning here is the design of the data structure in the code. In **DBoW2**, the main data structures used to store a picture are as follows

The steps of feature point matching are in **DBoW2**, which is mainly used when passing BoW vectorAfter retrieving __a candidate picture with the highest score through some other means, the candidate picture is then matched with feature points. By matching feature point sets between images, the basic matrices of both can be calculated (Fundamental Matrix). The specific calculation of the basic matrix will be explained in detail later. This section first focuses on the process of feature point matching.

6.3 Special point calculation method

6.4 Import

  • Picture piece I 1 , I 2 I_1,I_2 I1,I2For two pictures to be matched with feature points
  • F 1 , F 2 F_1,F_2 F1,F2 is the feature set of two pictures, F 1 = { f i , i ∈ N 1 } , F 2 = { f j , j ∈ N 2 } F_1 = \{f_i , i\in N_1\}, F_2 = \{f_j, j\in N_2\} F1={ fi,iN1}F2={ fj,jN2}
  • D 1 , D 2 D_1,D_2 D1,D2 is the two picturesdirect index, which stores the feature ids contained in eachword

6.5 Exit

  • M is the matching feature point set of the two pictures, M = { m k , k ∈ N m } , m k = ( i , j ) M = \{m_k, k \in N_m\},m_k = (i, j) M={ mk,kNm}mk=(i,j), i is the feature id in picture 1, j is the feature point id in picture 2;Nmis the number of matching feature points

6.6 walk

for d1,d2 in (D1,D2):
      for f1_id in d1.featureIds:
         best_dist_1 = DOUBLE_MAX
         best_dist_2 = DOUBLE_MAX
         best_f2_id = -1
         for f2_id in d2.featureIds:
               dist = distance(F1[f1_id], F2[f2_id]) // 计算特征之间的距离(注意是二值化特征)
               if dist < best_dist_1:
                  best_dist_2 = best_dist_1
                  best_dist_1 = dist
                  best_f2_id = f2_id
               else if dist < best_dist_2:
                  best_dist_2 = dist
         
         if best_dist_1 < best_dist_2 * THRESHOLD: // 最近距离需要比第二近距离小一定的阈值
            if f1_id not in M:
               M.push_back((f1_id, best_f2_id))
            else:
               compare the distance of (F1[f1_id], F2[M[f1_id]]) and (F1[f1_id], F2[best_f2_id])
               and choose the small one

7. Use DBoW2 for loopback detection

Loopback detection is in **DBoW2Another repository of the authorDLoopDetector< Implemented in a i=4>. DLoopDetectorUse DBoW2** library to doConstruct, and use to accelerate feature point matching, and use the matched feature point pairs to solve the basic matrix to determine whether the found image meets the physical constraints, that is, the solved basic matrix should be able to make the features The point reprojection error is very small, and this step is called. Bag of wordDirect Indexgeometric verification

When doing loopback detection, it is essentially to find a similar image, but in order to ensure that the accuracy is high enough (needs to be close to 100%, because if the loopback detection is wrong, the entire trajectory will be messed up. Of course, the recall rate does not need to be so high. Simple It can be said that it can not be detected, but the detected one must be correct). In order to achieve such a high accuracy, some special processing methods are used during loopback detection to screen candidate loopback matching frames:

7.1 Fraction conversion

Uses the score of the current frame and the previous frame as the denominator to normalize the score of the candidate frame. The reason for normalizing the score is that the distance calculation of BoW vector is greatly affected by various conditions. Using the normalized score can easily determine a unified threshold. In practice, in order to calculate the matching score between the current frame and the previous frame, it is implemented by caching the previous frame's BoW vector. The set threshold is generally not large, and the default is 0.3.

7.2 Candidate Frame Aggregation

"aggregate" a batch of retrieved candidate frames according to the frame sequence number, which is called "island" in the paper. That is to say, frames with similar serial numbers are aggregated together. When the distance between the two frames is less than the threshold, it is considered to belong to the same island. When the number of aggregated frames is sufficient, it is considered a qualified candidate "island". The score of eachisland is the sum of the scores of all frames in it. Select the one with the largest scoreisland as the final matchisland. At the same time, the frame with the largest score in this island is the matching frame.

7.3 Time continuity consistency

Check the temporal continuity of the island with the candidate island of the previous frame. Current matching result v t , V T ′ v_t,V_{T\prime} int,INTMatching result with the previous frame v t − Δ t , V T 1 v_{t-\Delta t},V_{T_1} intΔt,INT1应该满足 V T ′ V_{T\prime} INTIt is very close or overlaps with V_{T_1}$.

7.4 Calculation step

  1. Calculate the current frame v t v_t intBoW vector,记为 v t v_t int
  2. Retrieve from database with v t v_t intThe closest _ a batch of pictures _ _\_a batch of pictures_\_ _One criticismOne piece_, recorded as V
  3. 计算 v t v_t int and the normalized scores of all candidate frames in V, and remove frames below the threshold, recorded as V ′ V^\prime IN
  4. V ′ V^\prime INThe frames in ′ are aggregated according to the frame sequence number, and the island with the highest score is found, recorded as V b e s t ′ V^\prime_{best} INbest
  5. PS: In the code, temporal continuity consistency does not seem to play a role
  6. Consider the frame with the highest score inisland as the candidate frame, and proceedgeometry verification. If it passes, the loopback frame is considered to be found

8. Code and practical operation

The code for the loopback detector in DBoW2 is placed inDLoopDetector. When actually running, you need to solve the dependency issues first, and there may be some bugs that need to be fixed. But one unbearable point is that the default dictionary saving format of **DBoW2 is yml. When reading < When reading the dictionary file provided by /span> format, which makes it very fast to read and only costs seconds. **, the author saves the dictionary in ORB_SLAM, it took several hours and it has not been read. In DLoopDetectortxt

The author packages **DBoW2 and DLoopDetector, And fix dependency issues and some bugs, you can use it directly. Code link**

Improvement points:

  • Read dictionary from txt
  • Fixed some bugs in DLoopDetector
  • Integrate DBoW2, DLib, and DLoopDetector into one project to avoid dependency issues and compilation issues
  • The entire project only needs to rely on OpenCV
  • On the basis of demo_brief.cpp provided by DLoopDetector, demo_orb.cpp is added, and a dictionary is provided to directly run the test results

running result:

9. Companion fee

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/134873924