1. Bag of Words - 词袋模型
Bag of Words
The model is a commonly used document representation method in the field of information retrieval. To put it simply, different words can form a piece of text. We do not consider the frequency of these words, so this text can be represented by a histogram of the frequency of these words. The words are Words
and the histogram is Bag of Words
.
PS: This article provides available code out of the box for loopback detection practice;Code
1.1 Two examples
(from Wikipedia**Bag of Word**):
文本1:John likes to watch movies. Mary likes movies too
文本2:John also likes to watch football games.
词典:[“John”, “likes”, “to”, “watch”, “movies”, “also”, “football”, “games”, “Mary”, “too”]
Then the Bag of Words
of these two paragraphs of text is:
The Bag of Words
of text 1: [1, 2, 1 , 1, 2, 0, 0, 0, 1, 1]
Text 2'sBag of Words
: [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
The number indicates the number of occurrences of the word at the corresponding position
1.2 Too many words
If there are too many words in the text, then the vector of Bag of Words
will be very sparse and the vector length will be very large. In this case, the calculation amount will be very large when calculating distance. Therefore, needs to cluster words and cluster similar words together, so that the number of words can be reduced.
One of the simplest processing methods is to use a hash table (Hashing) to hash the words into a fixed-length vector. In this case, the number of words can be controlled to a fixed length. But in this case, similar words will be hashed to different positions, so that similar words cannot be clustered together.
2. 图像的Bag of Words
The Bag of Word model is a very commonly used model in text retrieval. In image retrieval, this model can also be used. But there is a big difference between images and text. The text itself can be regarded as a discrete signal, composed of different words; butthe image is a two-bit matrix and cannot be directly used as some elements. collection. Therefore, the image'sBag of Words
model needs to undergo some processing before it can be used.
To put it simply, through some feature extraction algorithms, the feature points in the image can be extracted, and then these feature points can be used as words, and the Bag of Words
model can be used.
2.1 Construction of Image Bag of Words Vocabulary
After extracting feature points from a large amount of data offline, a large number of feature points and feature descriptors will be obtained. By performing K-mean
/K-means++
clustering on these descriptors, a batch of center points can be obtained, called nodes
. During the clustering process, due to averaging, if the feature descriptor is a binary vector, the binary vector of the cluster center point will not be 0 or 1, but a floating point number. In this case, it is necessary to binarize the values of these center points to obtainnodes
. These nodes
are in Bag of Words
. That is vocabulary. words
ThenBag of Words
is the frequency histogram (or frequency vector) of the image feature point set in these words
Specifically, take a new picture and extract the feature points. For each feature point, calculate its relationship with each of distance, find the nearest , and then use the index of this as the of this feature point. In this way, the collection of of this picture is obtained. Then perform histogram statistics on this set, and get the of this picture. nodes
node
node
node
word
word
Bag of Words
3. 词汇权weight
Different words contain different amounts of information. For example, the word frequency
is more distinguishable than the word the
. These words should be given Highly discriminative words are given greater weight and vice versa. Here are some commonly used weight calculation methods.
Term-Frequency : TF
Term Frequency(TF) :
w i T F = n i d n d w_i^{TF} = \frac{n_{id}}{n_d} IniTF=ndnid
in
- n i d n_{id} nid : number of occurrences of word i in document d,
- n d n_d nd : number of words in document d
Intuitive meaning: the frequency of a certain feature appearing in the current picture
3.1 Inverse Document Frequency : IDF
Inverse document frequency (IDF):
w i I D F = l o g ( N N i ) w_i^{IDF} = log(\frac{N}{N_i}) IniIDF=log(NiN)
in
- N: number of documents
- N i N_i Ni: number of documents containing word i i i.
Intuitive meaning: the frequency of a certain feature appearing in the data set
3.2 TF-IDF
This is the most commonly used weight calculation method, and it is also the default weight calculation method in **DBoW2**
Term Frequency – Inverse Document Frequency
w i = n i d n d l o g ( N N i ) w_i = \frac{n_{id}}{n_d} log(\frac{N}{N_i}) Ini=ndnidlog(NiN)
Intuitive meaning: Words that appear frequently in the entire data set, such asthe
(frequently appearing features in images) have less distinction and the weight should be reduced. But if a word appears frequently in the document, the weight of the word should become larger
Through the weighting oftf-idf
, those features that appear less often, that is, the more recognizable features, will have greater weight.
The relevant weight definitions are explained in **README of the initial version of DBoW**
4. Creation of dictionary and vocabulary tree
The so-called词典
is obtained from clustering a large number of feature descriptorsnodes
. These nodes
are in Bag of Words
. That is vocabulary. These words represent the more representative features of the image. words
Bag of Words
The dictionary in the traditional sense in does not have a tree structure, but is just a collection of words (features). But in **DBoW2**, the dictionary has a tree structure. In this case, the search can be accelerated through the tree structure. The tree is only used to speed up search, and the collection of all leaf nodes of the tree is the dictionary. Leaf nodes are also calledwords
.
4.1 Create vocabulary tree
Finally, the following vocabulary tree will be constructed
The paper describes how to build a vocabulary tree like this:
To build it, we extract a rich set of features from some training
images, independently of those processed online later. The descriptors
extracted are first discretized into kw binary clusters by performing
k-medians clustering with the k-means++ seeding [22]. The medians
that result in a non binary value are truncated to 0. These clusters
form the first level of nodes in the vocabulary tree. Subsequent levels
are created by repeating this operation with the descriptors associated
to each node, up to Lw times.
There are some ambiguities between the description in the paper and the description in the figure. The leaf node in the diagram is level 0
. But in the above description, it means top-down, node 0
is the root node, and clustering is done downwards. In this case, the leaf node level should be level Lw
. Here Lw
means level word
. That is, the level of the vocabulary tree. k w k_w kwis the number of child nodes of each node.
In the code implementation, the actual construction sequence is actually as shown below:
4.2 创built词文
The process of creating a dictionary is actually just traversing the vocabulary tree and collecting all leaf nodes.
4.3 Calculate word weight (i.e. leaf node weight)
Here we take the default weight calculation methodTF-IDF
as an example. The calculation method of TF-IDF
has been explained above. When creating a dictionary, you can only calculate IDF
items first, and TF
items when using this dictionary to perform Bag of Words
can be calculated only when .
Review again, in IDF
is the number of all pictures, is the picture containing this word quantity. N
N_i
Therefore, when calculatingIDF
, you also need to use the vocabulary tree created in the first step to count the number of pictures in each leaf node. The specific calculation process can be as follows:
for each image:
for each feature in image:
word_id = tree.find(word)
if word_id not been counted:
word_Ni[word_id] += 1
for each word_id:
word_idf[word_id] = log(N / word_Ni[word_id])
4.4 Save dictionary and vocabulary tree
**DBoW2** The YAML
structure is directly used to save dictionaries and vocabulary, and at the same time Use cv:FFileStorage
to read and write, and you can save compressed files directly. The generated dictionary and vocabulary tree files are as follows
%YAML:1.0
---
vocabulary:
k: 9
L: 3
scoringType: 0
weightingType: 0
nodes:
- {
nodeId:1, parentId:0, weight:0.,
descriptor:"114 237 18 190 93 135 214 232 143 132 232 11 110 202 37 208 248 251 235 227 78 3 218 255 179 244 143 59 17 63 47 142 " }
- {
nodeId:2, parentId:0, weight:0.,
descriptor:"190 116 236 103 236 127 255 216 123 238 247 191 91 125 223 59 255 221 247 231 173 246 255 138 127 251 229 181 231 253 94 243 " }
- ...
- {
nodeId:9, parentId:0, weight:0.,
descriptor:"208 188 159 182 168 236 17 7 190 159 130 25 247 183 64 91 161 119 109 16 58 77 147 35 217 227 126 89 128 64 135 43 " }
- {
nodeId:717, parentId:9, weight:0.,
descriptor:"208 62 159 183 170 106 52 55 190 191 130 245 247 183 192 91 167 119 108 0 57 205 131 2 200 163 127 144 168 112 135 27 " }
- {
nodeId:718, parentId:9, weight:0.,
descriptor:"208 189 31 254 43 204 85 103 185 19 226 56 190 183 32 115 162 247 108 0 105 203 146 111 209 243 86 27 0 16 167 43 " }
- ...
- 省略后续所有节点...
words:
- {
wordId:0, nodeId:19 }
- {
wordId:1, nodeId:20 }
- {
wordId:2, nodeId:21 }
- {
wordId:3, nodeId:22 }
- {
wordId:4, nodeId:23 }
- 省略后续所有单词...
There are a few points worth noting above:
- Due to the traversal order problem when creating the vocabulary tree mentioned above (actually it is neither breadth-first nor depth-first traversal), the last one of the root node will appear The serial number of the child node , that is, the child node
nodeId:9
above, is particularly large. This serial number is based on the traversal order, not based on the structure of the tree. . - The type of feature descriptor (explained by ORB feature) in **DBoW2** is
cv::Mat, type = CV_8UC1
matrix, oneORB
feature has a total of 256 bits, so 32unsigned char
are used to save it. You can see that the abovedescriptor
field is a string of 32 numbers, each number is between 0-255 - Only leaf nodes (that is,
word
) have weights, and other weights are 0
5. Add pictures to the database
The process of adding pictures to the database can be understood as calculating Bag of Words
for the newly added pictures during real-time operation, and at the same time calculating the so-called Direct index
and < /span>Inverted index
.
Inverted index
stores the index of the picture corresponding to eachword
. When a new picture is sent in for calculationBag of Words
, the index of other pictures contained in each word of the image can be obtained. This can greatly speed up searches.
Direct index
stores the index of the feature points contained in word
of each image. It is mainly used to accelerate feature point matching. When a frame of image I_t is sent in and its corresponding picture \prime{I_t}
is obtained through some method (previous frame or loopback detection), it can be passed Direct index
Find the same feature pointsword
in the two frames of images, and retrieval and matching between these feature points will be much faster than directly calculating distances and matching all feature points.
5.1 Calculated through feature setBag of Words
In fact, a more accurate description is特征集合
turnBag of Words
. It is necessary to first use the feature point detector to extract feature points, and then use the feature descriptor to calculate the feature descriptor. Then performBag of Words
on these feature descriptors.
Each feature can be found through 词汇树
to find the final node
, and the vote of this node
is increased by 1. Performing this operation on all features yields a histogram of words
. This histogram isBag of Words
.
The feature set here refers to the descriptors of all feature points in a picture.
The specific process is actually very simple:
- For each feature point, find its corresponding
node
, and also find theword
of this feature point. - Collect the weight of
word
and add this weight to the corresponding position ofBoW vector
. - After traversing all the feature points, normalize
BoW vector
. The normalizedvector
is this pictureBag of Words
.
5.2 Calculation method of TF-IDF weight in implementation
IDF
The calculation of has been explained above, and the of each word
has been determined when the dictionary is built. As for the item, that is, the frequency of a certain appearing in a certain picture, it has not been explicitly calculated. Instead, when calculating , the weight of of each is directly accumulated. In this case, another normalization is performed later, and the weight of each in is IDF
TF
word
BoW vector
feature
word
BoW vector
word
TF-IDF
5.3 Database storage structure
The storage structure of the database adds adatabase
field after the storage structure of the dictionary/vocabulary
%YAML:1.0
---
vocabulary:
k: 9
L: 3
scoringType: 0
weightingType: 0
nodes:...
words:...
database:
nEntries: 4
usingDI: 1
diLevels: 0
invertedIndex:
-
- {
imageId:0, weight:1.5644960771493644e-03 }
- {
imageId:2, weight:2.0755306357238515e-03 }
-
- {
imageId:3, weight:4.0272643792627289e-03 }
-
- {
imageId:0, weight:1.5644960771493644e-03 }
- {
imageId:2, weight:4.1510612714477030e-03 }
-
- {
imageId:0, weight:1.5644960771493644e-03 }
- {
imageId:1, weight:3.7071104700705723e-03 }
- ...
directIndex:
- # 图片1
-
nodeId: 19
features:
- [ 20, 40, 256, 300, ..., feature_id]
-
nodeId: 21
features:
- [ 34 ]
-
nodeId: 22
features:
- [ 207 ]
- # 图片2
-
nodeId: 19
features:
- [ 256 ]
- ...
- # 图片3
-
nodeId: 19
features:
- [ 256 ]
- ...
- # 图片4
-
nodeId: 19
features:
- [ 256 ]
- ...
6. Use the database to retrieve similar images and match feature points
6.1 Image retrieval
The process can be broken down into two steps:
- calculation
Bag of Words
, getBoW vector
, record v t v_t int(手机, v t v_t intThe length of is the length of the vocabulary, that is, the number of leaf nodes of the vocabulary tree) - Retrieve the image closest to this
BoW vector
from the database
But in fact, because of Inverse index
, you can get it directly from v t v_t Inverse index
intword
The index of the image corresponding to each in v t v_t . In this way, you can get these pictures directly from the database, and then calculate the relationship between these pictures and intdistance to find the nearest picture. The actual calculation steps are as follows:
image_value: map<ImageId, RelavantCoefficient>
return: map<ImageId, RelavantCoefficient>
for each word in v_t:
for each image in invertedIndex[word]:
coeff = fabs(word.weight - image.weight) - fabs(word.weight) - fabs(image.weight)
image_value[image.imageId] += coeff
sort image_value by value (ascending)
select top N images from image_value -> return
for each image in return:
image.coeff = -image_value[image.imageId] * 0.5
The only strange thing here is the calculation of weights. Theoretically, the distance calculation of BoW vector
is as follows (taking the distance of L1
as an example):
d ( v t , v q ) = ∣ ∣ v t − v q ∣ ∣ 1 = ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ d(v_t, v_q) = ||v_t - v_q||_1 = \sum_{i=1}^{N} |v_t[i] - v_q[i]| d(vt,inq)=∣∣vt−inq∣∣1=∑i=1N∣vt[i]−inq[i]∣
The score calculated in this way is between 0 and 2, with 0 representing the best and 2 representing the worst. But in **DBoW2** in order to control the final score between 0-1, and 0 means the worst, 1 means the best, so Some transformations were made to the distance calculation above:
First of all,L1
the distance can be written as follows:
(Of course, the premise of this conversion is that v_t and v_q are both normalized)
d ( v t , v q ) = ∣ ∣ v t − v q ∣ ∣ 1 = 2 + ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ − ∣ v t [ i ] ∣ − ∣ v q [ i ] ∣ d(v_t, v_q) = ||v_t - v_q||_1 = 2 + \sum_{i=1}^{N} |v_t[i] - v_q[i]| - |v_t[i]| - |v_q[i]| d(vt,inq)=∣∣vt−inq∣∣1=2+∑i=1N∣vt[i]−inq[i]∣−∣vt[i]∣−∣vq[i]∣
Split the distance calculation into two steps:
d 1 ( v t , v q ) = ∑ i = 1 N ∣ v t [ i ] − v q [ i ] ∣ − ∣ v t [ i ] ∣ − ∣ v q [ i ] ∣ , [ − 2 b e s t . . . 0 w o r s t ] d ( v t , v q ) = − 0.5 ∗ d 1 ( v t , v q ) , [ 0 w o r s t . . . 1 b e s t ] d^1(v_t, v_q) = \sum_{i=1}^{N} |v_t[i] - v_q[i]| - |v_t[i]| - |v_q[i]| ,\quad [-2 \ best\ ...\ 0 \ worst]\\ d(v_t, v_q) = - 0.5 * d^1(v_t, v_q), \quad[0\ worst\ ...\ 1\ best] d1(vt,inq)=∑i=1N∣vt[i]−inq[i]∣−∣vt[i]∣−∣vq[i]∣,[−2 best ... 0 worst]d(vt,inq)=−0.5∗d1(vt,inq),[0 worst ... 1 best]
6.2 Special attack point allocation
对于图片 I q I_q Iqsum I d I_d Id, (q for query && d for destination), respectively have feature sets F q = { f i , i ∈ N q } F_q = \{f_i, i \in N_q\} Fq={ fi,i∈Nq}和 F d = { f j , j ∈ N d } F_d = \{f_j, j\in N_d\} Fd={ fj,j∈Nd}. Among them N q N_q Nqsum N d N_d Nd are the number of feature points in the two pictures respectively. f f f is a feature descriptor. For ORB features, f ∈ [ 0 , 1 ] 256 f\in \mathbb{ [ 0,1 ] }^{256} f∈[0,1]256. No matter what feature descriptor it is, there are roughly two traditional ways of matching feature points:
- Exhaustive : 对于每个 f i f_i fi, calculate its sum F d F_d FdMedium possession f j f_j fj distance, find the nearest f j f_j fj, if this distance is less than a certain threshold, the two feature points are considered to match. The disadvantage of this method is that the amount of calculation is large, because it requires F d F_d FdDistance calculation is performed on all feature points in .
- Flaann: First F d F_d FdConstruction
KD-tree
,Returning to the next page f i f_i fi, search for the nearest feature point inKD-tree
. Generally speaking,flann(Fast Library for Approximate Nearest Neighbors)
is used to speed upKD-tree
construction and search. In the paper, the author also conducted experiments on this method. In fact, due to the extra time-consuming construction ofKD-tree
, the actual efficiency of theflann
method is not higher than a>
In **DBoW2**, in order to speed up the search process of feature point matching, when the image is added to the database, the image is additionally saved. The set of feature points corresponding to each word
of the image is called the so-called direct index
. In this case, when matching the feature points of two images, you only need to go through the two images word
one by one, and match the word
contained in the two images. By matching the feature points of __Exhaustive__, you can find the matching feature points. Since the number of features contained in each word
is very limited, the efficiency of feature point matching can be greatly improved.
What is worth learning here is the design of the data structure in the code. In **DBoW2**, the main data structures used to store a picture are as follows
The steps of feature point matching are in **DBoW2**, which is mainly used when passing BoW vector
After retrieving __a candidate picture with the highest score through some other means, the candidate picture is then matched with feature points. By matching feature point sets between images, the basic matrices of both can be calculated (Fundamental Matrix
). The specific calculation of the basic matrix will be explained in detail later. This section first focuses on the process of feature point matching.
6.3 Special point calculation method
6.4 Import
- Picture piece I 1 , I 2 I_1,I_2 I1,I2For two pictures to be matched with feature points
- F 1 , F 2 F_1,F_2 F1,F2 is the feature set of two pictures, F 1 = { f i , i ∈ N 1 } , F 2 = { f j , j ∈ N 2 } F_1 = \{f_i , i\in N_1\}, F_2 = \{f_j, j\in N_2\} F1={ fi,i∈N1},F2={ fj,j∈N2}
- D 1 , D 2 D_1,D_2 D1,D2 is the two pictures
direct index
, which stores the feature ids contained in eachword
6.5 Exit
- M is the matching feature point set of the two pictures, M = { m k , k ∈ N m } , m k = ( i , j ) M = \{m_k, k \in N_m\},m_k = (i, j) M={ mk,k∈Nm},mk=(i,j), i is the feature id in picture 1, j is the feature point id in picture 2;Nmis the number of matching feature points
6.6 walk
for d1,d2 in (D1,D2):
for f1_id in d1.featureIds:
best_dist_1 = DOUBLE_MAX
best_dist_2 = DOUBLE_MAX
best_f2_id = -1
for f2_id in d2.featureIds:
dist = distance(F1[f1_id], F2[f2_id]) // 计算特征之间的距离(注意是二值化特征)
if dist < best_dist_1:
best_dist_2 = best_dist_1
best_dist_1 = dist
best_f2_id = f2_id
else if dist < best_dist_2:
best_dist_2 = dist
if best_dist_1 < best_dist_2 * THRESHOLD: // 最近距离需要比第二近距离小一定的阈值
if f1_id not in M:
M.push_back((f1_id, best_f2_id))
else:
compare the distance of (F1[f1_id], F2[M[f1_id]]) and (F1[f1_id], F2[best_f2_id])
and choose the small one
7. Use DBoW2 for loopback detection
Loopback detection is in **DBoW2Another repository of the authorDLoopDetector< Implemented in a i=4>. DLoopDetectorUse DBoW2** library to doConstruct, and use to accelerate feature point matching, and use the matched feature point pairs to solve the basic matrix to determine whether the found image meets the physical constraints, that is, the solved basic matrix should be able to make the features The point reprojection error is very small, and this step is called. Bag of word
Direct Index
geometric verification
When doing loopback detection, it is essentially to find a similar image, but in order to ensure that the accuracy is high enough (needs to be close to 100%, because if the loopback detection is wrong, the entire trajectory will be messed up. Of course, the recall rate does not need to be so high. Simple It can be said that it can not be detected, but the detected one must be correct). In order to achieve such a high accuracy, some special processing methods are used during loopback detection to screen candidate loopback matching frames:
7.1 Fraction conversion
Uses the score of the current frame and the previous frame as the denominator to normalize the score of the candidate frame. The reason for normalizing the score is that the distance calculation of BoW vector
is greatly affected by various conditions. Using the normalized score can easily determine a unified threshold. In practice, in order to calculate the matching score between the current frame and the previous frame, it is implemented by caching the previous frame's BoW vector
. The set threshold is generally not large, and the default is 0.3.
7.2 Candidate Frame Aggregation
"aggregate" a batch of retrieved candidate frames according to the frame sequence number, which is called "island" in the paper. That is to say, frames with similar serial numbers are aggregated together. When the distance between the two frames is less than the threshold, it is considered to belong to the same island. When the number of aggregated frames is sufficient, it is considered a qualified candidate "island". The score of eachisland
is the sum of the scores of all frames in it. Select the one with the largest scoreisland
as the final matchisland
. At the same time, the frame with the largest score in this island
is the matching frame.
7.3 Time continuity consistency
Check the temporal continuity of the island with the candidate island of the previous frame. Current matching result v t , V T ′ v_t,V_{T\prime} int,INT′Matching result with the previous frame v t − Δ t , V T 1 v_{t-\Delta t},V_{T_1} int−Δt,INT1应该满足 V T ′ V_{T\prime} INT′It is very close or overlaps with V_{T_1}$.
7.4 Calculation step
- Calculate the current frame v t v_t int目
BoW vector
,记为 v t v_t int - Retrieve from database with v t v_t intThe closest _ a batch of pictures _ _\_a batch of pictures_\_ _One criticismOne piece_, recorded as V
- 计算 v t v_t int and the normalized scores of all candidate frames in V, and remove frames below the threshold, recorded as V ′ V^\prime IN′
- 对 V ′ V^\prime INThe frames in ′ are aggregated according to the frame sequence number, and the
island
with the highest score is found, recorded as V b e s t ′ V^\prime_{best} INbest′ - PS: In the code, temporal continuity consistency does not seem to play a role
- Consider the frame with the highest score in
island
as the candidate frame, and proceedgeometry verification
. If it passes, the loopback frame is considered to be found
8. Code and practical operation
The code for the loopback detector in DBoW2 is placed inDLoopDetector. When actually running, you need to solve the dependency issues first, and there may be some bugs that need to be fixed. But one unbearable point is that the default dictionary saving format of **DBoW2 is yml
. When reading < When reading the dictionary file provided by /span> format, which makes it very fast to read and only costs seconds. **, the author saves the dictionary in ORB_SLAM, it took several hours and it has not been read. In DLoopDetectortxt
The author packages **DBoW2 and DLoopDetector, And fix dependency issues and some bugs, you can use it directly. Code link**
Improvement points:
- Read dictionary from txt
- Fixed some bugs in DLoopDetector
- Integrate DBoW2, DLib, and DLoopDetector into one project to avoid dependency issues and compilation issues
- The entire project only needs to rely on OpenCV
- On the basis of
demo_brief.cpp
provided by DLoopDetector,demo_orb.cpp
is added, and a dictionary is provided to directly run the test results
running result: