Detailed explanation of OpenCV-SIFT algorithm

Series Article Directory



introduction

The SIFT algorithm is to solve the matching problem of pictures. It wants to extract a feature that is robust to image size and rotation changes from the image, so as to achieve matching. The inspiration of this algorithm is also very intuitive: when the human eye observes whether two pictures match, it will notice the typical area (feature point part). If we can realize this feature point area extraction process, and then extract The feature matching can be realized by describing the region. The problem then evolves into the following sub-problems:

What kind of points should be selected as feature points? : The human eye is more sensitive to high-frequency areas in the image, so we should choose sharply changing edges or corners for detection. Here we choose to detect corners. (The intuition here is not very clear, but it feels that we may extract feature points for matching, and the edge is often composed of multiple points.) How to make the selected feature points invariant to scale
? : Use Gaussian pyramids to obtain image variants at different sizes, and scale-invariant features are obtained from these variants.
How to make the selected feature points have scaling invariance? : Use the Gaussian pyramid to obtain image variants at different sizes, extract feature points from each variant and scale them back to the original size to obtain feature points at different sizes of the image.
How to make the selected feature points invariant to rotation? : Follow up with the setting of rotating the feature point area to the main direction to obtain rotation invariance.
How to describe the feature point area? : Use the gradient assignment of each direction in the feature point area, similar to the HOG operator.

The above questions all use very interesting tricks in the SIFT algorithm, which will be introduced one by one later.

1. Gaussian Pyramid

The purpose of introducing the Gaussian Pyramid has been introduced in the introduction - to solve the problem of feature extraction under image scaling and scale changes. There are two Intuitions of the Gaussian Pyramid: 1. When people look at objects, they can be down-sampled to achieve (Pyramid -> Group); 2. When people look at objects, they are clear in the near area and blurry in the distance, and Gaussian smoothing of the image can be realized (Gaussian -> layer); for specific derivation, please refer to my other blog: Opencv study notes
( 6) Image pyramid
In SIFT, the number of layers and groups of the Gaussian pyramid has the following settings:

组数: O = [ l o g 2 m i n ( M , N ) ] − 3 O=[log_2min(M,N)]-3 O=[log2​min(M,N)]−3

Number of layers: S = n + 3 S=n+3 S=n+3

The setting of the number of groups comes from the empirical value given by the original paper that proposed the SIFT algorithm. In theory, it is known that O ≤ [ log 2 min ( M , N ) ] O\leq[log_2min(M,N)] O≤ [log2​min(M,N)] is sufficient, and the setting of the number of layers has a theoretical basis. Here, nnn is the number of image layers we want to extract feature points. After extracting the Gaussian pyramid, it is necessary to calculate the interlayer Difference to obtain a Gaussian difference pyramid (DOG, Difference of Gausssian), the number of layers of the Gaussian pyramid needs to be 1 more than the number of DOG layers, and the calculation of the eigenvalues ​​is required at the scale level, that is, between the upper and lower adjacent layers. The number of DOG layers is higher than If the number of feature layers is 2 more, S = n + 3 S=n+3 S=n+3 is required.

Attachment: There are several knowledge points that need to be supplemented for the Gaussian pyramid in SIFT.

The S in the SIFT algorithm refers to the concept of "scale", which is different from the traditional image size, but a continuously changing parameter. The σ \sigma σ in the Gaussian pyramid is such a scale space More specifically, the Gaussian kernel is the only linear kernel that can produce continuous multi-scale changes, which is why we use the Gaussian pyramid. This part of the content involves the concept of scale space, which can be understood by yourself.
Two basic parameters are mentioned in the article Image Pyramid, the coefficient of interlayer scale change kkk and the initial scale σ 0 \sigma_0 σ0​. In the original algorithm, k = 2 o + rn , r = 0 , 1… , n + 2 k=2^{o+\frac{r}{n}}, r={0,1...,n+2} k=2o+nr​, r=0,1...,n+2, due to each group The starting layer is obtained by down-sampling from the third last layer of the previous group, which actually guarantees that the starting scale of each group is σ 0 , 2 σ 0 , 3 σ 0 . . . \sigma_0,2\sigma_0,3\sigma_0… σ0​, 2σ0​, 3σ0​…. In addition, we mentioned that σ 0 = 1.6 \sigma_0=1.6 σ0​=1.6, and when the actual picture is taken, the camera has already carried out a scale transformation of the original scene. 0.5 σ′=0.5, from which we get σ 0 = 1. 6 2 − σ ′ 2 = 1.52 \sigma_0=\sqrt{1.6 2- \sigma' 2}=1.52 σ0​=1.62−σ′2

=1.52. The origin of this formula is: Gaussian filtering is embodied in the convolution of the Gaussian kernel and the original image f ( x ) ⨂ G ′ ( x ) f(x)\bigotimes G'(x) f(x)⨂G′(x) , the cascade of two Gaussian filters is equivalent to f ( x ) ⨂ ( G ′ ( x ) ⨂ G ( x ) ) f(x)\bigotimes (G'(x) \bigotimes G(x)) f(x) ⨂(G′(x)⨂G(x)), and the result of two Gaussian convolutions is still a Gaussian convolution, and the variance satisfies the square sum relationship, which can be proved by time-domain convolution or Fourier transform, see reference for details literature.

2. Gaussian Difference Pyramid

Through the Gaussian pyramid, we have obtained pictures of different scales. The next question is how to obtain high-frequency regions. A very simple idea is to use differential filters such as Laplacian filters and sobel filters according to the edge detection algorithm. Swipe across the image to find areas with drastic changes in grayscale values. According to previous studies, the maximum and minimum values ​​of the normalized Gaussian Laplacian operator can obtain the most stable image features compared with other feature extraction functions, so we intend to use the normalized Gaussian Laplacian The Las operator extracts features from multi-scale images, but the complexity of extraction in this way will be very high, and the scale-normalized Gaussian Laplacian operator has the following relationship with the DOG function: G ( x , y ,
k σ ) − G ( x , y , σ ) ≈ ( k − 1 ) σ 2 ∇ 2 GG(x,y,k\sigma)-G(x,y,\sigma)\approx(k-1)\sigma ^2 \nabla^2 GG(x,y,kσ)−G(x,y,σ)≈(k−1)σ2∇2G is
proved as follows:
Neglect the Gaussian function coefficient:
G ( x , y , σ ) = 1 σ 2 exp ( − x 2 + y 2 2 σ 2 ) ∂ G ∂ x = − x σ 4 exp ( − x 2 + y 2 2 σ 2 ) ∂ 2 G ∂ x 2 = − σ 2 + x 2 σ 6 exp ( − x 2 + y 2 2 σ 2 ) ∇ 2 G ( x , y ) = ∂ 2 G ∂ x 2 + ∂ 2 G ∂ y 2 = − 2 σ 2 + x 2 + y 2 σ 6 exp ( − x 2 + y 2 2 σ 2 ) ∂ G ∂ σ = − 2 σ 2 + x 2 + y 2 σ 5 exp ( − x 2 + y 2 2 σ 2 ) ⇒ σ ∇ 2 G = ∂ G ∂ σ
G(x,y,σ)=1σ2exp(−x2+y22σ2)∂G∂x=−xσ4exp(−x2+y22σ2)∂2G∂x2=−σ2+x2σ6exp(−x2+y22σ2)∇2G(x,y)=∂2G∂x2+∂2G∂y2 =−2σ2+x2+y2σ6exp(−x2+y22σ2)∂G∂σ=−2σ2+x2+y2σ5exp(−x2+y22σ2)⇒σ∇2G=∂G∂σ

​G(x,y,σ)=σ21​exp(−2σ2x2+y2​)∂x∂G​=−σ4x​exp(−2σ2x2+y2​)∂x2∂2G​=σ6−σ2+x2​exp (−2σ2x2+y2​)∇2G(x,y)=∂x2∂2G​+∂y2∂2G​ =σ6−2σ2+x2+y2​exp(−2σ2x2+y2​)∂σ∂G​=σ5 −2σ2+x2+y2​exp(−2σ2x2+y2​)⇒σ∇2G=
∂σ∂G ​​For the differential Gaussian pyramid:
DOG = G ( x , y , k σ ) − G ( x , y , σ ) ( k − 1 ) σ ≈ ∂ G ∂ σ DOG ≈ ( k − 1 ) σ 2 ∇ 2 G \begin{aligned} &DOG=\frac{G(x,y,k\sigma)-G(x ,y,\sigma)}{(k-1)\sigma}\approx \frac{\partial G}{\partial \sigma}\ & DOG \approx(k-1)\sigma 2\nabla 2G \ end { aligned} ​DOG=(k−1)σG(x,y,kσ)−G(x,y,σ)​≈∂σ∂G​DOG≈(k−1)
σ2∇2G ​We no longer The convolution operation needs to be performed on the Gaussian pyramid, and it is only necessary to calculate the interlayer difference at the scale level σ \sigma σ to obtain the Gaussian difference pyramid (DOG), that is, the feature extraction operation is completed.
insert image description here

3. Feature point processing

After getting the DOG, we have theoretically obtained the eigenvalues, but we need to do some processing on the eigenvalues, just like the idea of ​​non-maximum value suppression we used in cannny edge detection, to remove less characteristic eigenvalues .
1. Thresholding

Simple thresholding removes points that are not so sharply transformed, and these points may be caused by noise, which can be regarded as adding a layer of denoising measures in addition to Gaussian Laplacian.
val = { valabs ( val ) > 0.5 T n 0 otherwise val =
{val0abs(val)>0.5Tnotherwise

val={val0​abs(val)>0.5nT​otherwise​The TTT
here is the empirical value of 0.04, and nnn is the number of extracted feature points mentioned before.
2. Non-maximum suppression

The idea of ​​non-maximum consistency is the same as in other algorithms. We require that the selected eigenvalues ​​should be the extremum within its domain. The difference is that other algorithms only require the eigenvalues ​​in the two-dimensional plane of the image Extremum, and here it is also required to be extremum at the level of scale σ \sigma σ, which also inherits the previous setting that the number of DOG layers is 2 more than the number of layers used to extract features.
insert image description here
3. Second order Taylor correction

Since our picture can only take discrete values ​​in the directions of x , y , σ x,y,\sigma x,y,σ, even if some feature points are obtained through the first two steps, these feature points are not accurate enough. We need to introduce second-order Taylor function alignment for correction, so that feature points can appear in sub-pixel (sub-scale) positions.
insert image description here

f ( X ) = f ( X 0 ) + ∂ f T ∂ XX ^ + 1 2 XT ^ ∂ 2 f ∂ X 2 X ^ f(X)=f(X_0)+\frac{\partial f^T}{ \partial X}\hat{X}+\frac{1}{2}\hat{X T}\frac{\partial 2 f}{\partial X^2}\hat{X} f(X)=f (X0​)+∂X∂fT​X +21​XT ∂X2∂2f​X^
The above formula gives the approximate value f ( x ) f(x) f(x) of the function near the feature point X 0 X_0 X0 , where X ^ = X − X 0 \hat{X}=X-X_0 X^=X−X0​, calculate the zero point of the first-order derivative of this formula, you can get the position of the actual extreme point of the function and X 0 X_0 X0 ​, so that the discrete feature points are corrected to the sub-scale.
f ′ ( X ) = ∂ f T ∂ X + ∂ f 2 ∂ X 2 X ^ f'(X)=\frac{\partial f^T}{\partial X}+\frac{\partial f^2} {\partial X^2}\hat{X} f′(X)=∂X∂fT​+∂X2∂f2​X^

X ^ ex = − ∂ 2 f − 1 ∂ X 2 ∂ f ∂ X \hat{X} { ex}=-\frac{\partial^2 f^{-1}}{\partial X^2}\frac {\partial f}{\partial X} X^ex​=−∂X2∂2f−1​∂X∂f
​Introduced into previous formula to obtain the gray value of the new feature point:
f ( X ′ ) = f ( X 0 ) + 1 2 ∂ f T ∂ X ′ X ^ exf(X')=f(X_0)+\frac{1}{2}\frac{\partial f^T}{\partial X'} \hat{X}
{ex} f(X′)=f(X0​)+21​∂X′∂fT​X^ex
​It should noted that the above formula is an iterative process, that is, according to the current The process of obtaining new feature points by the second-order Taylor of feature points will be repeated until the termination condition is met, such as X ^ \hat{X} X^ is too small. It should also be noted that when the obtained solution exceeds a certain range of discrete extreme points, it needs to be discarded, because the second-order Taylor fitting is only valid near it.

Attachment: A very interesting point is that the purpose of the iteration here is not the same as the iteration in the gradient descent method. In the gradient descent, what we seek is the gradient of the function at a certain point, because it is difficult for us to understand complex nonlinear neurons. The network divides its first-order derivative zero point, even if it is obtained, it may fall into a local extremum; and here we directly obtain the first-order zero point of the function. The purpose of iteration is because the function itself is an approximation, and Taylor expansion only takes When it comes to the second item, it is necessary to continuously approximate the original function. The purpose of these two iterations is not the same.
4. Low contrast removal

The purpose is similar to the thresholding between them, and it is also to remove feature points that do not change so drastically. Requirements:
∣ f ( x ) ∣ ≥ T n |f(x)|\geq\frac{T}{n} ∣f(x)∣≥nT
​ 5. Edge effect removal

As mentioned in the introduction, the feature points we want to extract are corner points rather than edges, and the aforementioned series of measures can only ensure that the points with drastic gray value changes are taken, and the edge points also meet this feature, so we will pass the following way to remove edge points.

计算黑森矩阵 H ( x , y ) = [ D x x ( x , y ) D x y ( x , y ) D y x ( x , y ) D y y ( x , y ) ] H(x,y)=
[Dxx(x,y)Dyx(x,y)Dxy(x,y)Dyy(x,y)]

H(x,y)=[Dxx​(x,y)Dyx​(x,y)​Dxy​(x,y)Dyy​(x,y)​]
若矩阵行列式 D e t ( H ) < 0 Det(H)<0 Det(H)<0,舍去该特征点
若矩阵行列式和迹不满足: T r ( H ) D e t ( H ) < ( γ 0 + 1 ) 2 γ 0 \frac{Tr(H)}{Det(H)}<\frac{(\gamma_0+1)^2}{\gamma_0} Det(H)Tr(H)​<γ0​(γ0​+1)2​,舍去该特征点, γ 0 \gamma_0 γ0​为有实际意义的经验值,通常设定为10。

Next, explain the meaning of the following steps. The difference between a corner point and an edge point is that the edge appears as a line in the image, and the frequency in the direction perpendicular to the line is high, and the frequency in the direction along the line is relatively low; while the corner point is in the direction of the line. Strong high-frequency components appear in multiple (more than 2) directions. The Hessian matrix is ​​actually a matrix composed of the second-order partial derivatives of the function, which can reflect the curvature change of the function. And for the quadratic matrix, it has the following properties:

假定二次型矩阵 H H H两个特征值为 α , β \alpha,\beta α,β,则 D e t ( H ) = α β Det(H)=\alpha\beta Det(H)=αβ, T r ( H ) = α + β Tr(H)=\alpha+\beta Tr(H)=α+β;
实二次型矩阵的特征值异号时,该矩阵为不定矩阵,黑森矩阵为不定矩阵时,该临界点为非极值点;
黑森矩阵的特征值标定了函数在相应特征向量方向上变化的快慢。

From properties 1 and 2, we can deduce that when D et ( H ) < 0 Det(H)<0 Det(H)<0, the feature point is a non-extreme point and discarded; from properties 1 and 3, we can
deduce When T r ( H ) D et ( H ) = ( α + β ) 2 α 2 β 2 \frac{Tr(H)}{Det(H)}=\frac{(\alpha+\beta) 2} { \alpha 2\beta^2} Det(H)Tr(H)​=α2β2(α+β)2​Too small, the formula composed of the ratio of the two feature vectors γ \gamma γ ( γ + 1 ) 2 γ \frac{(\gamma+1)^2}{\gamma} γ(γ+1)2​It is also small, and the tick function is single-increased when γ > 1 \gamma>1 γ>1, we can According to T r ( H ) D et ( H ) \frac{Tr(H)}{Det(H)} Det(H)Tr(H)​Size to judge the relative size of the feature vector, the value is too small, indicating that the function is in The change in different directions of the point is very uneven, similar to an edge, and is discarded.

4. Feature point descriptor

Through the previous steps, we have obtained stable feature points on different scales, and we need to describe them next.

  1. Determine the direction of the feature point area

It is mentioned in the introduction that in order to make the feature points have rotation invariance, we will uniformly rotate the feature point area to a specific direction, which is the main direction of the feature point area, so we need to determine the main direction first.
The calculation method is: statistics on the scale layer σ oct closest to the feature point scale σ ∗ \sigma^* σ∗, on \sigma_{oct}, on σoct​, with the feature point as the center 3 σ ′ = 3 ∗ 1.5 σ oct 3\sigma'=3 1.5\sigma_{oct} 3σ'=3∗1.5σoct​The gradient magnitude and direction of the pixels within the range, for the gradient magnitude within the range, use 1.5 σ ∗ 1.5\sigma^ 1.5σ ∗ Gaussian kernel filtering for distance weighting. After obtaining a series of gradient magnitude and direction pairs = { amp , ang } pairs={amp,ang} pairs={amp,ang}, divide the 360° 360° 360° direction into multiple bins, which will contain the corresponding ang The amp and amp values ​​in the pairs of ang ang are accumulated to the corresponding bins. If ang ang ang is between the two bins, the amp amp is allocated according to the distance. (This is similar to the HOG algorithm, you can check it yourself). Finally, the amplitude-direction histogram in the feature point area can be obtained, and the direction with the largest amplitude is selected as the main direction.
Insert picture description here
Attachment: Several parameters here, such as the radius of the statistical area and the variance of Gaussian weighting, have different setting methods, but the inner thinking is the same. In addition, if there are other directions whose amplitude reaches 80% of the main direction amplitude outside the main direction, we will use it as the auxiliary direction, and two feature points with the same position and scale but different directions will appear in the subsequent matching. area.
2. Feature point region descriptor

After obtaining the main direction of the feature point area, we can use this value to calculate the descriptor with rotation invariance. First of all, similar to the previous step, the statistics of the gradient amplitude and direction in a certain area at the scale level where the feature point is located are slightly different in implementation.

首先在特征点所处尺度层面内划定特征区域,半径为:
r = 3 2 σ o c t ( d + 1 ) 2 r=\frac{3\sqrt{2}\sigma_{oct}(d+1)}{2} r=232

​σoct​(d+1)​,其中 d d d为我们在一个维度上划分的子块数目,通常为4;
将该区域划分为共 d ∗ d d*d d∗d个子块,每个子块内包含多个像素点 ;
将划定出来的区域旋转到该区域的主方向(前一步已求出);
在每个子块内构筑幅值-方向直方图,统计8个方向的梯度幅值。由此该区域可用 8 ∗ d ∗ d 8*d*d 8∗d∗d维向量表示,由此完成了特征点区域描述子的构建;在这里插入图片描述
附:需要注意的是以上的计算都是发生在相应的高斯金字塔尺度层面上,而非原图或者DOG上,此外由于旋转产生的像素值丢失通过插值算法解决,可以自行了解。如果想在原图上可视化SIFT特征,需要将我们获得的稳定特征点坐标变换回原始的图像尺寸上,简单的乘以原下采样倍数即可。

Summarize

So far, the SIFT algorithm has been explained, and the matching part can use other clustering algorithms according to the characteristics mentioned. Generally speaking, this algorithm is still difficult. This article only introduces mathematical derivation for details that are not mentioned in other blogs. , making the whole thinking process more coherent, please refer to it for more detailed mathematical proofs such as the finite difference method.
https://blog.csdn.net/Dr_maker/article/details/121442210

Guess you like

Origin blog.csdn.net/zyq880625/article/details/132192814