OpenCV image feature extraction learning four, SIFT feature detection algorithm

1. Overview of SIFT feature detection 

The full name of SIFT is Scale Invariant Feature Transform, scale invariant feature transformation, proposed by Canadian professor David G.Lowe. The SIFT feature is invariant to rotation, scale scaling, brightness changes, etc., and is a very stable local feature.

1.1 Features of SIFT algorithm

  1. The local features of the image remain unchanged for rotation, scaling, and brightness changes, and also maintain a certain degree of stability for viewing angle changes, affine transformations, and noise.
  2. With good uniqueness and rich information, it is suitable for fast and accurate matching of massive feature databases.
  3. Large amount, even a few objects can generate a large number of SIFT features
  4. High speed, the optimized SIFT matching algorithm can even achieve real-time performance
  5. Enrollment expansion, it can be easily combined with other feature vectors.

1.2 Steps of SIFT feature detection

  1. Scale-space extrema detection searches images across all scale-spaces to identify potential scale- and selection-invariant interest points via Gaussian differentiation.
  2. Feature point localization At each candidate position, a fitting fine model is used to determine the position scale, and key points are selected according to their stability.
  3. The feature direction assignment is based on the local gradient direction of the image, assigning one or more directions to each key point position, and all subsequent operations are to transform the direction, scale and position of the key points, thereby providing the invariance of these features.
  4. Feature Point Description Within the neighborhood around each feature point, the local gradients of the image are measured at selected scales, and these gradients are transformed into a representation that allows relatively large local shape deformations and illumination transformations.

================================================================

2. Gaussian kernel constructs scale space

The imaging process of the object on the retina is simulated by the degree of blurring of the image. The closer the object is, the larger the size and the blurrier the image is. This is the Gaussian scale space, using different parameters to blur the image (resolution Rate constant), is another form of representation of the scale space.
We know that the convolution operation of an image and a Gaussian function can blur the image, and using different "Gaussian kernels" can obtain images with different degrees of blur. The Gaussian scale space of an image can be obtained by convolution with different Gaussians:

                                       L\left ( x,y,\sigma \right )=G\left ( x,y,\sigma \right )\cdot I\left ( x,y \right )

where, G\left ( x,y,\sigma \right )is the Gaussian kernel function:

                                        G\left ( x,y,\sigma \right )=\frac{1}{2\pi \sigma ^{2}}e^{\frac{x^{2}+y^{2}}{2\sigma ^{2}}}

\sigmaIt is called the scale space factor, which is the standard deviation of the Gaussian normal distribution, reflecting the degree of blurring of the image. The larger the value, the blurrier the image, and the larger the corresponding scale. L\left ( x,y,\sigma \right )Represents the Gaussian scale space of the image.


The purpose of constructing the scale space is to detect feature points that exist at different scales, and a better operator for detecting feature points is the Laplacian of Gaussian operator (  \Delta ^{2}GLoG)

                                                \Delta ^{2}=\frac{\partial ^{2}}{\partial x^{2}}+\frac{\partial^{2} }{\partial y^{2}}

Let kk be the scale factor of two adjacent Gaussian scale spaces, then the definition of DoG:

                           G\left ( x,y,\sigma \right )=\left [ G\left ( x,y,k\sigma \right ) -G\left ( x,y,\sigma \right )\right ]\cdot I\left ( x,y \right )\\ =L\left ( x,y,k\sigma \right )-L\left ( x,y,\sigma \right )

The response image of DoG is obtained by subtracting the images of two adjacent Gaussian spaces. In order to obtain a DoG image, it is necessary to construct a Gaussian scale space first, and the Gaussian scale space can be obtained by adding Gaussian filtering on the basis of image pyramid downsampling, that is, using different parameters σσ for each layer of the image pyramid to perform Gaussian blur. Make each layer of the pyramid have multiple Gaussian blurred images. When downsampling, the first image of a group of images above the pyramid is obtained by downsampling the penultimate image of the group of images below it.

There are multiple groups of Gaussian pyramids, and each group has multiple layers. The scales between multiple layers in a group are different (that is, the Gaussian parameters σ used are different), and the scales between two adjacent layers differ by a scaling factor k. If each group has S layers, then k=2^{\frac{1}{S}}. The bottom image of the previous group of images is obtained by downsampling the image with a scale of 2σ in the next group with a factor of 2 (the Gaussian pyramid is first established from the bottom layer). After the Gaussian pyramid is constructed, the DoG pyramid is obtained by subtracting the adjacent Gaussian pyramids.
The number of groups of Gaussian pyramid is generally

                                              o=\left [ log_{2} min\left ( m,n \right )\right ]-a

o indicates the number of layers of the Gaussian pyramid, m and n are the rows and columns of the image respectively. The subtracted coefficient a can be 0-log_{2}\left ( m,n \right )any value in between, and is related to the size of the top image of the pyramid that is specifically required.
Gaussian blur parameter \sigma(scale space), can be obtained by the following relation:

                                                 \sigma \left ( o,s \right )= \sigma _{0}\cdot 2^{\frac{o+s}{S}}

Where o is the group it is in, s is the layer it is in, \sigma _{0}it is the initial scale, and S is the number of layers in each group. In the implementation of the algorithm \sigma _{0}=1.6, o_{min}=-1the length and width of the original image are firstly doubled. S=3, from here we can know the image scale relationship of adjacent layers in the same group:

                                                    \sigma _{s+1}=k\cdot\sigma _{s}=2^{\frac{1}{S}}\cdot \sigma _{s}

Scale relationship between adjacent groups:

                                                    \sigma _{0+1}=2\sigma _{0}

=================================================================

3. DoG space extreme value detection

In order to find the extreme point of the scale space, each pixel is compared with all adjacent points in its image domain (same scale space) and scale domain (adjacent scale space), when it is greater than (or less than) all the corresponding When adjacent points, the changed point is the extremum point. As shown in the figure, the detection point in the middle should be compared with 8 pixels in the 3×3 neighborhood of the image where it is located, and 18 pixels in the 3×3 region of the adjacent upper and lower layers, a total of 26 pixels. .


It can be known from the above description that the first and last layers of each group of images cannot be compared to obtain extreme values. In order to meet the continuity of scale transformation , Gaussian blur is used to generate 3 images on the top layer of each group of images. Each group of Gaussian pyramid has S+3 layers of images, and each group of DoG pyramid has S+2 groups of images.

Let S=3, that is, each group has 3 layers, then k=2^{\frac{1}{S}}=2^{\frac{1}{3}}, that is, there are Gaussian pyramids with (S−1) 3-layer images in each group, and DoG pyramids with (S-1) 3-layer images in each group, and DoG pyramids with each group There are (S-2) 2 layers of images. In the first group of the DoG pyramid, there are two layers of scales respectively σ and kσ, and the second group has two layers of scales respectively 2σ and 2kσ. Since only two items cannot be compared to obtain extreme values ​​(only when there are values ​​on the left and right sides) have extreme values).

Since it is impossible to compare and obtain extreme values, then we need to continue to perform Gaussian blur on the images of each group, so that the scale is formed \sigma ,k\sigma ,k^{2}\sigma ,k^{3}\sigma ,k^{4}\sigma, so that the middle three items can be selected \sigma ,k\sigma ,k^{2}\sigma ,k^{3}\sigma. The corresponding next set of three items obtained from the previous set of downsampling is 2k\sigma ,2k^{2}\sigma ,2k^{3}\sigmathat the first item is just continuous 2k\sigma =2^{\frac{4}{3}}\sigmawith the scale of the last item of the previous set .k^{3}\sigma =2^{\frac{2}{2}}\sigma

=====================================================

4. Delete bad extreme points (feature points)

The local extremum points of DoG detected by comparison are actually searched in a discrete space. Since the discrete space is the result of sampling the continuous space, the extremum points found in the discrete space are not necessarily the extremum points in the true sense. , so it is necessary to try to eliminate the points that do not meet the conditions. You can use the scale space DoG function to perform curve fitting to find extreme points. The essence of this step is to remove the points with very asymmetric local curvature of DoG.

There are two main types of points that do not meet the requirements to be eliminated:

  • Feature points with low contrast
  • Unstable Edge Response Points

 1. Eliminate low-contrast feature points

Candidate feature point x, its offset is defined as \widetilde{x}, its contrast is H(x), its absolute value ∣H(x)∣, apply Taylor expansion to H(x):

                                          H\left ( x \right )=H+\frac{\partial H^{T}}{\partial x}x+\frac{1}{2}x^{T}\frac{\partial ^{2}H}{\partial x^{2}}x

Since x is the extremum point of H(x), take the derivative of the above formula and set it to 0, and get

                                          \widetilde{x}=\frac{\partial ^{2}H^{-1}}{\partial H^{2}}\cdot \frac{\partial H}{\partial x} 

 Then substitute the obtained \widetilde{x}value into the Taylor expansion of H(x): 

                                          H\left ( \widetilde{x} \right )= H+\frac{1}{2}\frac{\partial H^{T}}{\partial x}\widetilde{x}

 Set the contrast threshold as T, if H\left ( \widetilde{x} \right )>T, then the feature point is kept, otherwise it is eliminated.

2. Eliminate unstable edge response points 

In the direction of the edge gradient, the principal curvature value is relatively large, while along the edge direction, the principal curvature value is small. The principal curvature of the DoG function H(x) of the candidate feature points is proportional to the eigenvalues ​​of the                                         3×3 Hessian matrix HH.

                                                     H=\begin{vmatrix} d_{xx},d_{xy}\\ d_{yy},d_{yx}\\ \end{vmatrix} 

Among them, d_{xx},d_{xy},d_{yy}is obtained by the difference of the corresponding positions of the candidate point neighborhood. In order to avoid finding specific values, the H feature value ratio can be used. Let bea=\lambda _{max} the largest eigenvalue of H and \beta =\lambda _{min}be the smallest eigenvalue of H. can get:

                                                 T_{r}\left ( H \right )=d^{xx}+d_{yy}=\alpha +\beta

                                                 Det\left ( H \right )=d_{xx}+d_{yy}-d_{xy}^{2}=\alpha \cdot \beta

Among them, Tr\left ( H \right )is the trace of matrix H, Det\left ( H \right )and is the determinant of matrix H. Let it \gamma =\frac{\alpha }{\beta }represent the ratio of the largest eigenvalue to the smallest eigenvalue, then:

                                      \frac{Tr\left ( H \right )^{2}}{Det\left ( H \right )}=\frac{\left ( \alpha +\beta \right )^{2}}{\alpha \cdot \beta }=\frac{ \left ( \gamma \beta +\beta \right )^{2}}{\gamma \beta ^{2}}=\frac{\left ( \gamma +1 \right )^{2}}{\gamma }

The result of the above formula is related to the ratio of the two eigenvalues, and has nothing to do with the specific size. When the two eigenvalues ​​are equal, its value is the smallest, and it increases with the increase of γγ. So in order to detect whether the principal curvature is below a certain threshold T\gamma, it is only necessary to detect:

                                                        \frac{Tr\left ( H \right )^{2}}{Det\left ( H \right )}> \frac{\left ( T\gamma +1 \right )^{2}}{T\gamma }

If the above formula is true, then remove the feature point, otherwise keep it.

=========================================================================

5. Find the main direction of the feature points

After the above steps, the feature points that exist at different scales have been found. In order to achieve image rotation invariance, the direction of the feature points needs to be assigned. The gradient distribution characteristics of the neighborhood pixels of the feature points are used to determine its orientation parameters, and then the stable orientation of the local structure of the key points is obtained by using the gradient histogram of the image. When the feature point is found, the scale of the feature point can be obtained \sigma, and the scale image where the feature point is located can also be obtained.

\sigmaCalculate the argument and amplitude of the area image centered on the feature point and with a radius of 3×1.5. In order to prevent the interference of noise, after calculating the gradient size and direction of all points in the circular area, the extreme point is the  center , with 1.5σ as the standard deviation of the Gaussian distribution, Gaussian weighting is performed on the gradient of all points in the circular area (the closer to the center point, the greater the weight):

                                     m^{'}\left ( x,y \right )= m\left ( x,y \right )\cdot W\left ( x,y \right )

                                     W\left ( x,y \right )=\frac{w\left ( x,y \right )}{\sum w\left ( x,y \right )}

                                      w\left ( x,y \right )=e^{\frac{d^{2}}{2\cdot \left ( 1.5\sigma \right )^{2}}} 

The modulus and direction L\left ( x,y \right )of the gradient at each point can be obtained by the following formula:m\left ( x,y \right )\theta \left ( x,y \right )

After calculating the gradient direction, it is necessary to use the histogram to count the gradient direction and magnitude corresponding to the pixels in the neighborhood of the feature point. The horizontal axis of the histogram of the gradient direction is the angle of the gradient direction (the range of the gradient direction is 0 to 360 degrees, and the histogram has a total of 10 columns per 36 degrees, or a total of 8 columns per 45 degrees), and the vertical axis is the accumulation of the gradient assignment corresponding to the gradient direction.

In order to get a more accurate direction, it is usually possible to interpolate and fit the discrete gradient histogram. Specifically, the direction of the key point can be obtained by parabolic interpolation from the three bar values ​​closest to the main peak. In the gradient histogram, when there is a column value equivalent to 80% of the energy of the main peak, this direction can be regarded as the auxiliary direction of the feature point, and one feature value can detect multiple directions.

After obtaining the main direction of the feature point, three pieces of information can be obtained for each feature point \left ( x,y,\sigma ,\theta \right ), namely position, scale and direction. From this, a SIFT feature region can be determined, and a SIFT feature region is represented by three values, the center represents the position of the feature point, the radius represents the scale of the key point, and the arrow represents the main direction. Key points with multiple directions can be copied into multiple copies, and then the direction values ​​are assigned to the copied feature points respectively. One feature point generates multiple feature points with the same coordinates and the same scale, but different directions.

========================================================================= 

6. Generating feature descriptions

Through the above steps, the position, scale and direction information of the SIFT feature points have been found. Next, a set of vectors is needed to describe the key points, which is to generate a feature point descriptor. This descriptor not only contains the feature points, but also contains the surrounding objects of the feature points. its contributing pixels. Descriptors should have high independence to ensure the matching rate .
There are roughly three steps in the generation of feature descriptors:

  1. Corrects the principal direction of rotation, ensuring rotation invariance.
  2. Generate descriptors, and finally form a 128-dimensional feature vector
  3. Normalization processing, the length of the feature vector is normalized to further remove the influence of illumination.

In order to ensure the rotation invariance of the feature vector, it is necessary to take the feature point as the center and rotate the coordinate axis by θθ (the main direction of the feature point) in the nearby neighborhood, that is, rotate the coordinate axis to the main direction of the feature point. The new coordinates of pixels in the neighborhood after rotation are:

                                                \begin{vmatrix} x^{'}\\ y^{'}\\ \end{vmatrix}=\begin{bmatrix} cos\theta ,-sin\theta \\ sin\theta ,-cos\theta \\ \end{bmatrix}\begin{vmatrix} x\\ y\\ \end{vmatrix}

After rotation, an 8×8 window is taken with the main direction as the center. As shown in the figure below, the center of the left figure is the position of the current key point, and each small grid represents a pixel in the scale space where the key point neighborhood is located. Calculate the gradient magnitude and gradient direction of each pixel, and the direction of the arrow represents the The gradient direction of the pixel, the length represents the gradient magnitude, and then it is weighted by the Gaussian window. Finally, draw a gradient histogram in 8 directions on each 4×4 small block, and calculate the cumulative value of each gradient direction to form a seed point, as shown in the right figure. Each feature point is composed of 4 seed points, and each seed point has vector information in 8 directions. This combination of neighborhood directional information enhances the anti-noise ability of the algorithm, and at the same time provides a more rational fault tolerance for feature matching with positioning errors.

Different from finding the main direction, the gradient histogram of each seed region is divided into 8 direction intervals between 0-360, and each interval is 45 degrees, that is, each seed point has gradient strength information in 8 directions. A total of 16 seed points of 4×4 are used to describe each key point, so that a key point can generate a 128-dimensional SIFT feature vector.

=========================================================================

Code:

#include"stdafx.h"
#include <opencv2/opencv.hpp>
#include <iostream>
#include "math.h"

#include <opencv2/features2d.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>

using namespace cv;
using namespace std;
//using namespace cv::features2d;

int main(int argc, char** argv) {
	Mat src = imread("F:/photo/i.jpg", IMREAD_GRAYSCALE);
	if (src.empty()) {
		printf("could not load image...\n");
		return -1;
	}
	namedWindow("input image", WINDOW_AUTOSIZE);
	imshow("input image", src);

	int numFeatures = 400;
	Ptr<SIFT> detector = SIFT::create(numFeatures);
	vector<KeyPoint> keypoints;
	detector->detect(src, keypoints, Mat());
	printf("Total KeyPoints : %d\n", keypoints.size());

	Mat keypoint_img;
	drawKeypoints(src, keypoints, keypoint_img, Scalar::all(-1), DrawMatchesFlags::DEFAULT);
	namedWindow("SIFT KeyPoints", WINDOW_AUTOSIZE);
	imshow("SIFT KeyPoints", keypoint_img);

	waitKey(0);
	return 0;
}

Effect realization:

Guess you like

Origin blog.csdn.net/weixin_44651073/article/details/127978307