[3D Reconstruction] Feature Detection and Matching

Series Article Directory

This series started on December 25, 2022, and began to record the study notes during the research of the 3D reconstruction project, which is mainly divided into the following parts:

1. The conversion relationship between camera imaging and coordinate system

2. Camera calibration : Zhang Youzheng calibration method

3. Feature detection and matching

4. Exercise recovery structure method


Article directory:

Table of contents

Series Article Directory

Article directory:

Foreword:

1. Preliminary summary of feature points

1.1 Descriptor

1.2 Scale space and DOG

1.2.1 Scale space and Gaussian convolution

 1.2.2 DOG pyramid

2. Feature detection 

2.1 Calculation of key point position, scale and direction 

2.1.1 DOG local extremum points

2.1.2 Precise positioning of key points

2.1.3 Removal of edge responses

2.1.4 Scale calculation of key points

2.1.5 Direction matching of key points

2.2 Calculation of feature descriptors (vectors)

2.2.1 Determine the image area required by the descriptor (preparation)

2.2.2 Rotate the coordinate axis to the direction of the key point (rotate first to get edited) 

2.2.3 Assign the sampling points in the neighborhood of the key points to the corresponding sub-regions, assign the gradient values ​​of the sub-regions to 8 directions, and calculate the weight (then map the editor to each small sub-region to obtain its Coordinates in subregions​Edit) 

2.2.4 Bilinear interpolation calculates the contribution of each sampling point to the eight directions of the seed point

2.2.5 Remove the influence of light and normalize

2.2.6 Descriptor vector threshold

2.2.7 Sorting feature description vectors according to the scale of feature points

2.2.8 Summary

3. Code implementation

Four. Summary

reference:


Foreword:

        After the parameter calibration of the camera, the UAV is used to collect the data, and the UAV image sequence is used for 3D reconstruction. First, the feature detection and matching between the image sequences are carried out. This paper mainly records the feature detection and Study notes matching relevant knowledge.


1. Preliminary summary of feature points

        It is a common method to use representative regions (corners, edges) in the image to match between images. In many computer vision processes, corner points are extracted as features to match images, such as SFM, visual SLAM, etc.

        However, sometimes the pure corner points cannot reflect all the information of the corresponding points in the image, and the corner points may change due to the movement of the camera. At this time, we need some changes that will not follow the movement, rotation or illumination of the camera. and changing feature points.

        The feature points of an image are composed of two parts: key points (Keypoint) and descriptors (Descriptor) . The key point refers to the position of the feature point in the image, and some also have direction and scale information; the descriptor is usually a vector , which describes the information of the pixels around the key point in an artificially designed way. Usually the descriptors are designed according to the features with similar appearance should have similar descriptors. Therefore, when matching , as long as the descriptors of two feature points have similar distances in the vector space, they can be considered as the same feature point.

1.1 Descriptor

        The descriptor of a feature is usually a vector that describes the information of the key point and its surrounding pixels. It has the following characteristics:

  1. Invariance: means that the features will not change as the image is zoomed in, zoomed out and rotated.
  2. Robustness: insensitive to noise, lighting, or other small deformations
  3. Distinguishability: Each feature descriptor is unique and exclusive, minimizing the similarity between each other.

Contradiction: The distinguishability of the descriptor and its invariance are contradictory. A feature descriptor with many invariances has a weaker ability to distinguish local image content; and if a feature descriptor that is easy to distinguish different local image content Feature descriptors are often less robust. Therefore, when designing a feature descriptor, it is necessary to comprehensively consider these three characteristics and find a balance between the three.

The invariance of feature descriptors is mainly reflected in two aspects:

1. Scale Invariant

        The same feature remains unchanged at different scales of the same image. In order to maintain the invariance of the scale, when calculating the descriptor of the feature point, the image is usually transformed into a uniform scale space, and the scale factor is added. Without this feature, the same feature point cannot be well matched between enlarged or reduced images.

2. Rotation Invariant
        refers to the same feature. After the imaging angle is rotated, the feature can still remain unchanged. Similar to scale invariance, in order to maintain rotation invariance, the direction information of key points should be added when calculating feature point descriptors.

        The following is the calculation method of commonly used descriptors: (this article mainly introduces the calculation method of SIFT )

 

1.2 Scale space and DOG

1.2.1 Scale space and Gaussian convolution

        Scale and scale space have been mentioned many times above, and their meanings are explained here.

        First review the two-dimensional Gaussian function expression:

         When we use Gaussian functions with different σ values ​​to convolve an image, multiple Gaussian images with different σ values ​​are generated, where the σ values ​​are different scales.

        It can be understood as the degree to which the computer observes the picture: when we observe a picture, the farther away from the image, the blurrier the image; we all know that we can use different Gaussian convolution kernels to low-pass filter the image to reduce image detail information , to blur the image, which is similar to observing the image with the human eye, and the width of the Gaussian filter (determining the degree of smoothness) is characterized by the parameter σ, and σ is related to the degree of smoothness.

        Therefore, the scale is the σ value in the two-dimensional Gaussian function, and all images at different scales constitute the scale space of a single original image. "Image scale space expression" is the description of the image at all scales.

        Scales exist naturally and objectively, not created subjectively. Gaussian convolution is just one form of representing scale space.

        Scale space expression:

         The Gaussian convolution kernel is shown in the figure:

        The original pixel value at the center of the convolution kernel has the largest weight, and the weight of adjacent pixel values ​​farther from the center is smaller.

        The Gaussian convolution mentioned above blurs the image similar to the blurring of the human eye when it is far away from the image, but people can know the information of the object by knowing the key outline of the object, so the computer can also use these key information to identify image.

 1.2.2 DOG pyramid

        Before introducing the DOG pyramid, let's first understand the pyramid:

         When downsampling an image, the image contains pixels and the size of the image will be reduced accordingly. After multiple downsampling, the results are stacked together to form a pyramid. Note that the pyramid in the above image is obtained by downsampling the underlying fine image.

        The difference between the pyramid and the scale space expression here is that:

        1. The resolution of each layer of the image pyramid is reduced, while the resolution of the scale space is unchanged.

        2. The processing speed of the image pyramid is accelerated as the resolution decreases.

        In order to make the scale reflect its continuity, a relatively neutral method is obtained, which combines the image pyramid and the scale space expression to obtain the LOG (Laplasmian of Gaussian) pyramid.

        First, the photo is down-sampled to obtain image pyramids at different resolutions, and then Gaussian convolution is performed on each layer of image. The final result is that on the basis of the pyramid, each layer has the same resolution image of different scales. The multiple images of each layer of the pyramid are collectively called a group (Octave), each layer of the pyramid has only one group of images, the number of groups is equal to the number of layers of the pyramid, and each group contains multiple Interval images.

        Finally, DOG is generated according to LOG.

        k is a constant of the multiple of adjacent two-scale spaces. Divide the image pyramid into O groups, one group is called an Octave, each group is divided into multiple layers, and the number of layer intervals is S, so there are S+3 (S+1+2, 2 represents adding a layer above and below image, the search extremum is only searched on the middle S+1 layer image) layer image, the image of the next group is descended from the penultimate layer of the previous group (if the layer index starts from 0, it is the Sth layer) image Sampling is obtained (in order to reduce the workload of convolution operations), and then subtract the adjacent images of each layer of the LOG pyramid (left in the figure below), and the reconstructed pyramid of all images is the DOG pyramid (right in the figure below).

        In the above scale space, \sigmathe relationship between O and S is as follows: 

         where \sigma _{0}is the base layer scale, o is the index of the group octave, and s is the index of the inner layer of the group. The scale coordinates of the key points are calculated according to the group where the key points are located and the layer within the group, using the above formula.

        When building the pyramid at the beginning, the input image should be pre-blurred as the image of the 0th layer of the 0th group, which is equivalent to discarding the sampling rate of the highest airspace. Therefore, the usual practice is to double the scale of the image first to generate the -1 group. We assume that the initial input image has been \sigma _{-1} = 0.5Gaussian blurred in order to fight against aliasing. If the size of the input image is doubled with bilinear interpolation, then it is equivalent \sigma _{-1} = 1.

        Pick:

        When constructing a Gaussian pyramid, the scale coordinates of each layer in the group are calculated according to the following formula:

        The intra-group scale coordinates of the same layer in different groups \sigma (s)are the same. The image of the next layer in the group is \sigma (s)obtained by performing Gaussian blurring on the image of the previous layer. The above formula is used to generate Gaussian images of different scales in the group at one time, and when calculating the scale of a certain layer of images in the group, directly use the following formula for calculation:     

         The scale within this group determines the size of the sampling window during orientation assignment and characterization. Finally got:

2. Feature detection 

2.1 Calculation of key point position, scale and direction 

2.1.1 DOG local extremum points

        Keypoints are composed of local extrema points in the DOG space.

        In the image, each pixel is compared with its surrounding pixels, and when its pixel value is greater than or less than all adjacent points, it is an extreme point. In the DOG pyramid, the surrounding pixel points are not only the surrounding points on a scale difference image of the current layer, but also the different scale difference images in a point in this layer are also taken into account.

        The middle detection point is compared with its 8 adjacent points of the same scale and 9×2 points corresponding to the upper and lower adjacent scales, a total of 26 points, to ensure that extreme points are detected in both the scale space and the two-dimensional image space.

        The above method obtains the extreme points in the discrete space. It is necessary to accurately determine the position and scale of the key points by fitting the three-dimensional quadratic function, and at the same time remove the low-contrast feature points and unstable edge response points (because the DOG algorithm The sub will produce a strong edge response) to enhance matching stability and improve noise immunity.

2.1.2 Precise positioning of key points

        The Taylor expansion (fitting function) of the DOG function in the scale space is:

         Among them X = (x, y ,\sigma )^{T}. The other side takes the derivative and makes it equal to zero to get the offset of the extreme point:

         This offset represents the offset relative to the interpolation center. When its offset in any dimension is greater than 0.5 (that is, x or y or ), it means that the interpolation center has been offset to its neighboring points. So the position of the current key point must be changed. After changing the position, repeat interpolation at the new position until convergence. If the number of iterations or image range is exceeded, it is determined that this point should be deleted.

        Bring the extreme points into the equation:

2.1.3 Removal of edge responses

        According to the Harris corner point, it can be known that the translation of a corner point in any direction should ensure the drastic change of the pixel value in the local window, and the pixel value in the local window basically does not change when the point on the edge moves along the edge direction. The extrema of a poorly defined difference of Gaussian operator has a large principal curvature across the edge and a small principal curvature perpendicular to the edge. The DOG operator will produce a strong edge response, so it is necessary to remove unstable edge response points.

        First, we obtain the Hessian matrix from the feature points,

        The eigenvalues ​​of H are proportional to the principal curvature of D. Finding specific eigenvalues ​​can be avoided because we only care about the ratio of eigenvalues. The eigenvalues ​​α (larger) and β (smaller) of H represent the gradient in the x and y directions:

        Tr(H) represents the sum of the diagonal elements of matrix H, and Det(H) represents the determinant of matrix H. Assuming that the eigenvalue of α is larger, but the eigenvalue of β is smaller, then \alpha =r\beta:

         In this way, we get a \gammaformula that is only related to the ratio of two eigenvectors, and has nothing to do with the specific eigenvalues. When the two eigenvalues ​​are equal ( \gamma = 1), the above formula reaches the minimum, and \gammathe larger the above formula is, the greater it is . \gammaThe larger the value, the larger the gradient in one direction and the smaller the gradient in the other direction. In order to prevent the above mentioned: there is a larger main curvature across the edge, and a larger main curvature in the direction vertical to the edge. In the case of small principal curvatures, we set a \gammathreshold of , so that we only need to check:

         If the conditions are met, it is deemed to be reserved.

2.1.4 Scale calculation of key points

        It has been introduced in 1.2.2:

 The scale coordinates of each layer in the group are calculated according to the following formula:

The above formula is used to generate Gaussian images of different scales in the group at one time, and when calculating the scale of a certain layer of images in the group, directly use the following formula for calculation:     

2.1.5 Direction matching of key points

         In order to achieve the rotation invariance of the descriptor, the local features of the image are used to obtain an orientation for each key point, and the image gradient orientation is used to obtain the stable orientation of the local structure. For the key points that have been detected, we know the position and scale of the point in the DOG pyramid, and use the finite difference on the Gaussian image corresponding to the scale (refer to 4.3 in 6 for details), calculate the feature point as the center, and its The magnitude and angle of the pixel within the neighborhood window of the Gaussian pyramid image:

        The modulus m(x, y) of the gradient \sigma = 1.5\times\sigma octis added according to the Gaussian distribution, and according to the 3σ principle of scale sampling, so the radius of the neighborhood window is 3\times 1.5 \times \sigma oct(s).

         After completing the gradient calculation of the Gaussian image of the key point neighborhood, use the histogram to count the gradient direction and magnitude of the pixels in the neighborhood. The horizontal axis of the gradient direction histogram is the gradient direction angle, and the vertical axis is the gradient amplitude accumulation corresponding to the gradient direction angle. The gradient direction histogram divides the range of 0-360 degrees into 36 bins, one bin every 10 degrees. The peak of the histogram represents the main direction of the image gradient in the neighborhood of the key point, that is, the main direction of the key point:

         In order to enhance the robustness of matching, only the directions whose peak value is greater than 80% of the main direction peak value are reserved as the auxiliary directions of this key point. SIFT author Lowe's paper pointed out that about 15% of the key points have multiple directions, but these points are critical to the stability of the matching.

2.2 Calculation of feature descriptors (vectors)

        Through the above steps, for each key point , there are three pieces of information: position, scale and direction . The next step is to create a descriptor for each key point, and use a set of vectors to describe this key point so that it does not change with various changes, such as lighting changes, viewing angle changes, and so on. The feature descriptor is related to the scale of the feature point, so the calculation of the gradient should be performed on the Gaussian image corresponding to the feature point .

        A representation of the statistical results of Gaussian image gradients in the neighborhood of key points in the SIFT descriptor, by dividing the surrounding pixels into blocks to calculate the gradient histogram within the block, a unique, abstract, and unique image information for this block is generated vector.

        Lowe explained in the article that the descriptor is used to calculate the gradient values ​​​​in 8 directions in a 4×4 window in the key point scale space, with a total of 4×4×8 = 128-dimensional vectors.

        The main expression steps are as follows:

2.2.1 Determine the image area required by the descriptor (preparation)

       Divide the neighborhood around the key point into d×d (4×4) sub- regions , each sub-region is used as a seed point , and the seed point has 8 directions, which means that each sub-region has 8 directions, and then according to the sampling point Calculate the weights of these 8 directions.

        The size of each sub-area is the same as that of the key point direction allocation, that is, each area has 3\sigma octa sub- pixel, and each sub-area \sqrt{3\sigma oct}can be surrounded by a rectangular area with a side length of , but 3\sigma oct(3\sigma oct\leqslant 6\sigma _0)it is not large, and the number of sampling points should be more than less. , in order to simplify the calculation and take its side length 3\sigma oct, so each sub-region is assigned a rectangular area with a side length3\sigma oct of .

        Considering that bilinear interpolation needs to be adopted later, the side length of the required image window is: 3\sigma oct\times(d+1). And consider the rotation factor (it is convenient to rotate the coordinate axis to the direction of the key point in the next step, which will be introduced in the next step), as shown in the figure below, the radius of the image area required for actual calculation is (the calculation result is rounded):

 

2.2.2 Rotate the coordinate axis to the direction of the key point (rotate first (x', y')

        To ensure rotation invariance, the coordinate axis is rotated to the direction of the key point. The left picture is before rotation, the red arrow in the picture marks the direction of the current key point, and the right picture is after rotation.

        The new coordinates of the sampling points in the rotated neighborhood (x', y')are:

2.2.3 Assign the sampling points in the neighborhood of the key points to the corresponding sub-area, assign the gradient value of the sub-area to 8 directions, and calculate the weight (and then map it to each small sub-area to obtain its value in each sub- (x', y')area (x'', y'')coordinates  in

        The rotated sampling point coordinates are assigned to d×d (4×4) sub-areas within a circle with a radius of radius, and the gradient and direction of the sampling points affecting the sub-areas are calculated and assigned to 8 directions.

        The rotated sampling point (x', y')falls at the position of the sub-area subscripted as (x'', y''):

        Among them 3\sigma octis the side length of each sub-region obtained in the first step of preparation, and d is the parameter for bilinear sampling in the later stage. Lowe suggested that the gradient size of the pixels in the sub-region \sigma = 0.5 dis calculated by Gaussian weighting, that is

Where a, b are the position coordinates of key points in the Gaussian pyramid image.

2.2.4 Bilinear interpolation calculates the contribution of each sampling point to the eight directions of the seed point

         The allocation is subscripted in the sub-area (x'', y''). For example, the red dot in the above figure is a point allocated in the upper-right sub-area. It is linearly interpolated and calculated for each seed point (each sub-area is used as a seed point ) contribution.

         The red dot in the figure falls between row 0 and row 1 and contributes to both rows. The contribution factor to the seed point in row 0, column 3 is dr, and the contribution factor to row 1, column 3 is 1-dr. Similarly, the contribution factors to the two adjacent columns are dc and 1-dc, and the contribution factors to the two adjacent columns are dc and 1-dc. The contribution factors of each direction are do and 1-do. Then the final accumulated weight in each direction is:

   

         Where k, m, n are 0 or 1. It is explained here that this point is in row 0 and column 3, and assuming the direction is 6, then the corresponding weight is added to hist[0][3][6].

2.2.5 Remove the influence of light and normalize

        Through the above steps, we can get the gradient values ​​of 8 directions in the 4×4 window in the key point scale space, a total of 4×4×8 = 128-dimensional vector H. Next, normalize the image:

         Among them, H = (h_1, h_2, ..., h_{128})is the 128-dimensional vector obtained before normalization, L = (l_1, l_2, ..., l_{128})and is the result after normalization.

2.2.6 Descriptor vector threshold

        Non-linear lighting, camera saturation changes will cause the gradient value in some directions to be too large, so set the threshold value (after vector normalization, generally 0.2) to truncate larger gradient values. Then, a normalization process is performed again to improve the discrimination of the features.

2.2.7 Sorting feature description vectors according to the scale of feature points

        Sort the feature description vectors according to the scale of feature points.

2.2.8 Summary

        So far, the generation of SIFT feature description vectors can be summarized as follows: After dividing the range for the calculation of the description vectors, first rotate to obtain, and then (x', y')map (x', y')to each small sub-region to obtain its coordinates in each sub-region (x'', y'') , bilinear interpolation Calculate the weight (contribution) of each sampling point to the eight directions of the sub-region (seed point), normalize and set the threshold, and finally sort the feature description vector according to the scale of the feature point.

3. Code implementation

        Because SIFT applied for copyright in the later stage, cv2.xfeatures2d.SIFT_create() can be used to create SIFT descriptors in lower versions. After OpenCV  3.4, the related library of SIFT/SURF was removed due to patent copyright issues, so a newer version is used An error will be reported when using the cv library.

sift = cv2.xfeatures2d.SIFT_create()
key_points, desc = sift.detectAndCompute(img, None)

        The function returns keypoints and feature vectors.

        Key points include information on:

pt: coordinates of key points

angle: Angle, which indicates the direction of the key point. According to Lowe’s thesis, in order to ensure that the direction is not deformed, the SIFT algorithm obtains the direction of the point by performing gradient calculations on the neighborhood around the key point. -1 is the initial value.

size: the size of the diameter of the point

class_id: When we want to classify pictures, we can use class_id to distinguish each feature point. If it is not set, it is -1. You need to set it yourself

octave: Represents the data extracted from which layer of the pyramid.

response: The degree of response, to be precise, the degree of the corner.

Four. Summary

        This article introduces how to extract the features of the image. After the features are extracted, when matching , as long as the descriptors of the two feature points are close in the vector space, they can be considered as the same feature point.

reference:

1、http://t.csdn.cn/OSm5h

2、http://t.csdn.cn/JyupC

3、http://t.csdn.cn/P12m0

4. Visual odometry for SLAM entry (1): Matching of feature points - Brook_icv - Blog Garden (cnblogs.com)

5、http://t.csdn.cn/5TLRn

6、http://t.csdn.cn/KWgLE

Guess you like

Origin blog.csdn.net/shisniend/article/details/129802075