Reprinted article: Detailed interpretation of SVO

illustrate

This article is reproduced to an article "Detailed Interpretation of SVO" by the blogger "Extreme Chocolate" in the blog garden. The reference link is:
https://www.cnblogs.com/ilekoaiq/p/8659631.html

foreword

Continued from the previous article "Detailed Interpretation of Depth Filter" .

SVO (Semi-Direct Monocular Visual Odometry) is a visual odometry method published in 2014 by the laboratory of Professor Scaramuzza of the University of Zurich. Its name is semi-direct visual odometry. visual odometry and direct method. At present, the algorithm has been open sourced on github ( https://github.com/uzh-rpg/rpg_svo ). The He family improved on its open source version and formed SVO_Edgelet ( https://github.com/HeYijia/svo_edgelet ). Compared with the original version, SVO_Edgelet has added some functions, such as combining essential matrix and homography matrix to initialize, adding edge feature points to tracking, etc., which has greatly improved the robustness of SVO.

Although SVO already has a paper [1], the paper only talks about the core algorithm theory, and you can use the paper to understand its algorithm thinking. But the specific implementation methods and techniques are hidden in the source code. If you want to master it thoroughly and use it flexibly in specific practice, you still have to read the source code.

So I read through the more than 20,000 lines of source code, and tried to deduce all the specific implementations, algorithms, formulas, techniques, and author's intentions from the code.

Under the source code, there are no secrets.

After understanding all the more than 20,000 lines of code, I restored all the specific implementation methods in the code, studied its advantages and disadvantages, applicable conditions, discussed its possible improvements, and summarized it in this article. Share it with everyone.

The program corresponding to this article is SVO_Edgelet.

Target readers of this article: SLAM algorithm engineers who have a certain understanding of SVO.

flow chart

insert image description here

1. Tracking

In fact, the essence of the tracking part of SVO is the same as ORBSLAM, TrackWithMotionModel and TrackLocalMap, but the matching method is changed from the feature point method to the gray value matching method.

But then, different from ORBSLAM, after optimizing the camera pose, SVO has the option to optimize the map points, and then optimize the map points and camera pose together.

1.1 Initialization

When the image first comes in, get its pyramid image with 5 layers and a ratio of 2.

Then process the first image, processFirstFrame(). First detect FAST feature points and edge features. If the number of feature points in the middle of the image exceeds 50, take this image as the first key frame.

Then process the consecutive images after the first one, processSecondFrame(), for triangular initialization with the first one. From the first image, the optical flow method is used to continuously track the feature points, and the feature pixels are converted into depth-normalized points in the camera coordinate system, and distortion correction is performed, and then the modulus becomes 1, which is mapped to on the unit sphere.

If the number of matching points is greater than the threshold and the median of the disparities is greater than the threshold. If the variance of the disparity is large, choose to calculate the E matrix, and if the variance of the disparity is small, choose to calculate the H matrix. If there are enough inliers after calculating H or E, it is considered that this frame is suitable for triangulation. According to the pose and map points restored by H or E, scale transformation is performed, and the median value of depth is adjusted to 1.

Then this frame, as a key frame, is sent to the depth filter. (It is sent to updateSeedsLoop()the thread of the depth filter. The depth filter is used to search for matching points on the epipolar line for the seed point, update the seed point, and the seed point converges to a new candidate map point. If it is a key frame, it is initialized For the new seed point, 25x25take the largest fast point in each size grid of each layer in this frame image. On the 0th layer image, find the canny edge point.)

After that is normal tracking processFrame().

1.2 Pose estimation based on sparse point brightness

Use the pose of the previous frame as the initial pose of the current frame.

Use the previous frame as a reference frame.

First create a matrix with n rows and 16 columns ref_patch_cache_, n represents the number of feature points on the reference frame, and 16 represents the number of pixels of the block to be taken (that is, the number of pixel blocks is 4*4).

Then create a matrix with 6 rows and n*16 columns jacobian_cache_. Represents the Jacobian of the error per pixel on the patch versus the camera pose.

What is optimized is the pose of the reference frame relative to the current frame.

Combine all the tiles on the reference frame with the map points, and project onto the pyramid image of the current frame image. On the pyramid image of the current frame, starting from the highest layer, counting layer by layer to the lower layer. Each time inherits the previous optimization result. If the error of the previous time is not reduced compared with the previous time, the optimized pose of the previous time will be inherited. The optimization of each layer is iterated 30 times.

The residual to be optimized is the luminance residual between I_k-1the block of the feature point on the reference frame and the block projected to the position on the current frame . I_kThe projected position is that I_k-1the feature points in the reference frame extend into the three-dimensional space to the same position as the depth of the corresponding map point, and then projected to the current frame I_k. This is an innovative point of SVO, which is directly extended from the feature points on the image instead of the map points (because there is also a projection error between the map points and the feature points), so as to ensure the accuracy of the tiles to be projected . The extended space point must also be on a straight line with the feature point and the optical center. This pinhole model is very beautiful.

Another innovation of SVO (referring to the inverse optical flow algorithm ) is that, based on the pixel value of the projected point on the current frame, by optimizing and adjusting the position of the pixel projected from the reference frame, in order to optimize the two pixels value residuals. In this way, the Jacobian of the pixel value on the projected block patch with respect to the pixel position can be calculated in advance and fixed. (In the past, the common method was to optimize the residual of the two by optimizing the position of the projection point based on the past pixel value of the reference frame projection.)

The residual is expressed as:
insert image description here
where, lrepresents the size of half a tile.

insert image description here
Cascade derivation, however, in this case, the Jacobian Jwill change after each iteration. In general, optimization will encounter such problems.

So, adopt an approximation idea. First of all, it is considered that the spatial point Pis fixed, and only k-1the pose of the reference frame is adjusted, so this disturbance does not affect the position of the projected point on the current frame, but only affects patchthe content of the block. Then, the reference frame regenerates new spatial points on the new pose, and the iteration continues. Although it is only an approximate optimization, the direction of each iteration is correct, assuming that the step size is similar, so the optimization can be successful in the end.

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

1.3 Block-based feature point matching

Because the current frame has an estimated pose of 1.1. For those key frames in the key frame linked list, project the scattered 5 points on their images onto the current frame to see if the projection is successful. If the projection is successful, it is considered common view. Then sort all the common-view keyframes according to the distance from the current frame. Then, according to the order of key frame distance from near to far, the map points corresponding to the feature points on these key frames are projected onto the current frame in turn, and the same map point is only projected once. If the projection position of the map point on the current frame can get an 8x8 tile, store the map point in the grid of the projection position of the current frame.

Then project all the candidate map points onto the current frame. If an 8x8 tile can be obtained at the projected position on the current frame, store the candidate map point into the grid at the projected position of the current frame. If 10 frames of a candidate point are unsuccessfully projected, the candidate point is deleted.

Then, for each grid, sort the corresponding map points according to the quality of the map points (TYPE_GOOD> TYPE_UNKNOWN> TYPE_CANDIDATE> TYPE_DELETED). If it is TYPE_DELETED, it is deleted from the grid.

Traverse each map point in the grid and find all keyframes where this map point was observed. Obtain the included angle between the line connecting the optical center of those key frames and the map point, and the line connecting the map point and the current frame optical center. Select the key frame with the smallest angle as the reference frame and the corresponding feature points. (Note that the selection of the included angle here is only suitable for the situation where the viewing angle of the drone is always downward. It should be changed to that of ORBSLAM, and the viewing angle must be converted to the corresponding camera coordinate system, and then filter again). The corresponding feature point must be able to obtain a 10x10 block on its own corresponding layer.

Then, calculate the affine matrix. First, obtain the modulus of the line connecting the map point on the reference frame with the optical center. Then its corresponding feature point, on its corresponding layer, takes the 5th pixel position on the right and the 5th pixel position below, and then maps to the 0th layer. Then convert to the unit ball, and then map to the three-dimensional space until the length is exactly the same as that of the map point. Map the corresponding feature points to the three-dimensional space until the length is exactly the same as the map point. Then, map these 3 points to the (distorted) image of the current frame. The affine matrix A_cur_ref is calculated according to their position transformation with the central projection point. A_cur_ref.col(0) = (px_du - px_cur)/halfpatch_size; A_cur_ref.col(1) = (px_dv - px_cur)/halfpatch_size;. (www.cnblogs.com/ilekoaiq) The affine matrix A is to convert the tiles on the reference frame to the 0th layer of the current frame on its own corresponding layer number. (This method of converting the scale transformation into a matrix representation is very good).

Then, calculate the number of target search layers in the current frame. By calculating the determinant of the affine matrix A_cur_ref, it is actually the area magnification. If the area magnification exceeds 3, go up one layer, and the area magnification becomes 1/4 of the original. Know that the area magnification is no longer greater than 3, or to the highest level. The number of layers to be searched by the target is obtained.

Then, the inverse affine matrix A_ref_cur of the affine matrix is ​​calculated. Then, in this way, if you take the projection point as the center (5,5) and take a 10x10 tile, the position of each pixel on the tile (relative to the center point) can be obtained through the inverse affine matrix. The pixel position (relative to the center point) on the corresponding layer image on the reference frame of . Perform pixel interpolation. That is, some pixel points near the feature points on the reference frame can be formed and mapped to the vicinity of the projection point positions corresponding to the number of layers on the current frame. These mapped positions just form a 10x10 block.

Then, an 8x8 tile is taken from the mapped 10x10 tile as a reference tile. Optimize the position of this tile so that it best matches the tile at the target location. The residual expression is.
insert image description here
Here, SVO has two innovations.
insert image description here
insert image description here
insert image description here
If it is a map point of type TYPE_UNKNOWN and the number of matching failures is greater than 15, it will be changed to a point of type delete. If it is a point of type TYPE_CANDIDATE and its number of matching failures is greater than 30, it will be changed to a point of type delete.

If the matching is successful, a new feature point (including coordinates and layers) will be generated on the current image, and the feature point will point to that map point. If the feature point on the corresponding reference frame is an edge point, the type of the new feature point is also set to an edge point, and the gradient is also affine, and after normalization, it is used as the gradient of the new feature point.

In each grid, as long as a map point is matched, the traversal loop of this grid will be jumped out. If 180 grids are successfully matched, directly jump out of the loop of all grids. After the loop ends, if the number of successfully matched grids is less than 30, it is considered that the matching of the current frame fails.

1.4 Further optimize the pose

insert image description here
insert image description here

1.5 Optimize map points

insert image description here
insert image description here

1.6 BA

There is an option in SVO to enable the BA function using g2o.

If this function is enabled, after the first two images are initialized, the two images and the initialized map points will be optimized with BA. The template in g2o is used.

In addition, after optimizing the map points in 1.5, all keyframes and map points in the window, or global keyframes and map points, will be optimized. The template in g2o is used.

1.7 Inspiration for Distorted Image Processing

The tracking of SVO is all tracked on the distorted fisheye image, and the image is not corrected, so that the original information of the image can be preserved as much as possible.

And because of the reverse block Jacobian method in 1.2, in addition to speeding up the calculation, it also avoids the Jacobian calculation of the distortion parameters. Because if the Jacobian ratio of the forward image is used, the distortion parameters must also be taken into account when calculating the Jacobian ratio.

In 1.3, the block matching is to use the distorted block to match, which ensures the accuracy. In order to avoid the calculation of the Jacobian of the distortion parameters, after the matching is completed, the positions of the projected points and the matched points are converted from the distorted image to the unit plane. In the future, on the distorted image, the reprojection error is calculated, and this method is used.

2. Create map points

The method of feature point extraction is placed in the map thread. Because it is different from ORBSLAM, when it tracks, it does not need to find feature points and then match, but directly matches according to the brightness difference of the block.

And if it is vins, you can also refer to this method, put the feature point extraction into the map thread, and match the feature points between consecutive frames with optical flow. However, the optical flow requires that there should not be too much difference between frames.

In SVO, the back-end feature points are only extracted on key frames, and FAST is used to add pyramids. The way to find matching points for the feature points of the previous key frame on this key frame is to use epipolar search to find the point with the smallest brightness difference. Finally, use the depthfilter depth filter to accurately filter out this map point.

Select 30 map points, if the median of the disparity between the current frame and the nearest key frame of these 30 map points is greater than 40, or the distance from the key frame in the window is greater than a certain threshold, it is considered that a new key frame is needed . Then set the current frame as a key frame and operate on the current frame.

2.1 Initialize the seed

When the key frame comes, process the key frame. On the current image, a grid of 25 pixels by 25 pixels is divided.

First, these existing feature points on the current frame occupy the grid.

On the 5-layer pyramid of the current frame, each layer head extracts fast points, and first suppresses them with non-maximum values ​​in the 3x3 range. Then, for the remaining points, all calculate the shiTomasi score, which is a bit like the score in the Harris corner. Then all are mapped to the grid of layer 0, and each grid only retains the point with the largest score and greater than the threshold.

When looking for edge points, they are only found on the 0th layer. Also draw the grid, and then find the canny line in each grid, and then calculate the modulus of its gradient for the points on the canny line in the grid, and keep the point with the largest modulus gradient as the edge point. The gradient direction is two-dimensional, which is the gradient of the right, bottom, left, and top of this point. The program uses cv::Scharr combined with cv::magnitude to quickly calculate the gradient of all points in the horizontal and vertical directions.

Then, for all new feature points, initialize them as seed points. The inverse depth is represented by a Gaussian distribution. The mean is the reciprocal of the depth of the nearest point. The depth range is the reciprocal of the nearest depth of the current frame, ie 1.0/depth_min. The standard deviation of the Gaussian distribution is 1/6*1.0/depth_min.

2.2 Update seed, depth filter

If a new keyframe comes, or the current normal frame, or the previous keyframe, it is used to update the seed point. For each seed point, the search range of the inverse depth is determined by plus or minus 1 standard deviation. These parameters correspond to the frame in which the seed point is initialized.

Then map the shortest and longest depths on the depth ray to the unit depth plane of the current frame, and in fact get the epipolar line segment on the unit plane. Then, map the depth corresponding to the mean value of the inverse depth to the current frame, which is the same method as in 1.3, to obtain the block affine matrix and the optimal number of search layers.

(For edge points, if the angle between the direction of the gradient and the direction of the epipolar line is greater than 45 degrees after the gradient is affine, it is considered to search along the epipolar line, and the pixel of the block will not change much, so it will not be searched. Return false directly.)

Project the epipolar line segment to the corresponding layer number, and if the pixel distance between the two endpoints is less than 2 pixels, directly optimize the position. Use the method of finding block matching in 1.3 to map the corresponding block. Once the best matching location is found, triangulation is performed. For the method of triangulation, refer to the triangulation of "Visual SLAM Fourteen Lectures", matrix block calculation.

insert image description here
If the pixel distance between the two endpoints is greater than 2 pixels, the search is performed on the epipolar line. First, determine the total number of steps, and divide the distance between the two ends by 0.7 to get the total number of steps n_steps. Then, divide the epipolar line segment on the unit depth plane into n_steps segments, start from one end point to the other end point, and project the position (including distortion) onto the image of the corresponding layer at each step, after the coordinates are rounded, Get tiles. (It can be improved here, the coordinates should not be rounded, but should be changed to interpolation). Then, calculate the similarity between the projected block and the projected position block. The calculation formula of the similarity is as follows, which has the effect of eliminating the mean.
insert image description here
insert image description here
insert image description here

3. Relocation

Relocalization in SVO is very simple to implement, that is, after the follow-up, still assume that the pose of the current frame is the same as that of the previous frame, project the map points to this pose, and use the method in Part 1 to optimize the calculation. If it succeeds, it will be relocated back. If the optimization is unsuccessful, it will continue to the next frame. Therefore, after a follow-up, the only way to relocate is to return to the position when the follow-up was lost.

The way to achieve relocation in this way is very simple, but the effect of relocation is very poor. This place can be improved.

4. Summary

The SVO is well positioned with very little jitter. Especially in the environment with repeated textures, it performs better than ORBSLAM2 based on the feature point method.

In the future, more robust relocation, loop closure, and global map functions can be added to meet more practical application scenarios, such as indoor robots, drones, and unmanned vehicles.

Guess you like

Origin blog.csdn.net/guanjing_dream/article/details/129346200