Introduction to vslam process framework

I often hear that vslam technology is used in sweeper products. So what exactly is vslam? Let’s take a look.

What is vslam?

VSLAM stands for Visual Simultaneous Localization and Mapping, which mainly refers to how to use cameras to solve positioning and mapping problems. When using a camera as a sensor, what we have to do is to infer the camera's movement and the surrounding environment based on the continuous moving images (they form a video).

vslam

From the above picture, we can clearly see that the upper right corner is the image data of a walking process collected by the camera. There are many small yellow boxes on the screen, and the left side of the image is a running triangle. By combining with the image in the upper right corner , the trajectory composed of green triangles on the left seems to be consistent with the recorded trajectory in the upper right corner. This is the vslam system. You can clearly see the positioning and mapping on the left side.

After briefly understanding the function of vslam, what is its framework?

The technical framework of VSLAM mainly consists of 5 parts, including sensor data preprocessing, front-end, back-end, loopback detection, and mapping.

vslam framework

The above figure clearly identifies the processing flow of the data flow. After the image data is captured by the sensor, it is transmitted to the front end, followed by the back end, and finally the map is built. This process also includes loopback detection. What exactly does each module implement? The following introduces one by one. The content in this article is based on Dr. Gao Xiang’s books for study and sorting. If you need a more systematic and comprehensive introductory learning course, I recommend the special course on the theory and practice of visual SLAM of Deep Blue Academy taught by Dr. Gao Xiang: Deep Blue Academy visual SLAM theory and practice .

VSLAM technology framework

The image data is captured by the sensor sensor. After capturing the image, it will not be sent directly to the front-end for processing, but there is a pre-processing. What does the pre-processing do?

Sensor data preprocessing

Sensor data preprocessing. The sensors in VSLAM include cameras, inertial measurement units, IMUs, etc. Preprocessing mainly includes filtering operations on images (median filtering, Gaussian filtering, etc.) and dedistortion.

First of all, what kind of camera should I choose? Take a look at the image below.

camera comparison

  • Monocular: Single sensor sensor, low BOM cost, unlimited shooting distance, but there is a scale uncertainty problem, it is impossible to know the distance between the shooting object and the camera. At the same time, when vslam is initialized, the inter-frame change is obtained for the first time. , it cannot just be pure rotation, it must have a certain degree of translation.
  • Binocular: Two sensors can calculate the position of the shooting object from the camera (triangulation), and the shooting distance is not limited. However, binocular calculation of depth information requires a large amount of calculation, and the distance between the two sensors needs to be known.
  • RGB-D: It can not only obtain image data, but also actively measure the depth of each pixel, which is good for 3D point cloud reconstruction. However, the range of active measurement is relatively small, and the measurement is easily interfered by sunlight. At the same time, it cannot detect transparent materials such as glass. Measurement.

After obtaining the image data, the image will be filtered, usually median filtering or Gaussian filtering. The filtering function is mainly to remove noise in the image and reduce interference. The comparison effect before and after filtering is as follows:

Comparison before and after filtering

After filtering, there will be de-distortion processing. We all know that the camera lens is a convex lens. Due to the refraction and loss of the lens, the straight lines in the real environment become curves in the image. The main distortions include pincushion distortion and barrel distortion, as shown below:

Distortion

Therefore, in order for subsequent modules to obtain image information more accurately, the image needs to be dedistorted.

After preprocessing, the camera image data will come to the front-end module. So, what does the front-end do?

Front end (visual odometry)

The front end , also known as visual odometry (VO for short), mainly studies how to quantitatively estimate camera motion between frames based on adjacent frame images. By stringing together the motion trajectories of adjacent frames, the motion trajectory of the camera carrier (such as a robot) is formed to solve the positioning problem. Then based on the estimated position of the camera at each moment, the position of the spatial point of each pixel is calculated. get a map.

Before introducing the specific operations of the front end, let's first take a look at what the image data processed by the front end looks like.

ORB-VSLAM2

The picture above is the image data processed by the vslam open source algorithm-ORB-VSLAM2. The small green squares in the left window in the figure are the extracted image ORB feature points, the green lines in the right window represent the movement trajectory of the camera, the blue squares represent the spatial position of the camera during movement (i.e. key frames), and the black Dots and red dots represent sparse maps of the environment (black represents historical landmarks, red represents current landmarks).

What are the feature points mentioned above, estimation of inter-frame motion from adjacent frames, etc.?

First of all, to estimate inter-frame motion based on adjacent frames, you have to find a reference object in the image. We can simply understand that this reference object is a feature point. Therefore, to estimate inter-frame motion, first extract feature points from each frame of image.

Feature points

How are feature points determined? How do you consider a point to be a feature point?

FAST is a kind of corner point. The FAST algorithm defines a feature point as if a pixel is in a different area from enough pixels in its surrounding area, then this pixel is a feature point.

The detailed calculation steps of this algorithm are as follows:

  1. Select pixel p in the image;
  2. Select a circle with a radius equal to 3 with the selected point coordinates as the center. There are a total of 16 points on the circle;
  3. Determine whether there are N consecutive pixels on the circle, and the difference between the brightness value of these N pixels and the pixel value of the center p is greater than or less than a certain threshold;
  4. If it is greater than or less than a certain threshold, the center p of the circle is a feature point;

Sometimes in actual calculations, in order to be more efficient, it is not necessarily necessary to detect 16 points, but directly detect the brightness of the 1st, 5th, 9th, and 13th pixels on the circle. Only 3 of the 4 pixels meet the difference value, the point is considered to be a feature point.

After the feature points are detected, the feature point descriptor will be obtained, which describes the size relationship between the feature point and two random pixels in the attachment. Because there are many feature points, it is not known which feature points are the same between frames. At this time, Descriptors are needed to express what is around a certain feature point and how to distinguish them.

After obtaining the feature points, feature matching between images will be performed. The following is a set of images to demonstrate this process.

After extracting feature points

Above are two similar images. The content of the image on the right has been moved to the left compared to the one on the left. First, extract feature points from the two frames of images. If the extracted feature points are directly matched, it will look like the following.

After feature point matching

As you can see from the above, there are many wrong matches in the picture, which will affect the final calculation, so we need to filter out these wrong matches. How to filter? In fact, it is to calculate the Hamming distance of the feature point descriptors, and remove the feature points with relatively large differences. After removal, it becomes as follows.

Matches after filtering

As can be seen from the above figure, the current remaining matching points are basically correct. After obtaining the matching feature points of adjacent frames, the next step is to estimate the camera motion between frames, and calculate the rotation vector R and The displacement vector t, the monocular camera obtains the inter-frame relationship (R, t) from the epipolar geometry .

epipolar geometric constraints

Take the above figure as an example, O 1 and O 2 are the camera centers of the front and back frames, P is a certain point in the three-dimensional space, there is a feature point P 1 in the image I 1 captured at O ​​1 , and the image I captured at O ​​2 2 has corresponding feature point P 2 . If the match is correct, it means that they are the projection of the same space point on the two imaging planes. The three points O 1 , O 2 and P determine a plane, which is called the polar plane. e 1 and e 2 are the poles, and O 1 O 2 is the baseline.

So how to calculate this inter-frame relationship? Let’s analyze the epipolar geometric constraints from an algebraic perspective.

Algebraic Analysis Epipolar Geometry

From the above we understand how to obtain the rotation matrix and displacement vector from two frames of images, but there are several problems in 2D-2D camera pose estimation:

  1. Scale uncertainty: The estimated displacement vector t has no unit, and the camera movement distance has only relative values, not absolute values. Therefore, the displacement vectors obtained in the first and second frames will be used for normalization, and subsequent movements will be changes relative to this result.
  2. Pure rotation problem of initialization: monocular initialization must have a certain degree of translation, otherwise t approaches 0, resulting in large R error;
  3. In the case of more than 8 pairs of points: Random sampling consistency (RANSAC) is used for estimation using feature points to effectively avoid erroneous feature points;

Through the above, we can get the relative changes between frames, and connect them together to form a trajectory map. What will happen if there is only front-end processing? Why do we need the back-end? Next, let’s look at the back-end processing .

rear end

Visual odometry only calculates the motion of adjacent frames and performs local estimation, which will inevitably cause cumulative drift. This is because each time there is a certain error in estimating the motion between two images, it passes through adjacent frames multiple times. , the previous errors will gradually accumulate, and the trajectory drift will become more and more severe. In this process, the following figure shows the offset well.

There is no backend to handle the offset situation

The above trajectory drift problem can be optimized through the two modules of back-end optimization and loopback detection to avoid the accumulation of errors.

The back-end mainly optimizes the results of the front-end to obtain the optimal pose estimation. That is, under what conditions is it most likely to produce the currently observed data. So what exactly does it do? The picture below explains it very well.

linear algebra

In the process of solving the rotation matrix and displacement vector between frames, as shown in the figure above, a series of (X, Y) is known, and the most suitable function parameters are solved to minimize the error from the function curve to each point.

In the robot, X and Y are the observation equations (X: the pose (position and attitude) of the current robot, Y: the coordinates of the observed landmark points (feature points in the image)), in fact, the front end has already derived the observation equation, However, there is a certain error between the obtained observation equation and the actual observation value, and the backend needs to minimize the error. How to minimize the error?

That is, the distance between the observed value and the observation equation is the smallest. The principle of least squares is commonly used : determine the equation by "minimizing the sum of squares of the residuals". The following takes the classic least squares method-----Gauss-Newton method as an example as an introduction to back-end optimization.

The basic idea of ​​the Gauss-Newton method is to use the Taylor series expansion to approximately replace the nonlinear regression model, and then through multiple iterations, the regression coefficient is corrected multiple times, so that the regression coefficient is continuously approaching the best regression coefficient of the nonlinear regression model, and finally Minimize the sum of squares of the residuals of the original model.

Gauss-Newton method

Among the above, β \betaβ can be understood as R and t obtained by vslam solution, and J is the variance pairβ \betaβ的求密,当Δ β \Delta\betaIf Δβ ​​is small enough, the optimization will stop; otherwise , the next pose point will be substituted to continue the optimization.

For back-end optimization, in actual engineering, least squares optimization is mostly performed based on G2O, a graph optimization library. (Because the SLAM field itself has many error terms and the relationship between them is unknown, and the system itself is also a nonlinear non-Gaussian system, the use of graph optimization is adopted by most developers). Graph optimization expresses the optimization problem as a graph, uses vertices to represent optimization variables, and uses edges to represent optimization items, and constructs a graph corresponding to the nonlinear least squares problem for optimization. In G2O itself, you can choose to use Gauss-Newton method, Levenberg-Marquardt method, etc. to solve nonlinear least squares problems.

The main purpose of loopback detection is to allow the robot to recognize where it has been, thereby solving the problem of position drift over time.

Loopback detection determines the loopback detection relationship based on the similarity between two images. The current engineering end uses the bag-of-words model to solve this problem. Bag-of-Words (BoW) aims to describe an image using "what features are present in the image". The word bag stores words (Word), and multiple words form a dictionary. The words in the dictionary are stored according to the k-ary tree structure.

In order to perform normal loopback detection, the following operations will be required:

  1. Create a dictionary and store the feature descriptors of each frame of image into a k-ary tree;
  2. When creating the dictionary, the weight of each descriptor will be calculated;
  3. Compare the images for similarity calculation, confirm which words the feature points in the image correspond to, and calculate the differences between frames;

Loopback detection also needs to pay attention to the following issues in dictionary applications:

  1. Increase the size of the dictionary, otherwise the difference in similarity between different images will be relatively small;
  2. Increase the processing of key frames. Because the similarity between adjacent frames is relatively high, the frames for loop closure detection should be sparse;

Mapping

Mapping , stitching together the surrounding environment of the robot during the movement process to obtain a complete map. The main uses of the map are: positioning, navigation, obstacle avoidance, reconstruction, interaction, etc.

Map classification

From the picture above, we can clearly understand what scenarios each map can be used for. Sparse labeled maps can only be used for positioning, while dense maps can also be used for navigation, obstacle avoidance, and reconstruction. Semantic maps are used for interaction, such as VR.

So the sweeping robot needs at least a semi-sparse and semi-dense map, otherwise it will not be able to achieve navigation and path planning. How does a monocular camera reconstruct a map?

To reconstruct the map with a monocular camera, you need to know the distance of each pixel. In multi-frame images, after obtaining the camera motion, triangulate the distance of the pixels. Epipolar search and block matching technology need to be used to find the pixels corresponding to the same point in the image between frames (unlike feature points, you can directly know the correspondence between feature points between frames. In this case, each pixel needs to be operated. So it can only be searched in extreme lines).

Polar search

As shown in the picture above, knowing the p 1 point, after the movement, which point on the epipolar line is the p 1 point we just saw?

If it is a feature point, the position of p 2 can be found through feature point matching , but currently there is no descriptor and the p 2 point can only be searched from the epipolar line.

You can go from one end of the epipolar line in the second image to the other and compare the similarity between each pixel and p 1 one by one, but this method may have many similar points. In mapping, similar points are found through block matching. Similarity matching is performed by dividing the blocks into blocks. The sum of the squares of the differences between the two blocks or the normalized cross-correlation is calculated to calculate the similarity. The similar pixels between frames are obtained, and then the depth of the pixel is obtained by triangulation .

After obtaining the rotation matrix and displacement vector between frames and the bit depth of each pixel, the PointCloud can be constructed, and then spliced ​​into a 3D point cloud image through the PCL point cloud library.

point cloud

Summarize

Above we briefly introduced the framework of vslam. Many places are just simple descriptions (PS: Because I also have a little knowledge), and there are some problems in expression in some places. For example, optical flow method is used in visual odometry, but it is not introduced. Camera pinhole imaging mode, conversion of world coordinates and image coordinates, Lie algebra, etc. are not introduced. These require additional learning and understanding. vslam involves a lot of content, and I will continue to study and understand it in the future.

reference

This article mainly refers to Gao Bo's "Fourteen Lectures on Visual SLAM" and a large number of excellent articles on the Internet. Thank you.

Guess you like

Origin blog.csdn.net/weixin_41944449/article/details/119864865