Classic literature reading--VIP-SLAM (tightly coupled RGB-D visual-inertial plane SLAM)

0. Introduction

Many existing visual SLAM algorithms are still based on feature extraction methods to complete the establishment of maps, while the algorithm of RGB-D sensors is still mainly based on sparse point SLAM systems, which leads to the construction of dense point cloud maps. A large number of map points need to be maintained to model the environment. The large number of map points brings us high computational complexity, making it difficult to deploy on mobile devices. Planes, on the other hand, are common structures in man-made environments, especially in indoor environments. We can usually represent a large scene with a small number of planes. The article " VIP-SLAM: An Efficient Tightly-Coupled RGB-D Visual Inertial Planar SLAM" solves this problem by reducing the high complexity of sparse point-based SLAM. We build a lightweight back-end mapping consisting of several planes and map points to achieve efficient bundle adjustment (BA) with equal or better accuracy. During the optimization process, the parameters of multiple planar points are eliminated by using homography constraints, which reduces the complexity of BA. The parameters and measurements in homography constraints and point-to-plane constraints are separated, and the measurement part is compressed to further effectively improve the speed of BA. We also integrate planar information into the overall system for robust planar feature extraction, data association and globally consistent planar reconstruction.

1. Main contributions

Prior to this paper, there are many works on fusion IMU showing that IMU helps to improve the robustness of the system, and structured planes help to improve the accuracy and robustness of the system. Additionally, planes can model environments using fewer parameters than line and point features. Based on this, we make full use of the characteristics of multiple sensors and propose a highly robust and accurate system that integrates IMU, RGB, Depth and Plane information. This paper has three contributions:

  1. We first propose a complete tightly coupled multi-sensor fusion SLAM system to fuse RGB, depth, IMU and structured planar information. Integrate all information into a unified nonlinear optimization framework to jointly optimize the parameters of keyframe pose, IMU state, point and plane;

  2. We introduce planar information to reduce the number of map points and speed up the optimization of BA. We use homography to remove the state at that point in the optimization process and simultaneously compress multiple constraints into one. These measures reduce the optimization time, and Figure 4(a) and Figure 4(b) show this process;

  3. Integrate planar information into the entire SLAM system to achieve high-precision tracking . We use purely geometric single-frame point-to-plane constraints to improve the accuracy and stability of untextured scene plane estimation. Furthermore, we convert the reprojection of plane points into homography constraints to establish the relationship between multiple frames and planes to further correct drift.

2. System overview

Figure 2 shows an overview of the proposed system. Our system takes RGB, depth and IMU as input and has three main components including front-end, planar module and back-end.
The front-end module is a sliding window based VIO system that can estimate 6-DoF pose in real time. Our VIO is similar to [7] except that we do not consider SLAM features and add an additional depth measure.
The flat module receives frontend and backend data as input. The high-frequency front-end information is only used for the extension plane. Low-frequency but high-precision backend information is used for new plane detection, plane expansion, point-to-plane association and plane-to-plane merging. Since most planes of indoor environments are either horizontal or vertical, we only consider the detection of horizontal and vertical planes. However, fusion optimization with respect to planes applies to planes in general.
The back-end module uses Local Bundle Adjustment (LBA) or Global Bundle Adjustment (GBA) to jointly optimize planes, points, IMU states and keyframe poses, and correct the drift of front-end poses.
insert image description here

3. Various symbols in the article

We first define the notations used in this paper. We will ( ⋅ ) W (·)^W()W is regarded as the world coordinate system. ( ⋅ ) C (·)^C()C is the camera coordinate system,( ⋅ ) I (·)^I()I is the IMU coordinate system. We useT ∈ SE ( 3 ) T ∈ SE(3)TSE ( 3 ) represents the pose, which consists of the rotationR ∈ SO ( 3 ) R ∈ SO(3)RSO ( 3 ) and translationt ∈ R 3 t ∈ R^3tR3 composition. We useπ = [ n ; d ] T π=[n;d]^TPi=[n;d]T means plane, wherennn is the plane normal,ddd is the distance between the origin and the plane. We take the CP[28] vector to parameterize the planeπ ππ,即η = nd η=ndthe=nd X X X is the state to be estimated, including attitude, velocity, IMU bias, point and planar landmarks.

4. Front end

When a new image is received, we detect ORB feature [1] points and compute corresponding descriptors. First, we track them from the previous image to the current image using KLT . Then, we project the features with 3D information onto the current image and use the Hamming distance to find the nearest ORB feature points. The remaining features are matched by finding the best observation using the results of KLT tracking as initial values. Finally, we use a RANSAC-based fundamental matrix to remove outliers.
Motion estimation is based on sliding window VIO, which tightly integrates RGB, depth and IMU . Our VIO is similar to [7], which uses an inverse square root filter to fuse all measurements. The main difference is that we do not consider SLAM features and add depth information in visual measurements.

5. Plane recognition and association

5.1 Plane detection and merging

Plane detection The plane module can only detect planes using backend data. After a plane is detected, the plane module uses the front-end and back-end data to extend the plane and associate the plane with a map landmark. Similar to [17], we use Delaunay triangulation to create 3D meshes and histograms to detect planes . We only detect vertical and horizontal planes by checking if the mesh normal vector is perpendicular or parallel to gravity. Our planar module employs some additional methods to improve planar accuracy. When a plane is detected from the histogram , we will use the data and the 3D plane points in the histogram to refine its parameters instead of using the scale value of the histogram directly . For the horizontal plane, we simply set n = [ 0 , 0 , 1 ] T n = [0,0,1]^Tn=[0,0,1]T and set the plane distance as the mean of the z-axis of the plane points. For the vertical plane, we setn = [ nx , ny , 0 ] T n = [n_x,n_y,0]^Tn=[nx,ny,0]T , and refine the vertical plane parameters using the following equation:
insert image description here
where,nnn is the number of plane points,P fk WP^W_{f_k}PfkWis the kkth in the world coordinate systemThe position of k landmarks,P ˉ fk W = [ P fk W ( 0 ) , P fk W ( 1 ) ] T \bar{P}^W_{f_k} =[P^W_{f_k}(0),P ^W_{f_k}(1)]^TPˉfkW=[PfkW(0),PfkW(1)]T. _ We use QR decomposition to solve Equation (1). When a plane parameter is detected, the plane can be associated with the 3D mesh by angle and distance. Our angle and distance thresholds are around 10 degrees and 5 cm. Figure 3 shows this process.
insert image description here

Figure 3. Plane detection: (a) 2D grid. (b) 3D grid, where the red and blue regions denote the horizontal and vertical grids, respectively. © Vertical planes are yellow and horizontal planes are gray.

5.2 Point and Plane Association

We use a 3D grid to associate more map points with the plane. Once the 3D mesh in the depth map is associated with a plane, we find its 2D mesh. If the 2D coordinates of the map point are within the 2D grid and the distance of the map point from the plane is less than 10 cm, then add the map point to the candidate set associated with the plane . If a point on a candidate set is observed at more than 3 keyframes , we check its geometric consistency. We compute the reprojection error of the point, force the point to be associated with a plane, and then compute another reprojection error. A point is considered a planar point if the two reprojection errors are similar and the maximum reprojection error is below a certain threshold. If a point fails the geometric consistency check multiple times, it is removed from the set of candidate points.
insert image description here

Figure 4. Our optimization problem. (a) and (b) are factor graphs before and after compressing the homography constraints. © is our global BA problem.

6. Backend optimization

6.1 Measured values

IMU and point feature measurement : IMU data between two consecutive keyframes are processed using a pre-integration method [8], [9]. We define a cost term based on pre-integrated IMU data, the same as [9]. Since depth maps are effective, we integrate depth information into visual point feature measurements.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/128703658