A Review of Visual-LiDAR Fusion based Simultaneous Localization and Mapping

A Review of Visual-LiDAR Fusion based Simultaneous Localization and Mapping

A Survey of Simultaneous Localization and Mapping Technology Based on Vision-LiDAR Fusion


Summary

Autonomous navigation requires both accurate and reliable mapping and localization solutions. In this case, simultaneous localization and mapping (SLAM) is a very suitable solution. SLAM exists in many application scenarios, including mobile robots, self-driving cars, unmanned aerial vehicles, or autonomous underwater vehicles. In these domains, both visual and visual-inertial SLAM are well studied and improvements are regularly proposed in the literature. However, lidar-SLAM technology appears to be relatively the same as it was ten or twenty years ago. Furthermore, few research efforts have focused on vision-LiDAR approaches, and this fusion would have many advantages. In fact, fusion solutions improve the performance of SLAM, especially in scenes with intense motion, poor lighting, or lack of visual features. This study provides a comprehensive survey on visual lidar SLAM. After summarizing the basic idea of ​​SLAM and its implementation, we provide a comprehensive review of the current state of SLAM research, focusing on solutions using vision, lidar, and sensor fusion for both modalities.

Keywords: SLAM; mapping; positioning; fusion; lidar; camera; vision


1. Introduction

Autonomous navigation of mobile robots has been a very active research area over the past few decades. The requirements for autonomous navigation are, first, a good and accurate localization of the robot itself and, second, a good prior knowledge or perception of its environment. Today, the main positioning system in use is the GNSS solution, which provides high-precision absolute positioning on Earth. However, such systems are not always available or accurate, depending on the environment (tunnels, caves, urban canyons, etc.) and can lead to errors in some instruments, which is unacceptable for safe autonomous navigation. Furthermore, mobile robots need to be able to navigate even in dynamic environments with potential obstacles and not always have any prior information about their environment (planetary exploration, search and rescue, etc.). ). The only way for a robot to be able to navigate is to have some form of representation of the environment. Generating 3D maps online seems to be the starting point for fully autonomous navigation in the 3D world. Such a map can consist of simple geometric features or more complex semantic objects. With a consistent map, robots will be able to detect free space, obstacles, and easily detectable landmarks for precise and safe navigation. In this way, the robot will be able to self-explore and map an unknown environment and interact with it safely. Applications for such mobile robots are vast: space exploration, autonomous vehicles, seabed analysis, mining applications, search and rescue, structural inspections, and many more.
This navigation method is called simultaneous localization and mapping (SLAM). SLAM is the process by which a robotic system uses different kinds of sensors to construct a map of its environment while simultaneously estimating its position in the environment. Such a map could be used by human operators to visualize the environment and set the robot's path, or even by the robot itself to plan its own mission autonomously. This is the case for autonomous navigation, where the robot must plan its own path and make the right decisions without human intervention. Such a robot can maintain its own stability and plan its own movements, even when some unexpected events occur.
This navigation method is called Simultaneous Localization and Mapping (SLAM). SLAM is the process by which a robotic system uses different kinds of sensors to construct a map of its environment while simultaneously estimating its position in the environment. Such a map could be used by human operators to visualize the environment and set the robot's path, or even by the robot itself to plan its own mission autonomously. This is the case for autonomous navigation, where the robot must plan its own path and make the right decisions without human intervention. Such a robot can maintain its own stability and plan its own movements, even when some unexpected events occur.
Currently, one of the most studied environments for SLAM applications is the domain of autonomous vehicles, as it requires localization and mapping to navigate the environment. More recently, the industry has proposed what it calls "autonomous" cars, but these cars are only the first step toward self-navigation and must be considered "semi-autonomous cars" because they are only guaranteed to drive safely and autonomously under very specific circumstances. Tesla Autopilot guarantees autonomous driving only on highway sections or simple scenarios [1] at this time, but requires the driver's full attention. Consider the Google car, which is designed only to move itself "on the open road" in good weather conditions with a pre-existing accurate 3D map[2].

Most self-driving cars use lidar and/or stereo cameras to perceive the environment they are navigating. Such systems are often mixed with Differential Global Positioning System (DGPS) or Satellite-Based Augmentation System (SBAS) and Inertial Measurement Unit (IMU) to enhance the robustness of the positioning solution [3]. With this sensor, with a very good GNSS, the positioning accuracy can be down to within a few centimeters. However, in cases where GNSS is not trustworthy, other positioning solutions must be investigated. Most state-of-the-art techniques attempt to address this localization problem with external sensors such as mmWave radar, LiDAR, and monocular/stereo cameras. By using a mix of such exosensory sensors with classical proprioceptive sensors (IMU, odometry), this allows reducing or eliminating drift due to accumulated errors of this relative positioning method. Interestingly, the most commonly used modalities (camera and lidar) are two very different sensors, each with their strengths and weaknesses. For example, laser scanners are important for obstacle detection and tracking but are sensitive to rain, while cameras are often used to obtain a semantic interpretation of a scene but cannot work in poor lighting conditions. Since they appear to be complementary, their fusion balances each major weakness. Interestingly, the most challenging problem in SLAM problems is error accumulation, which can reach arbitrarily high values ​​[4]. The use of vision and lidar sensors can reduce local uncertainties, thereby limiting global drift.

The purpose of this paper is to provide an overview of existing SLAM methods with a focus on new hybrid lidar-camera solutions. To make this paper accessible to new researchers, we will first give a short hint in Section 2 about the theory behind the SLAM process. Then, because the current main state-of-the-art lidar camera solution is a simple combination of visual SLAM and lidar SLAM, we think it is important to have an overview of each mode of SLAM. Section 3 will focus on different types of visual SLAM (V-SLAM) methods, which means that V-SLAM has a monocular and a stereo camera, but also modern RGB-D and event cameras. Then, Section 4 will give an overview of lidar-based SLAM. Finally, in Section 5, we discuss the state-of-the-art on hybrid camera-lidar SLAM to see what has been covered, and in Section 6, what remains to be done.

2. Simultaneous positioning and mapping

2.1 Principle

SLAM is a technique that simultaneously estimates sensor motion and reconstructs the geometry of the visited area. This technology was originally developed in 1985 for the autonomous control of robots [5]. SLAM has been widely studied and applied to various sensors and multiple robotic platforms. Since 1985, SLAM technology has been in constant development. Today, we are starting to see some mature algorithms, but these methods are still very dependent on the platform, the environment and the parameters that have to be tuned.

The basic idea is to use landmark correlations, and they are needed to improve the solution [6]. Combined with data association, SLAM solutions can perform loop closure to reduce the uncertainty of each pose of the map, since each landmark in the loop appears to be related [4].

SLAM is an estimation problem. We wish to estimate a variable X that includes the trajectory or pose of the robot and a variable M that represents the location of landmarks in the environment. Given a set of measurements Z = {z1, . . . . , zm} and the measurement or observation model h(.) are expressed as functions of X and M, such as:
insert image description here
Xk, Mk are subsets of X, M and eka random measurement noise, respectively. SLAM tends to solve the maximum a posteriori probability problem in the following way:
insert image description here
p(Z|X,M) is the likelihood of the measurement Z given X and M, and p(X,M) is the prior on X and M.
Assuming that the observations ZK are independent, the MAP problem becomes:
insert image description here
This SLAM mapping problem is initially solved thanks to the Extended Kalman Filter (EKF) [7]. It reduces uncertainty and gives an estimate at each step of the algorithm. Using a probabilistic model, EKF guarantees the convergence and consistency of the map. However, it is very sensitive to data association errors, and permanent updating of all landmarks and their covariance matrices requires significant computational effort. Figure 1 presents the block diagram of the EKF-SLAM process. State-of-the-art methods address this mapping problem thanks to optimization techniques such as bundle adjustment [8] or deep neural network approaches [9, 10].
insert image description here
Figure 1 EKF-SLAM flow chart. When the exosensory data comes at time t, the state of the robot at this time is predicted using Equation (5), and the detected features are matched with those in the map. Matching allows state and map updates using Equation (6). If the detection is not in the map, it is initialized if possible and added to the map. This process is done recursively (see equation (7)).

2.2 SLAM Solution Framework Based on Probability Estimation

As we have seen before, SLAM is a recursive estimation process. This process is often viewed in a probabilistic way, in which case the classical forecast and update steps have to be done.
Considering the robot moving in an unknown environment, we define:
xk: the state vector describing the robot at time k
xk | k-1: the estimated state vector at time k given the known previous state
uk: the applied control vector to move the vehicle to state xk (if provided)
mi: vector describing the i-th landmark
zk,i: observation of the i-th landmark at time k
X: set of vehicle positions from time 0 to k
U0:k: set of control inputs from time 0 to k
Z0:k: set of observations from time 0 to k
M: set of landmarks or maps
Mk|k-1: given the previous map at time k-1 , the estimated map at time k.
insert image description here
When we consider the probabilistic form of SLAM, at each time k, we want to compute the probability distribution function:
insert image description here
in order to proceed, we have to use a recursive solution that uses the update prior P(xk-1| k-1, Mk-1|k-1|Z0:k-1, U0:k-1).
To do this, we first need to define a motion model that provides a state prediction given a control input P(xk|xk-1,uk), in the following way: Similarly, we must also define a perception or observation
insert image description here
model P(zi , k|Xk, M), the model will relate sensor data about a detection I at time k to the estimated state in the following way: Figure 2 shows the
insert image description here
SLAM pipeline, which represents all the variables used in this section.
insert image description here
Figure 2. SLAM process description and notation. Blue is the estimated trajectory and map; white is the true value of the trajectory
In this step, we have calculated the prediction step and the update step, and SLAM can be expressed as an iterative estimation of the following equation, which is a combination of equations (5) and (6): Equation (7)
insert image description here
gives A recursive Bayesian approach to SLAM implementation. The solution to the SLAM problem must properly compute both the motion model and the perceptual model to efficiently compute recursive methods. Current state-of-the-art methods typically use an inertial measurement unit as a prediction step or make assumptions about the vehicle's motion (constant velocity, constant acceleration, etc.). Consider the observation model, which in the case of visual SLAM is usually based on inverse depth or classical perspective models. Considering LiDAR, RGB-D or RADAR methods, it is much easier to observe the model, because the observation is a direct 3D measurement of the 3D world.
insert image description here
In the case of T = [R, t], a rigid transformation of the 6D pose of the sensor is provided, K is the intrinsic function of the sensor and Π() is the perspective projection function. It can be seen that this function should be inverted to match the classical observation model (P3D = g(P2D, K, T)). However, such an inversion is not trivial, so the estimation step is usually postponed for additional P2D observations. Then, triangulate the P3Dmap.

2.3 Graph-based SLAM framework solution

In the case of LiDAR, the model is simple because observations are directly linked to states via classical rigid transformations, e.g. even though
insert image description here
probabilistic frameworks can be used with LiDAR data, such SLAM methods are often solved due to graph-based approaches , where graph-based methods only optimize over sets of relative transformations [11].

The graph-based formulation for SLAM is proposed by [12]. It formulates a simpler estimation problem by abstracting raw sensor measurements. The original measurements are replaced by edges in the graph, which can be thought of as "virtual measurements". In practice, such edges are labeled as probability distributions over the relative positions of two poses conditioned on their mutual measurement. As shown in Figure 3, the process consists of two main modules: graph construction (front-end) and graph optimization (back-end). Given constraints, most optimization techniques focus on computing the best map: this is the SLAM backend. Instead, the SLAM front-end tries to estimate optimal constraints from sensor data.
A tutorial on graph-based SLAM can be found in [13].
insert image description here

3. Visual slam

After mentioning SLAM theory, the purpose of this section is to provide a brief overview of all existing methods for performing visual SLAM. Visual-SLAM is one of the most active research areas in robotics. Vision sensors have been a major research direction for SLAM solutions because they are cheap, able to collect a large amount of information and provide a large measurement range. The principle of visual-SLAM is easy to understand. The purpose of such systems is to sequentially estimate camera motion from the perceived motion of pixels in an image sequence. This can be done in different ways. The first approach is to detect and track some important points in the image. This is what we call feature-based visual SLAM. The other is to use the entire image without extracting features. This approach is called direct SLAM. Of course, there also exist other SLAM solutions using different cameras, such as RGB-D or time-of-flight (ToF) cameras (which provide not only images but also depth of field), and event cameras (which only detect changes in pictures).
In this section, we propose to decompose visual-SLAM into these different categories for clarity.

3.1 SLAM based on feature points

Feature-based SLAM can again be decomposed into two subfamilies: filter-based methods and bundle adjustment-based (BA) methods. Reference [14] presents a comparison between such methods. Davison et al. proposed the first monocular method MonoSLAM in 2003. [15,16]. Estimation of features and poses can be easily done by using the Extended Kalman Filter. This filter-based technique has been shown to be limited in large environments because too many features must be kept in the state. To reduce this problem, PTAM was proposed in 2007 [17]. It separates pose and map estimation into separate threads and recommends using BA. Of course, many extensions have also been proposed [18, 19]. To improve feature-based SLAM using BA, loop closure is added to detect whether a keyframe has been seen [20, 21]. At the time of writing, the most commonly used algorithm for SLAM is ORB-SLAM [22,23]. This algorithm contains most of the "tricks" that improve the performance of SLAM and can handle monocular, stereo and RGB-D camera configurations using different algorithms. The problem with this algorithm is that a large number of input parameters need to be tuned in order for SLAM to work in a given environment. But if we know enough about the environment to tune everything, do we really need to perform SLAM? The work presented in [24] attempted to reduce the number of such parameters and make visual-SLAM platform and environment independent, but did not achieve the performance of ORB-SLAM.

3.2 SLAM based on direct method

Compared with feature-based methods, direct methods use images directly without any feature detectors and descriptors. Such feature-less methods typically register two consecutive images with photometric consistency (for feature-based methods, the registration is based on the geometric locations of feature points). In this category, the most well-known methods are DTAM [25], LSD-SLAM [26], SVO [27] or DSO [28,29]. Finally, with the development of deep learning, several SLAM applications have emerged to mimic previously proposed methods [9, 30]. Such research efforts generate semi-dense maps representing the environment, but direct SLAM methods are time-consuming and often require GPU-based processing.

3.3 RGB-D SLAM

RGB-D camera sensors [31, 32] based on structured light have recently become cheap and small. Such a camera can provide 3D information in real time, but is most likely to be used for indoor navigation because its range is less than 4 to 5 meters and the technology is very sensitive to sunlight. See [33–38] for RGB-D vSLAM methods.

3.4 Event Camera SLAM

Finally, an event camera is a bioinspired imaging sensor that can provide an "infinite" frame rate by detecting visual "events" (i.e. changes in images). Such sensors have recently been used in V-SLAM [39–41]. However, the technology is not mature enough to draw conclusions about the performance of SLAM applications.

3.5 Visual SLAM Summary

As you can see, the field of V-SLAM research is very rich, and we only review the main methods. For a more complete review of visual-SLAM, see [42,43]. Even though V-SLAM provides good results, all these V-SLAM solutions are error-prone because they are sensitive to light changes or low-texture environments. Furthermore, RGB-D based methods are very sensitive to sunlight since they are based on infrared light. Therefore, they only perform well in indoor scenarios. Considering other vision methods, they do not perform well in the absence of textures or harsh environments, where pixel displacement cannot be accurately estimated. Finally, image analysis still requires high computational complexity. See Table 1 for a summary.
insert image description here
These shortcomings have motivated researchers to create optimized robust algorithms that can handle data errors and reduce execution time. For all these reasons, other sensors have also been investigated for SLAM processes. Right now, the first self-driving car prototypes rely primarily on other sensors: RADAR or LiDAR.

4. Laser SLAM

What every mobile robot designed to perform SLAM has in common is that they use extrinsic sensors. Even though radar-based SLAM is proven to be effective [44–46], we choose to focus this paper on laser scanning devices. One reason for this choice is that radar is not yet accurate enough to provide a good 3D map of the vehicle's surroundings, so fusing it with vision sensors is very difficult. With LiDAR in mind, 3D mapping using laser scanners remains a popular technique due to its simplicity and accuracy. Indeed, applying LiDAR to SLAM problems enables low-drift motion estimation with acceptable computational complexity [47].
Laser scanning methods appear to be the cornerstone of 2D and 3D mapping research. LiDAR can provide point clouds which can be easily implemented in SLAM. Stop-and-scan [48] was one of the first attempts to reach a proper SLAM solution using LiDAR. It avoids motion distortion, but is not a reliable solution for navigation. Fusion with an IMU can correct motion distortions using an error model that takes velocity information as input [49]. While IMUs are often used to undistort data, they are also often used to predict motion. Work in [50] shows that this approach may lead to over-convergence and proposes a LiDAR odometry-based distortion analysis.

4.1 Scan matching and graph optimization

Scan matching is the fundamental process of creating 3D maps using LiDAR, which provides precise information about motion. A common method for registering 3D point clouds is Iterative Closest Point (ICP) [51]. See Figure 4 for the principle. Its main disadvantages are the expensive search for point correspondences and the high sensitivity to minimize the starting point. To solve this problem, a kd-tree structure [47] can be introduced to speed up the search for the closest point. The work in [52] shows that the robustness of ICP can be enhanced by using a probabilistic framework that takes into account the structure of the scan plane. This is generalized ICP. Another approach is polar scan matching (PSM) [53], which utilizes polar coordinates delivered by a laser scanner to estimate the match between each point.
insert image description here
To reduce local errors, graph-based methods [54] can be used together with LiDAR. The history of robot poses is represented by a graph: each node represents a sensor measurement, and edges represent constraints generated by observations (from ICP results). All methods that rely on pose graphs can be solved using various optimization methods such as the Levenberg–Marquardt optimizer.
Taking aircraft navigation as an example, [55] proposed the use of 2D LiDAR combined with GNSS and IMU. Let us note that scan matching can be done for both 2D and 3D LiDAR. Considering 2D LiDAR applications, filtering-based methods are also proposed under the "flat" world assumption.
Taking aircraft navigation as an example, [55] proposed the use of 2D LiDAR combined with GNSS and IMU. Let us note that scan matching can be done for both 2D and 3D LiDAR. Considering 2D LiDAR applications, filtering-based methods are also proposed under the "flat" world assumption.

4.1.1 Coverage maps and particle filters

Another effective way to solve the SLAM problem is to use Rao Blackwellized particle filters, such as Gmapping [56]. It greatly reduces local errors and provides interesting results in planar environments. Each particle represents a possible robot pose and map. However, the large number of particles required to map the environment correctly results in non-negligible computation time. Research in [57] shows that particle filters applied to 2D SLAM are able to compute highly accurate proposal distributions based on likelihood models. The result is an accurate occupancy raster map using an order of magnitude smaller number of particles than conventional methods. Of course, adapting this technique to 3D is very difficult due to the size of the grid occupied.

4.1.2 Loopback refinement steps

Previous solutions allowed to obtain localized information and build a map of the environment in an odometry manner. To fully address the SLAM problem, a loop closure step has been added to the LiDAR odometry. To improve the consistency of the global map, loop closures can be performed when the robot places itself in a pre-determined location. It can be performed by feature-based methods such as [58]. For laser scanning, geometric descriptors are used, such as lines, planes or spheres. These descriptors are used to perform matching between scans to detect eventual cycles. Since the scan matcher between each scan can hardly run in real-time, a subgraph representing the environment of multiple scans is used in [59]. All completed subgraphs are automatically inserted into a scan-matching loop, which performs loop detection in a sliding window around the current robot pose. Magnusson et al. [60] proposed a primitive loop detection process using a Normal Distribution Transform (NDT) representation of 3D clouds. It is based on a histogram of features describing the orientation and smoothness of the surface.
The work in [55] demonstrates that the overall drift of LiDAR-SLAM can be effectively corrected by performing loop closure. In their case, the Kalman filter was simply augmented with a location recognition module capable of detecting loops. Table 2 summarizes LiDAR-based SLAM.
insert image description here

5. Laser-vision fusion

As mentioned earlier, SLAM can be performed with the help of vision sensors or LiDAR. The advantage of vision sensors is that it needs to be well researched at the moment. Even though V-SLAM provides accurate results, there are some default values, such as: scale factor drift in the monocular case, poor depth estimation (delayed depth initialization) or small range for stereo vision, sparsity of the reconstruction map (for features), the difficulty of using RGB-D in outdoor scenes, etc. Considering 3D LiDAR-based SLAM, the techniques used rely on scan matching and graph pose. Some solutions focus on landmark detection and extraction, but the obtained point clouds are usually not dense enough to run efficiently. Nonetheless, the main advantage of LiDAR is that it has very good accuracy in ranging and thus in cartography. Today, it is clear that a fusion of both approaches will go a long way for modern SLAM applications. Of course, using both methods requires a first difficult calibration step. This section presents the state-of-the-art on available calibration tools and LiDAR camera fusion methods.

5.1 Mandatory calibration steps

In order to perform SLAM through LiDAR camera fusion with optimal performance, an accurate calibration between the two sensors must be ensured. As shown in Figure 5, extrinsic calibration is required to determine the relative transformation between the camera and LiDAR.
One of the first toolboxes to propose an interactive solution to calibrate cameras to LiDAR is [61]. It consists of manually marking the corresponding points of the LiDAR scan and the camera, and the work in [62] details the use of a checkerboard for automatic camera laser calibration. It performs line extraction in order to infer the appropriate rigid transformation between the two sensors. However, these off-line calibration techniques cannot be used for optimal external calibration because the external parameters change daily and require very specific conditions to work.
insert image description here
Figure 5. Principle of extrinsic calibration. The goal is to find the rigid transformation MCL between lidar and camera. Currently, this is usually done manually using a calibration target like a 2D or 3D checkerboard or pattern and by detecting this pattern for each mode (MB Land MB C).
As deep convolutional neural networks (CNNs) have recently become popular in robotics applications, the work in [63] proposes CNN-based calibration. CNN takes LiDAR and camera disparity as input and returns calibration parameters. This provides a fast in-line calibration solution suitable for real-time applications.

5.2 Vision-Laser SLAM

5.2.1 SLAM based on EKF fusion

In the context of visual lidar SLAM, it has been shown that the classical formulation of Extended Kalman Filter (EKF) SLAM can be modified to integrate such sensor fusion. The work in [64] proposes a new representation of EKF using data association, which improves the accuracy of SLAM. The work in [65] also presents an RGB-D camera with LiDAR EKF SLAM. The main purpose of this work is to address the failure of visual tracking. Failing that, the point cloud data from the RGB-D camera is localized using the LiDAR pose to build a 3D map. Such an approach does not actually provide fusion, but rather a switching mechanism between the two modes. The work in [66] integrates different state-of-the-art SLAM algorithms using EKF based on vision and inertial measurements in the context of low-cost hardware for micro-aircraft. 2D LiDAR is integrated in the SLAM system to generate 2.5D maps and improve robot pose estimation. The method thus proposed is still a loosely coupled method that does not rely on feature detection on the measurement space. More tightly coupled LiDAR vision sensor fusion algorithms are still lacking in the literature.

5.2.2 Improved Visual SLAM

From another perspective, the excellent performance achieved by visual SLAM algorithms motivates the use of sensor fusion techniques to obtain optimal solutions on these frameworks. In [67], lidar was used for depth extraction. After projecting the point cloud to the frame, motion estimation and mapping are performed using visual keyframe-based bundle adjustment. The literature [68] proposes a direct visual SLAM, which uses the sparse depth point cloud of LiDAR (Fig. 6). However, since the camera resolution is much higher than the LiDAR resolution, many pixels do not have depth information. The work presented in [69] provides a solution to the resolution matching problem. After computing the geometric transformation between the two sensors, Gaussian process regression was performed to interpolate missing values. Therefore, using only LiDAR, the features detected in the image can be directly initialized in the same way as RGB-D sensors.

insert image description here
Figure 6. LiDAR example reprojected on image. Note that small errors in calibration will lead to severe errors in estimated depth (see Box 2). We dealt with depth obtained incorrectly wrt stereo vision.

Zhang et al. [70] proposed monocular SLAM associated with a 1D laser range finder. Since monocular SLAM often suffers from scale drift, this solution provides effective drift correction at very low hardware cost. Scherer et al. [71] mapped the course and vegetation of a river with the aid of a flying robot and a hybrid framework. State estimation is performed through visual odometry combined with inertial measurements, and LiDAR is used to sense obstacles and map river boundaries. However, point clouds may contain occluded points which reduce the accuracy of the estimation. The work in [72] addresses this issue by proposing a direct SLAM approach with occluded point detectors and coplanar detectors. In these last papers, visual SLAM estimated poses are used to record LiDAR point clouds during the mapping phase.

5.2.3 Improved Laser SLAM

In many cases of visual LiDAR SLAM, the LiDAR is used for motion estimation via scan matching, while the camera performs feature detection. Liang et al. [73] enhanced the performance of LiDAR-based SLAM using a visual loop detection scheme with ORB for scan matching. In [74], 3D laser-based SLAM was associated with a vision approach to perform loop detection through a keyframe-based technique using bag-of-visual words. Additionally, iterative closest point (ICP) can be optimized using LiDAR camera fusion. The work in [75] uses visual information to make an initial guess about the pose transformation that is used for subsequent ICP estimation.

5.2.4 Laser-Vision Consistent SLAM

Other works try to combine LiDAR and visual-SLAM results. The work in [76] proposes to use vision and LiDAR measurements by running SLAM in parallel for each modality and coupling the data. This is done by using the residuals from both modalities during the optimization phase. Zhang et al. [77] combined their previous work to design VLOAM. This visual-LiDAR odometry performs high-frequency visual odometry and low-frequency LiDAR odometry to refine motion estimation and correct for drift.
Perhaps the tightest fusion method currently available is proposed in [78], where graph optimization is performed using a specific cost function that takes laser and feature point constraints into account. Here, both laser data and image data can be used to obtain robot pose estimation. A 2.5D map was also built to speed up loop detection.

  1. Jiang, G.; Lei, Y .; Jin, S.; Tian, C.; Ma, X.; Ou, Y . A Simultaneous Localization and Mapping (SLAM)
    Framework for 2.5D Map Building Based on Low-Cost LiDAR and Vision Fusion. Appl. Sci. 2019, 9, 2105.

5.3 Summary

In summary, these examples mainly use sensor fusion to provide more information for LiDAR-only or vision-only SLAM frameworks. Among all the approaches to achieve such SLAM (see Figure 7), hybrid frameworks have been the least studied. Creating a general SLAM framework using visual information and laser range seems to be a real challenge. More tightly coupled LiDAR-vision sensor fusion algorithms have not been thoroughly studied in the literature and should be investigated.
insert image description here

6. Discuss future research directions

After reviewing the literature, it appears that a fully tightly fused visual LiDAR approach that takes advantage of both sensor modalities does not yet exist. We state that fusing LiDAR features with visual features in a tight hybrid fashion will benefit the SLAM community. Indeed, it will be feasible to solve multi-modal, hybrid multi-constraint MAP problems. Such a solution would make SLAM more robust to environmental conditions such as lighting or weather. V-SLAM is known not to work in harsh lighting conditions or texture-less environments, but LiDAR SLAM can. On the other hand, LiDAR-SLAM performs well in rainy conditions (false impacts detected) or in geometrically salient regions (empty areas, long corridors) where camera-based SLAM works fine bad.
We propose to investigate some hybrid approaches using a set of different extracted landmarks from different modalities (e.g. L = {Lvision, LLiDAR}) in the multi-constraint MAP approach (see Fig. 8).

insert image description here
The proposed framework follows the classic SLAM architecture (as we proposed in [24]), which mainly includes the following three steps: (1) data processing step, feature detection and tracking for two modalities; (2) Estimation step, which first estimates the displacement of the vehicle from the tracked features (this can be done by ICP, epipolar geometry, proprioceptive sensors or their fusion, such as Kalman filter or multi-criteria optimization), and then tries to detect and Landmarks and features in the map are matched; once the matching is done, the pose can be refined (filtered/optimized) and finally new landmarks can be estimated. The final step (3) deals with the global map: does the current data define keyframes (does it bring in enough new information), and based on the detection of loop closures, is it a local or global optimization trajectory? In order for this framework to work properly, major efforts must be made on (1) LiDAR scan analysis for feature detection and (2) camera-LiDAR calibration process. Once accurate solutions to these two problems are found, the tight coupling between LiDAR and vision will be possible at the data level, resulting in more accurate state estimation.

7. Conclusion

Researchers have conducted various studies to find the best implementation of SLAM. As it turns out, an autonomous robot can simultaneously estimate its own pose and a map of its surroundings, and SLAM remains a promising and exciting research topic in robotics. In theory, it is a complete solution for autonomous navigation. However, in reality, many problems can occur. Even though this seems like a very promising solution, can we predict to what extent the development of SLAM can lead to truly autonomous navigation? Therefore, it is necessary to deepen the understanding of SLAM and its contribution to artificial intelligence mobile robots.
At this point, some powerful and effective solutions exist using vision sensors mixed with IMUs. Today, this method is mainly used in industrial applications based on virtual or augmented reality. RGB-D cameras are a hot topic, but such sensors don't perform well in outdoor environments (where ambient light interferes greatly with detection). Vision methods are prone to drift and are very sensitive to the absence of salient features in the environment. To overcome the lack of features in indoor monotonous environments, geometric features such as lines, line segments or edges have been studied. The main problems with such landmarks are (1) lack of accurate descriptors for the matching stage, and (2) difficult initialization stage with almost no detected corresponding 3D objects. As a result, the 3D sparse representation of the environment is less accurate due to feature mismatch or initialization errors. Finally, some hybrid maps with different landmark representations are generated. A generalized multi-constraint MAP problem is then solved using these different objects and observations.

On the other hand, LiDAR-based SLAM also exists and provides a good solution. LiDAR methods provide very accurate 3D information of the environment, but are often time-consuming and still rely on very simple and not very robust scan-matching methods. Currently, few works address the analysis of 3D scans by extracting some 3D landmarks. None of the SLAM methods using 3D LiDAR handle landmarks in a similar way to vision-based frameworks. The reason is the processing time required to analyze and extract LiDAR landmarks. Currently, planes are the only features used in LiDAR-SLAM methods. However, flat surfaces are not very useful in natural outdoor environments, which are naturally not well structured. LiDAR-based SLAM is mainly based on scan matching methods such as ICP. This algorithm has remained virtually unchanged since its invention 30 years ago.
There have been some experiments coupling LiDAR and vision sensors, but all of them are at a very loosely fused level. Fusion is mainly done using the results of two odometry steps, which means that LiDAR detection or visual detection cannot help each other, and the decision is made in the last step of fused relative displacement estimation. Other methods only use LiDAR's depth measurements to directly initialize visual features. Once again, the capabilities of lidar are completely underused.
In future work, we will investigate tight-hybrid implementations of SLAM using sensor fusion. By fusing camera frames to LiDAR point clouds, we expect to build a robust and low-drift SLAM framework. Also, with LiDAR prices falling over the years, we expect this solution to become less costly over time.


Guess you like

Origin blog.csdn.net/qq_21830903/article/details/108996939