20,000 words | Visual SLAM research review and future trend discussion

原文:Visual SLAM: What are the Current Trends and What to Expect?

Address: Visual SLAM Research Review and Future Trend Discussion

Translated by: Dong Yawei

Special reminder: This article has a total of 24,000 words, including all aspects of visual SLAM. If you need to read in detail, the background reply [ 221101 ] , you can download the original and translated documents

Abstract: In recent years, vision-based sensors have demonstrated remarkable performance, accuracy, and efficiency in simultaneous localization and mapping (SLAM) systems. Here, visual simultaneous localization and mapping (VSLAM) methods refer to SLAM methods that use cameras for pose estimation and map generation.

We can see that many studies have shown that despite the lower cost of visual SLAM, VSLAM can outperform traditional methods that only rely on specific sensors. VSLAM methods utilize different camera types (e.g., monocular, stereo, and RGB-D), are performed on various datasets (e.g., KITTI, TUM RGB-D, and EuRoC) and in different environments (e.g., indoors and outdoors). It is tested and employs various algorithms and methods to better understand the environment.

The above changes have made this research topic widely concerned by researchers, and many VSLAM methods have been produced. Based on this, the main purpose of this paper is to present the latest advances in VSLAM systems and discuss the existing challenges and trends. We conducted an in-depth literature survey on 45 influential papers published in the field of VSLAM, and classified these papers according to different characteristics, including method innovation, domain application novelty, algorithm optimization and semantic level, and also discussed Current trends and future directions are identified, which may help researchers in their studies.

01  Introduction

Simultaneous Localization and Mapping (SLAM) refers to the process of constructing a map of an unknown environment while locating the location of an agent [1]. Here, the agent can be a household robot [2], an autonomous vehicle [3], a planetary rovers [4], or even an unmanned aerial vehicle (UAV) [5], [6] or an unmanned vehicle (UGV) [7]. In environments where maps are not available or where the robot's position is unknown, SLAM has a wide range of applications. In recent years, with the continuous improvement of the application of robot technology, SLAM has gained great attention in the industry and scientific research circles [8], [9].

SLAM systems can collect data from the environment using a variety of sensors, laser-based, acoustic, and visual [10]. There are a variety of vision-based sensors, including monocular, stereo, event-based, omnidirectional, and RGB depth (RGB-D) cameras. Robots with vision sensors use the visual data provided by cameras to estimate the position and orientation of the robot relative to its surroundings [11]. The process of using visual sensors for SLAM is visual SLAM (VSLAM).

Using visual data in SLAM has the advantages of cheaper hardware, more intuitive object detection and tracking, and the ability to provide rich visual and semantic information [12]. Its captured images (or video frames) can also be used in vision-based applications, including semantic segmentation and object detection. The above characteristics make VSLAM a popular direction in robotics and have prompted a lot of research and surveys by robotics and computer vision (CV) experts in the past few decades. Therefore, VSLAM already exists in various applications that require reconstructing 3D models of the environment, such as: autonomous driving, augmented reality (AR), and service robotics [13].

As a general method introduced in [14] to solve the high computational cost, the SLAM method mainly includes two parallel threads, namely tracking and mapping. Therefore, the taxonomy of algorithms used in VSLAM is to represent how researchers use different methods and strategies in each thread. According to the type of data used by the SLAM system, SLAM methods can be divided into two categories: direct methods and indirect methods (feature-based) [15].

In usage scenarios, indirect methods extract feature points (i.e., keypoints) from object textures and track them by matching descriptors in consecutive frames. Despite the computational cost of the feature extraction and matching stages, these methods are accurate and robust to changes in light intensity in each frame. On the other hand, direct methods estimate camera motion directly from pixel-level data and optimize for minimizing photometric errors. Relying on photogrammetry techniques, these methods exploit all camera output pixels and track what they replace in successive frames according to their constrained aspects such as brightness and color. These features enable the direct method to model more information from the image than the indirect method, and to achieve higher-precision 3D reconstruction. However, although direct methods work better in less textured environments and do not require more computation for feature extraction, they often face large-scale optimization problems [16]. The pros and cons of each approach encourage researchers to consider developing hybrid solutions, considering combinations of both approaches. Hybrid methods often combine indirect and direct detection stages, where one initializes and corrects the other.

Figure 1 Standard Visual SLAM Pipeline. Regarding the direct/indirect methods used, the functionality of some of these modules may be changed or omitted

Moreover, since VSLAM mainly includes a visual odometry (VO) front-end (for locally estimating the camera trajectory), and a SLAM back-end (for optimizing the created map), the diversity of modules used in each part leads to realized difference. VO provides an initial estimate of the robot pose based on local consistency, which is sent to the backend for optimization. Therefore, the main difference between VSLAM and VO is whether to consider the global consistency of maps and predicted trajectories. Some state-of-the-art VSLAM applications also include two additional modules: loop closure detection and mapping [15]. They are responsible for detecting previously visited locations for more accurate tracking and mapping based on camera pose.

Figure 1 shows the overall architecture of the standard VSLAM approach. Thus, the system's input can also be integrated with other sensor data, such as inertial measurement units (IMUs) and lidar, to provide more information than just visual data. In addition, regarding the direct method or indirect method used in VSLAM Pipeline, the function of the visual feature processing module may be changed or ignored. For example, the Feature Processing stage uses only indirect methods. Another factor is the use of specific modules, such as loop closure detection and bundle adjustment, to improve execution.

This paper summarizes 45 VSLAM papers and classifies them into different categories according to different aspects. We hope that our work will serve as a reference for robotics researchers working to optimize VSLAM techniques.

The rest of this article is structured as follows:

Section II reviews the evolution of VSLAM algorithms.

Section III introduces and discusses other surveys in the field of VSLAM.

Section IV briefly introduces each module of VSLAM.

Section V discusses the classification of VSLAM based on different application goals.

Section VI discusses unresolved issues and potential research trends in this area.

02   Evolution of Visual SLAM

VSLAM systems have matured over the past few years, and several frameworks have played an important role in this development process. To clearly present the overall situation, Figure 2 presents the widely used VSLAM methods that have influenced the development of the SLAM circle and are used as standard references for other frameworks.

Figure 2 The highly influential visual SLAM method

The first attempt in the literature to implement a real-time monocular VSLAM system was developed by Davison et al. in 2007, who introduced a framework called Mono-SLAM [17]. The framework of their indirect method can estimate real-world camera motion and 3D objects using the Extended Kalman Filter (EKF) algorithm [18]. Despite the lack of global optimization and loop closure detection modules, Mono-SLAM starts to play a major role in the VSLAM domain. However, maps reconstructed in this way only included landmarks and provided no further details about the area.

Klein et al. [14] proposed parallel tracking and mapping (PTAM) in the same year, and they divided the entire VSLAM system into two main threads: tracking and mapping. This multithreading standard was endorsed by many subsequent works, which are discussed in this paper. The main idea of ​​their approach is to reduce computational cost and apply parallel processing to achieve real-time performance. While the tracking thread estimates camera motion in real time, the mapping thread predicts the 3D locations of feature points. PTAM is also the first to jointly optimize camera poses and create 3D maps using bundle adjustment (BA). It uses the FAST [19] corner detector algorithm for keypoint matching and tracking. Although the performance of this algorithm is better than that of Mono-SLAM, its design is complex and requires manual setting by the user in the first stage.

In 2011, Newcombe et al. introduced a direct method for measuring depth values ​​and motion parameters to construct maps, namely Dense Tracking and Mapping (DTAM). DTAM is a real-time framework equipped with dense mapping and tracking modules, which can determine the camera pose by aligning the entire frame with a given depth map. To construct the environment map, the above stages estimate the depth and motion parameters of the scene separately. While DTAM can provide a detailed representation of maps, real-time execution requires high computational cost.

As another indirect approach to the field of 3D mapping and pixel-based optimization, Endres et al. proposed an RGB-D camera-based approach in 2013. Their approach performs in real-time and focuses on low-cost embedded systems and small robots, but fails to produce accurate results in featureless or challenging scenarios. In the same year, Salas Moreno et al. [22] proposed the first attempt to exploit semantic information in a real-time SLAM framework, named SLAM++. Their system takes RGB-D sensor output and performs 3D camera pose estimation and tracking to form a pose graph. Nodes in a pose graph represent pose estimates and are connected by edges representing relative poses between nodes with measurement uncertainty [23]. Then, the predicted pose will be refined by incorporating relative 3D poses obtained from semantic objects in the scene.

As the basic framework of VSLAM matured, researchers focused on improving the performance and accuracy of these systems. In this regard, Forster et al. proposed a hybrid VO approach in 2014 as part of the VSLAM architecture, called semi-direct visual odometry (SVO) [24]. Their approach can combine feature-based and direct methods for sensor-based motion estimation and mapping tasks. SVO can work with monocular and stereo cameras and is equipped with a pose refinement module that minimizes reprojection errors. However, the main disadvantage of SVO is that it adopts short-term data association and cannot perform loop closure detection and global optimization.

LSD-SLAM [25] is another influential VSLAM method introduced by Engel et al. in 2014, which includes tracking, depth map estimation and map optimization. The method can reconstruct large-scale maps using its pose graph estimation module, with global optimization and loop closure detection. The weakness of LSD-SLAM is that its initialization phase is challenging, requiring all points in the plane, making it a computationally intensive method.

Mur Artal et al. proposed two accurate indirect VSLAM methods that have attracted the attention of many researchers so far: ORB-SLAM [26] and ORB-SLAM 2.0 [27]. These methods can accomplish localization and mapping in well-textured sequences, and perform high-performance pose detection using Oriented FAST and Rotated BRIEF (ORB) features. The first version of ORB-SLAM was able to use keyframes collected from camera positions to compute camera position and environment structure. The second version is an extension of ORB-SLAM with three parallel threads, including tracking for finding feature correspondences, local mapping for map management operations, and loop closure for detecting new loops and correcting drift errors. Although ORB-SLAM 2.0 can be used with monocular and stereo camera setups, it cannot be directly used for autonomous navigation due to the unknown scale of the reconstructed map data. Another downside to this method is that it won't work in areas without texture or in environments with repeating patterns. The latest version of this framework, named ORB-SLAM 3.0, was proposed in 2021 [28]. It works with various camera types, such as monocular, RGB-D, and stereo vision, and provides improved pose estimation output.

In recent years, with the remarkable impact of deep learning in various fields, methods based on deep neural networks can solve many problems by providing higher recognition and matching rates. Similarly, replacing hand-crafted features with learned features in VSLAM is one of the solutions proposed by many recent deep learning-based methods.

In this regard, Tateno et al. proposed a convolutional neural network (CNN) based approach that processes input frames for camera pose estimation and uses keyframes for depth estimation, named CNN-SLAM [29]. Segmenting camera frames into smaller parts to better understand the environment is one of the ideas in CNN-SLAM to provide parallel processing and real-time performance.

As a different approach, Engel et al. also introduced a new trend in the direct VSLAM algorithm called Direct Sparse Odometry (DSO) [30], which combines the direct method and sparse reconstruction to extract point of highest intensity. It takes image formation parameters into account and uses an indirect tracking method by tracking a sparse set of pixels. It should be noted that DSO can only obtain perfect accuracy when calibrating the camera photometrically, and cannot obtain high-precision results with conventional cameras.

In summary, during the evolution of VSLAM systems, recent approaches have focused on the parallelism of multiple specialized modules. These modules form a common technology and framework compatible with a wide variety of sensors and environments. The above properties enable them to execute in real-time and be more flexible in terms of performance improvement.

03   related overview

There are various review papers in the field of VSLAM that provide a comprehensive analysis of different existing methods. Each paper reviews the main advantages and disadvantages of adopting a VSLAM approach.

Macario Barros et al. [31] categorize visual SLAM schemes into three distinct categories: pure vision (monocular), visual-inertial (stereoscopic), and RGB-D. They also proposed various criteria to simplify the analysis of VSLAM. However, they do not include other vision sensors, such as event-based sensors, which we discuss later in Section 1 of Chapter 4.

Chen et al. [32] collated a large amount of traditional and semantic VSLAM literature. They divided the SLAM development era into classic, algorithm analysis, and robust perception stages, and introduced the hot issues at that time. They also summarize classical frameworks with direct/indirect approaches and investigate the impact of deep learning algorithms in semantic segmentation. Although their work provides a comprehensive exposition of high-order solutions in this field, the taxonomy of methods is limited to the types of features used in feature-based VSLAM.

Jia et al. [33] surveyed a large number of papers and made a simple comparison between methods based on graph optimization and methods using deep learning. However, despite appropriate comparisons, their conclusions cannot be properly generalized due to the limited number of papers surveyed.

In another work, Abaspur Kazerouni et al. [34] covered various VSLAM methods, exploited sensory devices, datasets and modules, and simulated several indirect methods for comparison and analysis. However, they only address feature-based algorithms such as HOG, Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and deep learning-based solutions. Bavle et al. [35] analyze aspects of pose perception in various SLAM and VSLAM applications and discuss their shortcomings. They can conclude that manipulating features that lack semantic scenes can improve the results of the current research work.

Other surveys have studied state-of-the-art VSLAM methods for specific topics or trends. For example, Duan et al. [15] studied the progress of deep learning in visual SLAM systems for transportation robots. In the paper, the authors summarize the advantages and disadvantages of using various deep learning-based methods in VO and loop closure detection tasks. A significant advantage of using deep learning methods in VSLAM is accurate feature extraction in pose estimation and overall performance computation.

In another work in the same field, Arshad and Kim [36] focused on the impact of deep learning algorithms in loop closure detection using visual data. They reviewed various VSLAM papers and analyzed the long-term autonomy of the robot under different conditions.

Singandhupe and La [37] summarized the impact of VO and VSLAM on unmanned vehicles. They collated the methods evaluated on the KITTI dataset, allowing them to briefly describe the strengths and weaknesses of each system.

In a similar article, Cheng et al. [32] reviewed VSLAM-based autonomous driving systems and proposed future development trends for such systems.

Several other researchers have investigated the ability of VSLAM to work under real-world conditions. For example, Saputra et al. [38] discuss the reconstruction, splitting, tracking, and parallel execution of threads for variations of VSLAM techniques operating in dynamic and harsh environments.

This review differs from others to date in providing a comprehensive analysis of VSLAM in different venues. Compared with other VSLAM surveys, the main contributions of this paper are:

  • Categorize various recent publications in VSLAM according to the main contributions, criteria, and goals of researchers proposing new solutions

  • Analyze current trends in VSLAM by delving into different approaches in different aspects

  • Introduce potential problems of VSLAM

04   Each module of visual SLAM

Combining various visual SLAM methods, we divide the requirements of different stages into the following modules:

4.1 Sensors and data acquisition

Early implementations of the VSLAM algorithm introduced by Davison et al. [17] were equipped with a monocular camera for trajectory recovery. Monocular cameras are also the most common vision sensors used for various tasks such as object detection and tracking [39]. On the other hand, a stereo camera contains two or more image sensors, enabling it to perceive depth information in captured images, thereby achieving better performance in VSLAM applications. These camera configurations are worthwhile for providing information perception for higher precision requirements. RGB-D cameras are other variants of vision sensors used in VSLAM that can provide depth and color information in a scene. Given the proper lighting and motion speed, the aforementioned vision sensors can provide rich information about the environment in an intuitive environment, but they often struggle with poor lighting conditions or scenes with a large dynamic range.

In recent years, event cameras have also been used in various VSLAM applications. When motion is detected, these low latency bio-inspired vision sensors can produce pixel-level brightness changes rather than standard intensity frames, enabling high dynamic range output without motion blur effects [40 ]. Compared with standard cameras, event-based sensors can provide accurate visual information in high-speed motion and large-scale dynamic scenes, but cannot provide sufficient information when the motion rate is low. Although event cameras can outperform standard vision sensors in harsh lighting and dynamic range conditions, they primarily provide asynchronous information about the environment. This makes traditional vision algorithms unable to process the output of these sensors [41]. Furthermore, using spatio-temporal windows of events together with data obtained from other sensors can provide rich pose estimation and tracking information.

In addition, some methods use multi-camera configurations to solve common problems of working in real environments to improve localization accuracy. Utilizing multiple vision sensors can help resolve complex issues such as occlusion, camouflage, sensor failure, or sparse trackable textures, providing cameras with overlapping fields of view. Although multi-camera configurations can solve some data acquisition problems, camera-only VSLAM may face various problems, such as motion blur when encountering fast-moving objects, feature mismatch in low or high light, high-speed changing scenes, etc. dynamic object omission, etc. Therefore, some VSLAM applications may be equipped with multiple sensors next to the camera. Fusing events and standard frames [42] or integrating other sensors such as LiDAR [43] and IMU into VSLAM are some existing solutions.

4.2 Application scenarios

A strong assumption in many traditional VSLAM practices is that robots work in a relatively static world with no unexpected changes. Therefore, although many systems can be successfully applied in specific environments, some unexpected changes in the environment (e.g., the presence of moving objects) may complicate the system and degrade the state estimation quality to a large extent. Systems working in dynamic environments typically use algorithms such as optical flow or Random Sampling Consensus (RANSAC) [44] to detect motion in the scene, classify moving objects as outliers, and skip they. Such systems exploit geometric information, semantic information, or a combination of both, to improve localization schemes [45].

Furthermore, we can divide environments into indoor and outdoor categories as a general classification. Outdoor environments can be urban areas with structural landmarks and large-scale motion changes (such as buildings and road textures), or off-road areas with weak motion states (such as moving clouds and vegetation, sand textures, etc.), which improve the The risk of localization and loop detection. Indoor environments, on the other hand, contain scenes with completely different global spatial properties, such as corridors, walls, and rooms. We can imagine that while a VSLAM system might work well in one of the aforementioned regions, it might not exhibit the same performance in other environments.

4.3 Visual Feature Processing

As mentioned in Chapter 1, detecting visual features and exploiting feature descriptor information for pose estimation is an inevitable stage of indirect VSLAM methods. These methods use various feature extraction algorithms to better understand the environment and track feature points in consecutive frames. There are many algorithms in the feature extraction stage, including SIFT[46], SURF[47], FAST[19], BRIEF[48], ORB[49], etc. Among them, compared with SIFT and SURF [50], ORB features have the advantage of fast extraction and matching without losing much accuracy.

The problem with some of the above methods is that they cannot be effectively adapted to various complex and unforeseen situations. Therefore, many researchers use CNN to extract deep features of images at different stages, including VO, pose estimation, and loop closure detection. Depending on the design features of these methods, these techniques can represent supervised or unsupervised frameworks.

4.4 Program Evaluation

While some VSLAM methods, especially those capable of working in dynamic and challenging environments, have been tested on robots under real-world conditions, many research works have used publicly available datasets to demonstrate their applicability.

The RAWSEEDS dataset of Bonarini et al. [51] is a well-known multi-sensor standard test tool that contains indoor, outdoor and mixed robot trajectory and ground truth data. It is one of the first publicly available standard test tools for robotics and SLAM purposes.

Scenenet RGB-D by McCormac et al. [52] is another popular dataset for scene understanding problems such as semantic segmentation and object detection, which contains 5 million large-scale rendered RGB-D images. The dataset also contains pixel-complete ground truth labels and accurate camera pose and depth data, which make it a powerful tool for VSLAM applications.

Many recent works in the field of VSLAM and VO have tested their methods on the TUM RGB-D dataset [53]. The aforementioned dataset and benchmark test harness contain color and depth images captured by a Microsoft Kinect sensor and their corresponding ground truth sensor tracks.

Alternatively, NTU VIRAL by Nguyen et al. [54] is a dataset collected by a drone equipped with 3D lidar, camera, IMU, and multiple ultra-wideband (UWB). This dataset contains indoor and outdoor instances and is designed to evaluate autonomous driving and aerial manipulation performance.

Furthermore, EuRoC MAV [55] by Burri et al. is another popular dataset that contains images captured by a stereo camera along with synchronized IMU measurements and motion ground truth data. According to the environmental conditions, the data collected in EuRoC MAV are divided into three categories: easy, medium and difficult.

OpenLORIS Scene [56] by Shi et al. is another publicly available dataset for VSLAM work, containing a large amount of data collected by wheeled robots equipped with various sensors. It provides appropriate data for monocular and RGB-D algorithms, as well as odometry data from wheel encoders.

As a more general dataset used in VSLAM, KITTI [57] is a dataset captured by two high-resolution RGB and grayscale cameras on a moving vehicle. KITTI uses GPS and laser sensors to provide accurate ground information, making it a very popular dataset in mobile robotics and autonomous driving.

TartanAir [58] is another standard dataset for evaluating SLAM algorithms in complex scenes.

In addition, the Imperial College London and National University of Ireland Maynooth (ICL-NUIM) [59] dataset is another VO dataset containing handheld RGB-D camera sequences, which has been used to Benchmark for many SLAMs.

Unlike the previous datasets, some other datasets contain data acquired with specific cameras instead of regular cameras. For example, the Event Camera dataset introduced by Mueggler et al. [60] is a dataset using samples collected from event-based cameras for high-speed robot evaluation. The dataset instance contains inertial measurements and intensity images captured by a motion capture system, making it a suitable benchmark for VSLAM equipped with event cameras.

The above datasets are used in various VSLAM methods depending on the sensor setup, application and target environment. These datasets mainly contain camera calibration parameters as well as ground truth data. Table 1 and Fig. 3 show the summarized characteristics of the datasets and some examples of each dataset, respectively.

Table 1 VSLAM commonly used data sets; GT in the table refers to the availability of true values
Fig. 3 Examples of some mainstream visual SLAM datasets used for evaluation in various papers. The characteristics of these datasets are listed in Table 1.

4.5 Semantic layer

Robots need semantic information to understand the surrounding scene and make more favorable decisions. In many recent VSLAM works, adding semantic information to geometry-based data is better than purely geometry-based approaches, enabling it to provide more information about the surrounding environment [61]. In this regard, pre-trained object recognition modules can add semantic information to VSLAM models [62]. One of the latest approaches is to use CNNs in VSLAM applications. In general, semantic VSLAM methods consist of the following four main components [43]:

Tracking: It uses 2D feature points extracted from consecutive video frames to estimate the camera pose and build a 3D map point cloud. The calculation of the camera pose and the construction of the 3D map point cloud establish the reference data for the localization and mapping process, respectively.

Local mapping: By processing two consecutive video frames, a new 3D mapping point is created, which is used together with the BA module to optimize the camera pose.

Loop closure detection: It adjusts the camera pose and optimizes the built map by comparing keyframes with extracted visual features and evaluating the similarity between them.

Non-Rigid Context Culling (NRCC): The main purpose of using NRCC is to filter temporal objects from video frames to reduce their adverse effects on the localization and mapping stages. It mainly consists of a masking/segmentation process to separate various unstable instances in frames, such as people. Since NRCC can reduce the number of feature points to be processed, the calculation part is simplified and more robust performance is obtained.

Therefore, exploiting semantic layers in VSLAM methods can optimize the uncertainty of pose estimation and map construction. However, it is now a challenge to correctly use the extracted semantic information without greatly affecting the computational cost.

05   Classification of VSLAM methods based on application goals

In order to precisely find the VSLAM methods that can achieve excellent results and have a stable architecture, we collected and screened highly cited publications published on top websites in recent years from Google Scholar and the famous computer science bibliographic databases Scopus and DBLP. We also studied the papers mentioned in the above publications and selected the most relevant ones to the field of VSLAM. After researching the papers, we can categorize the collected papers according to the specific problems they mainly address, as follows:

5.1 Goal 1: Multi-Sensor Processing

This category covers VSLAM methods that use various sensors to better understand the environment. While some technologies simply use cameras as sensors, others combine various sensors to improve the accuracy of algorithms.

1) Using multiple cameras:

Since it is difficult to reconstruct the 3D trajectory of a moving object with one camera, some researchers suggest using multiple cameras. For example, CoSLAM 4 is a VSLAM system introduced by Zou and Tan [63] that uses individual cameras deployed on different platforms to reconstruct robust maps. Their system integrates multiple cameras moving independently in a dynamic environment and reconstructs a map from their overlapping fields of view. This process makes it easier to reconstruct dynamic point clouds in 3D by integrating in-camera and inter-camera pose estimation and mapping. CoSLAM uses the KanadeLucas-Tomasi (KLT) algorithm to track visual features and operates in indoor/outdoor static and dynamic environments where relative positions and orientations may change over time. The main disadvantage of this approach is that complex hardware is required to parse the data output by a large number of cameras, and the computational cost increases due to the addition of more cameras.

For challenging in-the-wild scenarios, Yang et al. [64] developed a multi-camera collaborative panoramic VSLAM approach. Their approach requires each camera to be independent to improve the performance of VSLAM systems in difficult conditions, such as occlusions and sparsely textured environments. To determine the matching range, they extract ORB features from the overlapping fields of view of the cameras. In addition, they adopted CNN-based deep learning techniques to identify similar features for loop closure detection. In the experiments, the authors used a dataset generated by a panoramic camera and an integrated navigation system.

MultiCol SLAM is another open-source VSLAM framework by Urban and Hinz that uses a multi-camera configuration [65]. Using their previously created model, MultiCol, they augment ORB-SLAM with a keyframe-based process that supports multiple fisheye cameras. They added a multi-keyframe (MKF) processing module to ORB-SLAM, which collects images that convert images into keyframes. The author also proposes the idea of ​​multi-camera loop closure, where loop closure is detected from MKF. Although their method runs in real time, it requires a lot of computing power since several threads have to run simultaneously.

2) Employing Multiple Sensors (using multiple sensors)

Some other approaches recommend fusing multiple sensors and using vision-based, inertial-based sensor outputs for better performance. In this regard, Zhu et al. [66] proposed a low-cost indirect lidar-assisted VSLAM named CamVox 5 and demonstrated its reliable performance and accuracy. Their approach uses ORB-SLAM 2.0, combining Livox lidar as an advanced depth sensor with the output of an RGB-D camera. The authors used an IMU to synchronize and correct for non-repetitive scan positions. Their contribution is a method for autonomous lidar camera calibration operating in an uncontrolled environment. Real-world tests on robotic platforms have shown that CamVox operates in real-time as it processes its environment.

The authors in [67] proposed a multimodal system called VIRAL (Visual Inertial Ranging LiDAR) SLAM, which couples a camera, LiDAR, IMU, and UWB. They also proposed a visual feature map matching marginalization scheme based on local maps constructed from lidar point clouds. Visual components are extracted and tracked using the BRIEF algorithm. The framework also contains synchronization schemes and triggers for the sensors used. They tested their method on a simulated environment and a generated dataset called NTU VIRAL [54], which contains data captured by cameras, LiDAR, IMU and UWB sensors. However, their method is computationally expensive due to dealing with synchronization, multithreading, and sensor conflicts.

Vidal et al. [42] propose to integrate event cameras, camera frames and IMUs in a parallel configuration for reliable pose estimation in high-speed settings. Their Ultimate SLAM 6 system is based on event cameras and a keyframe-based nonlinear optimization thread introduced in [68]. They use the FAST corner detector and the Lucas Kanade tracking algorithm for feature detection and tracking, respectively. Ultimate SLAM avoids the motion blur issues that come with high-speed activities and operates in dynamic environments with varying lighting conditions. The efficiency of this technique on the "Event Camera Dataset" is evident when compared to other configurations of pure event cameras and regular cameras. The authors also tested Ultimate SLAM on an autonomous quadrotor drone equipped with an event camera to demonstrate how their system handles flight conditions that cannot be handled by conventional VO platforms. The main problem faced by Ultimate SLAM is the synchronization of events and standard frame output.

Nguyen et al. [69] proposed a tightly coupled monocular camera and UWB distance sensor approach for VSLAM. They create maps using a combination of feature-based (visible) and feature-less (UWB) landmarks. It can work effectively when UWB suffers from multi-path effects in crowded environments. They built an indirect method on the basis of ORB-SLAM and used ORB features for pose estimation. They tested their system on a dataset that simulates data collection by an aerial robot using a hand-held means. The synchronization of the camera and UWB sensor is a big difficulty in this case, but this has been overcome by using a new camera pose with an associated timestamp for each new image.

5.2 Goal 2: Pose Estimation

This type of method focuses on how to optimize the pose estimation of VSLAM using various algorithms.

1) Using line/point data:

In this regard, Zhou et al. [70] propose to use building structure lines as useful features to determine camera pose. Structural lines are associated with dominant directions and encode global direction information, improving predicted trajectories. The StructSLAM mentioned above is a 6-degree-of-freedom (DoF) VSLAM technology that can operate under low- and feature-free conditions. It uses EKF to estimate variables based on the current orientation in the scene. For evaluation, the indoor scene dataset from RAWSEEDS 2009 and a set of generated sequence image datasets are used.

Point and Line SLAM (PL-SLAM) is an ORB-SLAM-based VSLAM system proposed by Pumarola et al. [71], which is optimized for non-dynamic and low-texture scenes. The system fuses line and point features simultaneously to improve pose estimation and to help operate with fewer feature points. The authors tested PL-SLAM on the generated dataset and TUM RGB-D. The disadvantage of its approach is that it is computationally expensive, whereas other geometric elements such as planes are used for higher accuracy.

Gomez-Ojeda et al. [72] introduced PL-SLAM (different from the framework of the same name in Pumarola et al. [71]), an indirect VSLAM technique that uses points and lines from stereo vision cameras to reconstruct invisible map. They merge segments obtained from point and line in all VSLAM modules with visual information obtained from consecutive frames in their method. Points and lines are retrieved and tracked in subsequent stereo frames in PL-SLAM using ORB and Line Detector (LSD) algorithms. The authors tested PL-SLAM on the EuRoC and KITTI datasets, which may outperform the stereo version of ORB-SLAM 2.0 in terms of performance. One of the main disadvantages of PL-SLAM is the computation time required by the feature tracking module, and in order to extract more information about the environment, almost all the structure lines have to be covered.

Lim et al. [73] introduced a degradation avoidance technique for mono-purpose point-line based VSLAM. A powerful optical flow-based line tracking module for extracting line features, filtering out short lines in each frame, and matching previously identified line features is another contribution of their method. To demonstrate the effectiveness of their technique and demonstrate its superiority over established point-based methods, they tested their system on the EuRoC MAV dataset. Despite numerous findings, this system lacks an adaptive method for identifying the correct optimization parameters.

2) Use other traits:

A framework for stereo vision cameras is proposed in [74]: Dual Quaternion Visual SLAM (DQV-SLAM), which uses a Bayesian framework for 6-DoF pose estimation. To prevent linearization of groups of nonlinear spatial transformations, their method uses progressive Bayesian updating. For point clouds of maps and optical flow, DQVSLAM uses ORB features to achieve reliable data association in dynamic environments. On the KITTI and EuRoC datasets, the method can reliably estimate the experimental results. However, it lacks a probabilistic interpretation for stochastic modeling of poses and is computationally demanding for filtering based on sampling approximations.

[75] developed a technique to reconstruct large-scale indoor environment maps using artificial squared planar markers. Their real-time SPM-SLAM system can use the markers to resolve ambiguity in pose estimation if at least two markers are observable in each video frame. They created a dataset containing video sequences of markers placed in two rooms linked by a door. Although SPM-SLAM is of good value, it is only effective when multiple planar markers are scattered around the region and at least two markers are available for marker connection recognition. Furthermore, the ability of their framework to handle dynamic changes in the scene is not judged.

3) Deep learning method

Bruno and Colombini [76] proposed LIFT-SLAM, which combines deep learning-based feature descriptors with traditional geometry-based systems. They extended the Pipeline of the ORB-SLAM system and used CNN to extract features from images, using the learned features to provide denser and more accurate matches. For detection, description, and orientation estimation, LIFT-SLAM fine-tunes the LIFT deep neural network. Studies using indoor and outdoor instances of KITTI and EuRoC MAV datasets show that LIFT-SLAM outperforms traditional feature-based and deep learning-based VSLAM schemes in terms of accuracy. However, the disadvantage of this method is its computationally intensive threading and unoptimized CNN design, which of course also contributes to its near real-time performance.

Naveed et al. [77] proposed a deep learning based VSLAM scheme with reliable and consistent modules even on extremely complex problems. Their method outperforms several VSLAMs and uses deep reinforcement learning networks trained on real simulators. Furthermore, they provide a baseline for active VSLAM evaluation and can be properly generalized in real indoor and outdoor environments. The network path planner provides ideal path data, which is received by its underlying system ORB-SLAM. They produced a dataset containing real-world navigation problems in both challenging and texture-free environments for evaluation.

RWT-SLAM is a VSLAM framework based on deep feature matching proposed by the author in [78] for weak texture situations. Their approach is based on ORB-SLAM, using feature masks from the enhanced LoFTR [79] algorithm for local image feature matching. The coarse-level and fine-level descriptors in the scene are extracted using the CNN architecture and the LoFTR algorithm, respectively. RWT-SLAM is tested on TUM RGB-D and OpenLORIS scene datasets as well as real-world datasets collected by the authors. However, despite the robust feature matching results and performance, their system is still computationally intensive.

5.3 Goal Three: Real World Feasibility

The main goal of such methods is to be used in various environments and work in multiple scenarios. We note that the methods just mentioned all highly integrate the semantic information of the environment and present an end-to-end VSLAM.

1) Dynamic environment

In this regard, Yu et al. [61] introduced a VSLAM system named DS-SLAM, which can be used in dynamic environments and provides semantic information for map construction. The system is based on ORB-SLAM 2.0 and includes five threads: tracking, semantic segmentation, local mapping, loop closing and dense semantic map construction . To exclude dynamic items and improve localization accuracy before the pose estimation process, DS-SLAM employs the optical flow algorithm [80] with a real-time semantic segmentation network SegNet. DS-SLAM has been tested on real environments, RGB-D cameras, and the TUM RGB-D dataset. However, despite its high localization accuracy, it still faces the limitation of semantic segmentation and the characteristics of heavy computation.

Semantic Optical Flow SLAM (SOF-SLAM) is an indirect VSLAM system based on the RGB-D mode of ORBSLAM 2.0, which is another method for high dynamic environments proposed by Cui and Ma [45]. Their method uses a semantic optical flow dynamic feature detection module that extracts and skips dynamic features hidden in the semantic and geometric information provided by ORB feature extraction. In order to provide accurate camera pose and environment information, SOF-SLAM uses the pixel-level semantic segmentation module of SegNet. In highly dynamic situations, experimental results on the TUM RGB-D dataset and real environments show that SOF-SLAM outperforms ORB-SLAM 2.0. However, ineffective methods for non-static feature recognition and methods that only rely on two consecutive frames are the weaknesses of SOF-SLAM.

Cheng et al. [81] proposed a VSLAM system for dynamic environments using optical flow methods to separate and eliminate dynamic feature points. They exploit the structure of ORB-SLAM and provide it with fixed feature points generated from typical monocular camera output for accurate pose estimation. In the absence of features, the system works by classifying optical flow values ​​and using them for feature recognition. According to the experimental results on the TUM RGB-D dataset, the system works well in dynamic indoor environments.

Yang et al. [82] published another VSLAM scheme that uses semantically segmented network data, motion consistency detection techniques, and geometric constraints to reconstruct environment maps. Their method, based on the RGB-D variant of ORB-SLAM 2.0, performs well in dynamic and indoor environments. Use the improved ORB feature extraction technique to keep only the stable features in the scene, ignoring the dynamic features. The feature and semantic data are then combined to create a static semantic map. Evaluation results on the Oxford and TUM RGB-D datasets demonstrate the effectiveness of their method in improving localization accuracy and creating semantic maps with large amounts of data. However, their system may have problems in hallways or places with less information.

2) Solutions based on deep learning

In another work called DXSLAM by Li et al. [83], deep learning is used to find keypoints similar to SuperPoints and generate generic descriptors and keypoints for images. They trained a stronger CNN HF-NET to extract local and global information from each frame and generate frame- and keypoint-based description information. They also use the offline bag-of-words (BoW) method to train a visual dictionary of local features (Visual vocabulary) to achieve accurate loop closure detection. DXSLAM can run in real-time without the use of graphics processing units (GPUs) and is compatible with CPUs. Although not particularly emphasized, it has a strong ability to resist dynamic changes in dynamic environments. DXSLAM has been tested on TUM RGB-D and OpenLORIS scene datasets as well as indoor and outdoor images, and can obtain more accurate results than ORBSLAM 2.0 and DS-SLAM. However, the main disadvantages of this approach are the complicated feature extraction architecture and the problem of merging deep features with old SLAM frameworks.

Li et al. [84] developed a real-time VSLAM technique for extracting feature points based on deep learning in complex situations. The method is a self-supervised multi-task CNN for feature extraction that can run on a GPU and supports the creation of 3D dense maps. The output of CNN is a binary code string with a fixed length of 256, which allows it to be replaced by more traditional feature point detectors such as ORB. It includes three threads for accurate and timely performance in dynamic scenes: tracking, local mapping and loop closure detection. This scheme supports ORB-SLAM 2.0 using monocular and RGB-D cameras as a baseline. The authors tested it on the TUM dataset and two datasets collected by themselves (corridor and office datasets collected with Kinect cameras).

Steenbeek and Nex in [85] introduced a real-time VSLAM technique that uses CNNs for accurate scene analysis and map reconstruction. Their solution utilizes a drone's monocular camera stream during flight, employing a depth estimation neural network for stable performance. The above method is based on ORB-SLAM 2.0 and utilizes the visual information collected from the indoor environment. Additionally, the CNN is trained on more than 48,000 indoor cases and manipulates pose, spatial depth, and RGB inputs to estimate scale and depth. Evaluating the system using the TUM RGB-D dataset and real-world testing with drones demonstrates improved pose estimation accuracy. However, the system would struggle without textures, requiring both CPU and GPU resources for real-time performance.

3) Using Artificial Landmarks

and Medina Carnicer developed a technique called UcoSLAM11] that outperforms traditional VSLAM systems by combining natural and man-made landmarks and using fiducial markers to automatically calculate the scale of the surrounding environment. The main purpose of UcoSLAM is to solve the instability, repeatability and poor tracking quality of natural landmarks. It can run in environments without feature flags, as it can run in keys-only, flags-only, or mixed modes. In order to find out the map correspondence, optimize the reprojection error, and relocate when tracking fails, UcoSLAM sets the tracking mode. Additionally, it has a marker-based loop closure detection system that can be characterized using any descriptor, including ORB and FAST. Although UcoSLAM has many advantages, the system is executing many threads, which makes it a time-consuming method.

4) Wide-range of Setups

Another VSLAM strategy for dynamic indoor and outdoor environments is DMS-SLAM [87], which supports monocular, stereo and RGB-D vision sensors. The system employs a sliding window and grid-based motion statistics (GMS) [88] feature matching method to find static feature locations. Based on the ORB-SLAM 2.0 system, DMS-SLAM tracks the static features identified by the ORB algorithm. The authors tested their proposed method on the TUM RGB-D and KITTI datasets, and the results were better than the VSLAM algorithm, which has always worked well. Furthermore, DMS-SLAM performs faster than the original ORB-SLAM 2.0 due to the removal of feature points on dynamic objects in the tracking step. Despite the above advantages, this scheme suffers from difficulties in less textured, fast motion and highly dynamic environments.

5.4 Goal 4: Resource Constraint

Some VSLAM methods are built for devices with limited computational resources compared to devices with ideal conditions. This is the case, for example, for VSLAM designed for mobile devices and robots with embedded systems.

1) Devices with limited computing power:

EdgeSLAM is a real-time, edge-assisted semantic VSLAM system for mobile and resource-constrained devices proposed by Xu et al. [89]. It employs a series of fine-grained modules that are used by edge servers and related mobile devices without complex threads. EdgeSLAM also includes a semantic segmentation module based on mask RCNN technology to optimize the effect of target segmentation and tracking. The authors put their method into practice by installing some commercially available mobile devices, such as cell phones and development boards, on an edge server. By reusing the results of object segmentation, they adapt the system parameters to different network bandwidth and latency situations to avoid repeated processing. EdgeSLAM has been evaluated on TUM RGB-D, KITTI's monocular vision instance, and datasets created for the experimental setup.

For stereo cameras, Schlegel, Colosi, and Grisetti [90] proposed a lightweight feature-based VSLAM framework, named ProSLAM, with results comparable to well-received frameworks. Their approach consists of four modules: a triangulation module, which creates a 3D point cloud and associated feature descriptors; an incremental motion estimation module, which processes two frames to determine the current position; a map management module, which creates a local map; The localization module updates the global map based on the similarity of the local maps. ProSLAM retrieves the 3D pose of a point using a single thread and leverages a small number of known libraries to create a simple system. According to experiments on KITTI and EuRoC datasets, their method can achieve good results. However, it is weak in rotation estimation and does not contain any BA module.

Bavle et al. [91] proposed VPS-SLAM, a lightweight graph-based VSLAM framework for aerial robots. Their real-time system integrates geometric data, multi-object detection techniques, and VO/VIO to facilitate pose estimation and build a semantic map of the environment. VPS-SLAM uses low-level features, IMU measurements, and high-level planar information to reconstruct sparse semantic maps and estimate robot states. The system utilizes You Only Look Once v2.0 (YOLO2) [92], a lightweight version based on the COCO dataset [93], for object detection because of its real-time and computational efficiency. They used a hand-held camera and an aerial robot equipped with an RGB-D camera for testing. Indoor examples from the TUM RGB-D dataset were used to test its method, and they were able to provide the same results as known VSLAM methods. However, their VSLAM system can only use a small number of objects (such as chairs, books, and laptops) to build a semantic map of the surrounding area.

Tseng et al. [94] proposed another real-time indoor VSLAM method that satisfies the low allocation condition. The authors also propose a technique for estimating the number of frames and visual elements required for plausible localization accuracy. Their scheme is based on the OpenVSLAM [95] framework and uses it for emergent situations that arise in the real world, such as accessing specific objects. The system acquires feature maps of the scene by applying Efficient Perspective Point (EPnP) and RANSAC algorithms for accurate pose estimation. According to indoor test results, their device can obtain accurate results in poor lighting conditions.

2) Computation Offloading

Ben Ali et al. [96] proposed to use edge computing to migrate resource-intensive operations to the cloud to reduce the computational burden on robots. They modified the architecture of ORB-SLAM 2.0 in the indirect framework Edge SLAM 14, ran the tracking module on the robot, and migrated the rest to the edge computing device. By dividing the VSLAM pipeline between robots and edge devices, the system can maintain both local and global maps. With fewer resources, they can still function correctly without sacrificing accuracy. They performed evaluations using the TUM RGB-D dataset and two specific indoor environment datasets collected using different mobile devices equipped with RGB-D cameras. However, one of the disadvantages of their approach is the increased architectural complexity due to the decoupling of various SLAM modules. Another issue is that their systems work well for short periods of time, and degrade when using Edge SLAM in long-term scenarios (eg, multiple days).

5.5 Goal Five: Versatility

VSLAM work in this category focuses on straightforward exploitation, exploitation, adaptation, and extension.

Sumikura et al. [95] proposed OpenVSLAM, which is an adaptable open source VSLAM framework, which is mainly used for rapid development and can also be called by third-party programs. Their feature-based approach is compatible with multiple camera types, including monocular, stereo, and RGB-D, and reconstructed maps can be stored or reused for later use. Due to its powerful ORB feature extractor module, OpenVSLAM outperforms ORB-SLAM and ORB-SLAM2.0 in terms of tracking accuracy and efficiency. However, the open-sourcing of the system has been discontinued due to concerns about code similarity infringing ORB-SLAM 2.0.

In order to bridge the gap between real-time, accuracy and elasticity, Ferrera et al. [97] developed a method that OV^2 SLUDGEworks with monocular and stereo vision cameras. This reduces computation by restricting feature extraction to keyframes and monitoring them in subsequent frames by eliminating photo-metric errors. In this sense, it is a hybrid scheme that combines the advantages of the direct method and the indirect method of the VSLAM algorithm. Using well-known benchmark datasets including EuRoC, KITTI, and TartanAir in indoor and outdoor experiments, it is demonstrated to outperform several mainstream schemes in terms of performance and accuracy.

Teed and Deng proposed another method named DROID-SLAM, which is a deep learning based visual SLAM for monocular, stereo and RGB-D cameras [98]. They can achieve higher accuracy and robustness than well-known monocular and stereo tracking methods. Their scheme can run in real-time, including backend (for BA) and frontend (for keyframe collection and graph optimization) threads. DROID-SLAM has already been trained with monocular camera instances, so there is no need to train again with stereo and RGB-D inputs. Like the indirect method, this method minimizes projection errors while not requiring any preprocessing for feature recognition and matching. A feature extraction network including downsampling layers and residual blocks processes each input image to create dense features. DROID-SLAM has been tested on well-known datasets including TartanAir, EuRoC, and TUM RGB-D and achieves acceptable results.

Bonetto et al. [99] proposed iRotate, an active technique for omnidirectional robots based on RGB-D cameras. Furthermore, a module is set in their method for finding obstacles within the field of view of the camera. The main purpose of iRotate is to reduce the distance required for robots to map their environments by providing survey results of unexplored locations as well as previously visited locations. The above method uses a VSLAM framework with graph features as its backend. By comparing on simulated and real three-wheeled omnidirectional robots, the authors can achieve the same results as mainstream VSLAM methods. The main disadvantage of this approach is that the robot may face a start-stop situation for partial path replanning.

5.6 Goal Six: Visual Odometer

Such methods aim to obtain the highest possible accuracy in determining the pose of the robot.

1) Deep neural network

A dynamic SLAM framework was proposed in [100], which utilizes deep learning for accurate pose estimation and proper environment understanding. As part of optimizing the semantic-level module of VO, the authors use CNN to recognize moving objects in the environment, which helps them reduce the pose estimation error caused by incorrect feature matching. Furthermore, Dynamic SLAM uses a selective tracking module to ignore dynamic positions in the scene and a missing feature correction algorithm to achieve velocity invariance in adjacent frames. Although the results are good, due to the limited number of defined semantic classes, the system requires a huge computational cost and faces the risk of misclassifying dynamic/static objects.

Bloesch et al. [101] proposed the Code-SLAM direct technique, which provides a condensed and dense representation of scene geometry. Their VSLAM system is an enhanced version of PTAM [14], which only relies on a monocular camera to work. They split the intensity image into convolutional features and feed it to a deep autoencoder using a CNN trained on intensity images from the SceneNet RGB-D dataset. Indoor instances of the EuRoC dataset have been used to test Code-SLAM, and the results are promising in terms of accuracy and performance.

Wang et al. proposed DeepVO, an end-to-end VO framework using a deep recurrent convolutional neural network (RCNN) architecture for the monocular setting. Their method uses deep learning to automatically learn appropriate features, model sequential dynamics and relationships, and infer pose directly from color frames. The DeepVO architecture consists of a CNN called FlowNet (which can compute optical flow over successive frames), and two Long Short-Term Memory (LSTM) layers (used to estimate temporal changes based on the feeds provided by the CNN). This framework can simultaneously extract visual features and perform sequential modeling by combining CNN and recurrent neural network (RNN). DeepVO can combine geometric information with knowledge models learned for enhanced VO. However, it cannot be used to replace traditional geometry-based VO methods.

Parisotto et al. [103] proposed an end-to-end system similar to DeepVO, using a Neural Graph Optimization (NGO) step instead of LSTM. Their approach performs loop closure detection and correction based on different poses at a uniform time. NGO uses two attention optimization methods to jointly optimize the aggregated estimates made by the convolutional layers of the local pose estimation module and provide a global pose estimate. They experimented with their technique on 2D and 3D mazes and exceeded DeepVO's performance and accuracy levels. The above method needs to be connected to the SLAM framework to provide the relocalization signal.

In another work, Czarnowski et al. [104] introduced the most common VSLAM framework named DeepFactors, which is mainly used for dense reconstruction of environment maps from monocular cameras. In order to reconstruct the map more stably, their real-time solution uses probabilistic data combined with learning and model-based methods for joint optimization of pose and depth. The authors modified the CodeSLAM framework and added missing components such as local/global loop closure detection. After training on about 1.4 million ScanNet [105] images, the system is evaluated on the ICL-NUIM and TUM RGB-D datasets. DeepFactors improves the idea of ​​the CodeSLAM framework and focuses on code optimization in traditional SLAM Pipeline. However, due to the computational cost of the modules, this approach requires the use of GPUs to guarantee real-time performance.

2) Deep inter-frame processing

In another work, the authors of [106] developed a real-time dense SLAM method for RGB-D cameras by reducing photometric and geometric errors between two images for camera motion detection, improving their existing methods. Their keyframe-based solution augments Pose SLAM (which only retains non-redundant poses to generate dense maps), adds dense visual odometry features, and efficiently utilizes information from camera frames for stable camera motion estimation. . The authors also employ an entropy-based technique to compute the similarity of keyframes for loop closure detection and drift avoidance. However, their approach still needs work on loop closure detection and keyframe selection quality.

In another work presented by Li et al., a feature-based VSLAM method (called DP-SLAM) is used to achieve real-time dynamic object removal. The method uses a Bayesian propagation model that relies on the likelihood of keypoints derived from moving objects. DP-SLAM can use moving probability propagation algorithm and iterative probability update to overcome geometric constraints and semantic data changes. It is integrated with ORB-SLAM 2.0 and tested on the TUM RGB-D dataset. Despite accurate results, the system only works in sparse VSLAM and faces high computational cost due to iterative probabilistic update modules.

The indoor navigation system Pair Navi proposed by Dong et al. reuses the paths previously tracked by an agent for future use by other agents. So, the previous mobile robot, called the leader, captures tracking information, such as turns and specific environment information, and provides it to the next mobile robot (follower) that needs to go to the same destination. While the follower uses the relocalization module to determine its position with respect to the reference trajectory, the leader combines the visual odometry and trajectory creation modules. To identify and remove dynamic objects from the video feature set, the system employs a masked region-based CNN (Mask R-CNN). They tested Pair-Navi on a dataset collected from several smartphones.

3) Various feature processing

Another approach in this category is a text-based VSLAM system called TextSLAM, proposed by Li et al. It incorporates text items retrieved from the scene using the FAST corner detection technique into the SLAM Pipeline. Text items include various textures, patterns, and semantics, making the method use them more effectively to create high-quality 3D text maps. TextSLAM uses text items as stable visual fiducial markers, parametrizes them after the first frame where the text items are found, and then projects the 3D text objects onto the target image for localization again. They also propose a new three-variable parameterization technique for initializing instantaneous text item features. Using a monocular camera and a dataset created by the authors, experiments were performed in both indoor and outdoor environments, and the results were very accurate. Operating in a text-free environment, interpreting short letters, and needing to store large text dictionaries are the three fundamental challenges of TextSLAM.

Xu et al. [43] proposed an indirect VSLAM system based on improved ORB-SLAM, which uses an occupancy grid mapping (OGM) method and a new 2D mapping module to achieve high-precision localization and user interaction. Their system can use OGM to reconstruct a map of the environment, displaying the presence of obstacles as equally spaced variable fields, allowing continuous real-time navigation while planning a route. Experimental inspection of the generated datasets shows their approximation function under GPS-denied. However, their technique struggles to function well in dynamic and complex environments, and it struggles to properly match features in corridors and featureless conditions.

Ma et al. proposed the CPA-SLAM method, a direct VSLAM method for RGB-D cameras that utilizes planes for tracking and graph optimization. Frame-to-keyframe and frame-to-plane alignment are regularly integrated in its technology. They also introduce an image alignment algorithm to track the alignment of the camera's reference keyframe and the planar image. The keyframe data is used by CPA-SLAM to find the shortest time and geographic distance to track. The real-time performance of the system's tracking system is tested with and without a planar setup and analyzed on the TUM RGB-D and ICL-NUIM datasets as well as indoor and outdoor scenes. However, it only supports a small number of geometries, namely planes.

06   Research Trends

6.1 Statistics

Regarding the classification of various review papers mentioned above, we visualize the processed data in Fig. 4 to discover the current trend of VSLAM. In subfigure "a", we can see that most of the proposed VSLAM systems are stand-alone applications that implement the entire localization and mapping process from scratch using vision sensors. While ORB-SLAM 2.0 and ORB-SLAM are the base platforms used to build the new framework, the minimization method is based on other VSLAM systems, such as PTAM and PoseSLAM. Furthermore, in terms of VSLAM goals, the most important thing in subfigure "b" is the improved visual odometry module. Therefore, most of the recent VSLAM attempts to solve the problems of current algorithms in determining the position and orientation of the robot. Pose estimation and real-world survivability are further fundamental goals for presenting new VSLAM papers. Regarding the datasets used for evaluation in the surveyed papers, subplot "c" illustrates that most of the work was tested on the TUM RGB-D dataset. This dataset has been used as the main baseline or one of several baselines evaluated in researched papers. Furthermore, many researchers tend to conduct experiments on the datasets they generate. We can assume that the main motivation for generating the dataset is to demonstrate how the VSLAM method works in real scenarios and whether it can be used as an end-to-end application. EuRoC MAV and KITTI are the next popular evaluation datasets in VSLAM work, respectively. Another interesting information extracted from subfigure "d" concerns the impact of using semantic data when using a VSLAM system. We can see that most of the surveyed papers do not include semantic data when dealing with environments. We hypothesize that the reasons for not using semantic data are:

  • In many cases, training a model to recognize objects and use it for semantic segmentation is computationally expensive, which can increase processing time.

  • Most geometry-based VSLAM schemes are designed as plug-and-play devices, so they can use as little camera data as possible for localization and mapping.

  • Misinformation extracted from the scene can also add more noise to the process.

When considering the environment, we can see in subfigure "e" that more than half of the methods can also work in dynamic environments with challenging conditions, while the rest of the systems only focus on environments without dynamic changes. Also, in subfigure "f", most of the methods are applicable to "indoor environment" or "indoor and outdoor environment", while the rest of the papers are only tested in outdoor conditions. It should be mentioned that methods that only work in specific cases may not yield the same accuracy if used in other scenarios. This is one of the main reasons why some methods only focus on specific cases.

Fig. 4 Current research trends in VSLAM: a) the basic SLAM system used to implement the new method; b) the main purpose of the method; c) the various datasets on which the proposed method is being tested; d) the use of semantics in the proposed method The impact on the data; e) the number of dynamic objects present in the environment; f) the various environments in which the scheme was tested.

6.2 Trend Analysis

The current survey reviews the latest widely-attracted visual SLAM methods and illustrates their main contributions in this field. Although there have been a wide range of stable solutions and improvements in the various modules of the VSLAM system in the past few years, there are still many high-potential areas and unsolved problems, and research in these areas will bring more benefits to the future development of SLAM. stable method. Given the large number of visual SLAM methods, we discuss here the current trending areas and introduce the following open research directions:

Deep Learning: Deep neural networks have shown exciting results in various applications, including VSLAM [15], making them an important trend in several research fields. Due to their learning ability, these architectures have shown considerable potential to be used as decent feature extractors for problems in VO and loop closure detection. CNN can help VSLAM for accurate object detection and semantic segmentation, and can outperform traditional feature extraction and matching algorithms in correctly identifying hand-crafted features. It must be mentioned that since deep learning-based methods are trained on datasets with a large amount of diverse data and limited object classes, there is always a risk of misclassifying dynamic points and leading to mis-segmentation. Therefore, it may lead to lower segmentation accuracy and pose estimation error.

Balance of Information Retrieval and Computational Cost: In general, the processing cost and the amount of information in a scene should always be in balance. From this perspective, dense maps allow VSLAM applications to record high-dimensional complete scene information, but real-time execution would be computationally intensive. On the other hand, despite being less computationally expensive, sparse representations will not be able to capture all needed information. It should also be noted that real-time performance is directly related to the frame rate of the camera, and frame loss at peak processing time can negatively impact the performance of a VSLAM system independent of algorithm performance. In addition, VSLAM usually utilizes tightly coupled modules, and modifying one module may adversely affect other modules, which makes the balancing task more challenging.

Semantic Segmentation: Providing semantic information while creating a map of the environment can bring very useful information to robots. Recognizing objects (e.g., doors, windows, people, etc.) in the camera's field of view is a hot topic in current and future VSLAM work, as semantic information can be used in pose estimation, trajectory planning, and loop closure detection modules. With the widespread use of object detection and tracking algorithms, Semantic VSLAM will undoubtedly be one of the future solutions in this field.

Loop closure detection: Any SLAM system has a key problem: drift and loss of characteristic trajectories due to accumulated localization errors . Drift detection and loop closure detection need to identify previously visited location information, which leads to high computational delay and cost for VSLAM [89]. The main reason is that the complexity of loop closure detection increases with the size of the map reconstruction. Furthermore, combining map data collected from different locations and refining the estimated pose is a very complex task. Therefore, the optimization and balance of the loop closure detection module has great optimization potential. One of the common approaches for loop closure detection is to optimize image retrieval by training a visual dictionary based on local features and then aggregating them.

Special scene issues: working in an environment without textures, with few distinct feature points, which often leads to drift errors in the position and orientation of the robot. As one of the main challenges of VSLAM, this error can lead to system failure. Therefore, it will be a hot topic to consider complementary scene understanding methods, such as object detection or line features, within feature-based approaches.

07   Conclusion

This paper presents a family of SLAM efforts in which visual data collected from cameras plays an important role. We categorize the recent work on VSLAM system approaches according to various properties, such as experimental setting, innovative domain, object detection and tracking algorithm, semantic layer, performance, etc. We also review the key contributions of related work as well as existing pitfalls and challenges in terms of author perspectives, optimizations for future versions, and issues addressed in other related methods. Another contribution of the paper is the discussion of current trends in VSLAM systems and open issues that researchers will investigate further.

—— Wonderful recommendation——

  1.  【Literature】Laser SLAM Review

  2. 4D Overview | Multi-Sensor Fusion Perception for Autonomous Driving

  3. I'm studying at the end of 985, and I can't stick to the direction of visual SLAM. Can you give me some advice...

  4. The whole process of robot autonomous positioning and navigation based on SLAM

  5. 17 Point Cloud Processing Reviews - Point Cloud Semantic Segmentation, Point Cloud Object Detection, Point Cloud Processing in Autonomous Driving...

Guess you like

Origin blog.csdn.net/weixin_40359938/article/details/127633508