Dynamic SLAM paper (2) — DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes

Table of contents

0 Introduction

1 Related work

2 System description

3 Experimental results

4 Conclusions


Abstract  − A typical assumption in SLAM algorithms is the assumption of scene rigidity. Such strong assumptions limit the use of most visual SLAM systems to densely populated real-world environments that are the target of multiple related applications, such as service robots or autonomous vehicles. This paper introduces DynaSLAM, a visual SLAM system built on ORB-SLAM2 [1] with added capabilities for dynamic object detection and background inpainting . DynaSLAM is robust to dynamic scenes in monocular, stereo and RGB-D configurations. We are able to detect moving objects through multi-view geometry, deep learning, or a combination of both. Having a static map of the scene allows fixing frame backgrounds that are occluded by these dynamic objects. We evaluate our system on publicly available monocular, stereo and RGB-D datasets. We study the impact of several accuracy/speed tradeoffs to evaluate the limitations of the proposed method. DynaSLAM exceeds the accuracy of standard visual SLAM baselines in highly dynamic scenes. It also estimates a map of static parts of the scene, which is necessary for long-term applications in real-world environments.

0 Introduction

        SLAM is a prerequisite for many robotic applications, such as collision-free navigation. SLAM techniques jointly estimate a map of an unknown environment and the robot's pose in that map from the data stream from its onboard sensors. The map allows the robot to continuously localize in the same environment without accumulating drift. This differs from odometry approaches that integrate incremental motion estimation within a local window, which cannot correct for drift when revisiting locations.

        The main sensor of visual SLAM is the camera, which has received high attention and research in recent years. The solution with a monocular camera has practical advantages in terms of size, power consumption and cost, but also has some challenges, such as not being able to observe scale or state initialization. These issues can be addressed by using more complex stereo or RGB-D cameras, and the robustness of visual SLAM systems can be greatly improved.

        The research community has studied SLAM from different angles. However, the vast majority of methods and datasets assume that the environment is static. Therefore, they can only handle a small fraction of dynamic content by classifying it as an outlier to a static model. Although the static assumption holds true for some robotics applications, it limits the applicability of visual SLAM to many relevant cases, such as intelligent autonomous systems operating for long periods of time in densely populated real-world environments.

        Visual SLAM can be divided into feature-based methods [2], [3] and direct methods [4], [5], [6]. The former relies on salient point matching and can only estimate sparse reconstructions; while the latter is in principle able to estimate fully dense reconstructions by directly minimizing photometric error and TV regularization. Some direct methods focus on regions of high gradient, estimating semi-dense maps [7], [8].

        None of the above methods take into account the very common problem of dynamic objects in the scene, such as pedestrians, bicycles or cars. Detecting and handling dynamic objects in visual SLAM poses several challenges for both mapping and tracking, including:

1) How to detect these dynamic objects in the image:

  • Prevents the tracking algorithm from using matching points belonging to dynamic objects.
  • Prevents the mapping algorithm from including moving objects in the 3D map.

2) How to complete the partial 3D map that is temporarily occluded by dynamic objects.

        Many applications could benefit from advances in this area, including augmented reality, autonomous vehicles, and medical imaging, among others. They are safe to reuse maps generated from previous runs. Detecting and handling dynamic objects is a prerequisite for estimating a stable map, which is very useful for long-term applications. If no dynamic content is detected, it becomes part of the 3D map, complicating its usability for tracking or relocalization purposes.

      Thesis work : We propose an online algorithm to handle dynamic objects in RGB-D, stereo, and monocular SLAM. We add a front-end to the existing ORB-SLAM2 system [1] to allow for more accurate tracking and reusable scene maps. For monocular and stereo cases, our proposal is to use convolutional neural networks to perform pixel-level segmentation of prior dynamic objects (such as people and vehicles) in the frame, so that the SLAM algorithm does not extract features on them. For the RGBD case, we propose to combine a multi-view geometric model and a deep learning-based algorithm to detect dynamic objects and, after removing them from the image, inpaint the occluded background with the correct information of the scene (Fig. 1).

1 Related work

        Dynamic objects are classified as noisy data in most SLAM systems and thus are not included in the map nor used for camera tracking. The most typical outlier rejection algorithms are RANSAC (such as the algorithm used in ORB-SLAM [3], [1]) and robust cost function (such as the algorithm used in PTAM [2]).

        There are several SLAM systems that specialize in handling dynamic scene content. Among feature-based SLAM methods, some of the most relevant ones are:

  • Tan et al. [9] detect changes occurring in a scene by projecting map features into the current frame for appearance and structure verification.
  • Wangsiripitak and Murray [10] track known 3D dynamic objects in the scene.
  • [11] deal with human activities by detecting and tracking people.  
  • The work of Li and Lee [12] uses depth edge points with associated weights indicating their probability of belonging to dynamic objects.

        Direct methods are usually more sensitive to dynamic objects in the scene. The most relevant methods designed specifically for dynamic scenarios include:

  • Alcantarilla et al. [13] detect moving objects by using a scene flow representation of stereo cameras.
  • Wang and Huang [14] used RGB optical flow to segment dynamic objects in the scene.
  • Kim et al. [15] propose to capture static parts of a scene by computing the difference of successive depth images projected on the same plane.
  • Sun et al. [16] compute the intensity difference of consecutive RGB images. Pixel classification using segmentation of quantized depth images.

        All these methods - feature-based and direct methods - map static scene parts only by the information contained in the sequence [1], [3], [9], [12], [13], [14], [15], [16], [17], cannot estimate long-term models when the a priori dynamic objects remain stationary, such as parked cars or seated people. On the other hand, Wangsiripitak and Murray [10] and Riazuelo et al. [11] can detect these a priori dynamic objects, but cannot detect changes caused by static objects, such as a chair pushed by a person or a ball thrown by others. That is, the former method successfully detects moving objects, while the latter method successfully detects multiple movable objects. We propose DynaSLAM, combining multi-view geometry and deep learning, to address both situations. Similarly, Anrus et al. [18] segmented dynamic objects by combining a dynamic classifier with multi-view geometry.

2 System description

        First, the RGB channels are passed through a convolutional neural network (CNN) for pixel-level segmentation of all prior dynamic content, such as people or vehicles.

        In the RGB-D case, we leverage multi-view geometry to improve segmentation of dynamic content in two ways. First, we improve on CNN's previously obtained dynamic object segmentation. Second, we mark as dynamic new object instances that are stationary most of the time (i.e., moving objects detected in the CNN stage).

        To achieve this, it is necessary to know the camera pose, for which we implement a low-cost tracking module for localizing the camera in the created scene map. These split frames are the ones used to obtain the camera trajectory and scene map. Note that if the moving objects in the scene do not belong to the category of CNN, the multi-view geometry stage will still detect dynamic content, but the accuracy may be reduced.

        Once complete dynamic object detection and camera localization are complete, our goal is to reconstruct the occluded background of the current frame based on static information from previous views. These synthetic frames are important for applications such as augmented reality, virtual reality, and place recognition in life-long maps.

        In the monocular and stereo vision cases, images are segmented by CNN in order not to track and map keypoints belonging to dynamic objects a priori. All the different phases are described in detail in the following subsections (A to E).

A. Segmentation of potentially dynamic objects using CNN

        To detect dynamic objects, we propose to use a CNN for pixel-level semantic segmentation of images. In our experiments, we use Mask R-CNN [19], which is the current state-of-the-art method for object instance segmentation. Mask R-CNN can obtain pixel-level semantic segmentation and instance labels. In this work we use pixel-level semantic segmentation information, but instance labels may be useful in future work for tracking different moving objects. We use Matterport1's TensorFlow implementation. The input to Mask R-CNN is an RGB raw image. The idea is to segment those potentially dynamic or movable classes (person, bike, car, cat, dog, etc.). In our opinion, in most environments, possible dynamic objects are included in this list. If other categories are required, new training data can be added for fine-tuning using the network trained on MS COCO [20].

        Suppose the output of the network is an RGB image of size m×n×3, and the output is a matrix of size m×n×l, where l is the number of objects in the image. For each output channel i ∈ l, a binary mask is obtained. By merging all channels into one, we can obtain segmentation results for all dynamic objects present in the scene.

B. Low cost tracking

        The feature points of the segmentation boundary are not considered in the tracking process

        After segmenting the potentially dynamic content, the static part of the image is used to track the camera position. Since segmented contours usually become regions of high gradient, salient point features tend to appear, and we do not consider features in these contour regions . The tracking implemented at this stage of the algorithm is a simplified and less computationally burdensome version in ORB-SLAM2 [1]. It projects map features into an image frame, searches for correspondences in static regions of the image, and minimizes projection error to optimize camera pose.

C. Dynamic Content Segmentation Using Mask R-CNN and Multi-View Geometry

        By using Mask R-CNN, most dynamic objects can be segmented and not used for tracking and mapping. However, some objects cannot be detected by this method because they are not previously determined dynamic objects but are movable. Examples include books one carries, chairs one moves, and even furniture changes in long-term construction. This section details ways to handle these situations.

        For each input frame, we choose the previous keyframe that has the highest overlap with the new frame. This is done by considering the distance and rotation between the new frame and each keyframe, similar to the method of Tan et al. [9]. In our experiments, the number of overlapping keyframes is set to 5 as a trade-off between computational cost and dynamic object detection accuracy.

        We then calculate each keypoint projected from the previous keyframe to the current frame , resulting in keypoints , and their projected depths , computed from camera motion. Note that the keypoints come from the feature extraction algorithm used in ORB-SLAM2.

        Dynamic object judgment: parallax angle α > 30°, depth difference exceeds threshold 

        For each keypoint, whose corresponding 3D point is , we calculate the angle between the back-projections of and , that is, their parallax angle α. If this angle is greater than 30 degrees, the point may be occluded and ignored from then on. We observe that in the TUM dataset, static objects with disparity angles greater than 30 degrees are considered dynamic due to viewing angle differences. We take the depths of the remaining keypoints in the current frame (obtained directly from the depth measurements), account for reprojection errors, and compare them with . If the difference exceeds a threshold , the keypoint is considered to belong to a dynamic object. This idea is shown in Figure 3. To set the threshold , we manually labeled 30 images of dynamic objects in the TUM dataset, and evaluated the precision and recall of our method for different thresholds . By maximizing the expression 0.7×P precision+0.3×Recall, we conclude that = 0.4m is a reasonable choice.

        In the field of dynamic simultaneous localization and mapping (SLAM), there may be problems on the boundaries of some objects marked as dynamic keypoints. To avoid this problem, we exploit the information provided by the depth image. If a keypoint is set as dynamic, but the depth image region around it has high variance, we will mark it as static.

        So far, we know which keypoints belong to dynamic objects and which do not. To classify all pixels belonging to dynamic objects, we expand the region around dynamic pixels in the depth image. Figure 4a shows an example of an RGB frame and its corresponding dynamic mask.

        The results of convolutional neural networks (CNN) (Fig. 4b) can be combined with those of geometric methods for complete dynamic object detection (Fig. 4c). We can find advantages and limitations in both methods, so they can be used in combination. For geometric methods, the main problem is that the initialization is not trivial due to its multi-view nature. The excellent performance of deep learning methods on single-view is free from such initialization problems, but their main limitation is that objects considered static may move, which the method cannot recognize. Conformance testing from multiple perspectives can be used to address this situation.

        These two ways of solving the moving object detection problem are shown in Figure 4. In Figure 4a, we can see that the person behind is a potential dynamic object, but it is not detected. There are two reasons for this. First, RGB-D cameras have trouble measuring the depth of distant objects. Second, reliable feature points are located in well-defined and nearby image regions. Nevertheless, the person was detected by the deep learning method (Fig. 4b). Besides, in Figure 4a, we can see that not only the person in front is detected, but also the book in his hand and the chair he is sitting on. On the other hand, in Fig. 4b, only two persons are detected as dynamic objects, and their segmentation is less accurate. If only deep learning methods were used, there would be a floating book in the image, which would mistakenly be part of the 3D map.

        Due to the advantages and disadvantages of these two methods, we believe that they complement each other, so their combined use is an effective way to achieve accurate tracking and build maps. To achieve this goal, if an object is detected in both methods, the segmentation mask should take the result of the geometric method. If an object is only detected by deep learning methods, the segmentation mask should also contain this information. The final segmented image for the example in the previous paragraph can be seen in Figure 4c. The segmented dynamic part is removed from the current frame and map.

D. Tracking and Mapping

        The inputs to the system at this stage include RGB images, depth images, and their segmentation masks. We extract ORB features in image passages classified as static. Since the contours of image segments are high-gradient regions, keypoints on these intersections need to be removed.

E. Background restoration

        For each removed dynamic object, we aim to inpaint the occluded background with static information from previous views in order to synthesize a realistic image without moving content. We believe such synthetic frames, which incorporate the static structure of the environment, are useful for virtual and augmented reality applications as well as for relocalization and camera tracking after map creation.

        Background repair method: Project the RGB and depth information of the last 20 frames to the single previous frame to fill the blank area

        Since we know the positions of the previous and current frames, we project the RGB and depth channels of all previous keyframes (in our experiments, the last 20 frames) to the dynamic segment of the current frame. Some empty areas have no correspondence and are therefore left blank: some areas cannot be inpainted because their corresponding parts of the scene in keyframes have not been present so far, or if they have, there is no valid depth information. These empty areas cannot be reconstructed geometrically and require more complex restoration techniques. Figure 5 presents composite images of three input frames selected from different sequences of the TUM benchmark. Note that the dynamic content has been successfully split and removed. Furthermore, most segmentations have been correctly inpainted with information from the static background.

        Another application of these synthetic frames is: if the dynamic regions of the frame are inpainted with static content, the system can operate as a SLAM system using the inpainted image under static assumptions.

3 Experimental results

        We have evaluated our system on the public datasets of TUM RGB-D and KITTI, and compared it with other state-of-the-art SLAM systems in dynamic environments, using the results published in the original paper where possible. Furthermore, we also compare our system with the original ORB-SLAM2 to quantify the improvement of our method in dynamic scenes. In this case, results for some sequences were not published, so we performed their evaluation ourselves. Mur and Tardos [1] suggested running each sequence five times and showing the median results to account for the non-deterministic nature of the system. Since dynamic objects tend to add to this non-deterministic effect, we ran each sequence ten times.

A. TUM Dataset

        The TUM RGB-D dataset [22] consists of 39 sequences recorded at full frame rate (30Hz) in different indoor scenes using a Microsoft Kinect sensor. Both RGB and depth images are available, and ground truth trajectories are also provided, which are finally recorded by a high-precision motion capture system. In the sequence called sitting(s), there are two people sitting at a table talking and gesturing, i.e. low motion. In the sequence titled walking(w), two people walk in the background and foreground, then sit down at a table. This dataset is very dynamic and poses a great challenge to standard SLAM systems. For the two types of sequences, sitting (s) and walking (w), there are four types of camera motion: (1) hemispherical (half): the camera moves along a 1-meter-diameter hemispherical trajectory, (2) xyz: The camera moves along the x, y, z axes, (3) rpy: the camera rotates around the roll, pitch and yaw axes, (4) static: the camera is held still manually.

        We use the absolute trajectory RMSE as the error metric for the experiments, which was proposed by Sturm et al. [22]. In this dataset, the results of different variants of our system on six sequences are shown in Table I. First, DynaSLAM(N) is a system that only uses Mask R-CNN to segment dynamic objects a priori. Second, in DynaSLAM(G), dynamic objects are only detected by a multi-view geometry method based on depth variation. Third, in DynaSLAM (N+G), the detection of dynamic objects is performed by combining geometric and deep learning methods. Finally, we thought it would be interesting to analyze the system shown in Figure 6. In this case (N+G+BI), a background inpainting (BI) stage is performed before tracking and mapping. The motivation for this experiment is that if dynamic regions are inpainted with static content, the system can use the inpainted image as a SLAM system, operating under the stationary assumption. In this proposal, the ORB feature extraction algorithm works in both actual and reconstructed regions of the frame, finding matching points on keypoints of previously processed keyframes.

        Adding a background inpainting stage (BI) to tracking before camera localization often results in lower tracking accuracy. The reason is that background reconstruction is closely related to camera pose. Therefore, for sequences with pure rotational motion (rpy, hemisphere), the estimated camera pose has a large error, leading to inaccurate background reconstruction. Therefore, the background inpainting phase (BI) should be performed after the tracking phase is completed. The main achievement of background reconstruction lies in the synthesis of still images for applications such as virtual reality or cinematography (Fig. 5). From now on, the results of DynaSLAM are all the results of the best variant (N+G).

Fig. 6: Block diagram of RGB-D DynaSLAM (N+G+BI).
 

        Table II shows our comparison results with RGB-D ORB-SLAM2 on the same sequences. In highly dynamic scenes (walking), our method outperforms ORB-SLAM2, achieving similar errors to the original RGB-D ORB-SLAM2 system in static scenes. In low dynamic scenes (sitting), the tracking results are slightly worse because the tracked keypoints are farther away from the keypoints of dynamic objects. Nevertheless, DynaSLAM's map does not contain dynamic objects that appear in the sequence. Figure 7 shows the comparison of the estimated trajectories of DynaSLAM and ORB-SLAM2 with the real trajectories.

        Table III shows the comparison of our system with several RGB-D SLAM systems designed for dynamic environments. To evaluate the performance of our method and state-of-the-art motion detection methods (independent of the SLAM system used), we also show the corresponding improvement values, compared to the original SLAM system used in each case. DynaSLAM significantly outperforms other systems in all sequences (high and low dynamic). The error is usually around 1-2 cm, which is similar to the state-of-the-art in static scenes. Our motion detection method also outperforms other methods.

        ORB-SLAM, the monocular version of ORB-SLAM2, is generally more accurate than the RGB-D version in dynamic scenes due to their different initialization algorithms. RGB-D ORB-SLAM2 initializes and starts tracking from the first frame, so dynamic objects may introduce errors. ORB-SLAM defers initialization until disparity and consistency exist, based on static assumptions. As a result, it doesn't track the camera through the entire sequence, and sometimes misses a large part of it, even failing to initialize.

        Table IV shows the tracking results and percentages of tracked trajectories for ORB-SLAM and DynaSLAM (monocular) in the TUM dataset. DynaSLAM is always faster to initialize than ORB-SLAM. In fact, in highly dynamic sequences, the initialization of ORB-SLAM will only happen when moving objects disappear from the scene. In conclusion, although DynaSLAM is slightly less accurate, it successfully leverages dynamic content to prime the system and generate a map without that content (see Figure 1) for re-use in long-term applications. The reason for the slightly lower accuracy of DynaSLAM is that the estimated trajectories are longer, so there is room for error to accumulate.

 B. KITTI dataset

        The KITTI dataset [23] contains stereo sequences recorded from vehicles in urban and highway environments. Table V shows our results compared with stereo ORB-SLAM2 on eleven training sequences. We use two different metrics, the absolute trajectory RMSE proposed in [22] and the average relative translation and rotation error proposed in [23].

        Table VI shows the results of monocular variants of ORB-SLAM and DynaSLAM in the same sequence. Note that the results are similar in the monocular and stereo cases, but the former is more sensitive to dynamic objects (such as extra content in DynaSLAM). In some sequences, tracking accuracy improves without using features belonging to a priori dynamic objects (e.g., vehicles, bicycles, etc.). For example, all occurrences of vehicles in the KITTI 01 and KITTI 04 sequences are moving. In most recorded sequences of parked cars and stationary vehicles, the absolute trajectory RMSE is usually larger because the keypoints used for tracking are farther away and usually belong to low-texture regions (such as KITTI 00, KITTI 02, KITTI 06). However, the loop closure and relocalization algorithms work more robustly because the generated map contains only structural objects, i.e. the map can be reused and worked on in long-term applications.

         As future work, it is interesting to distinguish between movable and moving objects using only RGB information. If a car (movable) is detected by the CNN, but is not currently in motion, its corresponding keypoints should be used for local tracking, but should not be included in the map.

C. Timing Analysis

        To complete the evaluation of our proposed method, Table VII shows the average computation time for different stages. Note that DynaSLAM is not optimized for real-time operation. However, its ability to create lifetime maps is also relevant for offline mode operation. Mur et al. present real-time results on ORB-SLAM2 [1]. He et al. [19] report that Mask R-CNN runs in 195 ms per image on an Nvidia Tesla M40 GPU. The addition of the multi-view geometry stage resulted in additional slowdown, mainly due to the region growing algorithm. Background inpainting also introduces latency, which is another reason to do it after the tracking and mapping stages, as shown in Figure 2.

4 Conclusions

        We propose a visual SLAM system based on ORB-SLAM that adds a motion segmentation method to make it robust in dynamic environments with monocular, stereo, and RGB-D cameras. Our system accurately tracks the camera and creates a static and reusable map of the scene. In the RGB-D case, DynaSLAM is able to acquire synthetic RGB frames without dynamic content and occluded background restoration, and their corresponding synthetic depth frames, which is very useful for virtual reality applications. We have also included a video demonstrating the potential of DynaSLAM. Compared with the current state-of-the-art, DynaSLAM achieves the highest accuracy in most cases. In the TUM dynamic object dataset, DynaSLAM is currently the best RGB-D SLAM solution. In the monocular case, our accuracy is similar to ORB-SLAM, but we can initialize the static map of the scene earlier. In the KITTI dataset, the accuracy of DynaSLAM is slightly lower than monocular and stereo ORB-SLAM, unless dynamic objects occupy a significant part of the scene. However, our estimated map only contains structural objects and thus can be reused in long-term applications. Future extensions to this work may include real-time performance, RGB-based motion detectors, or more realistic synthesis of RGB frames using finer inpainting techniques such as GANs, using the method of Pathak et al. [24].

Guess you like

Origin blog.csdn.net/qq_41921826/article/details/131352146