DROID-SLAM article reading

paper

Paper information

Title : DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras
Author : Zachary Teed, Jia Deng
Code Address : https://github.com/princeton-vl/DROID-SLAM
Time : 2021

Abstract

We introduce DROID-SLAM, a new deep learning-based SLAM system. DROIDSLAM consists of cyclic iterative updates of camera pose and pixel depth via a Dense Bundle Adjustment layer. DROID-SLAM is very accurate, a large improvement over previous work, and robust, with a much lower incidence of catastrophic failures. Although trained with monocular videos, it can leverage stereo or RGB-D videos to improve test-time performance.

Introduction

传统SLAM系统缺陷:Failures come in many forms, such as lost feature tracks, divergence in the optimization algorithm, and accumulation of drift.

The method combined with Deep Learning : It can solve some problems, but the accuracy is far less than the classical method

DROID-SLAM Features :
High Accuracy: TartanAir SLAM competition, ETH-3D RGB-D, EuRoC
High Robustness: Fewer catastrophic failures
Strong Generalization: All results across 4 datasets and 3 modes are achieved with a single model

DROID-SLAM core : "Differentiable Recurrent Optimization-Inspired Design" (DROID)
based on RAFT + two innovations

  1. we iteratively update camera poses and depth; are applied to an arbitrary number of frames
  2. each update of camera poses and depth maps in DROID-SLAM is produced by a differentiable Dense Bundle Adjustment (DBA) layer

Different from DeepV2D and BA-Net : (After I read these two papers, I will take a closer look)
The design of DROID-SLAM is novel. The closest existing deep architectures are DeepV2D [48] and BA-Net [47], both of which focus on depth estimation and report limited SLAM results. DeepV2D alternates between updating depth and updating camera pose, rather than bundle adjustments. BA-Net has a bundled adjustment layer, but their layers are quite different: it is not "dense" in that it optimizes a small number of coefficients for linearly combining a depth basis (a set of pre-predicted depth maps), whereas we directly Optimize per-pixel depth without being hampered by depth base. Furthermore, BA-Net optimizes photometric reprojection error (in feature space), while we leverage state-of-the-art flow estimation to optimize geometric error.

Related Work

VSLAM

Divided into direct method (direct) and indirect method (indirect):

Indirect method: detect features for matching, minimize reprojection error to optimize camera pose and 3D point cloud

Direct methods: Model the image formation process and define an objective function for the photometric error, with the advantage that more information about the image can be modeled, such as lines and intensity variations that are not used by indirect methods. However, photometric errors often lead to more difficult optimization problems, and direct methods are less robust to geometric distortions such as rolling shutter artifacts. This approach requires more sophisticated optimization techniques, such as coarse-to-fine image pyramids to avoid local minima.

Our method is clearly not suitable for these two categories. As with direct methods, we do not require a preprocessing step to detect and match features between images. Instead, we use the full image, which allows us to exploit a wider range of information than the usual indirect methods that only use corners and edges. However, similar to the indirect method, we minimize the reprojection error. This is an easier optimization problem and avoids the need for more complex representations such as image pyramids. In this sense, our method draws on the advantages of both methods: the smoother objective function of the indirect method and the greater modeling power of the direct method.

Deep Learning

Deep learning has recently been applied to SLAM problems. Much work has focused on training systems for specific subproblems, such as feature detection, feature matching and outlier rejection, and localization.

Other works mainly focus on end-to-end training SLAM systems. These methods are not full SLAM systems, but focus on small-scale reconstructions of two to a dozen frames. They lack many core features of modern SLAM systems, such as loop closure and global bundle adjustment, which limits their ability to perform large-scale reconstructions, as demonstrated in our experiments. ∇SLAM implements several existing SLAM algorithms as differentiable computational graphs, allowing errors in reconstruction to be backpropagated back to sensor measurements. Although this approach is differentiable, it has no trainable parameters, meaning that the performance of the systems is limited by the accuracy of the classical algorithms they emulate.

DeepFactors [9] is the most complete deep SLAM system built on the earlier CodeSLAM [1]. It performs joint optimization of pose and depth variables and is capable of short-range and long-range loop closure. Similar to BA-Net [47] , DeepFactors optimizes the parameters of learned deep bases during inference. Instead, we do not rely on a learned basis, but optimize pixel depth.

Approach

Representation:
insert image description here

Feature Extraction and Correlation

The part of feature extraction and association is completely consistent with the feature network and related information body correlation volume in RAFT.

Feature extraction network : The feature network consists of 6 residual modules and 3 downsampling layers to generate a feature map of 1/8 resolution of the input image; the context network extracts the global information of the image as an input to the update operator below;

Correlation Pyramid : For each edge in the frame-graph, this article uses dot product to calculate a 4D correlation volume:
insert image description here
Correlation Lookup : According to the pixel coordinates and search radius, the search operator will find the corresponding correlation on the correlation volume of different resolutions Information tensors, and finally splicing them into a feature vector;

Update Operator

The core module of DROID-SLAM
insert image description here
C ij C_{ij}CijThis image I i I_iIiGive I j I_jIjThe relevant information tensor between, h_{ij} is the hidden state, after each round of update, hij h_{ij}hijWill be updated, and the output pose change Δ ξ ( k ) \Delta \xi^{(k)}D x( k ) and depth changeΔ d ( k ) \Delta d^{(k)}Δd( k ) , and then update the pose and depth of the next frame:
insert image description hereCorrespondence: Before each update, according to the current pose and depth, for the imageI i I_iIiFor each grid in the grid, the pixel coordinate set pi ∈ RH × W × 2 p_i\in \mathbb{R}^{H\times W\times 2}piRH × W × 2 , then it is in imageI j I_jIjThe corresponding grid pixel set pij p_{ij} onpijIt can be expressed as:
(This is an overview of the formula corresponding to optical flow)
insert image description here

input : According to the grid correspondence calculated in the previous step, find the correlation volume C ij C_{ij} of the two imagesCij; At the same time, according to pi p_ipipij p_{ij}pijCalculate the optical flow field flow field between two images. C ij C_{ij}CijIt characterizes the degree of visual consistency between two images. The goal of the update module is to calculate the relative pose to align the two images to maximize the consistency. However, because the visual consistency will be singular, we also use the optical flow field To improve the robustness of pose estimation.

update : Like the RAFT network, the correlation features and flow features are sent to the GRU module together with the context feature after two layers of convolution. In the GRU module, this paper uses the hidden state hij h_{ij}hijUse average pooling to extract the global context, which is helpful for improving the robustness of optical flow estimation for violently moving objects.

The GRU module can simultaneously update the hidden state to get hk + 1 h^{k+1}hk + 1 , we use this hidden state tensor to get the change of optical flow fieldrij ∈ RR × W × 2 r_{ij}\in \mathbb{R}^{R\times W \times 2}rijRR × W × 2 and the corresponding confidencewij ∈ RR × W × 2 w_{ij}\in \mathbb{R}^{R\times W \times 2}wijRR × W × 2 , then the revised gridpij ∗ = rij + pij p_{ij}^*=r_{ij}+p_{ij}pij=rij+pij.

Reuse h ( k + 1 ) h_{(k+1)}h(k+1)Get the pixel-wise damping coefficient λ \lambdaλ and the 8x8 upsampling mask used in the depth estimation process;

Dense Bundle Adjustment Layer :
First, the DBA layer converts the dense optical flow field output output by the update module into camera pose and dense depth value: the camera pose here can be calculated using traditional methods, and the depth value is based on the following goals Function and Schur complement formula, obtained by iterative optimization:
insert image description here
Here, the Guass-Newton method is still used to calculate the change of pose and depth, and the Schur complement is used to first calculate the change of pose and then calculate the change of depth
insert image description here
. The gradient direction of the camera pose and depth.
The implementation and backpropagation of the DBA layer are based on LieTorch, which is a package implemented by pytorch of Lie Group Lie Algebra.

training process

Monocular scale problem: In order to solve the monocular SLAM scale problem, we fix the image poses of the first two frames as ground truth.

Sampling video/image sequence: In order to make our network generalization ability better, we calculate the optical flow field distance between any two images in the image sequence, for such N i × N i N_i\times N_iNi×NiThe flow distance matrix is ​​sampled, and the video formed by the new sampled image sequence is obtained to input the network for training.

Supervision and loss: The supervision information is ground truth pose and ground truth flow. The loss is the loss of the flow fileds and ground truth flow fileds predicted by the network. The image pose loss isinsert image description here

SLAM system

Like the previous SLAM system, the SLAM system implemented by the method in this paper also includes two threads, front-end and back-end. The front-end thread extracts features, selects key frames, and optimizes locally. The back-end thread performs global optimization on all key frames.

Initialization : Select the front 12 frames of the video to complete the initialization. The optical flow between two adjacent frames must be greater than 16px. After the 12 frames are collected, we build a frame-graph and run the update algorithm 10 times. Sub iterations.

Visual front-end : The front-end part selects and maintains the key frame sequence. When a new image arrives, it performs feature extraction, calculates the optical flow field, and calculates the three key frames with the highest degree of common view according to the optical flow field, and then according to the common view relationship to iteratively update the pose and depth of the current keyframe. At the same time, the front-end part is also responsible for the edge operation.

Back-end optimization : The back-end part performs BA optimization on all key frames. Every time an update iteration is performed, the frame-graph will be rebuilt for all key frames, which is constructed based on the optical flow field distance matrix between all key frames . Then run the update operator on the frame-graph. We use LieTorch for BA relief. The back-end network only runs BA optimization on key frames. For ordinary frames in the video, only the camera pose is optimized.

Stereo and RGB-D:

In order to make the SLAM system we designed well applied to binocular and RGB-D scenarios, we will make some modifications to formula (4) in Stereo and RGB-D scenarios. For example, in RGB-D scenarios In , formula 4 adds a residual term: the sum of squared distances between the estimated depth map and the measured depth map. In the Stereo scene, formula 4 is changed to the pose reprojection error of the left and right cameras.

Experimental results

The method proposed in this paper has been fully experimented on multiple data sets, and compared with other deep learning methods and classic SLAM algorithms.

The experimental part focuses on comparing the absolute trajectory error ATE of the camera trajectory.

The network in this paper was trained 250k times on the synthetic dataset TartanAir, the image resolution was 384x512, and it took 1 week to train on 4 RTX-3090s.

insert image description here
insert image description here
insert image description here
Sufficient experiments have also been done on the EuRoc and TUM-RGB-D datasets. The experiments prove that the network in this paper can be well generalized to Stereo and RGB-D, while achieving high accuracy and robustness.
insert image description here
insert image description here
insert image description here

my feeling

Judging from the results, the improvement is great, but this kind of method has quite high requirements for computing power and GPU, I am afraid~

the code

Guess you like

Origin blog.csdn.net/qin_liang/article/details/131890782