Classic literature reading--NICER-SLAM (RGB neural implicit dense SLAM)

0. Introduction

Neural implicit representation has become a popular representation method in SLAM in recent years, especially in dense visual SLAM. However, previous work in this direction either relied on RGB-D sensors, or required a separate monocular SLAM approach for camera tracking , and failed to produce high-precision, high-density 3D scene reconstructions. In this paper, we propose NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes camera pose and layered neural implicit map representation, which also allows high-quality novel view synthesis.

To facilitate the map optimization process, we integrate additional supervisory signals, including easily accessible monocular geometric cues and optical flow, and introduce a simple deformation loss to further enforce geometric consistency. Furthermore, to further improve the performance in complex indoor scenes, we also propose a locally adaptive transformation from the signed distance function (SDF) to the density in the volume rendering equation . On synthetic and real datasets, we demonstrate strong performance in dense mapping, tracking, and novel view synthesis, even competing with recent RGB-D SLAM systems. This part of the code is not yet open source, you can look forward to a wave.

insert image description here

1. Article contribution

The contributions of this paper are as follows:

1. We propose NICER-SLAM, one of the first dense RGB SLAMs, which enables end-to-end optimization of tracking and mapping, and also enables high-quality synthesis of new views.

2. We introduce hierarchical neural implicit encodings for SDF, different geometric and motion regularizations, and locally adaptive SDF volume-density transformations.

3. We demonstrate strong mapping, tracking, and novel view synthesis performance on both synthetic and real datasets, even competing with recent RGBD SLAM methods.

2. System overview

We provide an overview of the NICER-SLAM pipeline in Figure 2. Given an RGB video as input, we simultaneously estimate accurate 3D scene geometry and color, and camera tracking via end-to-end optimization. Figure 2 shows the system overview of NICER-SLAM. Our method takes only RGB streams as input and outputs camera poses as well as learned hierarchical scene representations for geometry and color. To achieve end-to-end joint mapping and tracking, we render predicted color, depth, normal, and optimize based on input RGB and monocular cues . In addition, we further enforce geometric consistency through RGB deformation loss and optical flow loss. We use hierarchical neural implicit representations to represent scene geometry and appearance (Section 3). With NeRF-like differentiable volume rendering, we can present per-pixel color, depth, and normal values ​​(Section 4), which will be used for end-to-end joint optimization of camera pose, scene geometry, and color (Section 5 Festival). Finally, we discuss some design choices in the system (Section 6)

insert image description here

3. Hierarchical neural implicit representation (similar to Nice slam)

We first introduce our optimizable hierarchical scene representation approach, which combines multi-level grid features with an MLP decoder for SDF and color prediction.
Coarse-level geometry representation : The goal of coarse-level geometry representation is to efficiently model coarse scene geometry (objects that capture geometric details) and scene layout (e.g. walls, floors), even with only partial observation data. For this we use a resolution of 32×32×32 32×32×3232×32×A dense voxel grid of 32 represents the normalized scene and retains 32 features in each voxel. For any point x ∈ R 3in spacexR3 , we use a small MLPfcoarsef ^{coarse}fco a rse with a 64-dimensional hidden layer to obtain its base SDF valuescarse ∈ R scarse ∈ \mathbb{R}scoarseR and geometric featureszcoarse ∈ R 32 z^{coarse}∈\mathbb{R}^{32}zcoarseR32 , as shown in the following formula:
insert image description here
whereγ γγ corresponds to a fixed positional encoding [29, 54], mapping coordinates to higher dimensions. Following the method of [71, 69, 68], we set the level of positional encoding to 6. Φ coarse ( x ) Φ^{coarse}(x)Phico a rse (x)represents the feature gridΦ coarse Φ^{coarse}Phico a rse at pointxxTrilinear interpolation at x .

Fine-level geometric representation : While coarse geometric shapes can be obtained with our coarse-level shape representation, it is important to capture high-frequency geometric details in the scene . To achieve this, we use a multi-resolution feature grid and an MLP decoder [5, 31, 76, 53] to model high-frequency geometric details as residual SDF values . Specifically, we use a multi-resolution dense feature grid Φ { finel } 1 L {Φ^\{fine}_l \}^L_1Phi{ finel}1L, the resolution is R l R_lRl. These resolutions are sampled in geometric space [31] to incorporate features of different frequencies:
insert image description here
where, R min R_{min}Rminand R max R_{max}Rmaxcorrespond to the lowest and highest resolutions, respectively . Here we set R min = 32 R_{min}=32Rmin=32 R m a x = 128 R_{max}=128 Rmax=128 , a total ofL = 8 L=8L=8 levels. The feature dimension of each level is 4.
Now, for a pointxxx modeling the residual SDF values, we extract and concatenate the trilinearly interpolated features at each level and feed them into a 3 hidden layer MLP of size 64ffinef^{fine}ff in e :
insert image description here
wherezfine ∈ R 32 z^{fine} ∈ \mathbb{R}^{32}zfineR32 isxxThe geometric features of x at the fine level. Base SDF values ​​scarses^{coarse}by coarse layersco a rse and fine layer residual SDF∆ s ∆ss x x The final predicted SDF value of x s ^ \hat{s}s^ is simply the sum of both:
insert image description here
Color representation: In addition to3D geometry information, we also predict color values ​​so that our mapping and camera tracking can also be optimized with color loss. Also, as another application, we can render images from new perspectives on the fly. Inspired by [31], we use another multi-resolution feature grid{Φ lcolor } 1 L \{Φ^{color}_l\}^L_1{ Flcolor}1Land a decoder fcolorf^{color} parameterized by a 2-layer MLP of size 64fco l or to encode the color. The number of layers of the feature mesh is nowL=16 L=16L=16 , and the feature dimension of each layer is 2. Min and max resolution are now R min = 16 R_{min} = 16respectivelyRmin=16 andR max = 2048 R_{max} = 2048Rmax=2048 . We predict the color value of each point as:
insert image description here
where,n ^ \hat{n}n^ corresponds to s ^ \hat{s}from equation (4)s^ Calculated pointxxNormal at x , γ ( v ) γ(v)γ ( v ) is the viewing direction, which has a position encoding of level 4, following [68, 71].

4. Volume rendering (more important arguments)

Following recent work based on implicit methods for 3D reconstruction [38, 68, 71, 59] and dense visual SLAM [51, 76], we use differentiable volume rendering of the scene representation optimized from Section 3.1. Specifically, to render a pixel, we oo from the camera centero along its normalized line of sight directionvvv , will rayrrr is projected onto the pixel. Then sample N points along this ray, expressed asxi = o + tiv x_i = o + t_ivxi=o+tiv , their predicted SDF and color values ​​ares ^ i \hat{s}_is^i c ^ i \hat{c}_i c^i. For volume rendering, we follow [68] by taking the SDF s ^ i \hat{s}_is^iConvert to density value σ i σ_ipi:
insert image description here
where, β ∈ R β ∈ RbR is the parameter controlling the conversion from SDF to bulk density. As in [29], the current rayrrThe color of r C ^ \hat{C}C^ is calculated as:
insert image description here
where,T i T_iTia i a_iaiCorrespondingly along the ray rrSampling pointii of rThe transmittance and alpha value of i , δ i \delta_idiis the distance between adjacent sampling points. Similarly, we can also calculate the current ray rrThe depth D ^ \hat{D}of the surface intersected by rD^ and normalN ^ \hat{N}N^ , as follows:
insert image description here
This paper presents a method for local adaptive transformation. ** βin formula (6)The β parameter models the smoothness** near the surface of the object. As the network becomes more certain about objectsurfaces, β βThe value of β decreases gradually. Therefore, this optimization scheme enables faster and sharper reconstructions. In VolSDF [68], they putβ ββ is modeled as a single global parameter. This modeling approach essentially assumes the same degree of optimization in different scene regions, which is sufficient for small scenes. However, in complex indoor scenes, the globally optimizedβ βThe β value is suboptimal (see section 4.2 for ablation studies). Therefore, we propose a locally adaptive method that incorporatesβ βThe β values ​​are localized so that the SDF-density transformation in Equation (6) is also locally adaptive. Specifically, we maintain a voxel counter throughout the scene and count the number of point samples within each voxel during the mapping process. We empirically choose a voxel size of 643 (see Section 4.2 for ablation studies). Next, we heuristically design a method for countingT p T_pTpto bTransformation of β values:β
insert image description here
under the global input settingThe transformation was obtained by plotting the decrease of β with voxel count and fitting the curve. We empirically found that an exponential curve was the best fit.

5. End-to-end joint mapping and tracking (this part is also relatively necessary)

From RGB temporal input only, it is difficult to simultaneously optimize 3D scene geometry and color as well as camera pose due to the high degree of ambiguity, especially for large complex scenes with many textureless and sparsely covered regions. Therefore, to achieve end-to-end joint mapping and tracking under our neural scene representation, we propose the following loss functions, including geometric and prior constraints, single-view and multi-view constraints, and global and local constraints.

RGB rendering loss : Equation (7) connects the 3D neural scene representation with 2D observations, so we can use a simple RGB reconstruction loss to optimize the scene representation:
insert image description here
where, RRR represents randomly sampled pixels/rays in each iteration,CCC is the input pixel color value.

RGB warping loss : To further enforce geometric consistency from only color input, we also add a simple per-pixel warping loss. For mmPixels in m frames, denoted as rm r_mrm, we first render its depth value using equation (8) and project it into 3D space, then use the nnthThe intrinsic and extrinsic parameters of n frames are projected into another frame. Nearby keyframesnnA projected pixel rm → n in n is denoted by r_{m→n} asrmnExpressed as r_{m→n}$. Then define the deformation loss as:

insert image description here
where K m K_mKmIndicates the current frame mmlist of keyframes for m , excluding framemmm itself. We will be in framennPixels projected outside the image boundary of n are masked out. Note that, unlike [11], we observe that simply performing pixel warping is more efficient than using path warping for randomly sampled pixels, with no performance drop.

Optical Flow Loss : Both RGB rendering and warping losses are point-based terms that are susceptible to local minima. Therefore, we add a loss based on optical flow estimation, which respects region smoothness priors and helps resolve ambiguities. Assumed frame mmThe sampled pixels in m are rm r_mrm, the corresponding projected pixel at frame nnn isrn r_nrn, then the optical flow loss can be added as follows:
insert image description here
where GM ( rm → n ) GM(r_{m→n})GM(rmn) denote the optical flow estimated from GMFlow [66].

Monocular Depth Loss : Given an RGB input, geometric cues (such as depth or normal) can be obtained by off-the-shelf monocular depth estimators [12]. Inspired by [71], we also include this information in the optimization to guide neural implicit surface reconstruction. More specifically, in order to expect depth D ^ \hat{D}
in our renderingD^ and monocular depthD ˉ \bar{D}DTo enforce deep consistency between ˉ , we use the following loss [43, 71]:
insert image description here

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/129225286