Classic literature reading - NICE-SLAM (SLAM's neural implicit scalable coding)

0. Introduction

For deep learning, NeRF must be one of the hottest jobs in the past two years. **NeRF (Neural Radiance Fields)** was the first Best Paper at the 2020 ECCV conference, which pushed the implicit expression to the top. A new height, using only 2D posed images as supervision, can represent complex 3D scenes. The rapid development of NeRF has also been applied to many technical directions, such as new viewpoint synthesis, 3D reconstruction, etc., and achieved very good results. Previously , we mentioned the technical principle of NeRF in " Classic Literature Reading - NeRF-SLAM (Monocular Dense Reconstruction)". Next, we will introduce the shortcomings and development of the existing NeRF based on NICE-SLAM as the core.

1. Shortcomings and development

The biggest contribution of NeRF is the effective combination of Neural Field and volume rendering of graphics components. It implicitly learns a 3D scene by passing 2D position and 3D space coordinates through the NLP neural network. First of all, it is worth our attention to the problem of NeRF. The main problems are: slow speed, only for static scenes, poor generalization, and a large number of perspectives are required. We found several articles to do a brief analysis to see the improvement direction of different articles. In the next section we will focus on how to apply NeRF to SLAM. Let's analyze it in detail based on NICE-SLAM.

1.1 Neural Sparse Voxel Fields

In this work, we introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality viewpoint-free rendering. NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell. We incrementally learn the underlying voxel structure via discriminative ray marching operations from only a set of RGB images of poses. Using a sparse voxel octree structure, rendering of novel views can be accelerated by skipping voxels that do not contain relevant scene content.

1.2 Mip-NeRF

Mip-NeRF uses geometric shapes called conic frustums to render each pixel instead of rays, which reduces aliasing, can reveal fine details in images, and reduces error rates by 17-60%. The model is also 7% faster than NeRF.

1.3 NeRF-SR

NERF-SR, a solution for novel view synthesis of high-resolution (HR), mainly low-resolution (LR) inputs. The method is built-in based on Neural Radiation Field (NERF), which predicts per-point density and color, with multiple layers of firing. In generating images at arbitrary scales, NERF works with resolutions beyond the observed images. NERF-SR can further improve the performance of supersampling by refining the network, which exploits the estimated depth to hallucinate correlated patches on HR reference images.

1.4 KiloNeRF

KiloNeRF solves the problem of rendering slowness in NeRF, mainly related to the need to query deep MLP networks millions of times. KiloNeRF splits the workload into thousands of small MLPs instead of one large MLP requiring multiple queries. Each small MLP represents a portion of a scene, enabling a 3x performance improvement with reduced storage requirements and comparable visual quality.

1.5 Plenoxels

Plenoxels replaces the NeRF central MLP with a sparse 3D mesh. Each query point is interpolated from its surrounding voxels. As a result, new 2D views can be rendered without running a neural network, which greatly reduces complexity and computational requirements. Plenoxels provides similar visual quality to NeRF while being two orders of magnitude faster.

1.6 RegNeRF

While NERF can produce photostatic renderings of unseen viewpoints when many input views are available, its performance degrades significantly when this number is reduced. We observe that most artifacts in sparse input schemes are caused by errors in estimating scene geometry and by different behaviors at the beginning of training. We address this by normalizing the geometry and appearance of patches presented from unobserved viewpoints, and annealing the ray sampling space during training. We also use a normalized flow model to normalize the colors of unobserved viewpoints. Our models not only outperform other methods for optimizing individual scenes, but in many cases, conditional models that are extensively pretrained on large multi-view datasets.

1.7 Neural Deformable Voxel Grid for Fast Optimization of Dynamic View Synthesis

Neural Radiation Fields (NERF) are revolutionizing the superior performance of novel view synthesis (NVS). However, NERF and its variants often require a lengthy per-field training procedure in which a multi-layer perceptron (MLP) is fitted to the captured images. To address the challenge, voxel grid representations have been proposed to significantly speed up training. However, these existing methods can only handle static scenes. How to develop efficient and accurate dynamic view synthesis methods is still an open problem. Extending the approach to static scenes to dynamic ones is not trivial, since the scene geometry and appearance change over time. In this paper, based on recent advances in voxel grid optimization, we propose a rapidly deformable radiation field method to handle dynamic scenes. Our approach consists of two modules. The first module employs a deformable mesh to store 3D dynamic features, and a deformable lightweight MLP that uses interpolation functions to map 3D points in observation space to canonical space. The second module contains density and color meshes to model the geometry and density of the scene. Occlusion is explicitly modeled to further improve render quality. Experimental results show that our method can achieve comparable performance to D-NERF using only 20 minutes of training, which is more than 70 times faster than D-NERF, which clearly demonstrates the efficiency of our proposed method.

2. Specific contributions of NICE-SLAM

NICE-SLAM is a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with a pre-trained geometric prior enables detailed reconstruction in large indoor scenes. Compared with recent neural hidden SLAM systems, our method is more scalable, efficient and robust.

  1. We propose NICE-SLAM, a dense RGB-D SLAM system that is real-time, scalable, predictive, and robust to various challenging scenarios.

  2. At the heart of NICE-SLAM is a hierarchical, grid-based neural implicit encoding. In contrast to global neural scene encoding, this representation allows local updates, a prerequisite for large-scale methods.

  3. We perform extensive evaluations on various datasets, demonstrating competitive performance in mapping and tracking.

3. Overall framework

The article provides an overview of our approach in Figure 2. We use four feature grids and corresponding decoders to represent scene geometry and appearance (Section 3.1). We track the observation ray for each pixel using the estimated camera calibration. By sampling points along an observation ray and querying the network, we can render the depth and color values ​​for this ray (Section 3.2). By minimizing re-rendering losses for depth and color, we are able to optimize camera pose and scene geometry (Section 3.3) in an alternating fashion for keyframe selection (Section 3.4).

insert image description here

4. Hierarchical scene representation

We now introduce our hierarchical scene representation, which combines multi-level grid features and a pretrained decoder to predict occupancy. Geometry is encoded into three feature grids ϕ θ l ϕ^l_θϕiland their corresponding MLP decoder flf^lfl , wherel ∈ 0 , 1 , 2 l ∈ {0,1,2}l0,1,2 refers to coarse, medium and fine scene details. Additionally, we have a separate feature gridψ ω ψ_ωpohand the decoder g ω g_ωgohto simulate the appearance of the scene. Here θ θθω ωω denotes the optimizable parameters of geometry and color, i.e. features in the grid and weights in the color decoder.

4.1 Intermediate and fine-level scene geometry representation

In mid-level and fine-level scene geometry representations, the observed scene geometry is represented by mid-level and fine-level feature meshes. During the reconstruction, we use these two meshes in a coarse-to-fine approach, where the geometry is first reconstructed with a mid-level feature mesh optimization and then refined with a fine-level mesh. In the implementation we use a voxel grid with side lengths of 32 cm and 16 cm, except for TUM RGBD [46] where we use 16 cm and 8 cm. For mid-level feature grids, use the associated MLP f 1 f^1f1 decodes features directly into occupancy values. For any pointp ∈ R 3 p∈\mathbb{R}^3pR3 , we get the occupancy value
insert image description here
In the above formula,ϕ θ 1 ( p ) ϕ^1_θ(p)ϕi1( p ) means at pointppEigen grid for trilinear interpolation at p . The relatively low resolution allows us to efficiently optimize mesh features for observation. To capture small high-frequency details in scene geometry, we incorporate fine-level features in a residual manner. The fine-level feature decoder takes the corresponding mid-level and fine-level features as input, and outputs the offset of the mid-level occupancy.
insert image description here
The final occupancy of points is
insert image description here
Note that we fix the pre-trained decoderf 1 f^1f1 andf 2 f^2f2 , only optimize the feature grid ϕ θ 1 ϕ^1_θduring the whole optimization processϕi1ϕ θ 2 ϕ^2_θϕi2. We show that this helps to stabilize optimization and learn consistent geometries.

4.2 Coarse-grained hierarchy

In the coarse-grained hierarchy, we use a feature grid to capture the high-level geometric features of the scene (e.g. walls, floors, etc.), and optimize independently from the mid- and fine-level hierarchies. The purpose of the coarse grid is to be able to predict approximate occupancy values ​​(encoded in the medium/fine level hierarchy) outside the observed geometry, even if each coarse voxel is only partially observed. Therefore, we used a very low resolution in our implementation, with a side length of 2 meters. Similar to the mid-level grid, we interpolate the features and pass the MLP f 0 f^0f0 are directly decoded into occupancy values, i.e.,
insert image description here
during tracking, occupancy values ​​at the coarse level are only used to predict previously unobserved parts of the scene. This predicted geometry allows us to track large parts of the current image that have not been seen before.

4.3 Pretrained Feature Decoder

Three different fixed MLPs are used in our framework to decode grid features into occupancy values. Coarse and mid-level decoders are pre-trained as part of ConvONet [38], which includes a CNN encoder and an MLP decoder. We train the encoder/decoder between predictions and ground truth using a binary cross-entropy loss, the same as in [38]. After training, we only use the decoder MLP since we will directly optimize the features to fit the observations. This way, the pre-trained decoder can take advantage of the resolution-specific priors learned from the training set, and the same strategy is used to pre-train the fine-level decoder when decoding our optimized features. Except before inputting to the decoder, we just take the features of the middle layer ϕ θ 1 ( p ) ϕ^1_θ (p)ϕi1( p ) and the feature of fine layerϕ θ 2 ( p ) ϕ^2_θ(p)ϕi2( p ) are concatenated for input, and the rest need to consider coarse-grained information.

4.4 Color representation

We mainly focus on scene geometry, but we also encode color information, allowing us to render RGB images, providing additional signal for tracking. To encode colors in the scene, we apply another grid of features ψ ω ψ_ωpohand the decoder g ω g_ωgoh
insert image description here
where ω ωω denotes a learnable parameter during optimization. Unlike geometry with strong prior knowledge, we empirically find that jointly optimizing the color featuresψ ω ψ_ωpohand the decoder g ω g_ωgohCan improve tracking performance. Note that, similar to iMAP [47], this may cause forgetting problems, and the colors are only locally consistent. This can be optimized globally as a post-processing step if we want to visualize the colors of the entire scene.

4.5 Network Design

For all MLP decoders, we use a 32-dimensional hidden feature dimension and 5 fully connected blocks. In addition to the coarse-grained geometric representation, we apply a learnable Gaussian position encoding [47,50] to p prior to the input to the MLP decoder, which can discover high-frequency details of geometry and appearance.

5. Depth and Color Rendering

Inspired by the recent success of volume rendering in NeRF, we propose to use a differentiable rendering process that incorporates the predicted occupancy and color from Section 3.1 into the scene representation.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/128708926