SLAM method based on neural radiation field NeRF

With the emergence of NeRF[1] in 2020, neural radiation field methods (Neural Radiance Fields) have sprung up like mushrooms after a rain. NeRF was originally used for image rendering, that is, given a camera perspective, the image from that perspective is rendered. NeRF is based on the existing camera pose, but in most robot applications, the camera pose is unknown. So subsequently, more and more work applied NeRF technology to simultaneously estimate camera poses and model the environment, that is, NeRF-based SLAM (Simultaneously localization and mapping).

Integrating deep learning with traditional geometry is the development trend of SLAM. In the past, we have seen some single-point modules in SLAM replaced by neural networks, such as feature extraction (super point), feature matching (super glue), loop closure (NetVlad), and depth estimation (mono-depth). Compared with single-point replacement, the NeRF-based method is a brand-new framework that can replace traditional SLAM end-to-end, both in terms of design method and implementation architecture.

Compared with traditional SLAM, the advantages of NeRF-based method are:

  • There is no feature extraction and the original pixel values ​​are directly manipulated. The error returns to the pixel itself, the information transmission is more direct, and what you see is what you get in the optimization process.

  • Both implicit and explicit map expressions can be differentiated, that is, full-dense optimization of the map can be performed (traditional SLAM is basically unable to optimize dense maps, and usually can only optimize a limited number of feature points or cover and update the map)

It can be seen that the NeRF-based method has a very high upper limit and can perform very detailed optimization of the map. However, the shortcomings of this method are also obvious:

  • The calculation overhead is large, the optimization time is long, and it is difficult to perform in real time.

However, the inability to be real-time is only a temporary problem. There will be a lot of work in the future to solve the real-time problem of NeRF-based SLAM.

Timeline of several Nerf-based SLAM work:

1. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

First, let’s review the classic NeRF. NeRF selects a series of images whose poses are known. Sampling points on pixel rays, each ray samples dozens of points (x, y, z, theta, phi), and sends them to the MLP network (F_theta). The network predicts the RGB and density (sigma) of the sampling point. Then perform radiation integration on the point on the ray to obtain the RGB value of the pixel, calculate the loss with the true value, and train the network (F_theta) through gradient backpropagation. The optimized variable of this method is the MLP network parameter (F_theta), that is, the scene expression is implicit in the network. The camera pose is not optimized and adjusted.

2. iNeRF: Inverting Neural Radiance Fields for Pose Estimation

iNeRF is the first work to propose using NeRF model for pose estimation. iNeRF relies on a NeRF model that has been built in advance, F_theta. Therefore, iNeRF is not considered SLAM, but a relocation problem under an existing model. The difference with NeRF is that NeRF fixes the pose, optimizes the model, and transmits the loss back to F_theta (as shown in the red line in the figure); iNeRF fixes the model, optimizes the pose, and transmits the loss back to pose.

3. BARF : Bundle-Adjusting Neural Radiance Fields

BARF's article optimizes the network model and camera pose at the same time, and implements Bundle Adjustment using the neural rendering network method. To be precise, this method solves the SfM (structure from motion) problem. This method relies on a rough initial camera pose, which can be obtained through methods such as col map. The model and camera pose are refined through network iteration. This would be a nice slam job if timing and inter-frame tracking were introduced.

4. iMAP: Implicit Mapping and Positioning in Real-Time

iMAP is truly the first NeRF-based SLAM work. The RGB-D images used by iMAP are divided into two threads: Tracking and Mapping. The Tracking thread uses the current model, F_theta, to optimize the current camera pose; it determines whether the frame is a key frame. If it is a key frame, the pose of the key frame is optimized together with the model F_theta. The framework of iMAP is similar to traditional SLAM, but the core tracking and joint optimization are completed by neural network optimization. Unfortunately, iMAP is not open source, but the good news is that the subsequent work, nice-slam, has open sourced the implementation of iMAP.

5. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM

NICE-SLAM makes changes based on iMAP. The author not only open sourced his own part, but also open sourced the implementation of iMAP. The author's main change is the use of Feature Grid + MLP, an explicit + implicit hybrid method, to express the environment. The environmental information is placed in the voxel feature grid, and MLP acts as a decorator to decode the information contained in the feature grid into occupation and rgb. At the same time, the author also used the course-to-fine idea to divide the feature grid into coarse, medium and fine for more detailed expression. This method is 2-3 times faster than iMAP. Although it has a certain degree of real-time performance, it is still far from real-time in actual use. This is the best and most complete NeRF-based SLAM work currently seen.

6. PlenOctrees for Real-time Rendering of Neural Radiance Fields

PlenOctrees is a method of accelerating rendering. The method of acceleration is to train the implicit expression of mlp, then put all points in the space and all perspective observations into the network for reasoning, and save and record them. In this way, the next time you use it, you don't have to use network inference online, just look up the table, which speeds up rendering. However, since the network input has five degrees of freedom: x, y, z, theta, and phi, the number explodes when exhausted. Therefore, the author transformed the network and decoupled the perspective theta and phi from the network input. The network only inputs x, y, z and outputs density and golf association coefficient. Color is obtained by multiplying the viewing angle by the spherical cofunction. In this way, the degree of freedom of the network variables drops from 5 to 3, and exhaustive saving can be performed.

7. SNeRG:Baking Neural Radiance Fields for Real-Time View Synthesis

SNeRG is similar to PlenOctrees, both are methods of accelerating rendering. After Mlp is trained, the information independent of the perspective is stored in the 3D voxel grid. In this article, the author divides colors into intrinsic colors and specular colors. Inherent colors have nothing to do with the viewing angle. The network inputs 3D coordinate positions and outputs voxel density, intrinsic color, and specular color feature vectors. The specular color feature vector is decoded into a specular color through a small network, combined with the viewing angle, and added to the final color. Like PlenOctrees, backbone mlp networks are decoupled from perspectives. PlenOctrees restores the viewing angle color through the spherical cofunction, and SNeRG restores the specular color through a small network, and then superimposes it on the intrinsic color.

8. DVGO: Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction

DVGO proposes a method to accelerate network training. The author found that using implicit expressions such as MLP, the training speed is slow but the effect is good; using explicit expressions such as voxel grids, the training speed is fast but the effect is poor. Therefore, DVGO proposes a hybrid voxel grid representation method. For density, the voxel grid is used directly, and interpolation can be used to obtain the occupancy density at any position; for color, multi-dimensional vectors are stored in the voxel grid, and the multi-dimensional vectors are first interpolated and then decoded into RGB values ​​by MLP. In this way, the number of times the network uses MLP is reduced during the training process; MLP only translates colors and can be very lightweight, so the training speed is greatly improved.

9. Plenoxels: Radiance Fields without Neural Networks

Plenoxels is the follow-up to PlenOCtrees. The authors use explicit grids instead of MLP. The one-dimensional density and ball association coefficient are stored in the grid. When light passes through, the density and golf association coefficient of the sampling point on the light can be obtained by trilinear interpolation. In this way, the entire process gets rid of the dependence on the neural network and becomes a completely explicit expression. Since the MLP part of the neural network is removed, the training speed is greatly increased. The author emphasizes that the key to the neural radiation field lies not in the neural network, but in the differentiable rendering process.

10. NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

The first 3D scene reconstruction algorithm that combines dense monocular SLAM and hierarchical volumetric neural radiance fields to achieve accurate radiation field construction using image sequences in real time without requiring pose or depth input. .

The core idea is to use a dense monocular SLAM pipeline to estimate the camera pose and dense depth map and their uncertainty , and use the above information as a supervision signal to train the NeRF scene representation.

references

Summary of 9 SLAM methods based on neural radiation field NeRF

[Reading NeRF series papers] NeRF-SLAM: Real-time dense monocular SLAM system based on neural radiation field representation - Zhihu 

Guess you like

Origin blog.csdn.net/xhtchina/article/details/128751053