Neural Network Radiation Field NeRF, Real-time NeRF Baking, Directed Distance Field SDF, Occupancy Network Occupancy, NeRF Autopilot

1 Principle of NeRF

NeRF (Neural Radiance Fields, Neural Radiance Fields) is the Best Paper at the 2020 ECCV conference, which pushes the implicit expression to a new level, and it 2D不同的posed imagescan be rendered out only as a supervision 复杂的三维场景. One stone caused a thousand waves. Since then, NeRF has developed rapidly and been applied to multiple technical directions such as new viewpoint synthesis, 3D reconstruction, etc., and achieved very good results. Its influence is very huge.
insert image description here
NeRF神经辐射场is a deep learning model for 3D implicit space modeling, this deep learning model is 全连接神经网络(MLP多层感知机). The task NeRF needs to do is Novel View Synthesis, generally translated as 新视角合成任务, the definition is: a series of captures of the scene (including captured images, and the internal and external parameters corresponding to each image) under a known perspective, without the need for intermediate three-dimensional In the process of reconstruction, only based on the internal parameters of the pose and the image, the image under the new perspective is synthesized. Input sparse multi-angle images with poses, and train to obtain a neural radiation field model. According to this model, clear photos from any viewing angle can be rendered.

It can also be briefly summarized as 用一个MLP神经网络去隐式地学习一个三维场景. Under the Nerf-based representation method, the three-dimensional space is represented as a set of learnable and continuous radiation fields, which are 图像+姿态output after learning from the input 色彩+密度.
insert image description here
Position encoding : Traditional MLP networks are not good at learning high-frequency data information, but color-based texture information is high-frequency. If MLP learning is used directly, the surface of the learned texture will be quite blurred. Therefore, position coding is introduced to allow MLP to learn high and low frequency information at the same time to improve clarity. Note that the PE here is different from the PE in the transformer.
insert image description hereinsert image description here

There is a very important concept in this process Volume Rendering with Radiance Fields (辐射场体渲染), which is to trace, integrate, and model rays, use MLP to encode light into color and density values, save the representation of the 3D scene in the weight of MLP, and input a lot of known, to Generate an image.

3D rendering physics equation :
where xis the current three-dimensional space coordinates to be analyzed, dand is the light irradiation direction, this formula is divided into two parts:
L o ( x , d ) = L e ( x , d ) + ∫ Ω fr ( x , d , ω i ) L i ( x , ω i ) cos θ dwi L_o(x,d)=L_e(x,d)+\int_\Omega f_r(x,d,\omega_i)L_i(x,\omega_i)cos\ theta dw_iLo(x,d)=Le(x,d)+Ohfr(x,d,ohi)Li(x,ohi)cosθdwi

The first part represents the radiation amount of itself in the d direction when x is the light source point.
The second part is the radiation refracted in the d direction after the point light source irradiates other surfaces. In the second part fr ( x , d , ω i ) f_r(x,d,\omega_i)fr(x,d,ohi) is the scattering function,L i ( x , ω i ) L_i(x,\omega_i)Li(x,ohi) forfromwi w_iwiThe radiation received in the direction, θ \thetaθ iswi w_iwiAngle with d.

Why mention this? Because a large part of the color in human eyes comes from the radiation in the nerve radiation field. Human eyes receive light, and light is electromagnetic radiation, or an oscillating electromagnetic field, and light has wavelength and frequency, and the color of light is determined by frequency. If you remember junior high school physics, most light is invisible, and the only narrow section of the spectrum visible to the human eye is called the visible spectrum, and the corresponding frequency is what we think of as color.
insert image description here
Therefore, we can indirectly assume that 建模辐射光it is 建模对应的颜色. Instead Nerf则是一组可以对上面渲染方程近似求解的MLP, by modeling radiant light, the color of a three-dimensional scene is modeled. This is how Nerf works. Under the Nerf-based representation method, the 3D field is represented as a set of learnable and continuous radiation fields.

NeRF input and output :
Given a set of continuously captured images + poses , Nerf tries to use 光线位置、光照方向、对应三维坐标(x,y,z)as input and output the target 密度(形体)+颜色. Enter a total of five variables, and are therefore called " 5D辐射场". Specifically, given the space point coordinates (x, y, z) and the observation direction ( dx d_xdx, d y d_y dy, d z d_z dz), and the third one can be obtained by cross product, commonly known as "knowing two gets three") can be solved to obtain the 密度值(其实是光线在该点终止的概率)and corresponding of the point 颜色(RGB值). After predicting the color value and calculating the loss with the input image corresponding to the current pose, optimization can be performed to make the model gradually converge.

Density σ \sigmaσ , opacity, light transmittance T:
insert image description here
insert image description here

How the Nerf model is rendered :
Nerf introduces the classic volume rendering theory to model the color and density (that is, the Nerf output value). The relevant physical formula is as follows (the discretized form of the formula is actually used), this formula looks extremely complicated; it involves three sets of physical quantities: ray accumulation T ( x ) T(x)T ( x ) , voxel densityσ ( x ) \sigma(x)σ ( x ) , colorc ( x ) c(x)c ( x ) :
insert image description here
(1) voxel densityσ ( x ) \sigma(x)σ ( x ) reflects the particle density of the model at a certain point of the ray, that is, the density of particles on a specific three-dimensional coordinate
(2) colorc ( x ) c(x)c ( x ) reflects the specific three-dimensional coordinates, viewed from the direction of the light, the color reflected by the particle
(3) light cumulative amountT ( x ) T(x)T ( x ) is a quantity that continuously integrates the voxel density as the path length of the light increases. Its size gradually decreases as the depth of the light reaches increases, that is to say, the transparency is constantly increasing. Declining, the probability that light does not collide with any particles is decreasing.
Based on this, the physical meaning of the volume rendering equation can be imagined: the occlusion problem and the unbounded problem are solved.

Discretization :
insert image description here
Hierarchical volume sampling :
Here we still need to explain the background first: directly using the volume rendering integral in the above formula, we need to control the sampling starting point. If the global sampling is performed directly, the calculation consumption required is too large, and the points in the sampling interval are relatively sparse. Assuming uniform distribution sampling is used, direct sampling is inefficient. It is very important to choose the appropriate starting and ending points here. If the interval length of the starting point is too small, the sampling points will be insufficient, which will affect the training results. Based on the analysis of the volume rendering equation, a reasonable sampling choice is that it is best to avoid excessive sampling in vacant parts and occluded parts as much as possible, because these parts contribute little to the best color.

So how to sample the most efficiently? Nerf uses two networks to train at the same time (later called the coarse and fine networks). The points input by the coarse network are obtained by uniformly sampling the light. According to the volume density value predicted by the coarse network, the distribution of the light is estimated, and then according to the estimated The distribution of the output is carried out for the second importance sampling, and then all the sampling points are input to the fine network for prediction.
insert image description here
The effect of inverse transform sampling is to uniformly sample on the CDF value domain of the distribution p, and the sampling result is the same distribution as the sampling in the original distribution p. Therefore, if it is difficult to obtain the current distribution, the difficulty of the problem can be simplified by inverse transformation sampling.

NeRF Disadvantages : Slow! !

2 NeRF acceleration

NeRF training is very time-consuming, how to speed up is a question worth exploring!

Accelerating research progress on NeRF [column recommendation]: NeRF Baking

insert image description here

Plenoxels

Plenoxels is a voxel-based NeRF. It is found that the secret of NeRF's success is actually its volume rendering equation, which has little to do with its most time-consuming neural network. Therefore, in addition to MLPrepresenting 3D scenes, you can also use 体素voxel3D scenes (such as the Plenoxel method) to speed up training. No neural network is required. The same effect is achieved only through gradient descent and regularization, and the speed is 100 times faster:
多层感知机MLP是隐式的表示, 体素voxel是显示的表示, therefore voxel是张量Tensor,是可以分解的!!.
insert image description here
Plenoxels first reconstructs a sparse voxel table with opacity and spherical harmonic coefficients for each occupied voxel. The required color information is stored in these spherical harmonic coefficients. Each color channel needs 9 coefficients to represent, and there are three colors in total. Then each voxel needs 27 spherical harmonic coefficients to represent its color.
The color and opacity of each point that the camera ray passes through is calculated by trilinear interpolation of the nearest 8 voxels.
The resulting color and opacity are then rendered in 3D using volume rendering techniques, just like NeRF.
Plenoxels optimizes voxel opacity and spherical harmonic coefficients by minimizing the mean squared error (MSE) of rendered pixels, and uses TV regularization to help remove noise.

KiloNeRF

Decomposing the MLP depending on the scenario, using thousands of tiny MLPs instead of a single large MLP can lead to significant speedups.
insert image description here
Each individual MLP only needs to represent a part of the scene, so smaller and faster-evaluating MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, the rendering speed is increased by two orders of magnitude compared to the original NeRF model without incurring high storage costs. Training using teacher-student distillation, we show that this speedup can be achieved without sacrificing visual quality.
insert image description here

Instant NGP

A structure of learnable parameters 多分辨率哈希编码replaces the trigonometric frequency encoding used in NeRF, allowing the model to achieve equivalent or better results with smaller MLP structures. The smaller model, the efficient parallelization of multi-resolution encoding, and the native acceleration of pure cuda make the training time of NeRF compressed from hours to minutes or even seconds.
insert image description here
The above picture shows the simple process of NGP :
the first step is to convert the coordinate point (the real value of xyz) into the index in the hash table. The
blue and pink on the way indicate the calculation under different levels, different levels, and the resolution of the grid Different (the pink grid in the above picture is smaller, and the resolution of the pink is larger than that of the blue).
The second step is to find the values ​​of the eight points around the target value in the hash table of different levels, and then perform trilinear interpolation
. The third step is to splicing all the Level results, even if the encoding is completed here,
the fourth step is to send it to the neural network.

Historical background on input encoding :
Encoding input data is a very common topic, we can see it in many fields, such as

  • In machine learning, we often map low-dimensional input to high-dimensional so that complex data structures exhibit linear properties, such as one-hot encoding (one-hot encoding), kernel method (kernel trick).
  • In ViT, the input code is also unobtainable information. The main function of the input code here is to tell the model the specific position of the data currently processed in the image, which essentially serves as an attention mechanism.
  • In the original text of NeRF, the encoding form we use is very similar to that used in ViT, both of which are frequency encoding in the form of trigonometric functions, but here it is not used for the purpose of prompting the sample position, but to introduce high-frequency information to the input, so that The model learns the details of the samples better. In NGP, not only the network weights need to be trained, but also the encoding parameters should be trained.
    insert image description here
    The values ​​of two points will be indexed to the same place and conflict : but the training of the neural network will separate the conflicting indexes. Although the key is not enough, most areas in the space have no value, so The neural network will use the more valuable points as part of the dominant gradient.

TensoRF

Compared to the coordinate-based NeRF method, TensoRF represents the radiation field as a voxel grid feature. There are many methods that use voxel grid before, but they need a lot of GPU memory to store these voxels, and the size of voxels will increase with the size of the scene at a speed of 3 powers; and some methods need to calculate the output of MLP in advance. Distillation is performed, resulting in excessively long training times.
insert image description here
TensorF proposed that a feature grid can be regarded as a 4-dimensional tensor (tensor) form, that is, the first 3 dimensions represent the spatial coordinate XYZ, and the fourth dimension represents the feature channel. dimension. This allows us to use traditional tensor decomposition algorithms in radiation field modeling. The tensor decomposition algorithm can help reduce the dimensionality of high-dimensional data and compress the data, thereby reducing the space occupied during modeling.

2D Tensor Decomposition: Singular Value Decomposition
insert image description here
3D Tensor Decomposition:
insert image description here
CP Decomposition (Candecomp Parafac)insert image description here

VM decomposition (Vector-Matrix)
insert image description here

3 SDF + NeRF

Regardless of 2D or 3D assets, there 隐式(implicit)are 显式(explicit)two storage methods . For example, 3D models can use mesh to directly store model data, or use sdf, point cloud (point cloud), and neural network (nerual rendering) to represent

Mesh grid : Direct use NeRF神经辐射场is not good, but SDF有向距离场(Signed Distance Function) is a better choice.
insert image description here

The essence of SDF is 存储每个点到图形的最近距离to draw a surface for the model, the point value outside the model surface is >0, and the point value inside the model surface is <0. SDF (Signed Distance Field) has corresponding applications in 3d and 2d. In 3D, ray tracing consumes too much performance, so sdf is often used as an implicit expression of objects, and it cooperates with Ray Marching to achieve an effect close to ray tracing. There are also applications such as deepSDF for implicit expression of models. In 2D, sdf is often used to represent fonts, and the shadow map in Genshen's facial rendering is also generated based on sdf.
insert image description here

DeepSDF : A shape-conditioned classifier based on learning, the decision boundary is the shape surface itself (SDF). The core idea is to directly sample points, and then directly use the MLP model for regression, which is simple and violent.
insert image description here
insert image description here

The effect of 密度extracting mesh is not good, so we thought about introducing SDF into NeRF : clear input (a set of 2D RGB images), output (SDF as a three-dimensional scene representation), and at the same time, 除了SDF还有Occupancy可以代替作为三维场景表示!!!the latest neural implicit description for 3D reconstruction mainly uses SDF or Occupancy represents the surface, which we will introduce in the next chapter.
insert image description here

How to import?
First, clarify the work of NeRF: a set of "rays" passes through MLP, outputs the density field and color field, renders the 2D image, calculates the loss with the real graph, and performs backpropagation training.
Because the effect of using density to extract mesh is not good, so we 将Density替换为SDF. But how to convert SDF to Density?insert image description here

But how to convert SDF to Density?
NeuS: The first step is to construct a rendering method from the 3D model to the image (called rasterization in traditional graphics). The second step is to construct a volume rendering training SDF network
insert image description here
insert image description here
insert image description here

NeRF and NeuS discretization comparison :
insert image description here

4 Occupancy + NeRF

Looking at the hardware solutions of autonomous driving companies at home and abroad, they are mainly divided into two major routes: 1. Tesla’s pure vision solution Occupancy+NeRF, 2. Other companies’ multi-modal (radar, camera) fusion solutions. This article does not discuss the advantages and disadvantages of the two routes. Only analyze Tesla's occupancy network. It can be said that Zhan Network solves the detection of general obstacles (arbitrary obstacles) from a pure visual solution, which can be said to be a major technological breakthrough, and it is still very worth learning.

BEVThe main difference between vs Volume Occupancy
is that the former is a 2D representation, while the latter is a 3D representation. The second is a fixed rectangle. When designing a perception system, detection is often associated with a fixed output size. The rectangle cannot represent some unusually shaped vehicles or obstacles. If you see a truck, a 7x3 rectangle will be placed on the featuremap, and if you see a pedestrian, a 1x1 rectangle will be used. The problem is that dangling obstacles are not predictable this way. If the car has a ladder on top, and the truck has side trailers or arms; then such a fixed rectangle may not detect the object. However, if you use the Occupancy Network, you can see these situations in detail in the figure below.

insert image description here
Overall, Tesla Occupancy Network Occupancy should adopt the idea of ​​"Occupancy Network" and NeRF in 3D reconstruction, and further expand the integration. Tesla should mainly learn from this voxel-based NeRF.将occupancy的输出作为nerf的输入,然后通过nerf的Loss,将梯度回传给occupancy,进行监督训练。

insert image description here

insert image description here

5 NeRF Application Introduction

common application

NeRF was first applied in the direction of new viewpoint synthesis, and it developed rapidly in the direction of 3D reconstruction due to its super strong ability to implicitly express 3D information. Next, we will introduce several mainstream application directions of NeRF.

The traditional method of automatic driving
: the process is complicated, the effect is not good, and the scene is sparse.
insert image description here
NeRF method: the process is simple, the effect is good, and the scene is dense.
insert image description here

Metaverse/Game :
insert image description here
New View Synthesis (View Synthesis)
360-degree reconstruction of
large scene reconstruction of
human body reconstruction
3D style migration
mirror reflection scene reconstruction
...

Practical Challenges

insert image description here

Pose offset, lighting difference, dynamic objects, and resource consumption in large scenes.
insert image description here
(1) Camera pose: P, T, R, X, Y, Z
How to calculate the accurate camera pose? Camera calibration board, SLAM, SFM, NeRF
insert image description here
pose optimization: BARF
insert image description here
insert image description here
insert image description here
(2) Lighting difference :
insert image description here
insert image description here
insert image description here

insert image description here
insert image description here
insert image description here
insert image description here

(3) Dynamic objects
insert image description here
insert image description here

insert image description here
insert image description here
insert image description here

(4) Resource consumption in large scenes such as streets
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

6 NeRF autonomous driving

NeRF与自动驾驶的交集主要在:场景重建深度估计 两方面。
insert image description here
insert image description here
insert image description here
[1] https://www.youtube.com/watch?v=KC8e0oTFUcw
[2] Plenoxels: Radiance Fields without Neural Networks, CVPR 2022,arXiv:2112.05131
[3] Grid-Centric Traffic Scenario Perception for Autonomous Driving: A Comprehensive Review, 2023, arXiv:2303.01212
[4] CLONeR: Camera-Lidar Fusion for Occupancy Grid-aided Neural Representations, 2022, arxiv 2209.01194
[5] Urban Radiance Fields, CVPR 2022, arXiv:2111.14643
[6] S-NeRF: Neural Radiance Fields for Street, ICLR 2023, arXiv:2303.00749
[7] Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022, arXiv:2202.05263
[8] Switch-NeRF: Learning Scene Decomposition with Mixture of experts for Large-sacle Neural Radiance Fields, ICLR 2023, https://openreview.net/pdf?id=PQ2zoIZqvm
[9] SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields, 2023, arXiv:2212.02501
[10] Behind the Scenes: Density Fields for Single View Reconstruction, 2023, arXiv:2301.07668

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/130176241