NeRF and 3D Reconstruction Column (2) NeRF Original Text Interpretation and Volume Rendering Physical Model

Preface In the previous chapter, we briefly introduced the background of 3D reconstruction, the difficulties of applying NeRF to 3D reconstruction, and related data sets and evaluation indicators. This chapter will introduce the original text of NeRF, some source code, and the physical model of volume rendering in detail to help Readers have a better understanding of NeRF; in the next chapter, we will combine colmap to explain part of the nerf_pl source code and explain the use of the cuda operator.

Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join Planet Learning.

Transformer, target detection, semantic segmentation exchange group

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

NeRF original introduction

The original Nerf: Representing scenes as neural radiance fields for view synthesis was published in ECCV 2020

One of the authors , Ben Mildenhall , graduated from UC Berkeley in 2020, and will work at Google as of 4/4 of 2023, and has published Mip-NeRF , Block-NeRF , Mip-NeRP 360 , DreamFUsion and many other works that can be called nodes.

For NeRF, we defined it as a method of differentiable rendering in the previous chapter, in order to emphasize the difference between it and the geometric reconstruction algorithm; now we may wish to narrow the scope of the definition and define it as a method of neural rendering , which is different from traditional differentiable rendering.

Neural rendering uses a deep neural network to approximate scene parameters (such as volume density in NeRF, or the feature vector output by the first MLP network in NeRF, which we call neural representation), and learns the scene by training the neural network The optical properties of , is a data-driven rendering method.

In order to generate controllable images of real scenes, traditional rendering needs to simulate the propagation of light from existing observations (such as images and videos) to estimate complex physical parameters (such as camera internal parameters, lighting, albedo, etc.), that is, the inverse Rendering, resulting in a photorealistic image.

This concept was first proposed in Neural scene representation and rendering in 2018. Students who want to learn more can refer to Advances in Neural Rendering .

But the good news is that learning NeRF does not require too much basic knowledge compared to learning traditional rendering of graphics or learning traditional 3D reconstruction, so without further ado, let's start the formal content of this chapter as soon as possible.

1.Introduction

In this chapter, Mildenhall outlines the pipeline of their entire work, specifically:

Given a series of pictures with poses I = I 1 , . . . , I n I={I_1,...,I_n}I=I1,...,In, and the corresponding pose P = ( R 1 ∣ t 1 , . . . , R n ∣ tn ) P={(R_1|t_1,...,R_n|t_n)}P=(R1t1,...,Rntn) and camera internal referenceK = ( K 1 , . . . , K n ) K=(K_1,...,K_n)K=(K1,...,Kn) ; Sample point ( u , v )on the pixel plane(u,v ) , formed by the camera optical centero − ( u , v ) o-(u,v)o(u,v ) , and sample several three-dimensional points( x , y , z ) on the ray (x,y,z)(x,y,z ) , the line-of-sight directions of all three-dimensional points on this ray can be expressed as( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) (d_{x}(u,v), d_{y}(u,v),d_{z}(u,v))(dx(u,v),dy(u,v),dz(u,v))

a. Use position encoding to map low-frequency three-dimensional coordinates to high-frequency space to obtain more detailed information: γ p : = ( x , y , z ) → γ p ( x , y , z ) \gamma_p:=(x ,y,z)\rightarrow\gamma_p(x,y,z)cp:=(x,y,z)cp(x,y,z)

Build an implicit expression map fgeo with MLP network: = γ p ( x , y , z ) → ( σ , f ) f_{geo}:=\gamma_p(x,y,z)\rightarrow(\sigma,f)fgeo:=cp(x,y,z)( p ,f ) , whereσ \sigmaσ represents the volume density (volume density), controlled by( x , y , z ) (x,y,z)(x,y,How much radiance/intensity will the light of z ) accumulate; fff is the geometric feature vector of the point, which implicitly encodes( x , y , z ) (x,y,z)(x,y,z ) geometric properties;

b. Use position encoding to map the low-frequency direction vector to the high-frequency space to obtain more detailed information: γ d : = ( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) → γ d ( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) \gamma_d:=(d_x(u,v),d_y(u,v),d_z(u,v)) \rightarrow \gamma_d(d_x(u,v),d_y(u,v),d_z(u,v))cd:=(dx(u,v),dy(u,v),dz(u,v))cd(dx(u,v),dy(u,v),dz(u,v))

Use another MLP network to output the RGB color fcolor of the point : = ( γ d ( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) , f ) → cx , y , z f_{ color}:=(\gamma_d(d_x(u,v),d_y(u,v),d_z(u,v)),f)\rightarrow c_{x,y,z}fcolor:=( cd(dx(u,v),dy(u,v),dz(u,v)),f)cx,y,z

c. Using the traditional volume rendering method, the color and density of these three-dimensional points are accumulated on the image plane;

*: Use ( θ , ϕ ) (\theta,\phi)( i ,ϕ ) to represent the direction from which the point is observed and use the __unit vector__( dx , dy , dz ) (d_x,d_y,d_z)(dx,dy,dz) are equivalent, the former is represented by latitude and longitude, and the latter is represented by unit vectors. In order to avoid ambiguity, we will use unit vectors to represent directions later, and the 5D vectors mentioned in the paper are also described by 6D vectors.

*: The input of the entire NeRF model is a 6D vector ( xi , yi , zi , dxi , dyi , dzi ) (x_i,y_i,z_i,d_{xi},d_{yi},d_{zi})(xi,yi,zi,dxi,dyes yes,dz i) , output point( xi , yi , zi ) (x_i,y_i,z_i)(xi,yi,zi) color and volume density( ci , σ i ) (c_i,\sigma_i)(ci,pi) , which can be regarded as an autoregressive model. The approximate process is shown in the figure below:

nerf_pipeline

NeRF pipeline

2. Related work

2.1 Neural 3D shape representation

This part of the introduction mainly introduces the related methods of neural expression:

The main purpose of neural representation is to convert the geometric and optical features in the scene into low-dimensional, learnable feature representations to facilitate neural network learning and inference. However, what data to use for supervision is a problem. DeepSDF , Occupancy Networks , Local Deep Implicit Functions for 3D Shape and Local Implicit Grid Representations for 3D Scenes all require three-dimensional data for supervision. The true value of these data is difficult to obtain in real scenes of.

In a series of follow-up works such as Differentiable Volumetric Rendering and Scene Representation Networks, only the RGB loss of the image is allowed to be supervised.

The above methods can potentially represent complex and high-resolution geometry, but they can only express simple shapes with low geometric complexity, resulting in over-smoothed rendered images. By encoding the position of the 6D tensor, NeRF enables the network to learn more detailed geometry and texture information, resulting in a finer picture during the rendering process.

2.2 View Synthesis and Image-Based Rendering

Given dense view sampling, realistic new views can be reconstructed by a simple interpolation technique of light field samples. For new view synthesis with sparse view sampling, a popular approach is to use a mesh-based scene representation, then use differentiable rasterization to render the image, and use gradient descent to optimize the mesh position and material. However, the optimization of this type of method is usually very difficult, and it is not on the same channel as the volume rendering technology we are concerned about. We can skip it. Students who want to understand this part of differentiable rendering knowledge can refer to Differentiable Rendering: A Survey .

Another popular type is to use voxels to express scenes. Voxel methods can realistically represent complex shapes and materials, are very suitable for gradient-based optimization, and tend to produce less visually disturbing artifacts than mesh-based methods. Early voxel methods used the observed image to directly color the voxel grid;

Some subsequent work uses large datasets of multiple scenes to train deep networks, that is, to use neural networks to learn generalized material information. These networks predict sample voxels from a set of input images, and then use alpha synthesis or learn along the light Synthesis to generate new views at rendering time; other works optimize the combination of convolutional networks and sampled voxel grids for a specific scene, such that the CNN can compensate for discretization artifacts from low-resolution voxel grids.

While these voxel methods achieve impressive results in NVS, their ability to scale to higher resolution images is limited due to the finer sampling of the 3D space required to render them. NERF solves the problem of lack of resolution by encoding scene information within the parameters of MLP. Here we can understand that the input of NeRF is a continuous space, and any 3D point will have a corresponding volume density and color value without being affected by Effect of grid resolution.

3. Method

3.1 Expression of neural radiation field

Given a series of pictures with poses I = I 1 , . . . , I n I={I_1,...,I_n}I=I1,...,In, and the corresponding pose P = ( R 1 ∣ t 1 , . . . , R n ∣ tn ) P={(R_1|t_1,...,R_n|t_n)}P=(R1t1,...,Rntn) and camera internal referenceK = ( K 1 , . . . , K n ) K=(K_1,...,K_n)K=(K1,...,Kn) ; Sample point ( u , v )on the pixel plane(u,v ) , formed by the camera optical centero − ( u , v ) o-(u,v)o(u,v ) , and sample several three-dimensional points( x , y , z ) on the ray (x,y,z)(x,y,z ) , the line-of-sight directions of all three-dimensional points on this ray can be expressed as( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) (d_{x}(u,v), d_{y}(u,v),d_{z}(u,v))(dx(u,v),dy(u,v),dz(u,v))

NeRF uses an 8-layer, 256-wide MLP network with the spatial position coordinates of a three-dimensional point as input to predict the volume density and geometric feature vector
fgeo of the point: = ( x , y , z ) → ( σ , f ) f_{geo}: =(x,y,z)\rightarrow(\sigma,f)fgeo:=(x,y,z)( p ,f )
whereσ \sigmaσ is the volume density of the point, and its physical meaning is that the ray is at( x , y , z ) (x,y,z)(x,y,The probability of hitting a particle at z ) ; fff is the geometric feature vector of the point.

And use another single-layer, 256-width MLP with the direction vector and the geometric feature vector as input to predict the RGB color value of the point:
fcolor : = ( dx ( u , v ) , dy ( u , v ) , dz ( u , v ) ) , f ) → cx , y , z f_{color}:=(d_x(u,v),d_y(u,v),d_z(u,v),f)\rightarrow c_{x,y,z}fcolor:=(dx(u,v),dy(u,v),dz(u,v),f)cx,y,z
*: f g e o f_{geo} fgeoThe skip connection is added to the fourth layer, that is, the input and the fourth layer nodes are concat together and then sent to the fifth layer MLP; the network structure can refer to the network structure in the next section .

*: f g e o f_{geo} fgeoIt is only a function of the coordinates of the three-dimensional point position, which assumes the isotropy of the __density field__; and fcolor f_{color}fcolorIt is a function of the geometric feature vector and the direction because even at the same three-dimensional space point, due to different viewing angles, the expression of material and other information in RGB space will be different. An obvious example is on a specular reflection surface. Different colors were observed. In fact the article also states this:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-k2ivUSKM-1685695635216)(null)]

Interpolation of direction information by color network

The reason why NeRF can synthesize a new view, in addition to constructing a density field, is another reason for the direction ddd is interpolated. For example, in the picture above, the colors of the water surface observed by the two viewing angles are different. If there is a continuous frame that gradually transitions from view 1 to view 2, the color of the observed point on the water surface will also gradually transition from the color of view 1 to view 2 colors, from this point of view, unless there is a particularly complex nonlinear lighting environment,fcolor f_{color}fcolorNeither should be set too deep.

3.2Positional Encoding(PE)

In the study of On the Spectral Bias of Neural Networks (section 3), it is proved that the deep neural network will preferentially learn low-frequency information. This article uses a 6-layer 256-wide MLP network to fit a set of Fourier series. , the low-frequency items are always regressed first, so NeRF first uses PE to map the low-frequency xyz coordinates into a vector containing high-frequency information, and then sends the vector into the neural network: γ ( p ) = ( sin ( 2 0
π p ) , cos ( 2 0 π p ) , . . . , sin ( 2 L − 1 π p ) , cos ( 2 L − 1 π p ) ) \gamma(p)=(sin(2^0\pi p ),cos(2^0\pi p),...,sin(2^{L-1}\pi p),cos(2^{L-1}\pi p))c ( p )=(sin(20 pp),cos(20 pp),...,sin(2L1πp),cos(2L 1 πp))
*:γ \gammaγ是对 ( x , y , z , d x , d y , d z ) (x,y,z,d_x,d_y,d_z) (x,y,z,dx,dy,dz) is mapped to each element, where when mapping the position,L = 10 L=10L=10 , when mapping the direction,L = 4 L=4L=4 . So the initial xyz vector will become a 60-dimensional vector,( dx , dy , dz ) (d_x,d_y,d_z)(dx,dy,dz) will become a 24-dimensional vector, as shown in the following figure:

nerf_pipeline

NeRF network structure

3.3 Volume rendering technology applied to neural radiation field

In the traditional volume rendering method , it is assumed that there are several particles in the scene, and the light will hit these particles during the traveling process, thereby losing/increasing the amount of radiation. When we want to get the color of the light emitted from the surface of the object to the pixel plane, we only need to add up the radiation lost/increased by all the impacting particles: C ( r ) = ∫ tntf T ( t ) σ
( r ( t ) ) c ( r ( t ) , d ) dt , where T ( t ) = exp ⁡ ( − ∫ tnt σ ( r ( s ) ) ds ) (1) C(r)=\int_{t_n} ^{t_f}T(t)\sigma(r(t))c(r(t),d)dt,\; where\;T(t)=\exp(-\int_{t_n}^{t} \sigma(r(s))ds)\tag{1}C(r)=tntfT(t)σ(r(t))c(r(t),d)dt,whereT(t)=exp(tntσ(r(s))ds)( 1 )
in whichr ( t ) = o + tdr(t)=o+tdr(t)=o+t d represents from the camera optical centerooo depart, along the directionddd traveling ray;T ( t ) T(t)T ( t ) means that the light fromtn t_ntnproceed to ttCumulative transmittance at time t , that is, from tn t_ntnproceed to ttt is the probability of not hitting any particle.

In actual rendering, we approximate this integral by numerical integration:
C ^ ( r ) = ∑ i = 1 NT i ( 1 − exp ⁡ ( − σ i δ i ) ) ci , where T i = exp ⁡ ( − ∑ j = 1 i − 1 σ j δ j ) (2) \hat{C}(r)=\sum_{i=1}^N{T_i(1-\exp(-\sigma_i\delta_i))c_i} , \;where\;T_i=\exp(-\sum_{j=1}^{i-1}\sigma_j\delta_j)\tag{2}C^(r)=i=1NTi(1exp ( pidi))ci,whereTi=exp(j=1i1pjdj)(2)
其中 δ i = t i + 1 − t i \delta_i=t_{i+1}-t_i di=ti+1tiis the distance between two sampling points, and we can use α i = 1 − exp ⁡ ( − σ i δ i ) \alpha_i=1-\exp(-\sigma_i\delta_i)ai=1exp ( pidi) to replace the alpha value in traditional rendering, that is, opacity. When alpha is 1, it means that it is completely opaque; when alpha is 0, it means that it is completely transparent. This value is equivalent to volume density in a sense .

*: The volume rendering formula here is slightly different from the form we derived in the traditional volume rendering method , and we will discuss it after explaining the traditional volume rendering method.

3.4 Stratified sampling technique

Considering that in the process of volume rendering of three-dimensional points in the sampling space, if each training is evenly sampled on the ray, most of the points will fall in the free space area, which is not good for the NeRF training MLP network. So NeRF adopts a stratified sampling strategy, let's review the formula (2) (2)( 2 ) , the numerical integration formula is actually the weighted sum of each sampling point on the ray, where the weight isω i = T i ( 1 − exp ⁡ ( − σ i δ i ) ) \omega_i=T_i(1-\exp (-\sigma_i\delta_i))ohi=Ti(1exp ( pidi))

Then in the first sampling, we uniformly sample several points on each ray to get the PDF value of each point: ω ^ i = ω i / ∑ j = 1 N c ω j \hat{\omega }_i=\omega_i/\sum_{j=1}^{N_c}\omega_joh^i=ohi/j=1Ncohj, points with high weights have higher volume density values, and should have a greater probability of being selected in the next sampling, which is beneficial for us to quickly skip the sampling points in free space.

At the same time, NeRF uses different NeRF networks to sample in the two sampling processes, that is, the first forward process uses a rough network, calculates the PDF value and obtains new sampling points, and then sends it to a fine network for rendering. In my opinion, the purpose of doing this should be to learn the density field distribution more flexibly.

*: Although the original NeRF uses two networks to separate the rough and fine sampling process, in some subsequent work, only one network is often used, and multiple sampling is repeated.

This is the end of the introduction of the principle of NeRF, because this article aims to introduce the connection between the principle of NeRF and traditional volume rendering, and the experiment and analysis of NeRF can be moved to the paper .

Physical model for volume rendering

1. Introduction to Traditional Volume Rendering

Refer to Volume Rendering Digest (for NeRF) , Physically Based Rendering and Octane Render

Volume rendering technology has a relatively long history in graphics. In the field of volume rendering, it is assumed that the scene is composed of several particles, and light p = o + t ω p=o+t\omegap=o+t ω , fromt = 0 t=0t=Starting at 0 , along ω \omegaWhen the ω direction encounters these particles, the following phenomenon will occur:

a. Absorption: light LLPart of the energy of L is absorbed by the particle

b. External scattering: LLPart of the light from L hits the particles and scatters to other directions

c. Self-illumination: light LLL absorbs part of the light energy of the particle

d. Internal scattering: light from other directions hits the particle and just scatters to light LLin the direction of L

Among them, the absorption and external scattering phenomena will cause the light intensity to decrease, while the self-illumination and internal scattering phenomena will cause the light intensity to increase. We can express these processes in terms of linear differential equations:

a. For absorption, d L ( p ( t ) ) = − σ a ( p ( t ) ) ∗ L ( p ( t ) ) ∗ dt dL(p(t))=-\sigma_a(p(t)) *L(p(t))*dtdL(p(t))=pa(p(t))L(p(t))d t , indicating that the light reachesp ( t ) p(t)The light intensity at p ( t ) is linearly attenuated due to absorption phenomenon;

b. For external scattering, d L ( p ( t ) ) = − σ s ( p ( t ) ) ∗ L ( p ( t ) ) ∗ dt dL(p(t))=-\sigma_s(p(t) ) * L(p(t)) * dtdL(p(t))=ps(p(t))L(p(t))d t , indicating that the light reachesp ( t ) p(t)The light intensity at p ( t ) is linearly attenuated due to the external scattering phenomenon;

c. For self-illumination, light LLL intensity varies with optical pathdt dtdt increases, and the increment is only related to the luminous particles:d L ( p ( t ) ) = L e ( p ( t ) ) ∗ dt dL(p(t))=L_e(p(t)) * dtdL(p(t))=Le(p(t))dt

d. For internal scattering, due to the inclusion of multiple other directions of light that hit the particles and scatter on the LLFor the component in the L direction, we might as well define the phase function as P ( ω i → ω ) P(\omega_i\rightarrow\omega)P ( oiω ) , meaning a line fromω i \omega_iohiLight incident from the direction scatters to LLL directionω \omegaThe proportion of ω (probability density), the corresponding light intensity isL i L_iLi; then after internal scattering, the light intensity increment can be expressed as: d L ( p ( t ) ) = ( σ s ( p ( t ) ) ∫ Ω P ( ω i → ω ) L i ( p ( t ) ) d ω i ) dt dL(p(t))=(\sigma_s(p(t))\int_\Omega P(\omega_i\rightarrow\omega)L_i(p(t))d\omega_i)dtdL(p(t))=( ps(p(t))OhP ( oio ) Li(p(t))dωi) d t , whereΩ \OmegaΩ is spherical

σ s ( p ( t ) \sigma_s(p(t )ps(p(t) σ a ( p ( t ) ) \sigma_a(p(t)) pa( p ( t )) respectively represent the light in the particle (ttt ) is the proportional coefficient of light intensity attenuation caused by external scattering and absorption. Since __only these two phenomena__ can cause light intensity to decrease, we can use a total attenuation coefficient σ t ( p ( t ) )= σ a ( p ( t ) ) + σ s ( p ( t ) ) \sigma_t(p(t))=\sigma_a(p(t))+\sigma_s(p(t))pt(p(t))=pa(p(t))+ps( p ( t )) means;

This way we can solve the differential equation: d L ( p ( t ) ) = − σ t ( p ( t ) ) L ( p ( t ) ) dt dL(p(t))=-\sigma_t(p(t)) L(p(t))dtdL(p(t))=pt(p(t))L(p(t))dt得到:
L ( p ( t ) ) = L 0 e − ∫ 0 t σ t ( p ( u ) ) d u = L 0 T r ( p ( 0 ) → p ( t ) ) (3) \begin{align} L(p(t))&=L_0e^{-\int_0^t\sigma_t(p(u))du}\\ &=L_0T_r(p(0)\rightarrow p(t)) \end{align}\tag{3} L(p(t))=L0e0tpt( p ( u )) d u=L0Tr(p(0)p(t))( 3 )
This is the famous Beer's law:light propagates in a homogeneous medium, and the light intensity decays exponentially; here we can compare the extinction coefficient/bulk densityσ \sigmap

At the same time L e ( p ( t ) ) L_e(p(t))Le(p(t)) σ s ( t ) ∫ Ω p ( ω i → ω ) L i ( t ) d ω i \sigma_s(t)\int_\Omega p(\omega_i\rightarrow\omega)L_i(t)d\omega_i ps(t)Ohp ( oio ) Li(t)dωiare independent of the original light intensity, we can use S ( p ( t ) ) S(p(t))S(p(t))表示:
S ( ( p ( t ) ) ) = L e ( p ( t ) ) + σ s ( p ( t ) ) ∫ Ω P ( ω i → ω ) L i ( p ( t ) ) d ω i S((p(t)))=L_e(p(t))+\sigma_s(p(t))\int_\Omega P(\omega_i\rightarrow\omega)L_i(p(t))d\omega_i S((p(t)))=Le(p(t))+ps(p(t))OhP ( oio ) Li(p(t))dωi
Next we combine the above four phenomena into one formula:
d L ( p ( t ) ) dt = − σ t ( p ( t ) ) L ( p ( t ) ) + S ( p ( t ) ) (4) \frac{dL(p(t))}{dt}=-\sigma_t(p(t))L(p(t))+S(p(t))\tag{4}dtdL(p(t))=pt(p(t))L(p(t))+S(p(t))(4)

Formula (2) (2)( 2 ) is a first-order linear non-homogeneous differential equation, we can use the general solution formula of this type of equation to directly writeLLL的解来:
L ( p ( t ) ) = e − ∫ 0 t σ ( p ( x ) ) d x ( ∫ 0 t S ( p ( x ) ) e ∫ 0 x σ ( p ( u ) ) d u d x + C ) L(p(t))=e^{-\int^t_0{\sigma(p(x))dx}}(\int_0^t{S(p(x))e^{\int_0^x{\sigma(p(u))du}}dx}+C) L(p(t))=e0tσ(p(x))dx(0tS(p(x))e0xσ ( p ( u )) d u dx+C )
LetL ( t = 0 ) = L 0 L(t=0)=L_0L(t=0)=L0Substitute the initial value to get C = L 0 C=L_0C=L0

Immediately

L ( t ) = e − ∫ 0 t σ ( x ) d x ( ∫ 0 t S ( x ) e ∫ 0 x σ ( u ) d u d x + L 0 ) = ∫ 0 t S ( x ) e ∫ 0 x σ ( u ) d u − ∫ 0 t σ ( x ) d x d x + L 0 e − ∫ 0 t σ ( x ) d x = ∫ 0 t S ( x ) e − ∫ x t σ ( u ) d u d x + L 0 e − ∫ 0 t σ ( x ) d x (5) \begin{align} L(t)&=e^{-\int^t_0{\sigma(x)dx}}(\int_0^t{S(x)e^{\int_0^x{\sigma(u)du}}dx}+L_0)\\ &=\int_0^t{S(x)e^{\int_0^x{\sigma(u)du}-\int^t_0{\sigma(x)dx}}}dx+L_0e^{-\int^t_0{\sigma(x)dx}}\\ &=\int_0^t{S(x)e^{-\int_x^t{\sigma(u)du}}}dx+L_0e^{-\int^t_0{\sigma(x)dx}} \end{align}\tag{5} L(t)=e0tσ(x)dx(0tS(x)e0xσ ( u ) d u dx+L0)=0tS(x)e0xσ ( u ) du_0tσ(x)dxdx+L0e0tσ(x)dx=0tS(x)extσ ( u ) d u dx+L0e0tσ(x)dx(5)

It means that the light from L ( t = 0 ) = L 0 L(t=0)=L_0L(t=0)=L0Start, hit some particles to reach ttThe light intensity after t .

In traditional volume rendering/differentiable rendering, we usually explicitly model the relevant parameters (such as albedo, reflectivity) of the above phenomena, which are divided into BSDF field, BRDF field, etc. according to different assumptions. As a neural rendering method, the biggest difference between NeRF and them is that NeRF does not rely on optical models, but uses MLP to reconstruct the __neural radiation field__ and implicitly expresses it with neural networks; at the same time, due to the density parameter of the radiation field Different from the physical meaning of parameters in traditional volume rendering, its rendering formula is also slightly different.

2. The connection between NeRF neural radiation field and volume rendering physical model

Is NeRF's implicit expression + volume rendering mode born out of nowhere? In fact, it is not. For example, in NIPS 2019, there is also such an Oral work on MLP implicit expression scene + volume rendering Scene representation networks: Continuous 3d-structure-aware neural scene representations ;

We collectively refer to this method of encoding rendering parameters with neural networks as neural rendering. However, the rendering quality of this type of work in the past is not as amazing as NeRF, reaching the photo-realistic level of rendering. The author believes that NeRF has three main contributions:

a. Use the MLP network to learn the volume density, and the probability properties of the volume density are conducive to subsequent stratified sampling;

b. Positional Encoding is introduced to map low-frequency spatial coordinates to high-frequency coordinates;

c. Improvement of volume rendering formula

NeRF's rendering formula (1) (1)( 1 ) and our deduced volume rendering formula( 5 ) (5)( 5 ) is slightly different due to the volume densityσ \sigmaσ is defined as: the probability of a ray hitting a particle when it travels an infinitesimal distance. Recall that the transmittanceT ( t ) T(t)T ( t ) is defined as: fromtn t_ntnproceed to ttThe probability that t does not hit any particle, then we can listT+After d t , the probability of an unimpacted particle is:
T ( t + dt ) = T ( t ) ( 1 − dt ∗ σ ( t ) ) ⇒ T ( t + dt ) − T ( t ) dt = T ′ ( t ) = − T ( t ) σ ( t ) T(t+dt)=T(t)(1-dt*\sigma(t))\\ \Rightarrow\frac{T(t+dt)-T(t )}{dt}=T'(t)=-T(t)\sigma(t)T(t+dt)=T(t)(1dts ( t ))dtT(t+dt)T(t)=T(t)=T ( t ) σ ( t )
We can also give a general solution to this differential equation:
T ( a → b ) = T ( b ) T ( a ) = exp ⁡ ( − ∫ ab σ ( t ) dt ) T( a\rightarrow b)=\frac{T(b)}{T(a)}=\exp(-\int^b_a{\sigma(t)dt})T(ab)=T(a)T(b)=exp(abσ ( t ) d t )
At the same time, we can defineO ( 0 → t ) = 1 − T ( 0 → t ) O(0\rightarrow t)=1-T(0\rightarrow t)O(0t)=1T(0t ) is defined as the ray arrives atttThe probability that a particle was indeed hit at some time before t is a probability distribution function with a probability density equal to exactlyT ( t ) σ ( t ) T(t)\sigma(t)T ( t ) σ ( t ) , meaning that the ray stops exactly atttprobability of t .

In this way, NeRF's volume rendering formula is equivalent to calculating the expected value of the color of all points on the ray.

Review formula ( 5 ) (5)( 5 ) , these two formulas are satisfied in a homogeneous medium, the light intensity/transmittance decays exponentially, which is also an improvement of NeRF's application of volume rendering to its unique volume density field. In essence, it constructs a probability field, and the formula( 1 ) (1)The color in ( 1 ) can also be replaced with other physical quantities such asnormal vector and depth inmonosdf

notice

In this chapter, we explained the NeRF original text and the traditional volume rendering method, and discussed the connection between them. In the next chapter, we will combine colmap to explain part of the source code and explain the use of cuda operators.

Since there are too many NeRF-related frameworks, some for beginners

Original NeRF (implemented with tensorflow, which is not consistent with the current NeRF development environment)

nerf_pl (the version implemented by Ai Kui with torchlightning, there are related YouTube tutorials , and the later implemented ngp_pl is also one of the hands-on codes, but torchlightning is not very friendly to beginners, and the learning cycle may become longer)

nerf_pytorch (the version implemented by mit using torch, the code is not deeply nested, and it is easier to read)

Considering that most of the follow-up content will involve torch lightning, we choose nerf_pl to interpret part of the source code

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download

QQ exchange group: 470899183. There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.

Model deployment exchange group: 732145323. It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.

other articles

ICLR 2023 | RevCol: Reversible multi-column network, a new paradigm for large model architecture design

CVPR 2023 | Plug-and-Play Attention Module HAT: Activate more useful pixels to boost low-level tasks significantly!

ICML 2023 | A Handbook of Pre-Training for Lightweight Visual Transformers (ViT)

Plug and play series | Efficient multi-scale attention module EMA becomes a small helper for YOLOv5 improvement

Plug and play series | Meta's new work MMViT: Multi-scale and multi-view encoding neural network architecture based on cross-attention mechanism

The new YOLO model YOLOCS is here | Improve the Backbone/Neck/Head of YOLOv5 in every way

ReID column (3) application of attention

ReID column (2) multi-scale design and application

ReID Column (1) Overview of Tasks and Datasets

Libtorch Tutorial (3) Simple Model Construction

Libtorch Tutorial (2) General Operations of Tensors

libtorch tutorial (1) Development environment construction: VS+libtorch and Qt+libtorch

Anomaly Detection Column (3) Traditional Anomaly Detection Algorithms - Part 1

Anomaly Detection Column (2): Evaluation Indicators and Common Datasets

Anomaly Detection Column (1) Overview of Anomaly Detection

[CV Technical Guide] Our own CV full-stack guidance class, basic introductory class, and thesis guidance class are fully online!! _

CV's most comprehensive knowledge system and technical tutorials

Guess you like

Origin blog.csdn.net/KANG157/article/details/131010242