Getting to Know the Neural Radiation Field NeRF

NeRF, the abbreviation of Neural Radiance Fields (Neural Radiance Fields). Fellows are from UCB, Google and UCSD.

Title:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Paper:https://arxiv.org/pdf/2003.08934.pdf

Code:https://github.com/bmild/nerf

The motivation for writing this article is that, on the one hand, NeRF is too important to represent the important future direction of the combination of computer vision and image science; on the other hand, NeRF has a certain understanding threshold for students with a computer vision background. This article tries to minimize NeRF is introduced on the premise of background knowledge supplement and minimum understanding cost.

Overall introduction

The research purpose of NeRF is to synthesize images from different perspectives of the same scene. The method is very simple. According to several pictures of a given scene, the 3D representation of the scene is reconstructed, and then the images in this perspective can be synthesized (rendered) by inputting different perspectives during reasoning.

There are many forms of "3D representation". NeRF uses a radiation field, and then uses "Volume Rendering" (Volume Rendering) technology to render the radiation field into an image given a camera perspective. The reason for choosing radiation field + volume rendering is very simple, the whole process can be differentiated. This process is very interesting. It can be understood as flattening a space in one direction, and the colors in the space are weighted and summed to obtain the color on the plane.

radiation field

The so-called radiation field(x, y, z) , we can think of it as a function: if we emit a ray from an angle to a static space, we can query  the density of  this ray at each point in the space, and the position at the angle of  \sigma the (\theta, \phi)ray The colors that come out  c(c=(R, G, B)), ie  F:(x, y, z, \theta, \phi) \rightarrow(R, G, B, \sigma). Among them, the density is used to calculate the weight, and the weighted sum of the colors on the points can present the pixel color . In fact, for the radiation field, we only need to maintain a table, given a (x, y, z, \theta, \phi)direct lookup table to obtain RGB values ​​and densities, and use the volume rendering method. And NeRF proposed a better way: use a neural network to model this mapping.

volume rendering

The so-called volume rendering , intuitively speaking, we know the focal point of the camera, the connection between the focal point and the pixel can connect a ray, we can sum the colors of all the points on this ray to get the color of this pixel value.

In theory, we can get the color of each pixel by integrating the density (only related to space coordinates) and color (depending on space coordinates and incident angle) of each point in space that this ray passes through . When the color of each pixel is calculated, the image from this perspective is rendered. As shown below:

Starting from the camera focus, connect a ray to a pixel, obtain the attributes of each point in the space, and integrate to obtain the color of this pixel

In order to successfully complete the above process, we may need to maintain a huge Tensor to represent the radiation field, and look up the table to obtain RGB and density. A problem here is that the table can be as large as the space, and it can only be represented discretely. What NeRF has to do is to use a neural network to model the radiation field, so that no matter how large the space is, it does not affect the amount of storage we need to represent the radiation field, and the radiation field representation is continuous:  F_{\Theta}:(x, y, z, \theta, \phi) \rightarrow(R, G, B, \sigma).

Using Neural Networks to Represent Radiation Fields by Replacing Lookup Tables

overall process

Because the neural network is differentiable , the selected volume rendering method is differentiable ; the image obtained by volume rendering and the original image calculate MSE Loss. The whole process can be optimized end-to-end with gradient backpropagation very beautifully. The entire training pipeline is shown in the figure below:

Seeing this, the reader already has a general understanding of the principle of NeRF, and the following chapters are the specific details of NeRF.

Volume Rendering with Radiation Fields

We have already roughly understood how the process of volume rendering is done. But how to integrate the color in space along the ray? If we regard the ray as a ray, we can intuitively get the two conditions to be satisfied by this integral:

  1. The higher the density of a point , the weaker the ray becomes after passing through it, and the density and transmittance are inversely proportional
  2. The higher the density of a point , the greater the weight of the color response of this point on the pixel under this ray

Therefore, if we connect the ray from the focal point to a pixel as: \mathbf{r}(t)=\mathbf{o}+t \mathbf{d}, where o is the origin and t is the time. The start of time (near bound) and the end of time (far bound) are t_n the sum t_f. We have the following formula for integrating along this ray to get the pixel color:

Observe that the integral of this formula is the product of T(t) density  \sigma(\mathbf{r}(t))and color , among which  is the cumulative light transmittance, which means "how much light is left" when the light hits it.\mathbf{c}(\mathbf{r}(t), \mathbf{d})T(t)

Therefore, for the color of this pixel, the weight of the color of this point is  T(t) \sigma(\mathbf{r}(t)), that is, how much light is left when the light hits this point, and what is the density of this point.

Then, we continue to observe  T(t) the expression, it is the integral of the density on the ray, after negation, take the exponent. Intuitively, it is also easier to understand. The higher the density of points passing in front, the less light remaining behind. The strict derivation will not be repeated here, and interested readers can refer to the literature "Ray tracing volume densities". 

In the actual rendering process, we can only divide the ray into N small intervals on average, randomly sample a point in each interval, and perform some weighted summation on the colors of the sampled points:

Among them,  \delta_i=t_{i+1}-t_i. It is worth noting here that the weight of the original integral is  T(t) \sigma(\mathbf{r}(t)) , and the weight of the summation here is  T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right). And  it is directly proportional to 1-\exp \left(-\sigma_i \delta_i\right)the density  . \sigma_iFor the specific derivation process, refer to: "Optical models for direct volume rendering".

Two optimization points of neural radiation field

Positional encoding

Similar to the Transformer approach, the coordinates and viewing angles are expressed in a higher dimension as network input to solve the problem of blurred rendered images:

hierachical volume sampling

Because the density distribution in the space is uneven, if the rays are uniformly and randomly sampled, the rendering efficiency will be relatively low: the rays may pass through fewer high-density points after shooting for half a day. From the above analysis, we can see that the entire rendering process is nothing more than a weighted summation of the colors of the sampling points on the ray. where weight  w_i=T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) .

We can use the color weighting weight in the rendering formula  w_i as the probability of sampling in the corresponding interval. We train two radiant field networks, one coarse (Coares) and one fine (Fine). \left(N_c\right)The rough network is a network that is rendered and trained  at relatively few uniformly sampled points, and is used to output w_i sampling probability estimates. Normalize  w_i , and treat it as a probability value:

 Sampling points  from \hat{w}_i the probability distribution  , and using  points to train the fine network.N_fN_c+N_f

Architecture

As shown in the figure below, the coordinates are  \mathbf{x}=(x, y, z) mapped to a 60-dimensional vector, which is input to the fully connected network to obtain the density  ;  the 24-dimensional vector obtained by  \sigmaviewing angle d= mapped and  the output features of the previous layer are spliced ​​together, and the RGB value is obtained through two layers of MLP. It is worth noting that in order to strengthen the coordinate information, the coordinate information will be re-entered in the middle of the network. The network structure design here reflects that the density is independent of the viewing angle, and the color is related to the viewing angle .\mathbf{d}=(\theta, \phi)\sigma

 Reprinted from: It's 2022, I don't allow you to not understand NeRF - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/weixin_42620109/article/details/128814947