[Read the paper] [Quick reading] Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields

0. Abstract

Problem introduction:

When training or testing images to observe scene content at different resolutions, NeRF may produce blurred or aliased rendering.

Solutions:

Mip-NeRF renders anti-aliased conical frustums (conical frustums) instead of rays.

Effect(Experiment):

  • Be 7% faster than NeRF and half the size
  • Reduces average error rates by 17% on the dataset presented with NeRF and by 60% on a challenging multiscale variant of that dataset that we present.
  • Able to match the accuracy of a brute-force supersampled NeRF on our multiscale dataset while being 22× faster.(It’s a traditional method we may first think to solve the problem of blurred rendering caused by different resolutions.)

5. Conclusion

The process: Mip-NeRF casts cones, encodes the positions and sizes of conical frustums, and trains a single neural network that models the scene at multiple scales.

1. Introduction

Problem introduction:

When the training images observe scene content at multiple resolutions, renderings from the recovered NeRF appear excessively blurred in close-up views and contain aliasing artifacts in distant views.

The input contains different scales so there are some problems in rendering.

Traditional method:

A straightforward solution is to adopt the strategy used in offline raytracing: supersampling each pixel by marching multiple rays through its footprint.

This paper:

Take inspiration from the mipmapping approach: A mipmap represents a signal (typically an image or a texture map) at a set of different discrete downsampling scales and selects the appropriate scale to use for a ray based on the projection of the pixel footprint onto the geometry intersected by that ray.

Both the traditional method and the mip method come from the knowledge of graphics [1]. Mip comes from the Latin *multum in parvo: a small space where many things are placed. *In computer graphics, mipmapping is a texture rendering technology that speeds up rendering and reduces image aliasing. Simply put, mipmapping is to reduce the main image into a series of successively reduced small images, and save these lower-resolution small images. This strategy is called pre-filtering. The computational burden of anti-aliasing is concentrated on preprocessing: no matter how many times a texture needs to be rendered later, it only needs to be preprocessed once.

3. Method

For Mip-NeRF, the main innovation points are divided into three aspects:

  • The rendering process of Mip-NeRF is based on anti-aliased conical frustums;
  • Mip-NeRF proposed a new positional encoding method - IPE (Integrated Positional Encoding);
  • Mip-NeRF reduces coarse and fine MLP to a single multiscale MLP (a single multiscale MLP);

3.1. Cone Tracing and Positional Encoding

3.1.1 Cone Tracing

Insert image description here

Set phase machine center o \mathbf{o} oToward direction d \mathbf{d} d, a cone is emitted outward according to the pixel point to be determined in the picture. The size of the cone is limited by the pixel size, and is defined on the pixel plane as: r ˙ = 2 12 ∗ the width of the pixel \dot{r}=\frac{2}{\sqrt{12}}*\text{the width of the pixel} r˙=12 2the width of the pixel

We can use the following formula to determine whether the query point is inside the cone (input is the boundary [ t o , t 1 ] [t_o,t_1] [to,t1]Given radius r ˙ \dot{r} r ̇):
F ( x , o , d , r ̇ , t 0 , t 1 ) = 1 { ( t 0 < ;d T ( x − o ) ∥ d ∥ 2 2 < t 1 ) ∧ ( d T ( x − o ) ∥ d ∥ 2 ∥ x − o ∥ 2 > 1 1 + ( r ̇ / ∥ d ∥ ) 2 ) } , \begin{gathered} \mathrm{F}(\mathbf{x},\mathbf{o},\mathbf{d},\dot{r},t_{0},t_{1}) =1{\Bigg\{}\left(t_{0}<{\frac{\mathbf{d}^{\mathrm{T}}(\mathbf{x}-\mathbf{o})}{\ left\|\mathbf{d}\right\|_{2}^{2}}}<t_{1}{\Bigg)}\right. \\\left.\wedge\left({\frac{\mathbf{d}^{\mathrm{T}}(\mathbf{x}-\mathbf{o})}{\|\mathbf{d}\ |_{2}\|\mathbf{x}-\mathbf{o}\|_{2}}}>{\frac{1}{\sqrt{1+(\dot{r}/\|\ mathbf{d}\|_{2})^{2}}}}\right)\right\}, \end{gathered}F(x,o,d,r˙,t0,t1)=1{ (t0<d22dT(xo)<t1)(d2xo2dT(xo)>1+(r˙/∥d2)2 1)},
This formula has two restrictions, the first part is in d d ddirection, immediately x − o \mathbf{x}-o xoexist d d The projection in the d direction cannot exceed [ t o , t 1 ] [t_o,t_1] [to,t1]; the second part is in the angular direction, that is, it cannot exceed the angle limit of the cone.

Afterwards we discussed how to encode this circular cone. The authors said that the simplest thing they found was to find the expectation for the points of the entire circular cone.

In order to facilitate calculation and obtain a converged solution, a three-dimensional Gaussian distribution is first used to fit such a circular cone, so that we can characterize such a circular cone using only the mean value and the covariance matrix.

This Gaussian model is completely represented by 3 values: μ t \mu_t mt (average distance along the ray), σ t 2 \sigma^2_t pt2 (variance along the ray direction), σ r 2 \sigma^2_r pr2(沿射线垂直方向的方差)
μ t = t μ + 2 t μ t δ 2 3 t μ 2 + t δ 2 , σ t 2 = t δ 2 3 − 4 t δ 4 ( 12 t μ 2 − t δ 2 ) 15 ( 3 t μ 2 + t δ 2 ) 2 σ r 2 = r ˙ 2 ( t μ 2 4 + 5 t δ 2 12 − 4 t δ 4 15 ( 3 t μ 2 + t δ 2 ) ) . \begin{aligned}\mu_t&=t_\mu+\frac{2t_\mu t_\delta^2}{3t_\mu^2+t_\delta^2},\quad\sigma_t^2=\frac{t_\delta^2}{3}-\frac{4t_\delta^4(12t_\mu^2-t_\delta^2)}{15(3t_\mu^2+t_\delta^2)^2}\\\sigma_r^2&=\dot{r}^2\left(\frac{t_\mu^2}{4}+\frac{5t_\delta^2}{12}-\frac{4t_\delta^4}{15(3t_\mu^2+t_\delta^2)}\right).\end{aligned} mtpr2=tμ+3tm2+td22tμtd2,pt2=3td215(3tm2+td2)24td4(12tm2td2)=r˙2(4tm2+125td215(3tm2+td2)4td4).
其中, t μ = ( t 0 + t 1 ) t_{\mu}=(t_{0}+t_{1}) tμ=(t0+t1) t δ = ( t 1 − t 0 ) / 2 t_\delta=(t_1-t_0)/2 tδ=(t1t0)/2

In this way, given input (boundary [ t o , t 1 ] [t_o,t_1] [to,t1]Given radius r ˙ \dot{r} r˙), you can find the μ \mu μ σ \sigma p

This Gaussian model can then be converted from the truncated cone coordinate system into world coordinates to obtain the final multivariate Gaussian model:
μ = o + μ t d , Σ = σ t 2 ( d d T ) + σ r 2 ( I − d d T ∥ d ∥ 2 2 ) \mathbf{\mu}=\mathbf{o}+\mu_t\mathbf{d},\quad\mathbf{\Sigma}= \sigma_t^2(\mathbf{d}\mathbf{d}^\mathrm{T})+\sigma_r^2\bigg(\mathbf{I}-\frac{\mathbf{d}\mathbf{d}^ \mathrm{T}}{\|\mathbf{d}\|_2^2}\bigg) m=O+mtd,S=pt2(ddT)+pr2(Id22ddT)

3.1.2 Positional Encoding

One of the Nerf functions:
γ ( x ) = [ sin ⁡ ( x ) , cos ⁡ ( x ) , ... , sin ⁡ ( 2 L − 1 x ) , cos ⁡ ( 2 L − 1 x ) ] T \gamma(\mathbf{x})=\left[\sin(\mathbf{x}),\cos(\mathbf{x}),\ldots,\ sin(2^{L-1}\mathbf{x}),\cos(2^{L-1}\mathbf{x})\right]^\mathrm{T} γ(x)=[sin(x),cos(x),,sin(2L1x),cos(2L1x)]T
Determine the equivalence of the equation and the equation:
P = [ 1 0 0 2 0 0 ⋯ 2 L − 1 0 0 0 1 0 0 2 0 ⋯ 0 2 L − 1 0 0 0 1 0 0 2 ⋯ 0 0 2 L − 1 ] T , γ ( x ) = [ sin ⁡ ( P x ) cos ⁡ ( P x ) \mathbf{P}=\begin{bmatrix}1&0&0&2&0&0&\cdots&2^{L-1}&0&0\\0&1&0&0&2& 0&\cdots&0&2^{L-1}&0\\0&0&1&0&0&2&\cdots&0&0&2^{L-1}\end{bmatrix} ^T,\gamma(\mathbf{x})=\begin{bmatrix}\sin(\mathbf{Px})\\\cos(\mathbf{Px})\end{bmatrix}. P= 1000100012000200022L10002L10002L1 T,γ(x)=[sin(Px)cos(Px)].
Using matrix form to position encode the previous Gaussian model, the expectation and variance become:
μ γ = P μ , Σ γ = P Σ P T . \mu_{\gamma}=\mathrm{P}\mu,\quad\Sigma_{\gamma}=\mathrm{P}\Sigma\mathrm{P}^{\mathrm{T}}. mγ=Pμ,Sγ=PΣPT.
Given the infinitesimal equation of sin(x) and cos(x):
E ⁡ x ∼ N ( μ , σ 2 ) [ sin ⁡ ( x ) ] = sin ⁡ ( μ ) exp ⁡ ( − ( 1 / 2 ) σ 2 ) , E ⁡ x ∼ N ( μ , σ 2 ) [ cos ⁡ ( x ) ] = cos ⁡ ( μ ) exp ⁡ ( − ( 1 / 2 ) σ 2 ) \begin{aligned}\operatorname{E}_{x\sim\mathcal{N}(\mu,\sigma^2)}[\sin(x)]&=\sin(\mu)\exp\bigl (-(^1/2)\sigma^2\bigr),\\\operatorname{E}_{x\sim\mathcal{N}(\mu,\sigma^2)}[\cos(x)] &=\cos(\mu)\exp\bigl(-(^1/2)\sigma^2\bigr).\end{aligned} ANDxN(μ,σ2)[sin(x)]ANDxN(μ,σ2)[cos(x)]=sin(μ)exp((1/2)σ2),=cos(μ)exp((1/2)σ2).
In the case of a slightly larger, slightly less stable environment, the Independent Peripheral Equipment (IPE). )definition:
γ ( μ , Σ ) = E x ∼ N ( μ γ , Σ γ ) [ γ ( x ) ] = [ sin ⁡ ( μ γ ) ∘ exp ⁡ ( − ( 1 / 2 ) diag ( Σ γ ) ) cos ⁡ ( μ γ ) ∘ exp ⁡ ( − ( 1 / 2 ) d i a g ( Σ γ ) ) ] \begin{aligned} \gamma(\ballsymbol{\mu},\ballsymbol{\Sigma})& =\mathrm{E}_{\mathbf{x}\sim\mathcal{N}(\ballsymbol{\mu}_{\gamma},\ballsymbol{\Sigma}_{\gamma})}[\gamma( \mathbf{x})] \\ &=\begin{bmatrix}\sin(\ballsymbol{\mu}_\gamma)\circ\exp(-(1/2)\mathrm{diag}(\ballsymbol{ \Sigma}_\gamma))\\\cos(\ballsymbol{\mu}_\gamma)\circ\exp(-(1/2)\mathrm{diag}(\ballsymbol{\Sigma}_\gamma) )\end{bmatrix} \end{aligned} γ(μ,Σ)=ANDxN(μγ,Σγ)[γ(x)]=[sin(μγ)exp((1/2)diag(Σγ))cos(μγ)exp((1/2)diag(Σγ))]

Simplifications can be made when calculating:

diag ⁡ ( Σ γ ) = [ diag ⁡ ( Σ ) , 4 diag ⁡ ( Σ ) , … , 4 L − 1 diag ⁡ ( Σ ) ] T \operatorname{diag}(\boldsymbol{\Sigma}_{\gamma})=\Big[\operatorname{diag}(\boldsymbol{\Sigma}),4\operatorname{diag}(\boldsymbol{\Sigma}),\ldots,4^{L\boldsymbol{-}1}\operatorname{diag}(\boldsymbol{\Sigma})\Big]^{\mathrm{T}} diag(Σγ)=[diag(Σ),4diag(Σ),,4L1diag(Σ)]T

diag ⁡ ( Σ ) = σ t 2 ( d ∘ d ) + σ r 2 ( 1 − d ∘ d ∥ d ∥ 2 2 ) \operatorname{diag}(\bold symbol{\Sigma})=\sigma_t^2(\; mathbf{d}\circ\mathbf{d})+\sigma_r^2\biggl(\mathbf{1}-\frac{\mathbf{d}\circ\mathbf{d}}{\left\|\mathbf{ d}\right\|_2^2}\biggr)diag(Σ)=pt2(dd)+pr2(1d22dd)

3.2 Architecture

Mipnerf and Nerf only differ in cone tracking and IPE encoding. Let’s compare these two aspects:

Insert image description here

3.2.1 Why Muti-resolution

In Nerf, sampling is based on points on the ray, and then the point information can be queried for rendering. In Mipnerf, a cone is used, and the sampled points are the two boundaries of the circular cone. As shown on the right side of the figure above, a circular cone is established between the first point and the second point. The picture shows an elliptical shape. This It is the distribution of three-dimensional Gaussians in this direction. This encoding and sampling method is the ability of mipnerf to learn multiple scales: explicitly encoding the scale into the input features, and then letting the network learn. This is the ability of mipnerf to utilize volume as mentioned in the picture, giving it a better anti-aliasing effect than nerf.

Insert image description here

3.2.2 High-frequency information

For hyperparameters L L The problem of L reflects mipnerf’s ability to process high-frequency information.

Insert image description here

It can be seen from the figure that when L increases, the high-frequency information increases and the Nerf effect also drops sharply. However, mipnerf can remain unchanged. This is because as the sampling space becomes larger, the high-frequency information will be smoothed. (I don’t understand the specific principle yet), you can refer to the following animation:

This cone-shaped dynamic setting allows mipnerf to automatically filter high-frequency information when processing distant views, alleviate the aliasing phenomenon (that is, remove high-frequency components in the scene), and restore the processing of high-frequency information when processing close-up views.

3.2.3 Different sampling

In the previous Nerf, it was necessary to sample the coarse and fine layers of the network and put them into two MLPs to calculate a unified loss function. Now with the help of mipnerf's ability to handle multiple scales, only one MLP is needed to complete the adaptive process. Work.

In general, Mipnerf uses the volume representability of cone coding to complete multi-scale learning and achieve better anti-aliasing effects.

Reference

[1] Computer Graphics Seven: Texture Mapping and Mipmap Technology - Zhihu (zhihu.com)

[2] Mip-NeRF paper notes - Zhihu (zhihu.com)

[3] [Review] Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields (velog.io)

[4] Mip-NeRF: Anti-aliasing multi-scale neural radiation field ICCV2021_tzc_fly's blog-CSDN blog

[5] Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields | IEEE Conference Publication | IEEE Xplore

Guess you like

Origin blog.csdn.net/weixin_62012485/article/details/134796371