Zip-NeRF

In 2020, researchers from the University of California, Berkeley and Google open sourced an important research on converting 2D images to 3D models - NeRF. It can use several static images to generate multi-view realistic 3D images, and the generation effect is very amazing: NeRF original team

build

Three years later, the team made even more amazing results: In a study called "Zip-NeRF", they completely restored all the scenes of a family, just like the effect of drone aerial photography.

The author introduces that the Zip-NeRF model combines scale-aware anti-aliasing NeRF and fast grid-based NeRF training to solve the aliasing problem in neural radiation field training. Compared with previous techniques, Zip-NeRF reduces the error rate by 8%-76%, and the training speed is increased by 22 times.

This technology is expected to be applied in the field of VR, such as visiting online museums and viewing houses online.

  • Paper address: https://arxiv.org/pdf/2304.06706.pdf

  • Project address: https://jonbarron.info/zipnerf/

Paper overview

In Neural Radiative Fields (NeRF), a neural network is trained to simulate a volumetric representation of a 3D scene such that new views of the scene can be rendered by ray tracing. NeRF has proven to be an effective tool for tasks such as view synthesis, generative media, robotics, and computational photography.

Both Mip-NeRF 360 and instant-NGP (iNGP) are based on the form of NeRF: pixels are rendered by casting 3D rays and the position along the distance t along the rays, these features are input to the neural network, and the output is rendered in color. Training is done by repeatedly casting rays corresponding to pixels in the training image and minimizing (via gradient descent) the error between the rendered and observed color for each pixel.

Mip-NeRF 360 and instant-NGP differ significantly in how coordinates along rays are parameterized. In mip-NeRF 360, a ray is subdivided into a set of intervals [t_i, t_i+1], each representing a conic cone whose shape is approximated by a multivariate Gaussian whose desired position encoding is used for Input to a large MLP [3]. In contrast, instant-NGP interpolates the eigenvalues ​​of locations into a 3D grid hierarchy of different sizes, and then uses a small MLP to generate the eigenvectors. The model proposed by the authors combines the overall framework of mip-NeRF360 and the characterization method of instant-NGP, but blindly combining these two methods directly will introduce two forms of aliasing:

1. The feature grid method of instant-NGP is not compatible with mip-nerf360's scale-aware integrated position encoding technology, so the features generated by instant-NGP are aliased relative to the spatial coordinates, thereby generating an aliased rendering. In the following presentation, the researchers address this problem by introducing a multisampling-like solution for computing the instant-NGP properties for pre-filtering.

2. Using instant-NGP speeds up training significantly, but this exposes a problem with mip-nerf360's online distillation method, which leads to highly visible "z-aliasing" (aliasing along rays), where scene content varies with disappears erratically as the camera moves. In a later presentation, the researchers address this problem with a new loss function that prefilters along each ray during online distillation.

Method overview

1.Spatial Anti-Aliasing:

The feature used by Mip-NeRF approximates the position-encoded integral of coordinates within a subvoxel, which in NeRF is a cone along a cone. This leads to small amplitudes of the Fourier features when the period of each sinusoid is larger than the standard deviation of the Gaussian - these features represent the spatial position of the subvolume only at wavelengths larger than the subvoxel size. Because this property encodes both position and scale, MLPs using it can learn multi-scale representations of 3D scenes with anti-aliased images. Instead of querying subvoxels, grid-based representations like iNGP use trilinear interpolation at individual points to construct features for MLP, which results in the trained model not being able to reason about different scales or aliasing.

To solve this problem, the researchers turned each cone into a set of isotropic Gaussians, using multisampling and feature weighting: the anisotropic subvoxels were first converted into a set of points approximating their shape, and each point was then considered a Isotropic Gaussian scale. This isotropy assumption allows us to approximate the true integral of the feature grid over subvoxels by exploiting the fact that the values ​​in the grid have zero mean. By averaging these down-weighted features, scale-aware pre-filtered features are obtained from the iNGP grid. See the image below for visualization information.

The issue of anti-aliasing is discussed in depth in several graphics literature. Mip-maps (the eponymous name for mip-nerf) precompute a structure that can be anti-aliased quickly, but it's unclear how to apply this approach to iNGP's underlying hash data structure. Supersampling technology uses a method of directly increasing the number of samples to anti-aliasing, resulting in a large number of unnecessary samples, which is similar to mip-map effect, but more expensive. Multi-sampling techniques build a small set of samples and then pool information from these multi-samples into an aggregated representation that feeds a complex rendering process—a strategy similar to the authors' approach. Another related method is the elliptical weighted average, which approximates an elliptical kernel of isotropic samples aligned along the major axis of the ellipse.

Given the interval [t_i, t_(i+1)) along the ray, the researcher wants to construct a set of approximate conical multi-sample shapes. As in the program where multisampling is applied to graphics with limited sample budgets, they hand-designed a multisampling mode for their use case, distributing n points along a spiral which loops m points around the axis of the ray, and spaced linearly along t:

 

These 3D coordinates are rotated into world coordinates by multiplying by an orthonormal basis whose third vector is the direction of the ray, whose first two vectors are any frame perpendicular to the view direction, then by the ray The origin moves. When n≥3 and n and m are co-prime numbers, it is guaranteed that the sample mean and covariance of each group of multi-samples exactly match the mean and covariance of each sample, similar to Gaussian sampling in mip-NeRF.

The researcher uses these n multi-samples {x_j} as the mean of an isotropic Gaussian distribution with standard deviation σ_j for each sample. They set σ_j to rt, passing a hyperparameter (0.35 in experiments). Because the iNGP grid requires the input coordinates to lie within a bounded domain, the researchers applied the mip-NeRF 360 contraction function. Because these Gaussian distributions are isotropic, this shrinkage can be performed using a simplified and optimized version of the Kalman filtering method used by mip-NeRF 360, as detailed in the Supplement below.

To dealias each individual multisample, we reweight features at each scale in a novel way that is inversely proportional to the isotropic Gaussian fit of each sample within each grid cell : If the Gaussian value is much larger than the unit being interpolated, the interpolated features may be unreliable and the weight should be lowered. The IPE property of Mip-NeRF has a similar interpretation.

In iNGP, each {V_l} at coordinate x is interpolated by scaling with the linear size n of the grid and trilinearly interpolating V_l to obtain a c-length vector. Instead, the researchers interpolate a set of multi-sampled isotropic Gaussian distributions with mean and standard deviation σ_j. By reasoning about Gaussian CDFs, the fraction of each Gaussian PDF within [−1/2n, 1/2n]^3 in V can be calculated, which is interpolated as a scale-dependent drop weighting factor ω_j,l, The researcher applies weight decay on {V} to encourage the values ​​in V to be normally distributed and have zero mean. This zero-mean assumption lets them approximate the desired grid characteristics of each multi-sample Gaussian distribution as ω_j f_j,l+(1−ω_j)0=ω_j f_j,l. In this way, the desired feature corresponding to the conic cone can be approximated by taking a weighted average of each multi-sample interpolated feature:

2. Z-Aliasing and Proposal Supervision:

While the elaborate multisampling and deweighting methods mentioned earlier are effective ways to reduce spatial aliasing, one must consider that there is an additional source of aliasing along the rays -- z-aliasing. It is due to the fact that MLP learning under mip-NeRF360 produces an upper-bounded scene geometry: during training and rendering, this MLP is repeatedly evaluated along the ray to generate the next round of sampling of the histogram, only the last set of samples is obtained by NeRF's MLP network presented. Mip-NeRF 360 shows that the method significantly improves speed and rendering quality compared to previous strategies that learn one mi-nerf or multiple nerfs, all of which use image reconstruction loss for supervision. The researchers found that the MLP scheme in mip-NeRF 360 tends to learn a non-smooth mapping from input coordinates to output volume densities. This will result in a ray jumping artifact of the scene content, as shown in the image above. Although this artifact is subtle in mip-NeRF 360, it becomes common and visually prominent if the authors use an iNGP backend instead of MLP in their proposed network (which can increase the fast optimization capability of the new model), especially when When the camera is translated along its z-axis.

In the figure below, the researchers visualize proposal supervision for a training instance, where a narrow NeRF histogram (blue) is translated along a ray relative to a coarse proposal histogram (orange). (a) The loss used by mip-NeRF360 is piecewise constant, but (b) the loss of the new model is smooth because we blur the NeRF histogram to a piecewise linear spline (green). The pre-filtering loss in the new model can learn the anti-aliased proposal distribution. Anti-Aliased Interlevel Loss:

 The proposal supervision method in mip-NeRF 360 inherited by the researchers requires a loss function that takes the form of a step function produced by NeRF(s,w) and a similar step function produced by the proposal model (^s,^w) as input. Both of these step functions are histograms, where s and ˆs are vectors of endpoint positions, and w and ˆw are weight vectors with sum equal to ≤1, where w_i represents the interval i for which the visible scene content is a step function. Each s_i is a normalization function of the true metric distance ti, according to some normalization function g(・), which the researchers will discuss later. Note that s and ˆs are not the same - the endpoints of each histogram are different.

Normalizing Metric Distance:

Many NeRF methods require a function to convert the metric distance t∈[0,∞) to the normalized distance s∈[0,1]. Left: The power transformation P(x,λ) allows interpolation between common curves, such as linear, logarithmic, and inverse, by modifying λ, while maintaining a linear shape near the origin. Right: Building a curve from a linear transition to an inverse/inverse query and supporting scene content close to the camera.

Experimental effect

The researchers' model is implemented in JAX, and based on the baseline of mip-NeRF 360, the voxel grid and hash table structure of iNGP is redesigned to replace the large MLP network used by mip-NeRF 360, except in The anti-aliasing adjustments introduced, and some additional modifications, the overall model architecture is the same as mip-NeRF 360.

Performance on the multiscale version of 360 Datase, training and evaluating on multiscale images. Red, orange, and yellow highlights indicate the first, second, and third best-performing techniques for each metric. The proposed model significantly outperforms the two baselines—especially the iNGP-based baseline, especially at coarse scales, where the new model reduces error by 54%-76%. Row AM is the ablation experiment of the model, please refer to the extended text at the end of the paper for details.

Although the 360 ​​dataset contains a lot of challenging scene content, it cannot measure the rendering quality as a function of scale, because this dataset is obtained by shooting the camera around a central object at a roughly constant distance, and the learning model does not need to deal with training. Center objects at different image resolutions or at different distances. The researchers therefore use a more challenging evaluation procedure, similar to the multi-scale blender dataset used with mip-NeRF: the researchers turn each image into a set of four images that are used [1,2,4,8 ] Scale downsampled images separately The cameras of the additional training/testing views have been zoomed in from the center of the scene. During training, the researchers multiplied the data items by a scale factor for each ray, and at test time they evaluated each scale separately. This greatly increases the reconstruction difficulty for the model to generalize across scales, and leads to significant aliasing artifact effects, especially at coarse scales.

In Table 1, we evaluate the newly proposed model against iNGP, mipNeRF 360, mip-NeRF 360 + iNGP baseline and many ablation methods. Although mip-NeRF 360 performs reasonably (because it can train on multiple scales), the new model achieves a reduction of 8.5% on the finest scale and 17% on the coarsest scale, while being 22 times faster. The mip-NeRF 360 + iNGP baseline, which has no mechanism for anti-aliasing or inference scale, performs poorly: the RMSE of the new model is 18% lower at the finest scale, 54% lower at the coarsest scale, and the lowest DSSIM and LPIPS are both 76% lower at the coarse scale. This improvement can be seen in the figure below. The researchers' mip-NeRF 360 + iNGP baseline generally outperforms iNGP (except for the coarsest scales), as they expected in the second table. whaosoft  aiot  http://143ai.com

 Summarize

The researchers propose the Zip-NeRF model, which combines the advantages of both scale-aware anti-aliasing NeRF and fast mesh-based NeRF training. By exploiting methods regarding multisampling and pre-filtering, the model is able to achieve 8%-76% lower error rates than previous techniques, while also being 22x faster than mip-NeRF360 (the current state-of-the-art on related problems). The researchers hope that the tools and analysis presented here regarding aliasing (the spatial aliasing of meshes from the spatial coordinates of color and density mapping, and the z-aliasing loss function in online distillation along each ray) can further improve nerf inverse rendering techniques Quality, speed and finished product efficiency.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130704417