[NeRF Summary] NeRF-based 3D Vision Annual Progress Report

NeRF-based 3D vision annual progress report

Tsinghua University: Liu Yebin

Original Link: [NeRF Summary] NeRF-based 3D Vision Annual Progress Report – Liu Yebin, Tsinghua University (by Small Sample Vision and Intelligence)

Table of contents

Article directory

01 Background introduction

NeRF

NeRF: A New View Synthesis Method Based on Differentiable Volume Rendering and Neural Field 3D Representation.

Fig 1. Free perspective rendering of different scenes

Two core elements of NeRF:

  • Implicit Neural Fields: Identifying Color Fields and Volume Density Fields with Coordinate-Based Fully Connected Networks
  • Volume Rendering Formula: Render the Color Field Volume Density Field as an Image

Fig 2. NeRF core process. See [1] for more detailed process

NeRF and 3D vision

NeRF's core optimization method: End-to-end differentiable rendering (compact-efficient 3D visual information expression)

From a more essential point of view, the connection between two-dimensional images and the three-dimensional world is established

Fig 3. NeRF models the visual imaging mechanism, which is closer to the essence of the visual world

3D Representation and Differentiable Rendering

Fig 4. Comparison of NeRF and traditional 3D representation methods.

background value

Application scenarios include:

  • 3D content generation and editing
  • Robot Vision Positioning and Navigation
  • 3D reconstruction and rendering
  • Reality Drives Digital Humans
  • City level street view map
  • physical simulation
  • ……

Since it was proposed in 2020, NeRF has become one of the basic research paradigms in the field of 3D vision, and has promoted the development of tasks such as reconstruction, rendering, positioning, generation, and understanding of 3D vision.

02 Research progress based on NeRF

efficiency optimization

Research motivation: Naive NeRF has long training time and long rendering time. The calculation bottleneck lies in: complexity = single sampling point network query time x number of sampling points

01 Use sparse geometric expression

Solution: Use sparse geometric expressions (sparse voxels, octrees, surfaces, etc.) to exclude sampling areas that do not contribute to the integration and reduce the number of samples.

  • NSVF, SNeRG, Plenoxels, Plenoctrees : Delete the voxels in the non-geometry area, refine the voxels near the surface of the object, and obtain sparse voxels or octree expressions.

  • MobileNeRF : Extract NeRF to the sparse geometry of triangular mesh surfaces, and use rasterization to render in real time on the mobile side.

02 Voxelization

Solution idea: Use voxel space to store high-dimensional features or lightweight networks to achieve low-complexity queries.

  • KiloNeRF : Spatial voxelization, each voxel uses a lightweight network, which significantly reduces the amount of computation and increases rendering by about thousands of times.

  • DVGO : Through training strategies such as low-density initialization of voxel grids and activation after interpolation, the NeRF density field and color feature field expressed by voxels are directly optimized to achieve minute-level training convergence.

03 Voxel compression (hash table)

Solution idea:
use hash technology to compress high-resolution voxel grid storage

  • InstantNGP: Establish multi-scale voxel grids to store high-dimensional features, compress high-resolution grids with hash, and achieve high-resolution and fast rendering under low-complexity conditions.

04 Voxel decomposition

Solution idea: The voxel grid is decomposed into a low-dimensional plane grid expression, and the space occupation is reduced to the square level.

  • EG3D: The voxel features corresponding to the three-dimensional coordinates are defined as the features of three orthogonal projection planes.
  • TensoRF : Decomposes a voxel grid into a sum of low-rank tensors in the form of vector-plane tensor products.
  • MeRF: low resolution voxel + high resolution planar projection

05 Voxel decomposition (4D promotion)

Solution idea: follow the decomposition idea of ​​3D->2D, and carry out the decomposition of 4D->2D

  • Tensor4D: 4D grid -> 3 3D grids -> 3x3=9 2D grids
  • HumanRF : 4D mesh -> tensor product of 4 3D meshes with 1D mesh, where 3D mesh uses hash compression.
  • HexPlane, K-Planes: 4D grid -> (x, y, z, t) coordinates are combined in pairs to get 6 2D grids.

dynamic modeling

Research motivation: Extend NeRF to represent non-stationary content, and run new viewpoint synthesis on dynamic scenes.
Early work: D-NeRF, Nerfies, Hyper-NeRF.
Solution: Model the dynamic scene as a standard space and deformation field, use the deformation field to map the appearance information observed in different frames to the standard space, and realize the decoupling of appearance and motion information.

Existing limitations and improvement directions:

01 Dynamic foreground perception

Research motivation: For real scenes with large motion deformation captured by monocular cameras, the existing dynamic representation methods based on deformation fields cannot accurately decouple the motion of objects, and it is difficult to restore high-quality dynamic textures.

Solution:
Enhance NeRF's perception of dynamic foregrounds by changing representations or introducing additional information.

  • FSDNeRF: A representation method for constructing an implicit velocity field, introducing monocular prediction inter-frame optical flow information, and applying time-domain regularization to the velocity field.
  • Nerfplayer: Design a real-time domain-related dynamic residual NeRF to reduce the coupling of motion information and dynamic texture.
  • RoDynRF: Introduces dynamic NeRF to display modeled foreground segmentation and enhance synthetic appearance quality through joint camera pose optimization.

02 Voxelization

Solution idea:
Use voxels to store high-dimensional features or lightweight networks to achieve minute-level dynamic modeling and high-definition real-time rendering.

  • TineuVox: Transform the standard space NeRF into a voxel-based display representation, and use a multi-scale feature sampling strategy to ensure the global perception of voxels during the optimization process.
  • Fourier Plenoctrees: Modeling Frame-by-Frame Voxel Radiation Parameters Using FT Parameters in Combination with Discrete Fourier Transform
  • Dynamic MLP masps: : 3D scenes are represented by voxel-level local lightweight network combinations, combined with 2D hyperparameter convolutional networks to efficiently generate frame-by-frame MLP network parameters.

Human body reconstruction and avatar generation

Research motivation: The dynamic NERF modeling method is difficult to apply to scenarios where the human body moves in a large range.
Early work: Using the human parametric model SMPL as a priori, establish a large-scale skeleton motion connection between frames, and optimize the non-rigid deformation field and NeRF under the standard pose

01 Dynamic human avatar

Recent route: Higher-quality drivable digital humans, focusing on the modeling of dynamic clothing details.

  • Remelli et al.: Introducing additional image-driven signals to provide richer appearance information.
  • AvatarRex: Proposes local neural radiation fields together with local feature blocks to encode fine-grained human clothing details.
  • PoseVocab: A pose representation library is proposed to encode high-frequency changes in human appearance under different poses.

02 Interaction between people, objects and scenes

solution:

  • Instant-NVR: Combining non-rigid tracking and Instant-NGP to achieve online reconstruction of people and objects NeRF.
  • HOSNeRF: Introduce state hidden coding to represent different interaction states of people, objects, and scenes.
  • Hou et al. introduce the latent coding of people and objects to decouple the contact relationship between people and objects, and synthesize the interaction between people and objects under new poses.

03 Digital Human Generation

Solution: SMPL+NERF+ (GAN/Diffusion)

  • AvatarCLIP: Using CLIP as a priori, generate static digital human and motion sequences respectively.
  • EVA3D : Propose a combined human NERF to learn a 3D human GAN in a standard space.
  • DreamAvatar: With Stable Diffusion as a priori, the image rendered based on NeRF is constrained to satisfy the semantic input.

Face reconstruction and avatar generation

01 Sparse viewpoint reconstruction

Research motivation: Sparse viewpoint face reconstruction, NeRF is easy to overfit to each viewpoint, and artifacts appear in new viewpoint synthesis.

Solution: Introduce priors such as face big data, key points and face templates to optimize the quality of NeRF reconstruction.

  • LP3D (static + real-time ): Use the face data generated by EG3D for training, input a single image, and infer the NeRF of the three-plane expression.
  • HAvatar (dynamic avatar): Using 3DMM projected three-plane neural radiation field constraints to achieve high-quality dynamic avatars of human heads.
  • NeRSemble (dynamic) : 3DMM expression parameters are introduced to construct a semantic space deformation field with expressions to fit complex expression dynamics.

02 Face avatar generation

Research motivation: The dynamic NeRF reconstruction method cannot perform follow-up expressions on the face model through audio and video, and the mouth shape is driven.
Recent research: Introducing pre-trained models to expand to single image reconstruction; better NeRF expression and expression expression.

  • Huang et al. : Learning implicit expression parameters from speech, compared with traditional 3DMM expressions, has stronger expressive ability.
  • OTAvatar: No video is required for training, only a single frame image is input, and the NeRF model can be driven by pre-training EG3D generation.

Generalizable NeRF reconstruction

01 Reconstruction based on diffusion model

Research motivation: Naive NeRF requires intensive shooting and independent training for each object or scene, so it is hoped that NeRF can be directly recommended from sparse viewpoint images.

Early work: NeRF for learning spatial alignment of image features from large-scale data.

Recent Route: Single Image NeRF Reconstruction Based on Diffusion Model

  • ReRDi: Obtain the semantic information of the input image from the pre-trained latent diffusion model, and constrain the new viewpoint rendering image to conform to the semantic information.
  • GeNVS: A 3D-aware diffusion model is proposed, and the denoising process is performed on the condition of the feature map obtained by volume rendering.
  • Make-It-3D: A two-stage optimization method from NeRF to point cloud is proposed to further improve the texture quality.

3D generation

Research motivation: Utilize large-scale 2D image priors to obtain generative prior models of objects to support sparse viewpoint reconstruction and various editing tasks.
Recent route: category object 3D generation -> GAN, general object 3D generation -> Diffusion

01 3D GAN category object generation

Research motivation: NeRF has the characteristics of differentiable rendering, and can optimize network parameters from the supervision of 2D images. Therefore, NeRF is combined with GAN to build a generative neural radiation field and learn 3D content generation.

Solution: The neural radiation field-based MLP network uses GAN's confrontational training strategy to learn the generative neural radiation field from 2D images, and controls its geometry and texture through random noise to generate implicit codes.

02 3D GAN category object generation (three-plane improvement)

Research motivation: 3D GAN is limited by the memory consumption and expression ability of MLP, and the resolution of the generated results is low.

Solution and innovation: A three-dimensional expression based on three planes is proposed, and the high-frequency signal of the neural radiation field is stored on the three planes to reduce the weight of the MLP network. Without losing the expressive ability, it greatly reduces memory consumption and improves rendering speed. Use efficient 2D styleGAN to generate triplane with high-frequency details to improve the quality of generation; use 2D super-resolution to improve rendering resolution

03 3D GAN category object generation (super resolution)

Research motivation: 2D super-resolution network couples perspective information and image features, destroying 3D consistency

Solution and innovation: replace 2D super-resolution with 3D super-resolution

  • Gram-hd : Set a set of implicit surface manifolds in the neural radiation field, and perform superresolution on the surface manifolds.
  • Mimic3D : By letting the generator's 3D rendering branch synthesize images that mimic those generated by its 2D super-resolution branch, it enables 3DGANs to generate high-quality images while maintaining their strict 3D consistency.

04 Generic 3D object generation (2D upscaling)

Research motivation: 2D generative large models have a strong ability to generate images from text; NeRF has the ability to represent continuous and complex 3D objects, and its rendering method is a differentiable and reversible rendering, so the network parameters of the radiation field can be reversely optimized through 2D supervision , to achieve
3D generation of general objects or scenes.

Solution: Use the pre-trained 2D generative large model as a priori, use the score distillation sampling (SDS) loss, minimize the
KL divergence of the distribution between the NeRF differentiable rendering image and the image generated by the diffusion model, optimize the NeRF parameters, and realize the text to three-dimensional generation. Representative work: Dreamfusion, Magic3D, Fantasia3D

Research motivation: The optimization goal of Score Distillation Sampling (SDS) is to make the rendering image of a single NeRF satisfy the maximum likelihood of the image distribution of the pre-trained model under a given text, so that the NeRF is optimized to conform to a certain optimal distribution of the image. Values: Generated 3D models are oversaturated, smoothed, and lack diversity

Solution: The image distribution of the pre-trained model under the given text corresponds to a set (greater than or equal to one) of the distribution of NeRF, and the variational inference of NeRF parameters is performed from the perspective of probability. Variational Score Distillation Sampling (VSD) changes the optimization target from single-point NeRF to the distribution of NeRF; uses particles to model the distribution of NeRF, and iteratively optimizes these particles so that the distribution of the rendered image is close to the distribution of the pre-trained model, thus generating The diversity and detail quality of the 3D models are higher.

Category object 3D generation (native 3D)

Research motivation: The method of using Diffusion to optimize NeRF (2D upscaling) is time-consuming (hours); the MLP network in the neural radiation field has no explicit structure and cannot be directly optimized based on diffusion; the memory storage required for 3D diffusion is different from that of The computational overhead is almost unbearable,

Solution: Build a diffusion model with three-dimensional perception: represent the neural radiation field as an explicit three-planar structure (Rodin, NFD, SSDNeRF), voxel grid (DiffRF), by learning the denoising process of the neural radiation field, you can Generate neural radiation fields directly from noise without optimization. Currently only category object generation is supported.

3D editing

01 Editing of objects/scenes NeRF

Research Motivation: Traditional neural radiation fields fit or generate scenes or objects, which cannot be edited.

Solution: Utilize different networks and latent vectors to decouple shape and appearance; users edit on 2D rendered pictures, use network and latent vectors for back-propagation optimization or forward editing.

Early work: EditNeRF, NeRF-Editing, NeuMesh, ARF

02 NeRF editing based on GAN

Research motivation: 3D GANs such as PiGAN and GRAF generate rich 3D faces, but they cannot be fine-grained edited.

Solution: Mapping external signals to neural radiation fields, editing their features.

  • IDE3D : Propose a generative neural semantic field that decouples geometry and material, and align 3D semantics and geometry by additionally outputting semantic masks in the geometric branch network; the editing principle is that 2D semantic map editing is mapped to the semantic field, thereby editing 3D semantics and The geometry to align with.
  • Next3D : A dynamic three-plane expression based on neural texture maps is proposed. The driving expression signal will be rasterized through neural texture, causing the deformation of the plane features, and then rendering the image with the corresponding expression.

Representative work: IDE3D, NeRFaceEditing, AnifaceGAN, Next3D

03 NeRF editing based on Diffusion

Research motivation: Based on the diffusion model of Vincent graph, using text to NeRF to achieve more intuitive and interactive 3D or 4D editing.

Solution: use the diffusion model to iteratively edit the training set, and optimize the parameters of the neural radiation field at the same time, so that the NeRF rendering result and the edited image generated by the given text tend to be consistent;

Representative work: InstructNeRF2NeRF, Instruct3D-to-3D

4D Generation and Editing

Dynamic NeRF generation and editing based on Diffusion

Research motivation: The existing diffusion model can only be edited to generate 2D images. With the help of dynamic NeRF, it can be upgraded from 2D to 4D to achieve high-quality and consistent 4D editing and generation.

  • Control4D : Combining Tensor4D and GAN to implement a 4D GAN, using 4D GAN to learn the image distribution generated by the diffusion model at different time perspectives, avoiding the direct supervision of the image to achieve high-quality editing and generation effects, the supervision generated by the 4DGAN discriminator The signal is smoother than the diffusion model, which makes the spatiotemporal consistency of 4D scene editing better and the network converges faster.

Light and shadow editing

Research motivation: Extend NeRF to represent material information to realize light and shadow editing. ,

Earlier work: NeRFactor, InvRender, PhySG.

Solution: Decompose the NeRF color representation into "Normal + BRDF + Lighting" and recombine the rendering to realize re-lighting and material editing.

representation enhancement

Research motivation: The implicit surface field has the advantage of representing geometry, but it is difficult to render training through the NeRF ray stepping method: If the implicit surface function is converted into a density function using a naive method, the surface position estimated by the ray integral will be slightly closer on the real surface.

Early work: VolSDF, NeuS, DoubleField, UNISURF

Solution ideas: 1) Redistribute the weight of the integral to the light sampling points so that the final integral can fall on the surface; 2) Resample the light sampling points so that the sampling points are concentrated on the surface.

scene modeling

Research Motivation: Extending NeRF to represent large scene content, allowing accurate reconstruction and synthesis of new viewpoints for unstructured image collections with large spatial spans and complex geometric textures

Early work: NeRF++, Mip-NeRF, Mip-NeRF 360

Solution idea: By introducing a full-space nonlinear parametric model, the problem of NeRF modeling in unbounded 3D scenes is solved, and by introducing integrated position coding considering the Gaussian area of ​​sampling points, the problem of blurring and aliasing in NeRF under multi-scale reconstruction is solved.

03 Annual development trend of NeRF

Trend 1: High Quality Dynamic Modeling

Although the NeRF method before 2022 performs superiorly in static scenes, there is still room for improvement in modeling complex dynamic scenes . This year, a lot of work has been made in this direction, including not only the improvement of 4D modeling for general dynamic scenes, but also the improvement of the modeling of human faces and human bodies. Some of the work has even achieved amazing results under the premise of ensuring real-time performance.

Trend 2: Combination with large models

The landing application of large models is already unstoppable. A lot of work this year has been dedicated to combining generative large models with NeRF to enable NeRF generative authoring. After being combined with a large model, NeRF is no longer limited to reconstructing real objects or scenes, but has the creativity of "creating something out of nothing".

Trend 3: Richer information embedding

The NeRF work before 2022 mainly focuses on new viewpoint rendering, so only the modeling of geometry and texture is considered. In this year's work, the researchers introduced more information to NeRF, including rich material properties and higher-level semantic connotations . The introduction of semantic information further broadens the potential application scenarios of NeRF.

Trend 4: Apply to other fields

In the previous year, NeRF only received attention in the field of 3D vision. This year, NeRF has achieved "breaking the circle" and has also been applied in the fields of robotics, autonomous driving and medical treatment . Its new viewpoint generation capability can effectively assist data generation and scene understanding in these fields.

04 Prospect of NeRF Research

Guess you like

Origin blog.csdn.net/NGUever15/article/details/131312640