A review of the progress of neural rendering in 2022

A review of progress in neural rendering

Source: https://zhuanlan.zhihu.com/p/567654308

EuroGraphics'2022 review paper "Advances in Neural Rendering", March 2022, with authors from MPI, Google Research, ETH, MIT, Reality Labs Research, Technical University of Munich and Stanford University.

Synthesizing photorealistic images and videos is at the heart of computer graphics and has been the focus of research for decades. Traditionally, composite images of a scene are generated using rendering algorithms such as rasterization or ray tracing that take as input a specially defined representation of geometry and material properties. Collectively, these inputs define the actual scene and rendered content, known as the scene representation (the scene consists of one or more objects). Example scene representations are triangle meshes (e.g. created by artists), point clouds (e.g. from depth sensors), volume meshes (e.g. from CT scans), or implicit surface functions (e.g., truncated signed distance fields) with accompanying textures. ). Reconstructing such a scene representation from observations using a differentiable rendering loss is known as inverse graphics or inverse rendering.

Neural rendering is closely related, combining ideas from classic computer graphics and machine learning to create algorithms that synthesize images from real-world observations. Neural rendering is a step toward the goal of synthesizing photorealistic image and video content. Recent years have seen tremendous progress in this field, demonstrating different approaches to inject learnable components into rendering pipelines.

This latest report on progress in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations (often now called neural scene representations ). A key advantage of these approaches is that they are 3D consistent by design, enabling applications such as new viewpoint synthesis of captured scenes. In addition to methods for handling static scenes, neural scene representation for modeling non-rigid-body deformable objects and scene editing and compositing are presented . While most of these methods are scene-specific, techniques for generalization across target classes are also discussed and can be used for generative tasks. In addition to reviewing these state-of-the-art methods, the basic concepts and definitions used are outlined. Finally, public challenges and social impact are discussed.


While traditional computer graphics allows the generation of high-quality controllable images of a scene, all physical parameters of the scene (e.g., camera parameters, lighting, and materials of objects) need to be provided as input. Estimating these physical properties from existing observations such as images and videos, i.e., inverse rendering, is very challenging if one wants to generate controllable images of real scenes, especially when the goal is photorealistic synthetic images.

In contrast, neural rendering is a rapidly emerging field that allows for compact representations of scenes and the rendering can be learned from existing observations by neural networks. The main idea of ​​neural rendering is to combine insights from classical (physics-based) computer graphics with recent advances in deep learning. Similar to classical computer graphics, the goal of neural rendering is to generate photorealistic images in a controlled manner, such as new viewpoint synthesis, relighting, scene warping and compositing, etc.

A good example of this is recent neural rendering techniques that attempt to separate the modeling and rendering processes by only learning 3D scene representations and relying on rendering functions in computer graphics for supervision. For example, **Neural Radiative Fields (NeRF)** uses a multi-layer perceptron (MLP) to approximate the radiation and density fields of a 3D scene. This learned volume representation can be rendered from any virtual camera using analytically differentiable rendering (i.e., volume integration). For training, assume that the scene is observed from multiple camera viewpoints. From these training viewpoints, an estimated 3D scene is rendered, and the difference between the rendered and observed images is minimized, and the network is trained on these observations. Once trained, the 3D scene approximated by the neural network can be rendered from new viewpoints, enabling controllable synthesis. In contrast to approaches that use neural networks to learn rendering functions, where NeRF uses knowledge from computer graphics more explicitly, new views are better generalized due to (physical) inductive bias: intermediate 3D structures of scene density and radii Expressed. As a result, NeRF learns physically meaningful color and density values ​​in 3D space, and physically inspired raycasting and volume integration can be continuously rendered into new views.

The quality of the results obtained, together with the simplicity of the method, has led to an "explosion" in the field. Several advances have been made that improve applicability, enable controllability, capture of dynamically changing scenarios, and training and inference time. Since neural rendering is a very fast-growing field with significant progress along many different dimensions, recent methods and their application domains are categorized to provide a brief overview of developments.

In this report, we focus on advanced neural rendering methods that combine classical rendering with learnable 3D representations (see Figure).

The underlying neural 3D representation is 3D consistent by design and able to control different scene parameters. In this report, we give a comprehensive overview of different scene representations and detail the component rationale borrowed from classic rendering pipelines as well as machine learning. Further attention is paid to methods for rendering with neural radiance fields and volumes. However, neural rendering methods that reason primarily in 2D screen space are ignored here, nor are neural supersampling and denoising methods for ray-traced images.


For decades, the computer graphics community has explored a variety of representations, including point clouds, implicit and parametric surfaces, meshes, and volumes (see figure).

While these representations are well defined in the field of computer graphics, there is often confusion in the current neural rendering literature, especially when it comes to implicit and explicit surface and volume representations. Often, volume representations can represent surfaces, but not vice versa. Volumes represent storage volume characteristics such as density, opacity or occupancy, but they can also store multidimensional features such as color or brightness. Unlike volume representations, surface representations store the properties of the target surface. They cannot be used to simulate bulk substances such as smoke (except as a rough approximation). There are both continuous and discrete counterparts for surface and volume representations (see figure above). Continuous representations are particularly interesting for neural rendering methods because they can provide analytical gradients.

There are two common ways to render a 3D scene to a 2D image plane: ray casting and rasterization , see the figure below. A rendered image of a scene can also be computed by defining a camera in the scene. Most methods use a pinhole camera, where all camera rays pass through a single point in space (focal point). For a given camera, a ray from the camera's origin can be cast into the scene to compute a rendered image.

To correctly model the current camera image needs to take the lens into account. Leaving aside effects such as depth of field or motion blur that must be modeled during image formation, distortion effects are also added to the projection function. Unfortunately, there is no one simple model to capture all the different lens effects. Calibration packages, such as those given by OpenCV, typically implement models with up to 12 distortion parameters. They are modeled by polynomials of degree 5, and thus are not simply invertible (which is required for raycasting, not pointcasting). More modern camera calibration methods use more parameters, achieve higher accuracy, and are invertible and differentiable.

Direct rasterization mainly uses meshes, which are described by a set of vertices v and faces f, connecting three or four vertices to define a surface. A basic insight is that geometric operations in 3D can only deal with vertices: for example, transform every point in the world to the camera coordinate system with the same extrinsic matrix E. After conversion, points outside the viewing frustum or with wrong normal directions can be culled, reducing the number of point-surfaces to be processed in the next step. The point positions projected to the image coordinates can also be easily found through the internal reference matrix K. Surface information can be used to interpolate the depth of surface primitives, and the topmost surface can be stored in the z-buffer. However, some effects (for example, lighting effects, shadows, reflections) are difficult to capture in this way. It can be subdivided by "soft" rasterization techniques.


Various approaches to neural rendering and neural scene representation are discussed below by application: new viewpoint synthesis for static scenes, generalization to objects and scenes, viewpoint synthesis for non-static scenes, scene editing and composition, relighting and material editing, etc.

1 New View Synthesis

New view synthesis is the rendering of a given scene from new camera positions given a set of images and their camera poses as input.

View synthesis methods are evaluated based on several important criteria. Obviously, the output image should be as realistic as possible. However, this is not all, perhaps more important is the multi-view 3D consistency. Rendered video sequences must appear to depict consistent 3D content without flickering or warping as the camera moves across the scene. As the field of neural rendering has matured, most approaches have moved towards generating a fixed 3D representation, the output of which can be used to render new 2D views. This approach automatically provides a level of multi-view consistency that has been difficult to achieve in the past when overly relying on black-box 2D convolutional networks as image generators or renderers.

In order to solve the voxel grid resolution and memory limitations, Scene Representation Networks (SRN) combines a sphere-tracking-based neural renderer with a multi-layer perceptron (MLP) as a scene representation, focusing on the generalization of the scene and implementing less Lens reconstruction. Differentiable Volumetric Rendering (DVR) similarly leverages surface rendering methods, but demonstrates that overfitting of a single scene enables reconstruction of more complex appearance and geometry.

Neural Radiant Fields (NeRF) marks a breakthrough in applying MLP-based scene representations to single-scene, photorealistic novel view synthesis, see figure below.

Unlike surface-based methods, NeRF directly applies a volume rendering model to synthesize images from an MLP, mapping from input positions and viewing directions to output volume densities and colors. Based on the pixel-level rendering loss of the input image, a different set of MLP weights is optimized to represent each new input scene.

MLP-based scene representations achieve higher resolution than discrete 3D volumes due to the efficient differentiable compression of the scene during optimization. For example, rendering a NeRF representation of an 800×800 resolution output image requires only 5MB of network weight. In comparison, an 800^3 RGBA voxel grid would consume close to 2GB of storage.

This ability can be attributed to the fact that NeRF applies a positional encoding to the input spatial coordinates before passing through the MLP. Compared with previous work using neural networks to represent implicit surfaces or implicit volumes, NeRF's MLP can represent much higher frequency signals without increasing its capacity (in terms of the number of network weights).

The main disadvantage of switching from discrete 3D meshes to MLP-based representations is rendering speed. Computing the color and density of a single point in space, rather than directly querying simple data structures, requires evaluating an entire neural network (hundreds of thousands of floating-point operations). Implementing NeRF in standard deep learning frameworks to render a single high-resolution image takes tens of seconds on a typical desktop GPU.

There are some accelerated volume rendering methods based on MLP representations, such as Neural Sparse Voxel Fields and KiloNeRF. There are also several methods that cache various quantities learned by NeRF MLPs on sparse 3D grids, allowing real-time rendering after training is complete, such as SNeRG, FastNeRF, PlenOctrees, and NeX-MPI, among others. Another way to speed up rendering is to train the MLP representation itself, efficiently precomputing some or all of the volume integrals along rays, such as AutoInt and Light Field Networks.

Many new methods use classical data structures such as grids, sparse grids, trees, and hashes to speed up rendering and achieve faster training times. Instant Neural Graphics Primitives leverage multi-resolution hash coding, rather than explicit grid structures, enabling NeRF training in seconds.

Other improvements include supervised data (e.g. depth values), optimized camera poses, hybrid surface/volume representations, robustness and quality improvements (NeRF++, MipNeRF), a combination of NeRF and standard computational imaging methods (Deblurr-NeRF, NeRF in the Dark, HDR-NeRF and NeRF-SR, etc.), large-scale scenes and NeRF from text (Dream NeRF and CLIP NeRF), etc.


2 Generalization of target and scene

A lot of work involves the generalization of multiple scenes and target classes based on voxel-based, grid-based or non-3D structured neural scene representations. Here we mainly discuss the latest progress in generalization of MLP-based scene representations. Among them, the method of overfitting a single MLP in a single scene requires a large amount of image observation data, and the core goal of generalization in scene representation is new view synthesis given few or possibly only a single input view. The methods in the overview are categorized as follows: whether they exploit local or global conditions, whether they can be used as non-traditional generative models, what 3D representations are exploited (volume, SDF, or occupancy), what training data is required, and how inference is performed (via encoding decoder, auto-decoder framework, or gradient-based meta-learning, etc.).

There are two key ways to generalize different scenarios. One class of work follows an approach similar to image-based rendering (IBR), where multiple input views are warped (warp) and blended (blend) to synthesize new viewpoints. In the context of MLP-based scene representations, this is usually achieved by local conditioning, where the coordinate input of the scene representation MLP is concatenated with a local variation feature vector stored in a discrete scene representation such as a voxel grid.

PiFU uses an image encoder to compute features of an input image, and adjusts the 3D MLP of these features by projecting 3D coordinates on the image plane. However, PiFU does not have a differentiable renderer and thus requires ground-truth 3D supervision. PixelNeRF and Pixel-Aligned Avatars exploit this approach in volume rendering frameworks, where these features are aggregated over multiple views and the MLP generates color and density fields, rendered in NeRF fashion. When trained on multiple scenes, a scene prior can be learned for reconstruction, reconstructing the scene from several views with high fidelity.

PixelNeRF can also be trained on specific object classes, enabling object instance 3D reconstruction from one or more pose images. GRF uses a similar framework with an additional attention module to account for the visibility of 3D points in differently sampled input images. Stereo Radiance Fields similarly extracts features from multiple contextual views, but leverages learned correspondence matches between pairs of contextual image features to aggregate features across contextual images instead of simple average aggregation. Finally, IBRNet and NeR-Former introduce a transformer network in ray sampling to infer visibility. LOLNeRF learns a generalized NeRF model for portrait images with only monocular supervision. The generator network is jointly trained, conditioned on instance-specific latent vectors. GeoNeRF builds a set of concatenated cost bodies and uses transformers to infer geometry and appearance.

An alternative to image-based methods aims to learn a holistic, global representation of the scene, rather than relying on images or other discrete spatial data structures. Given a set of observations, its implementation describes the entire scene by inferring a set of weights for a scene representation MLP. Some works do this by encoding the scene in a single low-dimensional latent code, and then use this code to condition the scene representation MLP.

Scene Representation Networks (SRN) map the low-dimensional latent code to the parameters of the MLP scene representation through a hypernetwork, and then render the resulting 3D MLP through ray-marching. To reconstruct an instance given a pose view, SRN optimizes the latent code whose rendering matches the input view. Differentiable Volumetric Rendering similarly uses surface rendering, computes its gradients analytically, and performs inference via a CNN encoder. Light Field Networks leverages low-dimensional latent code to directly parameterize the 4D light field of a 3D scene, enabling single-evaluation rendering.

NeRF VAE embeds NeRF into a variational auto-encoder (VAE), similarly representing the entire scene in a single latent code, but learns a generative model to enable sampling. Sharf employs a generative model that voxels the shape of the target in a class, which A higher resolution neural radiance field is then tuned, which uses volume rendering for higher fidelity in new view synthesis.

Fig-NeRF models the target category as a template shape conditioned on a latent code that undergoes deformation conditioned on the same latent variable. This enables the network to interpret certain shape changes as more intuitive deformations. Fig-NeRF focuses on retrieving object categories from real object scans, and also proposes to segment objects from their backgrounds with a learned background model. An alternative is to represent the scene as a low-dimensional latent code, and quickly optimize the weights of the MLP scene representation in a few optimization steps through gradient-based meta-learning. This can be used to quickly reconstruct the neural radiation field from a small number of images. When training, the pretrained model converges faster and requires fewer views than standard neural radiance field training.

Portrait-NeRF proposes a meta-learning approach to recover NeRF from a single frontal image of a person. To account for differences in pose between subjects, 3D portraits are modeled in a pose-agnostic standard frame of reference, warping each subject with 3D keypoints. The NeRF of the scene is quickly recovered using gradient-based meta-learning and local adjustments on image features.

Instead of inferring low-dimensional latent codes from a set of observations looking for 3D scenes, a similar approach can be used to learn unconditional generative models. Here, a 3D scene representation equipped with a neural renderer is embedded into a generative adversarial network (GAN). Instead of inferring low-dimensional latent codes from a set of observations, a distribution of latent codes is defined. In the forward pass, a latent variable is sampled from this distribution, the MLP scene representation is adjusted, and the image is rendered by the Neural Renderer. This image can be used in an adversarial loss. Given only 2D images, this is able to learn a 3D generative model of 3D scene shape and appearance. A framework for parametric 3D scene representation via voxel grids, where GRAF first exploits conditional NeRF and achieves significant improvements in photorealism. Pi-GAN further improves the architecture through a FiLM ("Film: Visual reasoning with a general conditioning layer") conditioning scheme based on a SIREN ("Implicit neural representations with periodic activation functions") structure.

Several recent approaches have explored different directions for improving the quality and efficiency of these generative models. The computational cost and quality of geometry reconstruction can be improved by surface representations. In addition to synthesizing multi-view images for the discriminator, ShadeGAN uses an explicit shading step to also generate output image renderings under different lighting conditions for higher quality geometry reconstruction. Many approaches have been explored in terms of hybrid techniques, where image-based CNN networks are used to optimize the output of 3D generators. Image space networks can be trained with higher resolution and higher fidelity output. Some approaches explore decomposing generative models into separate geometry and texture spaces. Here, some methods learn texture in image space, while others learn geometry and texture simultaneously in 3D.

While these methods do not require more than one observation per 3D scene, nor do they require the camera pose ground truth, they still require knowledge of the camera pose distribution (for portrait images, the camera pose distribution must yield plausible portrait angles). CAMPARI addresses this constraint by jointly learning camera pose distributions and generative models. GIRAFFE proposes to parameterize the scene as a combination of multiple foreground (object) NeRFs and a single background NeRF to learn a scene generation model composed of multiple objects. The latent code is sampled individually for each NeRF and synthesized into a plausible 2D image by the volume renderer.

3 Expansion of dynamic scenes

Raw neural radiance fields are used to represent static scenes and objects, and there are methods that additionally handle dynamically changing content. These methods can be categorized as time- varying representation methods, which allow the synthesis of new viewpoints of dynamically changing scenes into unmodified playbacks (e.g., producing bullet-time effects), or as techniques for controlling deformation states, which allow for the The content is synthesized and edited with new viewpoints. The deformed neural radiation field can be implemented implicitly or explicitly, as shown in the figure: the left one is implemented implicitly, modulating the radiation field v over deformation (time t). The right is implemented explicitly, using a separate deformable MLP to warp the space and regress the offset (blue arrow) from the deformed space (black) to the static norm space (yellow). This deformation bends straight light rays into the standard radiation field.

  • time-varying representation

Time-varying NeRF allows playback of videos with new viewpoints. Due to the relinquishment of control, these methods do not depend on a specific motion model and thus can handle general objects and scenes.

Meanwhile some work proposed several extensions of NeRF for non-rigid scenes. Methods for implicitly simulating deformations are first discussed. While the original NeRF is static, taking only points in 3D space as input, it can be extended to be time-varying in a simple way: additionally volumetric representations can depend on vectors representing deformed states. In current methods, this conditioning employs temporal inputs (possibly positionally encoded) or automatically decoded latent codes at each time step.

Handling non-rigid scenes without prior knowledge of object type or 3D shape is an ill-posed problem, and such methods employ various geometric regularization methods, as well as conditional learning on additional data patterns. To encourage temporal consistency of reflections and opacities, there are several approaches to learn a temporal flow map of the scene between adjacent time steps. Since this is limited to small temporal neighborhoods, distortion-free synthesis of new views mainly Input camera trajectory demo in close space-time.

Scene flow maps can be trained with a reconstruction loss that warps the scene from other time steps to the current time step, which encourages consistency between estimated optical flow and scene flow 2D projections, or 3D tracking keypoints for backprojection . Scene flow is often constrained by additional regularization losses, such as encouraging spatial or temporal smoothness or forward-backward cycle consistency. Unlike the other methods mentioned, **Neural Radiance FLow (NeRFlow)** models deformations with infinitesimal displacements, which requires integration with Neural ODEs to obtain offset estimates.

Furthermore, some methods use estimated depth maps to supervise geometry estimation. A limitation of this regularization is that the accuracy of the reconstruction depends on the accuracy of the monocular depth estimation method. Therefore, artifacts of the monocular depth estimation method can be seen in new views.

Finally, static backgrounds are often processed separately, allowing multi-view cues for temporal monocular inputs. To this end, some methods estimate a second static volume that is not conditioned on deformation, or introduce a soft regularization loss to constrain the static scene content.

NeRFlow can be used for denoising and super-resolution views of pretrained scenes. Limitations of NeRFlow include difficulty maintaining static backgrounds, handling complex scenes (non-segmented rigid deformations and motion), and rendering new views under camera trajectories that are substantially different from the input trajectories.

So far, emerging methods implicitly model deformations with deformation-dependent scene representations. This makes the control of deformation cumbersome and difficult. Other work decouples deformation from geometry and appearance: decomposing deformation into an independent function on top of a static canonical scene is a crucial step towards controllability. Deformation is implemented by casting straight rays into the deformation space and warping into the canonical scene, usually by regressing point offsets on the straight rays with a coordinate-based MLP. This can be thought of as space warping or scene flow.

In contrast to implicit modeling, these methods share geometric and appearance information over time through the construction of static canonical scenes, thus providing hard correspondences that do not drift. Due to this hard constraint, unlike implicit methods, methods with explicit deformations cannot handle topological changes and demonstrate results only for scenes with significantly smaller motion than implicit methods.

D-NeRF uses a ray-bending MLP without regularization to simulate single or multiple synthetic object deformations segmented from the background, viewed through a virtual camera. It assumes that given a predefined set of multi-view images, only one single view is selected for supervision during training. Therefore, D-NeRF can be regarded as an intermediate step between multi-view supervised techniques and true monocular supervised methods.

Several works have demonstrated real scene results observed by moving monocular cameras. The core application of Deformable NeRF is the construction of Nerfies, the free-viewpoint selfies. Deformable NeRF modulates deformation and appearance with automatically decoded latent codes for each input view. Bending rays are regularized with an as rigid-as-possible term (also known as an elastic energy term), penalizing deviations from piecewise rigid scene configurations.

As a result, Deformable NeRF works well in articulated scenes (e.g., a hand holding a tennis racket) and scenes involving human heads (where the head moves relative to the torso). Nevertheless, small non-rigid deformations are handled well (like smiles) because the regularizer is soft. Another important innovation of this work is the from-coarse-to-fine scheme, which allows low-frequency components to be learned first and avoids local minima due to overfitting to high-frequency details.

HyperNeRF is an extension of Deformable NeRF, using a canonical hyperspace instead of a single canonical framework. This allows handling scenes with topological changes, such as opening and closing the mouth. In HyperNeRF, Deformable NeRF's bending network (MLP) is augmented by a surrounding slice surface network (also MLP), which indirectly adjusts the deformable canonical scene, choosing a canonical subspace for each input RGB view. Therefore, it is a hybrid model combining explicit and implicit deformation modeling, allowing to sacrifice hard correspondences to handle topological changes.

Non-rigid NeRF (NR NeRF) models time-varying scene appearances using scene canonical volumes, scene rigidity markers (MLP) and frame ray bending operators (MLP). NR NeRF shows that no additional supervisory cues such as depth maps or scene flow are required to process scenes with small non-rigid deformations and motions. Furthermore, the observed deformations are regularized by a divergence operator imposing volume-preserving constraints that stabilize occluded regions relative to supervised monocular input views. In this respect, it has similar properties to Nerfies' elastic regularizers, which penalize deviations from piecewise rigid deformations. This regularization makes the camera trajectory of the new view significantly different from the input camera trajectory. While controllability is still severely limited, NR-NeRF demonstrates several simple edits to the learned deformation field, such as motion amplification or dynamic scene content removal.

Other methods are not limited to the case of monocular RGB input video, but consider the presence of other inputs.

The Time-of-Flight Radiance Fields (TöRF) method replaces data-driven prior knowledge to reconstruct dynamic content with depth maps from depth sensors. Unlike the vast majority of computer vision work, TöRF uses raw ToF sensor measurements (so-called phasors), which bring advantages when dealing with weakly reflective regions and other limitations of modern depth sensors (e.g., limited operating depth range). In NeRF learning, integrating measured scene depth reduces the requirement on the number of input views, resulting in sharp and detailed models. Depth cues provide higher accuracy than NSFF and spatiotemporal neural radiance fields.

Neural 3D Video Synthesis sets up and implicitly models deformations with multi-view RGB. The method first trains on keyframes, exploiting temporal smoothness. It also sets the camera to remain static, the scene content is mostly static, and samples light in a biased way for training. Even for smaller dynamic content, the results are crisp.

  • control deformation state

To control the deformation of the neural radiation field, such methods use a class-specific motion model as the basic representation of the deformation state (e.g., a deformed model of a human face or a skeletal deformed map of a human body).

NeRFace is the first method to implicitly control neural radiation fields using deformable models. They use a face tracker to reconstruct face blend shape parameters and camera pose in the training view (monocular video). The MLP is trained on these views with mixture shape parameters and a learnable per-frame latent code as conditioning. Furthermore, they assume a known static background, making the radiation field only store information about faces. Latent codes are used to compensate for lost tracking information (i.e. person's shoulders) as well as errors in tracking. After training, the radiation field can be controlled via the blend shape parameters, allowing reenactment and expression editing.

An audio-driven neural radiation field ( AD-NeRF ) inspired by NeRFace , instead of expression coefficients, the audio features extracted by Deep-Speech are mapped to a feature that provides conditions for the radiation field representation MLP. While expressions are controlled implicitly via audio signals, explicit control over the rigid pose of the head is provided. To synthesize a portrait view of a person, they used two separate radiation fields, one for the head and one for the torso.

"IM Avatar" extends NeRFace based on the skin field, which is used to deform a canonical NeRF volume given new expression and pose parameters.

In addition to these subject-specific training methods, Head-NeRF and MoFaNeRF propose a generalized model for representing faces under different views, expressions and illuminations. Similar to NeRFace, they tune the NeRF MLP by controlling additional parameters such as character shape, expression, albedo, and lighting. Both methods require a refined network (2D network) to improve the rough results of volume rendering based on conditional NeRF MLP.

While the above methods show promising results in portrait scenarios, they are not suitable for highly non-rigid deformations, especially articulated human motion captured from a single view. Therefore, human skeleton embeddings need to be exploited explicitly. **Neural Articulated Radiance Field (NARF)** is trained by posing annotated images. The articulated target is decomposed into multiple rigid target parts with its local coordinate system and global shape variation on top. Converged NARF renders new views by manipulating pose, estimating depth maps, and performing body part segmentation.

Compared with NARF, A-NeRF learns an actor-specific somatic neural body model from monocular footage in a self-supervised manner. The method combines the explicit controllability of dynamic NeRF volumes with articulated human skeletal embeddings, and reconstructs pose and radiation fields synthetically and analytically. Once trained, radiation fields can be used for novel viewpoint synthesis as well as motion relocalization.

When A-NeRF is trained on monocular video, **Animatable Neural Radiance Fields (ANRF)** is a skeleton-driven method for reconstructing human models from multi-view videos. Its core component is a new representation of motion, a neural hybrid weight field, which is combined with a 3D human skeleton to generate a deformation field. Similar to several general-purpose non-rigid NERFs, ANRF maintains a canonical space and estimates bidirectional correspondences between multi-view inputs and canonical frames.

The reconstructed animatable mannequin can be used for arbitrary viewpoint rendering and re-rendering in new poses. Human meshes can also be extracted from ANRF by running the marching cubes algorithm on the volume density of discretized regularized space points. The method achieves high visual accuracy for the learned human model, and in future work, it can be improved to handle complex non-rigid deformations of the observed surface (such as those caused by loose clothing).

Neural Body methods enable novel view synthesis of human performances from sparse multi-view videos (e.g., four simultaneous views). Their method is conditioned by a parametric human shape model, SMPL, as a shape-aware prior. It assumes that neural representations recovered from different frames have the same set of latent codes anchored to a deformable grid. Common baselines such as rigid NeRF (applied per timestamp) or Neural Volumes assume a denser set of input images. Therefore, rendering new views of a moving human body from several simultaneous input images cannot compete with Neural Body. The method also compares favorably to human mesh reconstruction techniques such as PIFuHD, which rely heavily on training 3D data when it comes to 3D reconstruction of fine appearance details (e.g., rarely worn or unique clothing).

Similar to the Neural Body method, Neural Actor (NA) and HVTR use the SMPL model to represent the deformation state. They exploit the agent to explicitly unwrap the surrounding 3D space into canonical poses, in which NeRF is embedded. To improve the recovery of geometric and apparent high-fidelity details, they use an additional 2D texture map defined on the SMPL surface as an additional condition to NeRF MLP.

H-NeRF is another technique for temporal 3D reconstruction using phantom conditions. Similar to Neural Body, they require sparse video sets from synchronized and calibrated cameras. In contrast, H-NeRF uses a structured implicit human model with symbolic distance fields, leading to cleaner rendering and more complete geometry. Similar to H-NeRF, DD-NeRF is built on top of signed distance fields, rendering the whole human body. Given a multi-view input image and a reconstructed SMPL volume, they render the accumulated regressed SDF and radiance values ​​with the volume.

Human-NeRF is also based on multiple views of the input, but learns a generalized neural radiation field for arbitrary viewpoint rendering, which can be fine-tuned for specific actors. Another work, called HumanNeRF, drives the motion field with a skeleton-refined generic non-rigid motion field, showing how to train an actor-specific neural radiation field based on monocular input data.

Mixture of Volumetric Primitives for real-time rendering of dynamic, animatable virtual human models. The main idea is to model a scene or object with a set of voxels that can dynamically change position and content. These primitives, like part-based models, model the components of the scene. Each voxel is a grid of voxels generated by the decoder network from the latent code. This latent code defines the configuration of the scene (e.g., facial expression in the case of a human face), which the decoder network uses to generate raw positions and voxel values ​​(including RGB color and opacity).

To render, a ray marching procedure is used to accumulate color and opacity values ​​along the corresponding rays for each pixel. Similar to other dynamic NeRF methods, multi-view videos are used as training data. The method is capable of creating very high-quality real-time renderings that look realistic even on challenging materials such as hair and clothing. E-NeRF demonstrates an efficient NeRF rendering solution based on depth-guided sampling technology. They demonstrate real-time rendering of moving humans and static objects using multi-view images as input.

4 Combining and editing

The methods discussed so far allow to reconstruct volumetric representations of static or dynamic scenes and possibly render new views of them from several input images. Keep the observed scene unchanged, except for relatively simple modifications (such as foreground removal). Several recent methods also allow editing of reconstructed 3D scenes, i.e. rearranging and affine transforming objects and changing their structure and appearance.

Conditional NeRF can change the color and shape of observed rigid targets in 2D images through manual user editing (for example, some target parts can be removed). The function starts with a single NeRF trained on multiple target instances of the same class. During editing, network parameters are adjusted to match the shape and color of the newly observed instance. One of the contributions of this work is to find the subset of tunable parameters that can successfully propagate user edits to generate new views. This avoids costly modifications to the entire network. CodeNeRF represents shape and texture variation in target classes. Similar to pixelNeRF , CodeNeRF can synthesize new views of unseen objects. It learns two different embeddings of shape and texture. At test time, it estimates camera pose, object 3D shape and texture from a single image, and can be continuously modified by changing the latent code. Co-deNeRF achieves comparable performance to previous single-image 3D reconstruction methods without assuming known camera poses.

**Neural Scene Graphs (NSG)** is a method for synthesizing new views from monocular videos (self-vehicle views) recorded by driving. The technique decomposes a dynamic scene of multiple independently rigid moving objects into a learned scene graph that encodes individual object transformations and radiations. Therefore, each target and background is encoded by a different neural network. In addition, the sampling of static nodes is limited to slices (which are parallel to the image plane) for efficiency, i.e. 2.5D representations. NSG requires annotated tracking data for each rigidly moving object of interest on a collection of input frames, and each object class (e.g., car or bus) shares a single body prior. The neural scene graph can then be used to render new views of the same (i.e., observed) or edited (i.e., by rearranging objects) scene. Applications of NSG include background-foreground decomposition, enriching training datasets for automotive perception, and improving object detection and scene understanding.

Another hierarchical representation, spatially and temporally consistent NeRF ( ST-NeRF ) relies on the bounding boxes of all independently moving and articulated objects, resulting in multiple layers and disentangling their position, deformation, and appearance information. The input to ST-NeRF is a set of 16 simultaneous videos from cameras placed in a semicircle at regular intervals, and a human background segmentation mask. The name of the method suggests that the spatiotemporal consistency constraints are reflected in its architecture, namely the spatiotemporal deformation module and the NeRF module as a gauge space. ST-NeRF also accepts timestamps to account for the evolution of appearance over time. When rendering a new view, sampled rays are cast to multiple scene layers, which results in cumulative density and color. ST-NeRF can be used for neural scene editing such as rescaling, moving, duplicating or removing performers, and time rescheduling.

5 heavy lighting and material editing

The above application is based on a simplified absorber-emitter rendering model, where the scene is modeled as particle bodies that block and emit light. While the model is good enough to render images of the scene from new viewpoints, it cannot render pictures of the scene under different lighting conditions. Enabling relighting requires a scene representation that simulates the transport of light through volumes, including scattering of light by particles with various material properties.

Neural Reflectance Fields proposes to extend NeRF for relighting for the first time. Unlike NeRF, Neural Reflectance Fields does not represent the scene as a volume density field and a view-related radiance field, but represents the scene as a volume density field, surface normal, and bidirectional reflectance distribution function (BRDF). This allows rendering of scenes under arbitrary lighting conditions by using the predicted surface normal and BRDF at each 3D location to evaluate how much incoming light is being reflected back to the camera by the particles at that location. However, for neural volume rendering models, evaluating the visibility to each light source from each point along the camera ray is computationally intensive. Even considering only direct lighting, the MLP must evaluate at densely sampled locations between every point along the camera ray and every light source in order to compute the incoming lighting for rendering to that ray. The neural reflex field is only illuminated with a single point light co-located with the camera, and training on the resulting target images sidesteps this problem, so the MLP only needs to be evaluated along the camera ray.

Other recent work that recovers the relightable model simply ignores self-occlusion and assumes that all light sources in the upper hemisphere above any surface are fully visible, avoiding the difficulty of computing light source visibility. Two methods, PhySG and NeRD , assume full light source visibility, and represent the ambient lighting and scene BRDF as a mixture of spherical Gaussians to further accelerate rendering, so that the incident light is multiplied by the hemispherical integral of the BRDF, which can be calculated in closed form. Assuming full light visibility works well for most convex objects, but this strategy cannot simulate the effects of scene geometry occluding light sources, such as casting shadows.

Neural Reflectance and Visibility Fields (NeRV) trains an MLP to approximate light source visibility for any input 3D position and 2D incident light direction. Unlike querying the MLP at densely sampled points along each ray, here the visibility MLP only needs to be queried once for each incident light direction. This enables the neural network to recover a relightable model of the scene from images with significant shadowing and self-occlusion effects.

Unlike the previously discussed methods, NeRFactor starts with a pre-trained NeRF model. Then, NeRFactor simplifies the volume geometry of the pre-trained NeRF into a surface model, optimizes the light source visibility and surface normal of any point on the MLP representation surface, and finally optimizes the ambient lighting and BRDF representation of any surface point to restore the relightable model. This results in a more efficient relightable model when rendering images, since the volumetric geometry has been reduced to a single surface, and the visibility of light sources at arbitrary points can be computed with a single MLP query.

The NeROIC technique also uses a multi-stage pipeline to recover reilluminable NeRF-like models from target images captured under multiple unconstrained lighting environments. The first stage recovers the geometry while accounting for appearance changes due to lighting with latent appearance embeddings, the second stage extracts normal vectors from the recovered geometry, and the third stage estimates BRDF properties and a spherical harmonic representation of the lighting.

Unlike the above relightable representations that focus on restoring objects, NeRF-OSR restores NeRF-style relightable models of large buildings and historic sites. NeRF OSR takes the Lambertian model and decomposes the scene into diffuse albedo, surface normals, spherical harmonic representation of lighting and shadows, which combine to relight the scene under the new ambient lighting.

The above relightable model represents the scene material as a continuous 3D field of BRDF. This will enable some basic material editing, as the restored BRDF can be changed before rendering. NeuTex introduces a surface parameterization network to learn the mapping from 3D coordinates in the body to 2D texture coordinates, so as to achieve more intuitive material editing. After restoring the NeuTex model of the scene, 2D textures can be easily edited or replaced.

Ref-NeRF focuses on improving NeRF's ability to represent and render specularly reflective surfaces. While Ref-NeRF cannot be used for relighting because it cannot separate incident light from reflective properties, it constructs emitted light as physically meaningful components (diffuse and specular colors, normal vectors, and roughness) , enabling intuitive material editing.

6 light fields

Volume rendering, ball tracing, and other 3D rendering forward models can produce photorealistic results. However, for a given ray, they all require sampling the underlying 3D scene at whatever 3D coordinate the ray first intersects the scene geometry. Since this intersection point is not known a priori, a ray-marching algorithm must first discover this surface point. Ultimately, this creates a time and memory complexity proportional to the geometric complexity of the scene, where more and more points must be sampled to render increasingly complex scenes. In practice, there are hundreds or even thousands of points per ray. Additionally, accurate rendering of reflections and second-order lighting effects requires multiple bounce ray tracing, so for each pixel, multiple rays must be traced instead of just one. This creates a high computational burden. While in the case of reconstructing a single scene (overfitting) this can be avoided with clever data structures, hashing and expert low-level engineering, in the case of reconstructing a 3D scene given only a few observations or even a single image , this data structure hinders the application of learned reconstruction algorithms, such as using convolutional neural networks to infer the parameters of a 3D scene from a single image.

7 Engineering framework

Using neural rendering models poses significant engineering challenges for practitioners: large amounts of image and video data must be processed in a highly non-sequential manner, and models often need to distinguish between large and complex computational graphs. Developing efficient operators often requires the use of low-level languages, which also makes it more difficult to use automatic differentiation. Recent advances in tools that can help overcome the entire software stack associated with neural rendering. Including: storage, hyperparameter search, differential rendering and ray casting, etc.

Open Questions and Challenges

  • seamless integration
  • Scale up
  • universalization
  • multimodal learning
  • quality

social influence

The areas most affected by new neural representations are computer vision, computer graphics, and augmented and virtual reality, which could benefit from enhanced photorealism of rendered environments. Indeed, state-of-the-art volumetric models rely on easy-to-understand and elegant principles, lowering the barriers for photogrammetry and 3D reconstruction research. More importantly, this effect is amplified by the ease of use of these methods and publicly available code libraries and datasets.

Since neural rendering is still immature and not well understood, end-user tools like Blender do not yet exist to enable these new approaches. However, a broader understanding of technology inevitably affects developed products and applications. It is expected that the workload of game content creation and movie special effects will be reduced. The possibility to render photo-realistic new views of a scene from several input images is a significant advantage over the state-of-the-art. This could reshape the entire established process of content design in the visual effects (VFX) industry.

conclusion

The field of neural rendering has developed rapidly over the past few years and continues to grow rapidly. Its applications range from arbitrary viewpoint video of rigid and non-rigid scenes to shape and material editing, relighting and human avatar generation.

We believe that neural rendering is still an emerging field with many open challenges that can be addressed.

【Project recommendation】

The core code library of top conference papers for Xiaobai: https://github.com/xmu-xiaoma666/External-Attention-pytorch

YOLO target detection library for Xiaobai: https://github.com/iscyy/yoloair

Analysis of papers for Xiaobai's top journal and conference: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading

![](https://files.mdnice.com/user/18705/379cbb49-f18e-4590-9a3c-4d

Guess you like

Origin blog.csdn.net/Jason_android98/article/details/127140316