[NeurIPS 2019] "Outstanding New Direction" Honorable Mention Award Paper Interpretation: A Scene Representation Network Model SRNs

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Authors:

Vincent Sitzmann, Michael Zollhöfer, Gordon Wetzstein (Stanford University)

Click here to enter "Thesis Address"

Click here to get the "paper code"

 

1. Summary

The use of generative models for unsupervised learning to discover rich representations of 3D scenes has great potential. However, these existing characterization methods do not explicitly involve geometric reasoning, and therefore do not consider the potential 3D structure of the scene. Although geometric deep learning has explored the 3D structure perceptual representation of scene geometry, these models usually require explicit 3D supervised learning.

To this end, this paper proposes Scene Representation Networks (SRNs) , which is a continuous, 3D structure-aware scene representation model that can encode geometry and appearance at the same time. SRNs represent the scene as a continuous function, and map the world coordinate to a feature representation of local scene attributes. By systematically describing imaging as a differentiable ray-marching algorithm, SRNs can be trained end-to-end based only on 2D images and their camera positions, without the need for depth of field or shape information. This method can be naturally generalized between different scenes, and in the process, learn powerful prior knowledge of geometry and appearance. This paper evaluates the potential of SRNs through the synthesis of new vision maps, few-sample reconstruction, joint shape and appearance interpolation, and unsupervised discovery of non-rigid surface models.

 

2. Introduction

This paper proposes Scene Representation Networks (SRNs). The key to this model is to implicitly represent the scene as a continuous and differentiable function, which maps the 3D world coordinates to the feature-based representation of the scene attributes . This allows SRNs to interact naturally with established multi-view and projection geometry technologies while operating efficiently in memory with high spatial resolution. SRNs only need a set of scene 2D images for end-to-end training and learning. SRNs do not need 2D convolution to generate high-quality images, but only need to operate on a single pixel, which makes it possible to generate images of any resolution. This naturally generalizes to the camera conversion and the situation where the intrinsic parameters are completely invisible during training. For example, SRNs can see objects at only a fixed distance, and they can perfectly present close-ups of objects.

To summarize the contributions of this article:

(1) Propose a continuous, 3D structure-aware neural scene representation and rendering model-SRNs. SRNs can effectively encapsulate the geometry and appearance information of the scene;

(2) The end-to-end training method of SRNs requires only images in 2D space without explicit supervised learning in 3D space;

(3) It is proved that SRNs are significantly better than the recent literature benchmarks on the tasks of new field map synthesis, shape and appearance interpolation, and few-sample reconstruction, and unsupervised discovery of non-rigid facial models.

 

3. Introduction to SNRs

11061.png

 

11062.jpg

Figure 1

1. Characterize the scene as a function

11063.png

2. Neural rendering

11064.png

 • Differentiable Ray Marching algorithm

11065.jpg

 

11066.png

 

11067.png

 • Pixel generator framework

11068.png

3. Generalization across scenarios

11069.png

4. Joint optimization

110610.jpg

 

4. Experimental results

This paper trains SRNs on several object categories and evaluates them in new view synthesis and few-sample reconstruction. At the same time, the discovery of the non-rigid surface model is further proved. In the supplementary material of the thesis, the comparison between SRNs and DeepVoxels in single scene new view synthesis is elaborated. Hyperparameters, framework details, etc. are also supplemented with materials, so I won’t go into details here.

1.Shepard-Metzler objects:

The 7-element Shepard-Metzler object is selected, the reference benchmark is dGQN, and the evaluation index is the accuracy of the new view reconstruction. On the training set, the pixels on the SRNs are almost perfect results, and the PSNR (peak signal-to-noise ratio) reaches 30.41 dB. In this limited data set, dGQN cannot learn the shape and multi-view geometric information of the object, and the final PSNR is only 20.85dB.

With two-shot, SRNs can reconstruct any part of the observed object, and its final index is 24.36dB. At the same time, dGQN can only achieve 18.56dB. In one-shot, SRNs can reconstruct an object consistent with the observed view. As expected, due to the non-probabilistic implementation of the current model, although both dGQN and SRNs reconstruct the object, the object is the average of hundreds of possible objects that generated the observation results. The final indicators are 17.51dB and 18.11dB, respectively.

2.Shapenet v2:

Here only select the two categories "chair" and "car" in Shapenet v2. Evaluate the new view synthesis effect on the training set and the retained test set. The quantitative and qualitative comparison results of SRNs and the benchmark model are shown in Table 1 and Figure 2, respectively.

110611.png

Table 1

110612.jpg

Figure 2: Qualitative comparison with the benchmark model

It can be seen from the experimental results that the SRNs model is significantly better than other models. Normally, SRNs views are completely consistent across multiple views, the only exception is that objects have unique and fine geometric details, such as windshields. However, the views generated by the benchmark model cannot achieve multi-view consistency. For two-shot, most of the object has been seen, and SRNs can robustly reconstruct the appearance and geometry of the object. For single-shot, SRNs complete the reconstruction of the invisible part of the object in a plausible way, which shows that the learned prior has truly captured the underlying distribution.

3. Parameter management of non-rigid deformation:

When the hidden parameters of the scene are learned, these parameters can be conditionally set instead of solving the hidden variables $\mathbf{z}_j$ together. Randomly sample 1000 faces from the Basel face model, and generate 50 rendering images for each. Each face is defined by a 224-dimensional parameter vector, where the first 160 dimensions define the identity, and the last 64 dimensions control facial expressions. SRNs can reconstruct the geometry and appearance of human faces. After training, the facial expressions are activated by changing 64 expression parameters while keeping the identity unchanged, even if this specific combination of identity and expression has not been observed before. Figure 3 shows the qualitative results of this non-rigid deformation:

110613.jpg

Figure 3: Non-rigid deformation of the face. Note: The movement of the mouth is directly controlled by normal maps

4. Geometry reconstruction:

SRNs perform geometric reconstruction in a completely unsupervised manner. The geometric reconstruction is just to better assist in the interpretation of observations in 3D. Figure 4 shows the results of geometric reconstruction:

110614.jpg

Figure 4: Normal maps of the selected object. Note: The geometric reconstruction results are completely unsupervised learning, purely generated by the image view and multi-view geometric constraints

5. Hidden space interpolation:

The hidden space learned by the model in this paper allows interpolation of object instances. Take Figure 5 as an example:

110615.png

Figure 5: Implicit coding vector interpolation results of cars and chairs in the Shapenet dataset. Rotate the camera around the model at the same time. Features smoothly transition from one model to another

6. Camera Pose extrapolation:

Due to the explicit 3D perception and pixel-by-pixel modeling, SRNs naturally have generalization capabilities for 3D conversion (such as zooming in the camera lens close-up or turning the camera). For this part, you can refer to the sample video of pose extrapolation in the supplementary material.

 

Five, discussion

This paper proposes SRNs, which is a 3D structure-oriented neural scene representation model. The model represents the scene as a continuous and differentiable function. This function maps the 3D coordinates into a feature-based scene representation, and then uses a differentiable ray marcher to render the feature representation into a 2D image. The whole process is end-to-end training. SRNs do not need shape supervision and can only be trained through a set of 2D images taken by a pose. SRNs are evaluated on new view synthesis, shape and appearance interpolation, and few-shot reconstruction tasks.

future job:

(1) Explore SRNs under the framework of probability;

(2) Extend to the modeling of view-related factors, lighting-related influencing factors, transparency and participating media;

(3) Extend to other image formation models, such as computer tomography (CT) or magnetic resonance imaging (MRI);

(4) Combine camera parameters with algorithms to estimate camera pose;

(5) SRNs also have exciting applications in addition to vision and graphics. Future work may explore SRNs in robot operation or as an independent agent to model the world;

(6) Generalization in a complex and messy 3D environment;

(7) Combine with meta-learning to improve cross-scene generalization capabilities.

 

Reader | Liu Jiepeng

Typography | Academic Spinach

Proofreading | Xiaoman, Yizhi

Responsible Editor | Young Academics, Excellent Academics

 

Past review:

[NeurIPS100] Seven award-winning papers of NeurIPS2019 are announced and in-depth analysis of selected papers!

[NeurIPS100] Who are the highly productive Chinese authors of NeurIPS2019? Which paper has the highest number of citations, just read this one!

[NeuIPS 2019] Yoshua Bengio report: How to use deep learning to implement System2?

Guess you like

Origin blog.csdn.net/AMiner2006/article/details/103579246