[Single-view reconstruction] RealFusion: 360° reconstruction from a single image to any object

Project homepage: https://lukemelas.github.io/realfusion
Article: RealFusion: 360◦ Reconstruction of Any Object from a Single Image



Summary

采用一个基于扩散模型的条件图像生成器,并设计一个提示,鼓励它“想出”物体的新视图,从单一图像重建一个完整的360◦模型,使用最近的 DreamFusion 方法,我们将给定的输入视图、条件先验和其他正则化器融合到一个最终的、一致的重建中。重建提供了输入视图的忠实匹配,以及对其外观和三维形状的合理预测,包括物体不可见的一面。


I. Introduction

The challenge is that a single image does not contain enough information for a 3D reconstruction. But it can be done by leveraging humans' vast knowledge of the natural world and the objects contained within it to make up for the missing information in the image.

To solve the problem, visual geometry must be combined with a powerful statistical model of the three-dimensional world . Recently, 2D image generators such as DALL-E [36], Imagen [42] and stable diffusion [40] can solve highly ambiguous generation tasks through diffusion, ranging from text descriptions, semantic mappings, partially complete images or random noise. obtain reliable 2D images. Clearly, these models have high-quality priors . While people have access to billions of 2D images, the same cannot be said for 3D data.

Another way to train a 3D diffusion model is to extract 3D information from an existing 2D model. A recent example is DreamFusion [33], which generates high-quality 3D models from textual descriptions alone.

In this paper, we use a neural radiation field to represent the three-dimensional geometry and appearance of an object . Training reconstructs a given input image by minimizing the usual rendering loss. At the same time, we randomly sample other views of the object and use diffusion priors to constrain them using a technique similar to DreamFusion. This requires adding sufficiently reasonable conditions to the diffusion model. The idea is to configure a prior to "imagine" or sample images, which may constitute other views of the object. Specifically, diffuse prompt words are created by randomly enhancing a given image . Only in this way can diffusion models provide sufficiently strong constraints.

In addition to setting the hints correctly, we also added some regularizers: shadowing the underlying geometry and randomly removing textures (also similar to Dream Fusion), smoothing the surface's normals, and fitting the model in a coarse-to-fine manner, capturing the object 's The overall structure and then only the fine-grained details are captured. In terms of efficiency, the model is based on Instant NGP [29] to achieve reconstruction within hours.

We do not train a full-blown 2D to 3D model, nor are we restricted to specific objects; instead, we pre-train a 2D generator as a prior on an image-to-image basis. Quantitatively and qualitatively, it surpasses previous single-image reconstructors, including Shelf-Supervised Mesh Prediction [58] (with supervision specifically customized for 3D reconstruction).

Contributions are :

(1) RealFusion does not require assumptions about object image types or any kind of 3D supervision; leverages existing 2D diffusion image generators for new single image transformations

(2) A new regularizer is introduced and implemented effectively using InstantNGP

2. Related work

1. Image-based appearance and geometry reconstruction

The photometric and geometric reconstruction problem has been greatly revitalized by the introduction of Neural Radiation Fields (NeRF: coordinate MLP provides a compact yet expressible three-dimensional domain representation for efficient modeling). Neus uses signed distance functions (SDFs) to recover cleaner geometries. These methods assume hundreds of views per scene for reconstruction. where we use a diffusion model to "imagine" the missing view

2. Few-view reconstruction

Closely related to our work is NeRF-on-a-Diet [17], which reduces the number of images required to learn NeRF by generating random views and measuring their “semantic compatibility” with available views via CLIP embedding [35] , but they still require multiple input views.

3. Single view reconstruction

Recovering a complete radiation field from a single image typically requires multi-view data for training, as well as a learning model specific to a particular object class. 3D-R2N2 [5], Pix2Vox [55, 55] and LegoFormer [57] learn to reconstruct volumetric representations of simple objects, mainly leveraging synthetic data (such as ShapeNet). Recently, CodeNeRF [19] predicts a complete radiation field including reconstructed object photometry.

4. Extract 3D model from 2D generator

CLIP-Mesh and Dream Fields By using CLIP embedding, 3D generation of text can be achieved. Our model is built on DreamFusion (using a diffusion model). [Novel view synthesis with diffusion models] proposes to directly generate multiple 2D views of an object, and then use a nerf-like model for 3D reconstruction. The model requires multi-view data for training, is only tested on synthetic data, and needs to explicitly sample multiple views for reconstruction.

2. Diffusion model

Diffusion denoising probabilistic model is a type of generative model based on iterative inversion of Markov noise process. In vision, early work formulated the problem as learning variational lower bounds [14], or defined it as optimizing the discretization of score-based generative models [45, 46] or continuous stochastic processes [47]. Recent improvements include the use of faster and deterministic sampling [14, 25, 52], class conditional models [7, 46], text conditional models [32], and modeling in latent spaces [41].

3. Method of this article

3.1. Radiation fields and DreamFusion (preliminary knowledge)

  1. Radiance fields .
    A radiation field (RF) is a pair of functions (σ(x), c(x)) that maps a 3D point x∈R3 to an opacity value σ(x)∈R + and a color value c(x) ∈R 3 . I (u)∈R 3 is the color of pixel u, which is rendered through rays (see the blog [NeRF Principle] for details):

Insert image description here

The loss function diagram is as follows :Insert image description here

  1. diffusion model

See the blog [DDPM Probabilistic Diffusion Model], the loss function is:
Insert image description here
This model can be easily extended to extract samples from the distribution p(x|e) under the condition of prompt e . Conditional sampling of prompts is obtained by adding e as an additional input to the network Φ , and the strength of the conditioning can be controlled by classifier-free bootstrapping [7].

  1. DreamFusion and SDS (Fractional Distillation Sampling)

Given a two-dimensional diffusion model p (I|e) and a prompt word e, DreamFusion uses RF(σ, c) to extract the three-dimensional representation of the corresponding concept from the model. It randomly samples a camera parameter π, uses RF to render the corresponding view I π , and evaluates its similarity to the rendered view based on the view generated by the model p (I π |e) , and updates RF to increase the model-rendered view. Authenticity.

DreamFusion uses a denoising network as a frozen evaluator and takes gradient steps:

Insert image description here
where I = R(·; σ, c, π), is the image rendered from a given perspective π, and tip e. This process is called Fractional Distillation Sampling (SDS) . Equation (4) differs from simply optimizing the standard diffusion model objective because it does not include the Jacobian term of Φ . In practice, removing this term both improves build quality and reduces computational and memory requirements.

This last aspect of DreamFusion is essential for understanding our contribution in the next section: in order to obtain good 3D shapes, it is necessary to use very high weights of the classifier-free guide [7] (say 100), which are smaller than Image sampling is much larger. As a result, the diversity generated is often limited; they only generate the most likely objects for a given hint, which is incompatible with our goal of reconstructing any given object.

2.RealFusion (emphasis)

Goal: Use the prior in the diffusion model Φ to reconstruct a three-dimensional model of the object in a single image I 0 . This is achieved by optimizing a radiation field by achieving two goals simultaneously:

  1. A fixed-angle reconstruction target: Equation (2)
  2. SDS-based prior target equation (4) for new views randomly sampled in each iteration

1. Single image-to-text conversion as an alternative to multi-view

Use single-image textual inversion instead of multiple views. Since new images cannot be directly sampled from the distribution: p(I|I0 ) , a text prompt e (Io) is synthesized specifically for the input image I0 , as an approximation of the multi-view p (I| I0 ) ** .

This is achieved by randomly augmenting the input image g (I 0 ) , g∈G , as a pseudo-surrogate view. We use these enhancers as a mini-dataset D'={g(I 0 )} g∈G and optimize equation (3) with respect to the diffusion loss L diff (Φ(·; e ( Io) )), while freezing text embeddings and model parameters «e» .

In practice, the prompt is automatically derived from the template “an image of a «e»”, where «e» = g (I 0 ) is a new token introduced into the vocabulary of the text encoder of the diffusion model. The optimization process reflects and generalizes the recently proposed text conversion method [10: An image is worth one word].

Insert image description here

The text embedding "e" contains the knowledge of the original image: if the general text prompt "image of a fish" is used to reconstruct the image of the fish, the losses of formulas (3) and (4) are used. However, this often results in an input that looks like the input fish from an angle, but looks like some different, more general fish from the back. In contrast, using the prompt "an image of an e" , the reconstruction resembles the input fish from all angles . Figure 7 shows an exact example.

Insert image description here

2. Training from coarse to fine

RF model: InstantNGP is a grid-based model that stores features on the vertices of a multi-resolution feature grid {G i } L i=1 and has high computational efficiency and training speed. However, the optimization process occasionally produces small irregularities in the surface of the object . We find that coarse-to-fine training helps alleviate these problems: in the first half of training, we only optimize the low-resolution feature grid {G i } L/2 ​​i=1 , and then in the second half of training Partially, we optimize all feature grids {G i } L i=1 . Utilizing this strategy, we gain the benefits of efficient training and high-quality results.

3. Normal vector regularization

Due to the observation that our RF model occasionally produces noisy surfaces with low levels of artifacts. Therefore a new regularization term is introduced to encourage the geometry of our RF to have smooth normals . Note that this regularization is performed in 2D rather than 3D.

In each iteration, in addition to computing the RGB and opacity values, we also compute the normal at each point along the ray and aggregate it by the ray travel equation to get the normal N∈R H×W× 3 .
Insert image description here

where stopgrlad is a stopping gradient operation and blur(·, k) is a Gaussian blur with kernel size k (we use k = 9). Although regularizing normals is more common in three dimensions, operating in two dimensions reduces the variance of the regularization term, producing better results.

4. mask loss

In addition to the input image, the model also utilizes the mask of the object one wishes to reconstruct. In practice, we use an off-the-shelf image matting model to obtain the image mask. The way to merge masks is to add a simple L2 loss term to the difference between fixed reference viewpoints R(σ, π 0 ) ∈R H×W :Insert image description here

The final optimization objective includes four items , the upper row corresponds to the a priori objective, and the bottom row corresponds to the reconstruction objective:
Insert image description here


4. Experiment

1. Settings

The diffusion model is trained a priori on the LAION [43] dataset using text-image pairs. The Instant NGP model is trained in a coarse-to-fine manner using a model with 16 resolution levels, feature dimension 2, and maximum resolution 2048.

The camera used for reconstruction is placed on a sphere of radius 1.8, viewing the origin at an angle of 15 degrees above the plane. Each time the rendering is optimized, the reconstruction loss L rec and L rec ,mask are calculated. λ image = 5.0, λ mask = 0.5, λ normal = 0.5 .

2. Quantitative results

Comparing methods: Reconstruct arbitrary 3D object method Shelf-Supervised Mesh Prediction [58] (provides 50 pre-trained models for 50 different categories in OpenImages [23]). We evaluate 7 categories in the CO3D dataset [39] with corresponding open image categories. Three images per class were randomly selected and RealFusion and Shelf-Supervised were run simultaneously to obtain reconstructions.

We test the reconstructed 3D quality in Figure 5. Shelf-Supervised predicts mesh directly. We use marching cubes to extract the mesh from the predicted radiation field . CO3D is able to reconstruct sparse point clouds of objects from multi-view geometries. For evaluation, we sample points from the reconstructed mesh and optimally align them to the GroundTruth point cloud using ICP (Iterative Closest Point) by estimating a scaling factor. Finally, an F-score with a threshold of 0.05 is used to measure the distance between the predicted real point cloud and the ground real point cloud. The renderings are visually close to the real views, so we also evaluate the CLIP embedding similarity . The results are shown in Table 1:

Insert image description here

3. Qualitative results

Figure 4 shows the multi-view qualitative results: multiple 3D reconstructions using a single image. Figure 6 explores the ability to sample the space of possible solutions by starting from the same input image and repeating the reconstruction multiple times. There is almost no difference in the reconstruction of the front side of the object, but there is a big difference in the back side.

Insert image description here

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/132581324