[3D Generation] Make-it-3D: Diffusion+NeRF generates high-fidelity 3D objects from a single image (Submit & Microsoft)

insert image description here

题目: Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior
Paper: https://arxiv.org/pdf/2303.14184.pdf
Code: https://make-it-3d.github.io/



foreword

In this paper, the researchers aim to create high-fidelity 3D content from a single real or artificially generated image . This will open new avenues for artistic expression and creativity, such as bringing 3D effects to fantasy images created by cutting-edge 2D generative models like Stable Diffusion. By providing a more accessible and automated way to create visually stunning 3D content, the researchers hope to attract a wider audience to the world of effortless 3D modeling.
insert image description here

This article explores the problem of creating high-fidelity 3D content using only a single image . This is inherently a challenging task that requires estimating the underlying 3D geometry and simultaneously generating unseen textures. To address this issue, the paper utilizes the prior knowledge of the trained 2D diffusion model as the supervision for 3D generation . Make-It-3D adopts a two-stage optimization pipeline : the first stage optimizes the neural radiation field by combining the constraints of the reference image in the foreground view and the diffusion prior in the new view; the second stage converts the rough model into a textured point cloud, And take advantage of the high-quality texture of the reference image, combined with the diffusion prior to further improve the realism. Extensive experiments prove that the paper's method significantly outperforms previous methods in terms of results, achieving the expected reconstruction effect and impressive visual quality. The paper is the first attempt to create high-quality 3D content for general objects from a single image , which can be used in various applications such as text-to-3D creation and texture editing.


提示:以下是本篇文章正文内容

1. Method

The paper uses the prior knowledge of the text-image generation model and the text-image comparison model to restore high-fidelity texture and geometric information through two-stage (Coarse Stage and Refine Stage) learning. The proposed two-stage three-dimensional learning framework as the picture shows.

insert image description here

1. The first stage Coarse Stage: Single-view 3D Reconstruction

As the first stage, the paper reconstructs a coarse NeRF from a single reference image x , constraining new viewpoints with diffusion priors. The goal of optimization is to simultaneously satisfy the following requirements:

1. The optimized 3D representation should closely resemble the rendering of the input observation x on the reference view
2. The new view rendering should show semantics consistent with the input and be as plausible as possible
3. The resulting 3D model should appear attractive Striking Geometry
Given this, we randomly sample camera poses around the reference view and impose the following constraints on the rendered image g θ of the reference view and unseen views:

1. Reference point pixel loss Reference view per-pixel loss

The optimized 3D representation should closely resemble the rendering of the input observation x on the reference view , thus penalizing pixel-level differences between the NeRF-rendered image and the input image:
insert image description here
where a foreground matting mask m is used to segment the foreground.

2. Diffusion prior Diffusion prior

The new view rendering should reveal semantics consistent with the input, to address this, the paper uses an image captioning model to generate a detailed textual description y for the reference image . With the text prompt y , L SDS can be performed on the latent space of Stable Diffusion (using the text conditioned diffusion model as a 3D perception prior), measuring the similarity between the image and the given text prompt:

insert image description here

While LSDS can generate 3D models that are faithful to text cues, they are not perfectly aligned to the reference image (see baseline in the figure above), because text cues cannot capture all object details. Therefore, the paper adds an additional diffusion CLIP loss, denoted as L CLIP-D , which further forces the generated model to match the reference image:

insert image description here

Specifically, the paper does not optimize both L CLIP-D and L SDS . The paper uses L CLIP-D for small timesteps and switches to L SDS for large timesteps . Combining LSDS and LCLIP-D, the paper's diffusion prior ensures that the generated 3D models are visually appealing and believable, while also conforming to the given image (see above).

3. Depth prior Depth prior

In addition, the model still suffers from shape ambiguity, leading to issues such as concave faces, out-of-plane geometry, or depth ambiguity (see Figure 3). To solve this problem, the paper uses an existing monocular depth estimation model to estimate the depth d of the input image . In order to explain the inaccuracy and scale mismatch in d , the paper regularizes the negative Pearson correlation between the estimated depth d(β ref ) of NeRF on the reference viewpoint and the monocular depth d , namely:

insert image description here

4. Training overall Overall training

The final total loss can be expressed as a combination of L depth , L ref , L CLIP-D and L SDS . In order to stabilize the optimization process, the paper adopts a progressive training strategy, starting from a narrow range of views near the reference view, and gradually expanding the range during training. Through progressive training, the paper can achieve a 360° object reconstruction, as shown in the figure above.

insert image description here

2. The second stage Refine Stage: Neural Texture Enhancement

At the coarse stage, we obtain a 3D model with reasonable geometry, but often exhibits rough textures that may affect the overall quality . Therefore, further refinement is required to obtain high-fidelity 3D models.

The main idea of ​​the paper is to prioritize texture enhancement while preserving the rough model geometry. We exploit the observable overlapping regions in the new view and the reference view to map the high-quality texture of the reference image into the 3D representation . Then, the paper focuses on enhancing the texture of the occluded regions in the reference view . In order to better realize this process, the paper exports NeRF to an explicit representation - point cloud (although NeRF, as a rough representation, can continuously handle topological changes, it is challenging to project reference images onto it ) . Compared to the noise mesh exported by Marching Cube, the point cloud provides a cleaner and more direct projection .

1.Textured point cloud building texture point cloud construction

A simple attempt to construct a point cloud is to render multi-view RGBD images from NeRF and lift them to texture points in 3D space. However, we found that this naive approach leads to noisy point clouds due to conflicts between different views : 3D points may have different RGB colors in NeRF rendering under different views [56]. Therefore, we propose an iterative strategy to build clean point clouds from multi-view observations . As shown in the figure below, we first construct a point cloud from the reference view βref according to NeRF’s rendered depth D( βref ) and alpha mask M(βref ) ,

insert image description here

insert image description here

where R ref and K are the extrinsic and intrinsic matrices of the camera , and P represents the depth-to-point projection. These points are visible in the reference view, so they can be shaded with the GT texture. For the projection of the remaining views βi, it is necessary to avoid introducing points that overlap existing points but have conflicting colors . To do this, we:

  1. The existing point V(βref) is projected into a new view βi to generate a pseudo-mask M ref (βi) of view βi .
  2. Using this mask as a guide, we only lift those points V(βi) that have not yet been observed, as shown in Figure 5.
  3. These unseen points are then initialized with a coarse texture in NeRF rendering G(βi) (that is, the result M(βi) of the βi viewpoint produced in the first stage ). Subtract the two M(βi) - ​​M ref (βi) to get the mask of the invisible back point in the point cloud obtained from the original input. Map it into a 3D point, and then add it to the visible points.

2.Deferred point cloud rendering

So far, we have built a set of textured point clouds V = {V(β ref ), V(β 1 ), ..., V(β N )}. Although V(β ref ) already has a high-fidelity texture projected from the reference image, other points that are occluded in the reference view still suffer from the smooth texture from the coarse NeRF , as shown in the figure below. To enhance the texture, we optimize the texture at other points and constrain the new view rendering using a diffusion prior. Specifically, we optimize a 19-dimensional descriptor F for each point, whose first three dimensions are initialized with initial RGB colors .

insert image description here

To avoid noisy colors and bleeding artifacts [2], we adopt a multi-scale deferred rendering scheme: Given a new view β, we rasterize the point cloud V K times to obtain K different sizes [W / 2 i , H/2 i ] feature map I i ​​, where i∈[0, K). These feature maps are then concatenated and rendered into an image I using a jointly optimized U-Net renderer R θ :

insert image description here
where S is a differentiable point rasterizer. The goal of the texture enhancement process is similar to the geometry creation discussed in Section II. 3.2, but we also include a regularization term that penalizes large differences between the optimized and initial textures.

2. Experiment

1. Implementation details

NeRF rendering . We use Instant NGP 's multi-scale hash coding to achieve NeRF representation in the coarse optimization stage with less computational cost. Similar to Inestist-NGP, we maintain an occupancy grid to enable efficient ray sampling by skipping empty spaces. Additionally, we employ some shadow enhancements such as Lambertian and normal shadows on the rendered image.

Point cloud rendering . For deferred rendering, we use a 2D U-Net architecture with gated convolutions. The dimension of the point descriptor is 19, where the first 3 dimensions are initialized with RGB colors, and the remaining dimensions are initialized randomly. We also set a learnable descriptor for the background.

camera settings . We randomly sample new views with 75% probability and pre-defined reference views with 25% probability. We also randomly zoom in on the FOV when rendering with NeRF.

Fractional distillation sampling . We randomly sample t between 200 and 600 and set w(t) to be uniformly weighted according to the time step. We also use classifier-free guidance and guidance weights ω: ˆφ(zt;y,t)=(1+ω)φ(zt;y,t)−ωφ(zt;t)
. Our method aims at aligning the created 3D model with the input image and uses bootstrapping weights ω = 10.

training speed . Adam, the learning rate of both stages is 0.001. The coarse stage is trained for 5,000 iterations at a rendering resolution of 100×100; the refinement stage will be trained for another 5,000 iterations at a rendering resolution of 800×800. On a Tesla 32GB V100 GPU, the training takes about 2 hours.

Test benchmark . We build a test benchmark consisting of 400 images, including real images and images generated by stable diffusion. Each image in the benchmark is accompanied by a foreground alpha mask, estimated depth map and text hints. Text cues for real images are obtained from the image captioning model.

2. Comparison with SOTA model

baseline . We compare our method with five representative baselines. 1) DietNeRF, a few-shot NeRF. We train it with three input views. 2) SinNeRF, a single-view NeRF method. 3) DreamFusion Since it was originally based on text prompts, we also modified it in the reference view, called DreamFusion+, for a fair comparison. 4) Point-E, image-based point cloud generation. 5) 3D-Photo, depth-based image warping and internal inpaint methods.

Qualitative comparison . We first compare our method with a 3D generative baseline, where DreamFusion++ utilizes 2D diffusion as a 3D prior and PointE is a 3D diffusion model. As shown in Figure 7, their generated models cannot be aligned with reference images and suffer from texture smoothing. In contrast, our method produces high-fidelity 3D models with fine geometry and realistic textures. Figure 8 shows additional comparisons for new view synthesis. Due to SinNeRF and due to the lack of multi-view supervision, DietNeRF suffers from difficulty in reconstructing complex objects. 3D photos fail to reconstruct the underlying geometry and produce visible artifacts in large views. In contrast, our method achieves very faithful geometry and visually pleasing textures under novel perspectives.
insert image description here

insert image description here


Summarize

We introduce Make-It-3D, a novel two-stage method for creating high-fidelity 3D content from a single image. Utilizing diffusion priors as 3D-aware supervision, the generated 3D models exhibit faithful geometry and realistic textures with diffusion clip loss and textured point cloud augmentation. Make-It-3D is suitable for general objects, giving multi-purpose attractive applications. We believe our approach is a big step forward in extending the success of 2D content creation to 3D, providing users with an entirely new 3D creation experience.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/131470999