2D and 3D two-pronged approach is king? KAUST, Snap and VGG proposed the Magic123 framework for 3D reconstruction of a single image

In the recent AIGC community, the field of 3D visual generation has received more and more attention. The deep rendering network based on Neural Radiation Fields (NeRFs)  has shown you amazing 3D effects. However, NeRFs require a large number of multi-view pictures as supervision, so 3D reconstruction from a single 2D image is still extremely challenging. This article introduces a work Magic123 (One Image to High-Quality 3D Object Generation using Both 2D and 3D diffusion Pirors) from KAUST, Oxford University VGG group and Snapchat. Magic123 is a two-stage coarse-to-fine 3D generation framework, which proposes to use both 2D and 3D visual priors to perform 3D reconstruction from a single image. The figure below shows the comparison of Magic123 and other baseline methods.

The author selected four objects, teddy bears, dragon statues, horses, and colorful teapots for display. It can be seen that the 3D reconstruction effect of Magic123 is relatively complete, and it is very consistent with the 3D shape and texture of the corresponding real objects. The two latest methods for comparison, Neural Lift[1] and RealFusion[2] (both published on CVPR2023), have certain defects in the control of the 3D shape and texture details of the object. Among them, Neural Lift even generated the oolong effect of two horse heads . Magic123 can produce better results, mainly due to two points:

  1. The authors use both 2D and 3D priors to promote the model to achieve a balance between reconstruction imagination and 3D consistency, and has better generalization ability;
  2. Two-stage training, in the first stage, the author optimizes the NeRF network to generate a rough geometry, and in the second stage, it is continuously refined into a texture-rich high-resolution 3D mesh.

Paper Link:

https://arxiv.org/abs/2306.17843
Project Address:
https://guochengqian.github.io/project/magic123
Code Warehouse:
https://github.com/guochengqian/Magic123

I. Introduction

Although humans usually use a 2D way to observe the world, the human brain has a very powerful three-dimensional imagination and reasoning ability. How to simulate the three-dimensional reasoning ability of the human brain is a hot issue in the field of three-dimensional vision research. The 3D image synthesis model needs to preserve the geometry and texture details of the original object as much as possible while generating the 3D object. However, the current method of 3D reconstruction using only a single image still has a performance bottleneck. The author believes that this is mainly caused by the following two reasons: (1) Existing methods usually rely on large-scale labeled 3D datasets, which limits the generalization ability of the model in unknown domains . (2) It is difficult for existing methods to make a good trade-off between the details of generating 3D objects and the computing resources of the model when processing 3D data . As shown in the figure below, the author uses teddy bears, donuts, and dragon statues as three different 3D reconstruction situations. Since teddy bears are relatively common, the model can restore them well only by learning 3D priors. For the dragon statue in the lower left corner, the 3D data set with limited annotations can no longer meet the requirements. Although the generated geometric structure has three-dimensional consistency, it lacks details.

Compared with 3D generative models, the development of 2D image generative models is obviously faster and more complete. Existing 2D generative models use massive text-annotated images for training, which can cover a wide range of image semantics. Therefore, using 2D models as priors to generate 3D content has become a very popular method, such as DreamFusion [6]. However, the author found that relying entirely on 2D priors will produce serious 3D inconsistencies , such as Janus problem (generate multiple faces), inconsistent object sizes and materials from different perspectives, etc. Therefore, Magic123 in this article proposes to use 3D prior and 2D prior at the same time, and set a trade-off parameter between them to achieve the purpose of dynamically adjusting the 3D model generation effect . In addition, the author found that the traditional NeRF will occupy a large amount of video memory, which leads to the low resolution of the image rendered by the model and affects the details of the 3D generation. Therefore, the author introduced a memory-efficient hybrid 3D grid representation in the second stage of Magic123, which can increase the final generation resolution to 1K, while refining the geometric texture and details of the generated objects .

2. The method of this paper

Magic123 comprehensively considers the diffusion prior of 2D and 3D image generation, and completes the task of 3D reconstruction from a single image in a two-stage (from coarse to fine) form. The overall framework of Magic123 is shown in the figure below.

2.1 Coarse stage

As shown in the left half of the figure above, it is the rough generation stage of Magic123. In a process, the model focuses on optimizing the basic geometric structure of the image . This process mainly uses NeRF for generation. Magic123 first deploys a pre-trained segmentation model Dense Prediction Transformer [3] to extract foreground objects from a given single image. In addition, in the rough stage, the author comprehensively considered factors such as image reconstruction supervision required for NeRF synthesis, guidance of new perspective images, depth priors for generating 3D objects, and NeRF's own artifact synthesis defects . And according to these factors, the corresponding loss function is designed respectively to jointly optimize the overall model:

2.2 Fine stage

Due to the huge computational overhead of NeRF and the easy introduction of artifact noise, the coarse stage can only generate low-resolution semi-finished 3D models. The detailed stage of Magic123 adopts a hybrid SDF-Mesh representation, that is, DMTet[4], which greatly optimizes the memory efficiency of NeRF synthesis. The author mentioned that the previous NeRF alternative program Instant-NGP with higher resource efficiency can only achieve 128128 resolution on a 16GB memory GPU. However, with the support of DMTet, the Magic123 framework in this paper can easily synthesize high-precision 3D models with a resolution of 1K.

2.3 Trade-off between 2D prior and 3D prior

The 2D image prior referenced in Magic123 comes from the score distillation sampling loss (score distillation sampling, SDS) in Stable Diffusion. SDS mainly acts on the diffusion process of the image. It first encodes the rendered view into the latent space, and adds a certain amount of noise to it, and then predicts the denoised new view according to the input text prompt. SDS builds a bridge between the rendered view content and the text prompt . SDS loss is defined as follows:

The author then found that in the process of image synthesis, the use of 2D prior and 3D prior is actually complementary. 2D prior has a strong imagination and has the ability to make the model explore the geometric space, but it will lead to incomplete geometry of the generated 3D model, while 3D prior can make up for this defect, but it has poor versatility and lacks geometric details . Therefore, the author proposes a priori loss that weighs the two:

3. Experimental results

The experiment in this paper is carried out on two data sets of NeRF4 and RealFusion15. The evaluation indicators use PSNR, LPIPS and CLIP similarity. The former two are used to measure the reconstructed image quality and perceptual similarity of the generated effect, and the latter mainly uses the appearance similarity calculated by the CLIP model to measure the 3D consistency of the generated content . The author selected 6 methods including Zero-1-to-3, Neural Lift and RealFusion for comparison. The following table shows the performance comparison of 3D synthesis effect

4. Summary

This paper proposes a coarse-to-fine two-stage 3D synthesis Magic123 framework , Magic123 can generate high-quality 3D models with high texture details from only a single random-view image. Magic123 overcomes the limitations of the existing 3D synthesis framework by weighing the 2D and 3D diffusion priors inside the model. The 2D and 3D tradeoff parameters proposed in this paper can enable the network to explore a dynamic balance between 2D geometry and 3D shape constraints, so that the model can take into account the diversity of objects and special 3D textures and details during the 3D synthesis process.

reference

[1] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360{\deg} views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023[2] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. Realfusion: 360{\deg} reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.[3] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision trans ormers for dense prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, 2021.[4] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 6087–6101, 2021.[5] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.[6] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. International Conference on Learning Representations (ICLR), 2022.

Author: seven_

Illustration by IconScout Store from IconScout

Guess you like

Origin blog.csdn.net/hanseywho/article/details/131578655