Magic3D: High-Resolution Text-to-3D Content Creation (high resolution text to 3d content creation)

Magic3D: High-Resolution Text-to-3D Content Creation (high resolution text to 3d content creation)

Fig 1. Results and applications of Magic3D.  Above: High-resolution text to 3D generation.  Magic3D can generate high-quality and high-resolution 3D models from text prompts.  Bottom: Hint-based high-resolution editing.  Magic3D can edit 3D models to use different cues by fine-tuning the diffusion prior.  Taking a low-resolution 3D model as input (left), Magic3D can modify different parts of the 3D model based on different input text cues.  Along with various creative controls on the resulting 3D models, Magic3D is a handy tool for augmenting 3D content creation.

Paper:https://readpaper.com/pdf-annotate/note?pdfId=4738271534435532801&noteId=1848084184935912192
Project:https://research.nvidia.com/labs/dir/magic3d/

Original link: Magic3D: High-resolution text to 3d content creation (by small sample vision and intelligence frontier)

01 Insufficiency of existing work?

DreamFusion suffers from two inherent limitations: (a) NeRF optimization is extremely slow; (b) low-resolution image space supervision for NeRF results in long processing times and low quality 3D models.

02 What problem does the article solve?

We address the above two limitations in DreamFusion by utilizing a two-stage optimization framework. That is to increase the optimization speed and improve the quality of the 3D model.

03 What is the key solution?

  • A coarse model is first obtained using a low-resolution diffusion prior and accelerated using a sparse 3D hash grid structure.
  • The textured 3D mesh model is further optimized using a coarse representation as initialization, and interacts with a high-resolution latent diffusion model using an efficient differentiable renderer.

04 What kind of effect has been achieved?

Our method, called Magic3D, can create high-quality 3D mesh models in 40 minutes, which is 2 times faster than DreamFusion (reported to take 1.5 hours on average), while also achieving higher resolution.
User studies show that 61.7% of raters prefer our approach over DreamFusion. Coupled with image condition generation capabilities, we give users new ways to control 3D compositing, opening up new avenues for a variety of creative applications.

05 What is the main contribution?

  • We present Magic3D, a framework for high-quality 3D content synthesis using text cues, by improving several major design choices in DreamFusion. It consists of a coarse-to-fine strategy that leverages low-resolution and high-resolution diffusion priors to learn 3D representations of target content. The resolution of Magic3D composite 3D content is 8 times higher than DreamFusion, and the speed is 2 times faster than DreamFusion. The 3D content synthesized by our method is clearly favored by users (61.7%).
  • We will extend various image editing techniques developed for text-to-image models to 3D object editing and demonstrate their application in the proposed framework.

06 What are the related jobs?

  • Text-to-image generation.
  • 3D generative models
  • Text-to-3D generation

07 How is the method implemented?

Fig 2. Overview of Magic3D. We generate high-resolution 3D content from input text cues in a coarse-to-fine fashion.  In the first stage, we exploit the low-resolution diffusion prior and optimize the neural field representation (color, density and normal field) to obtain a coarse model.  We further differentially extract textured 3D meshes from the density and color fields of the coarse model.  We then fine-tune it with a high-resolution latent diffusion model.  After optimization, our model generates high-quality 3D meshes with detailed textures.

Background: DreamFusion

DreamFusion introduces Fractional Distillation Sampling (SDS), which computes gradients:

High-Resolution 3D Generation

Magic3D is a two-stage coarse-to-fine framework that enables high-resolution text-to-3D synthesis using an efficient scene model (Fig. 2).

1)Coarse-to-fine Diffusion Priors

Magic3D uses two different diffusion priors in a coarse-to-fine fashion to generate high-resolution geometry and textures. In the first stage, we use the base diffusion model described in eDiff-I [2], which is similar to the base diffusion model of Imagen [38] used in DreamFusion. In the second stage, we use the Latent Diffusion Model (LDM) [36], which allows gradients to be back-propagated into high-resolution 512 × 512 rendered images;

Although high-resolution images are generated, the computation of the LDM is manageable because the diffusion prior acts on the latent zt at a resolution of 64 × 64zt:

2)Scene Models

Neural fields as coarse scene models.
The initial coarse stage of optimization requires finding geometry and textures from scratch. This can be challenging as we need to accommodate complex topological changes of the 3D geometry and depth ambiguity of the 2D surveillance signal.

Since volume rendering requires dense samples along rays to accurately represent high-frequency geometry and shading, the cost of having to evaluate large neural networks at each sample point quickly adds up. For this reason, we choose to use hashed grid coding from Instant NGP [27], which allows us to represent high-frequency details at a lower computational cost.

We also maintain a spatial data structure that encodes scene occupancy and utilizes empty space jumps [20, 45].

Specifically, we use a density-based voxel pruning method from Instant NGP [27], and an octree-based ray sampling and rendering algorithm [46]. With these design choices, we greatly speed up the optimization of coarse scene models while maintaining quality.

Textured meshes as fine scene models.
In the fine stage of optimization, we use textured 3D meshes as scene representations. Compared to volume rendering in the neural domain, rendering textured meshes with differentiable rasterizers can be performed efficiently at very high resolutions, making meshes a suitable choice for our high-resolution optimization stage. Using the neural field from the coarse stage as the initialization of the mesh geometry, we can also avoid the difficulty of learning a large number of topological changes in the mesh.

We use a deformable tetrahedral mesh ( VT , T ) (V_T,T)(VT,T ) represents a 3D shape, whereVT V_TVTis the vertex of mesh T.
Each vertex vi ∈ VT v_i \in V_TviVTContains a signed distance field (SDF) value si ∈ R s_i \in RsiThe deformation of R and a vertex relative to its initial canonical coordinatesΔ vi ∈ R 3 \Delta v_i \in R^3v _iR3 .
We then extract the surface mesh from the SDF using the differentiable movable tetrahedron algorithm [41]. For textures, we use neural color fields as volumetric texture representations

3)Coarse-to-fine Optimization

We describe our coarse-to-fine optimization procedure, which first operates on coarse neural field representations and then on high-resolution textured meshes.

Neural field optimization.
Instead of estimating normals from density differences, we use MLP to predict normals. Note that this does not violate the geometry property, since volume rendering is used instead of surface rendering; thus, the orientation of particles at continuous positions need not be oriented to surface-level levels. This helps us significantly reduce the computational cost of optimizing a coarse model by avoiding the use of finite differences.

Similar to DreamFusion, we also model the background using an environment map MLP that predicts RGB color as a function of light direction.

We use a tiny MLP (hidden dimension size 16) for the environment map and reduce the learning rate by a factor of 10 to allow the model to focus more on the neural field geometry.

Mesh optimization To optimize the mesh from the neural field initialization, we convert the (rough) density field to SDF by subtracting a non-zero constant, resulting in initial si s_isi

To improve the smoothness of the surface, we further regularize the angle difference between adjacent faces on the mesh. This allows us to obtain good geometry even with supervisory signals with high variance such as SDS gradients.

08 What are the experimental results and comparative effects?

Speed evaluation

Unless otherwise stated, the coarse stage was trained for 5000 iterations with 1024 samples along the ray (subsequently filtered by a sparse octree), with a batch size of 32 and a total runtime of about 15 minutes (over 8 iterations /sec, varies due to differences in sparsity). The refinement phase was trained for 3000 iterations using a batch size of 32, with a total runtime of 25 minutes (2 iterations/second). Both run times add up to 40 minutes. All runtimes are measured on 8 NVIDIA A100 GPUs.

Qualitative comparisons.

Fig 3. Qualitative comparison with DreamFusion [33].  We use the same text prompt as DreamFusion.  For each 3D model, we render it from two views, each rendered with no textures and the background removed to focus on the actual 3D shape.  For DreamFusion results, we obtain frames from videos published on the official webpage.  Compared to DreamFusion, our Magic3D generates higher quality 3D shapes both geometrically and textured.  *A DSLR photo of… †A scaled-down DSLR photo of…

User studies.

Table 1. User preference studies.  We conducted a user study to measure preference for 3D models generated using 397 hints published by DreamFusion.  Overall, more raters (61.7%) prefer 3D models generated by Magic3D to DreamFusion.  In Magic3D, most raters (87.7%) prefer fine models to coarse models, which shows the effectiveness of our coarse-to-fine approach.

Personalized text-to-3D.

We are able to successfully modify the 3D model preserving the subject in a given input image.

Fig 6. Personalization based on Magic3D and dreambooth.  Given an instance-specific input image, we use DreamBooth to fine-tune the diffusion model and optimize the 3D model based on the given cues.  Identity is well preserved in the resulting 3D model.

Prompt-based editing through fine-tuning.

We modified the base cues, fine-tuned the NeRF model at high resolution, and optimized the mesh. It turns out that we can tweak the scene model based on our cues, e.g. changing a "little bunny" to a "stained glass bunny" or a "metal bunny" results in similar geometry but with different textures

Fig 7. Magic3D with hint-based editing.  Given a coarse model (first column) generated from basic cues, we replace the underlined text with new text and fine-tune NeRF to obtain a high-resolution NeRF model using LDM.  We further fine-tune the high-resolution grid with the NeRF model.  This hint-based editing approach gives artists greater control over 3D generated output.

09 What do ablation studies tell us?

Can single-stage optimization work with LDM prior?

Fig 4. Single stage (top) vs. coarse-to-fine model (bottom).  Both use NeRF as the scene model.  During optimization, the left two columns use 64×64 rendering resolution, while the right two columns use 256×256.  Compared with our coarse-to-fine method, the single-stage method can generate details, but with poorer shape.

Can we use NeRF for the fine model?

Yes, although optimizing NeRF from scratch does not work well, we can follow the coarse-to-fine framework but replace the second-stage scene model with NeRF.

Coarse models vs. fine models.

We see significant quality improvements on both NeRF and mesh models, suggesting that our coarse-to-fine approach is suitable for general scene models.

Fig 5. Ablation in the fine-tuning stage.  For each text cue, we compare coarse and fine models with grid and NeRF representations.  Mesh fine-tuning significantly improves the visual quality of generated 3D assets, providing more realistic detail on 3D shapes.

10 Conclusion

We present Magic3D, a fast and high-quality text-to-3d generative framework. We benefit from efficient scene models and high-resolution diffusion priors in our coarse-to-fine approach. In particular, 3D mesh models scale well with image resolution and enjoy the benefits of higher resolution supervision from latent diffusion models without sacrificing speed. It takes 40 minutes to go from text prompts to a high-quality 3D mesh model ready to use in a graphics engine. Through extensive user research and qualitative comparisons, we found that Magic3D was preferred by raters (61.7%) compared to DreamFusion, while being 2x faster. Finally, we propose a set of tools to better control the style and content of 3D generation. We hope to realize popular 3D synthesis through Magic3D and open up everyone's creativity in 3D content creation.

Original link: Magic3D: High-resolution text to 3d content creation (by small sample vision and intelligence frontier)

Guess you like

Origin blog.csdn.net/NGUever15/article/details/131682591