Magic3D: High-Resolution Text-to-3D Content Creation (high resolution text to 3d content creation)
Paper:https://readpaper.com/pdf-annotate/note?pdfId=4738271534435532801¬eId=1848084184935912192
Project:https://research.nvidia.com/labs/dir/magic3d/
Original link: Magic3D: High-resolution text to 3d content creation (by small sample vision and intelligence frontier)
Article directory
- Magic3D: High-Resolution Text-to-3D Content Creation (high resolution text to 3d content creation)
-
- 01 Insufficiency of existing work?
- 02 What problem does the article solve?
- 03 What is the key solution?
- 04 What kind of effect has been achieved?
- 05 What is the main contribution?
- 06 What are the related jobs?
- 07 How is the method implemented?
- 08 What are the experimental results and comparative effects?
- 09 What do ablation studies tell us?
- 10 Conclusion
01 Insufficiency of existing work?
DreamFusion suffers from two inherent limitations: (a) NeRF optimization is extremely slow; (b) low-resolution image space supervision for NeRF results in long processing times and low quality 3D models.
02 What problem does the article solve?
We address the above two limitations in DreamFusion by utilizing a two-stage optimization framework. That is to increase the optimization speed and improve the quality of the 3D model.
03 What is the key solution?
- A coarse model is first obtained using a low-resolution diffusion prior and accelerated using a sparse 3D hash grid structure.
- The textured 3D mesh model is further optimized using a coarse representation as initialization, and interacts with a high-resolution latent diffusion model using an efficient differentiable renderer.
04 What kind of effect has been achieved?
Our method, called Magic3D, can create high-quality 3D mesh models in 40 minutes, which is 2 times faster than DreamFusion (reported to take 1.5 hours on average), while also achieving higher resolution.
User studies show that 61.7% of raters prefer our approach over DreamFusion. Coupled with image condition generation capabilities, we give users new ways to control 3D compositing, opening up new avenues for a variety of creative applications.
05 What is the main contribution?
- We present Magic3D, a framework for high-quality 3D content synthesis using text cues, by improving several major design choices in DreamFusion. It consists of a coarse-to-fine strategy that leverages low-resolution and high-resolution diffusion priors to learn 3D representations of target content. The resolution of Magic3D composite 3D content is 8 times higher than DreamFusion, and the speed is 2 times faster than DreamFusion. The 3D content synthesized by our method is clearly favored by users (61.7%).
- We will extend various image editing techniques developed for text-to-image models to 3D object editing and demonstrate their application in the proposed framework.
06 What are the related jobs?
- Text-to-image generation.
- 3D generative models
- Text-to-3D generation
07 How is the method implemented?
Background: DreamFusion
DreamFusion introduces Fractional Distillation Sampling (SDS), which computes gradients:
High-Resolution 3D Generation
Magic3D is a two-stage coarse-to-fine framework that enables high-resolution text-to-3D synthesis using an efficient scene model (Fig. 2).
1)Coarse-to-fine Diffusion Priors
Magic3D uses two different diffusion priors in a coarse-to-fine fashion to generate high-resolution geometry and textures. In the first stage, we use the base diffusion model described in eDiff-I [2], which is similar to the base diffusion model of Imagen [38] used in DreamFusion. In the second stage, we use the Latent Diffusion Model (LDM) [36], which allows gradients to be back-propagated into high-resolution 512 × 512 rendered images;
Although high-resolution images are generated, the computation of the LDM is manageable because the diffusion prior acts on the latent zt at a resolution of 64 × 64zt:
2)Scene Models
Neural fields as coarse scene models.
The initial coarse stage of optimization requires finding geometry and textures from scratch. This can be challenging as we need to accommodate complex topological changes of the 3D geometry and depth ambiguity of the 2D surveillance signal.
Since volume rendering requires dense samples along rays to accurately represent high-frequency geometry and shading, the cost of having to evaluate large neural networks at each sample point quickly adds up. For this reason, we choose to use hashed grid coding from Instant NGP [27], which allows us to represent high-frequency details at a lower computational cost.
We also maintain a spatial data structure that encodes scene occupancy and utilizes empty space jumps [20, 45].
Specifically, we use a density-based voxel pruning method from Instant NGP [27], and an octree-based ray sampling and rendering algorithm [46]. With these design choices, we greatly speed up the optimization of coarse scene models while maintaining quality.
Textured meshes as fine scene models.
In the fine stage of optimization, we use textured 3D meshes as scene representations. Compared to volume rendering in the neural domain, rendering textured meshes with differentiable rasterizers can be performed efficiently at very high resolutions, making meshes a suitable choice for our high-resolution optimization stage. Using the neural field from the coarse stage as the initialization of the mesh geometry, we can also avoid the difficulty of learning a large number of topological changes in the mesh.
We use a deformable tetrahedral mesh ( VT , T ) (V_T,T)(VT,T ) represents a 3D shape, whereVT V_TVTis the vertex of mesh T.
Each vertex vi ∈ VT v_i \in V_Tvi∈VTContains a signed distance field (SDF) value si ∈ R s_i \in Rsi∈The deformation of R and a vertex relative to its initial canonical coordinatesΔ vi ∈ R 3 \Delta v_i \in R^3v _i∈R3 .
We then extract the surface mesh from the SDF using the differentiable movable tetrahedron algorithm [41]. For textures, we use neural color fields as volumetric texture representations
3)Coarse-to-fine Optimization
We describe our coarse-to-fine optimization procedure, which first operates on coarse neural field representations and then on high-resolution textured meshes.
Neural field optimization.
Instead of estimating normals from density differences, we use MLP to predict normals. Note that this does not violate the geometry property, since volume rendering is used instead of surface rendering; thus, the orientation of particles at continuous positions need not be oriented to surface-level levels. This helps us significantly reduce the computational cost of optimizing a coarse model by avoiding the use of finite differences.
Similar to DreamFusion, we also model the background using an environment map MLP that predicts RGB color as a function of light direction.
We use a tiny MLP (hidden dimension size 16) for the environment map and reduce the learning rate by a factor of 10 to allow the model to focus more on the neural field geometry.
Mesh optimization To optimize the mesh from the neural field initialization, we convert the (rough) density field to SDF by subtracting a non-zero constant, resulting in initial si s_isi。
To improve the smoothness of the surface, we further regularize the angle difference between adjacent faces on the mesh. This allows us to obtain good geometry even with supervisory signals with high variance such as SDS gradients.
08 What are the experimental results and comparative effects?
Speed evaluation
Unless otherwise stated, the coarse stage was trained for 5000 iterations with 1024 samples along the ray (subsequently filtered by a sparse octree), with a batch size of 32 and a total runtime of about 15 minutes (over 8 iterations /sec, varies due to differences in sparsity). The refinement phase was trained for 3000 iterations using a batch size of 32, with a total runtime of 25 minutes (2 iterations/second). Both run times add up to 40 minutes. All runtimes are measured on 8 NVIDIA A100 GPUs.
Qualitative comparisons.
User studies.
Personalized text-to-3D.
We are able to successfully modify the 3D model preserving the subject in a given input image.
Prompt-based editing through fine-tuning.
We modified the base cues, fine-tuned the NeRF model at high resolution, and optimized the mesh. It turns out that we can tweak the scene model based on our cues, e.g. changing a "little bunny" to a "stained glass bunny" or a "metal bunny" results in similar geometry but with different textures
09 What do ablation studies tell us?
Can single-stage optimization work with LDM prior?
Can we use NeRF for the fine model?
Yes, although optimizing NeRF from scratch does not work well, we can follow the coarse-to-fine framework but replace the second-stage scene model with NeRF.
Coarse models vs. fine models.
We see significant quality improvements on both NeRF and mesh models, suggesting that our coarse-to-fine approach is suitable for general scene models.
10 Conclusion
We present Magic3D, a fast and high-quality text-to-3d generative framework. We benefit from efficient scene models and high-resolution diffusion priors in our coarse-to-fine approach. In particular, 3D mesh models scale well with image resolution and enjoy the benefits of higher resolution supervision from latent diffusion models without sacrificing speed. It takes 40 minutes to go from text prompts to a high-quality 3D mesh model ready to use in a graphics engine. Through extensive user research and qualitative comparisons, we found that Magic3D was preferred by raters (61.7%) compared to DreamFusion, while being 2x faster. Finally, we propose a set of tools to better control the style and content of 3D generation. We hope to realize popular 3D synthesis through Magic3D and open up everyone's creativity in 3D content creation.
Original link: Magic3D: High-resolution text to 3d content creation (by small sample vision and intelligence frontier)