3D character generative AI: principle and implementation

Since the publication of the seminal paperDenoising Diffusion Probabilistic Models, such image generators have continued to improve, producing images with quality that Beats GANs on all metrics and is indistinguishable from real images.

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesisand the subsequent releaseInstant Neural Graphics Primitives with a Multiresolution Hash Encoding, let us now have a way to convert a sparse set of images of an object on multiple views into a 3D high-quality rendering of said object.

However, although the radiation fields obtained by training the NeRF model are promising (whether using the original implementation or the InstantNGP backbone for fast training), extracting a usable mesh from them is very resource-intensive, produces noisy results, and destroys all lighting and material data . This is because the NeRF model and its derivatives are synthesized from new perspective views, parameterizing only the RGB color and density of points in the 3D scene given certain camera poses.

While representing the scene as neural volumes does have the advantage of essentially "baking" the lighting data, it doesn't actually do any explicit calculations to at least approximate the BRDF (bidirectional reflectance distribution function) on the 3D surface. What all this actually means is that there is ambiguity in the lighting conditions and properties of surfaces, as NeRF ignores this very important part of traditional PBR (Physically Based Rendering).

NSDT tool recommendationThree.js AI texture development kit - YOLO synthetic data generator - GLTF/GLB online editing - 3D model format online conversion - Programmable 3D scene editor -  REVIT export 3D model plug-in - 3D model semantic search engine 

Fortunately, we have fixed the issue of not being able to extract meshes and materials from NeRF and NeRF-derived models. Extracting Triangular 3D Models, Materials, and Lighting From Images is one such work - the technique first involves training neural volumes using NeRF, using . Both DMTet and nvdiffrast are differentiable stages and therefore use joint gradient optimization of both models. This produces high-quality 3D meshes and PBR materials for relightable 3D objects without any modification to the output. nvdiffrast reconstructs the 3D surface and applies differential rendering to the model via DMTet

Use the work of ControlVideo: Training-free Controllable Text-to-Video Generation , GroundingDINO, SegmentAnything, and nvdiffrec to try to surpass NeRF-DDIM Models such asstable-dreamfusion are used to generate 3D characters of asset quality.

1. Problem statement

Let's first explore the requirements for generating 3D characters.

3D models are used to depict real-world and conceptual visuals in art, entertainment, simulations, and graphics, and are integral to many industries, including virtual reality, video games, 3D printing, marketing, television and film, scientific and medical imaging and computer-aided design — excerpted from  TechTarget

Yes, the use of 3D models and 3D characters is relevant to many areas of the digital world. Let's take video games as the main focus:

Number of global video game players from 2015 to 2024

  • Approximately 3 billion people play video games worldwide. Data source: Marketers
  • 83% of video game sales occur in the digital world. Data source: Global X ETF
  • Consumer spending on video games in the United States in 2021 was $60.4 billion. Data source: PR Newswire, Newzoo
  • About 85% of gaming revenue comes from free-to-play games. Data source: WePC, TweakTown
  • In the first quarter of 2021, there were approximately 14.1 billion mobile game downloads. Data source: Statista
  • By 2025, the PC gaming industry alone will have amassed $46.7 billion. Data source: Statista
  • 38% of global gamers are between the ages of 18 and 34. Data source: Statista

Obviously, an important part of most video games are the game characters.

With this in mind,

In terms of type, the 3D segment accounted for 84.19% of the game engine market share in 2019 and is expected to experience significant growth by 2027. 3D types are widely used in games or character modeling, scene modeling, 3D engines and particle systems. — Excerpted from  BusinessWire

Character designers and 3D artists alike are in high demand.

3D character workflow

Also consider 3D character creationworkflow/pipeline:

Ideation -> Concept Modeling -> Sculpting -> Retopology -> UV Unwrap -> Baking -> Texturing -> Rigging and Skinning -> Animation -> Rendering

Each step requires expert hard work to complete. If we could somehow automate this process, or at least automate parts of it, it would save a lot of development resources and open up 3D character creation to more people.

2. Indicators and baselines

We've seen how traditional 3D character creation has multiple stages. Although this ultimately results in very high-quality assets for use in movies, video games, marketing, VR, and more. But they do take anywhere from multiple working days to a few weeks to complete.

What about existing text-to-3D solutions?

2.1 Google DreamFusion

First of all, you can refer toDreamFusion, which is a project to generate 3D text from text based on the 2D diffusion model. It was released by Google Research in 2022. Use Imagen as a prior to optimize NeRF MLP. However, since DreamFusion is closed source, we were unable to test it. Reimplementation is also difficult because Imagen is a pixel-space diffusion model that requires significant computational resources to run.

What is the result? Let's use the prompts and build parameters from the DreamFusion example:

— text “masterpiece, best quality, 1girl, slight smile, white hoodie, blue jeans, blue sneakers, short blue hair, aqua eyes, bangs” \
— negative “worst quality, low quality, logo, text, watermark, username” \
— hf_key rossiyareich/abyssorangemix3-popupparade-fp16 \
— iters 5000

Get the following results:

2.2 OpenAI SHAP-E

Shap-e  Released by OpenAI in 2023, it uses an encoder-decoder architecture, in which the encoder is trained to encode 3D point clouds and spatial coordinates into the decoder’s hidden formula parameter function. The decoder is then an implicit NeRF MLP trained to output a signed distance field.

What was the result?

Statue of a girl, generated by SHAP-E

A girl, generated by SHAP-E

2.3 Open source projects

Public text-to-3D models are either too experimental or do not produce good results. For example, open sourcestable-dreamfusion:

3. Data collection and cleaning

Our character generation pipeline can be divided into two steps: text-to-video and video-to-3D. Because we are trying to synthesize data for a NeRF model. Our best option is a diffusion probabilistic model. Stable diffusion (implemented by Huggingface/diffusers) is one such model.

We also leverage ControlNet-like technology for image conditioning. Therefore, to generate ControlNet images, our rendering is based on a base mesh.

OpenPose adjusts images

SoftEdge HED adjusts the image

We rendered the same character 100 frames each time in Blender with different camera views and the same A pose.

Then, to synthesize the dataset required for training nvdiffrec, we exploit a novel consensus generation technique, which also happens to be our main contribution.

In order to understand how the technique used works, we should first take a step back and see how the pixel space diffusion model works, you can check outthis video.

The latent diffusion model applies a diffusion process in a latent space; the image is first encoded into the latent space by a VAE encoder, and the output of the diffusion process is decoded by a VAE decoder. However, when running text-to-image inference, only the VAE decoder is used.

In order to better control the output of the diffusion model, we draw from ControlVideo: Training-free Controllable Text-to-Video Generation  Inspiration, and using similar techniques:

The original paper applies time dilation to Conv2D and self-attention layers in a diffusion-stabilized UNet noise prediction model to be able to condition input from other frames, which we do as well. Our deviation from the original implementation is to enter a fixed number of potential codes (up to 3 in our case) to alleviate memory constraints.

Only the first frame and previous frames are used in our cross-frame attention mechanism. We find that our results are generally consistent with the original implementation of sparse causal attention, but still struggle to resolve the finer details in the image; however, this remains a limitation of the DPM that is foreseeable but remains to be solved.

4. Exploratory data analysis

The baseline ControlNet used is:

  • lllyasviel/control_v11p_sd15_openpose
  • lllyasviel/control_v11e_sd15_ip2p
  • lllyasviel/control_v11p_sd15_softedge

We found this combination to be the best for consistency (although some improvements could still be made)

We also merged AOM3 (AbyssOrangeMix3) with Pop Up Parade at a scale of 0.5 asour base model and exploited the NAI-derived VAE weights (anything-v4.0-vae) asour VAE. It is important that the VAE does not produce NaN, otherwise the entire generation process is wasted.

5. Modeling, verification and error analysis

A deeper look at modifications and deviations from our implementation of ControlVideo

First, we use the DPMSolverMultistepScheduler with order=2 to implement it. This reduces generation time by 60% since we only need 20 sampling steps (this article assumes 50 DDIM sampling steps)

Second, we removed the RIFE (Real-time Intermediate Flow Estimation for Video Frame Interpolation) model. While it slightly improves the flickering problem, it does more harm by making the image less clear and less saturated.

Finally, we modified the denoising loop so that it only focuses on the first latent code and the previous frame's latent code:

for i, t in enumerate(timesteps):
    torch.cuda.empty_cache()

    # Expand latents for CFG
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = self.scheduler.scale_model_input(
        latent_model_input, t
    )
    noise_pred = torch.zeros_like(latents)
    pred_original_sample = torch.zeros_like(latents)

    for frame_n in range(video_length):
        torch.cuda.empty_cache()

        if frame_n == 0:
            frames = [0]
            focus_rel = 0
        elif frame_n == 1:
            frames = [0, 1]
            focus_rel = 1
        else:
            frames = [frame_n - 1, frame_n, 0]
            focus_rel = 1

        # Inference on ControlNet
        (
            down_block_res_samples,
            mid_block_res_sample,
        ) = self.controlnet(
            latent_model_input[:, :, frames],
            t,
            encoder_hidden_states=frame_wembeds[frame_n],
            controlnet_cond=[
                cnet_frames[:, :, frames]
                for cnet_frames in controlnet_frames
            ],
            conditioning_scale=controlnet_scales,
            return_dict=False,
        )
        block_res_samples = [
            *down_block_res_samples,
            mid_block_res_sample,
        ]
        block_res_samples = [
            b * s
            for b, s in zip(block_res_samples, controlnet_block_scales)
        ]
        down_block_res_samples = block_res_samples[:-1]
        mid_block_res_sample = block_res_samples[-1]

        # Inference on UNet
        pred_noise_pred = self.unet(
            latent_model_input[:, :, frames],
            t,
            encoder_hidden_states=frame_wembeds[frame_n],
            cross_attention_kwargs=cross_attention_kwargs,
            down_block_additional_residuals=down_block_res_samples,
            mid_block_additional_residual=mid_block_res_sample,
            inter_frame=False,
        ).sample

        # Perform CFG
        noise_pred_uncond, noise_pred_text = pred_noise_pred[
            :, :, focus_rel
        ].chunk(2)
        noise_pred[:, :, frame_n] = noise_pred_uncond + guidance_scale * (
            noise_pred_text - noise_pred_uncond
        )

        # Compute the previous noisy sample x_t -> x_t-1
        step_dict = self.scheduler.step(
            noise_pred[:, :, frame_n],
            t,
            latents[:, :, frame_n],
            frame_n,
            **extra_step_kwargs,
        )
        latents[:, :, frame_n] = step_dict.prev_sample
        pred_original_sample[:, :, frame_n] = step_dict.pred_original_sample

We then train nvdiffrec using the following parameters:

{
    "ref_mesh": "data/ngp",
    "random_textures": true,
    "iter": 5000,
    "save_interval": 100,
    "texture_res": [ 2048, 2048 ],
    "train_res": [1024, 768],
    "batch": 2,
    "learning_rate": [0.03, 0.01],
    "ks_min" : [0, 0.08, 0.0],
    "dmtet_grid" : 128,
    "mesh_scale" : 2.1,
    "laplace_scale" : 3000,
    "display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}],
    "background" : "white",
    "out_dir": "output"
}

The following results are obtained:

From left to right: combination, ground truth, environment map, albedo, depth, normal

After 5000 iterations we get the following results:

  • Mean squared error: 0.00283534
  • Peak signal-to-noise ratio: 25.590
  • 5504 vertices
  • 9563 texture coordinates
  • 5504 normals
  • 11040 noodles

As expected, novel view synthesis using InstantNGP produces better results at the expected angles, however, the results tend to be inconsistent when viewed from extreme angles.

R-Precision scores calculated using Sentence-Transformers/clip-ViT-B-32  confirm our findings:

提示:“a 3d model of a girl”
0.34226945 — LDM R-Precision
0.338451 — InstantNGP R-Precison
0.3204362 — nvdiffrec R-Precision

6. Deployment

We provided a Colab notebook running 3 ControlNet modules, resulting in a peak vram usage of 14GiB. The build takes a total of 2.5 hours.


Original link:3D character generation AI implementation - BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/134917910