OpenAI's popular new work Shap-E! The text generates a 3D model and then upgrades, and the modeling is completed in a few seconds!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

This time, compared with the point cloud explicit generation model Point·E, OpenAI's new conditional generative 3D model Shap-E models a high-dimensional, multi-representation output space, converges faster, and achieves equivalent or better Good sample quality.

The generative AI large model is the focus of OpenAI's efforts. At present, it has launched text-generated image models DALL-E and DALL-E 2, as well as POINT-E, which generates 3D models based on text earlier this year.

Recently, the OpenAI research team upgraded the 3D generative model and launched Shap・E, a conditional generative model for synthesizing 3D assets . At present, relevant model weights, inference codes and samples are open source.

b06006a65129ff9036789eac7d5597d9.png

  • Paper address: https://arxiv.org/abs/2305.02463

  • Project address: https://github.com/openai/shap-e

Let's take a look at the generation effect first. Similar to generating images based on text, the 3D object model generated by Shap・E focuses on a "magic and unconstrained style". For example, a plane that looks like a banana:

4afbeaeeca0ac3f15f5d4f569e299a23.gif

A chair that looks like a tree:

f72c6c5f0affd23fca85711f59b2e74a.gif

There are also classic examples, like the avocado chair:

bf03d540dee7a6966706c228f626ee14.gif

Of course, it is also possible to generate 3D models of some common objects, such as a bowl of vegetables:

6df5b554178e492c6a0ce8bb61d9ce37.gif

Donut:

9ec17c73be186b1f112886701811c7d0.gif

Shap·E proposed in this paper is a latent diffusion model on 3D implicit function space, which can be rendered into NeRF and textured meshes. Given the same dataset, model architecture, and training computation, Shap·E outperforms similar explicit generative models . The researchers found that plain-text conditional models can generate diverse and interesting objects, highlighting the potential for generating implicit representations.

7e5fca2eadb85848f7bfb18f7af1e765.png

Unlike work on 3D generative models that produces a single output representation, Shap-E is able to directly generate the parameters of implicit functions. Training Shap-E is divided into two phases: first, an encoder is trained that deterministically maps 3D assets into the parameters of the implicit function; second, a conditional diffusion model is trained on the output of the encoder. When trained on large datasets of paired 3D and text data, the model is able to generate complex and diverse 3D assets in seconds. Compared with the point cloud explicit generative model Point·E, Shap-E models a high-dimensional, multi-representational output space, converges faster, and achieves comparable or better sample quality .

Research Background

This paper focuses on two implicit neural representations (INR) for 3D representation:

  • NeRF an INR that represents a 3D scene as a function that maps coordinates and orientations to densities and RGB colors;

  • DMTet and its extension GET3D represent a textured 3D mesh that maps coordinates to colors, signed distances, and vertex offsets as a function. This INR enables the construction of a 3D triangular mesh in a differentiable manner, which is then rendered into a differentiable rasterization library.

While INR is flexible and expressive, obtaining INR for each sample in the dataset is expensive. Furthermore, each INR may have many numerical parameters, which may pose difficulties when training downstream generative models. By addressing these problems using an autoencoder with an implicit decoder, smaller latent representations can be obtained, which are directly modeled with existing generative techniques. Another alternative is to use meta-learning to create an INR dataset that shares most of the parameters, and then train a diffusion model or normalization flow on these INR's free parameters. It has also been suggested that gradient-based meta-learning may not be necessary, and instead the Transformer encoder should be directly trained to produce NeRF parameters conditioned on multiple views of 3D objects.

The researchers combined and expanded the above methods, and finally obtained Shap·E, which became a conditional generative model for various complex 3D implicit representations. First generate INR parameters for 3D assets by training a Transformer-based encoder, and then train a diffusion model on the output of the encoder. Unlike previous approaches, an INR representing both the NeRF and the mesh is generated, allowing them to be rendered in multiple ways or imported into downstream 3D applications.

When trained on a dataset of millions of 3D assets, our model is able to generate diverse recognizable samples conditioned on text prompts. Compared with the recently proposed explicit 3D generative model Point·E, Shap-E converges faster. It achieves comparable or better results given the same model architecture, dataset, and conditioning mechanism.

Method overview

The researchers first train the encoder to generate an implicit representation, and then train the diffusion model on the latent representation generated by the encoder, which is mainly divided into the following two steps:

1. Train an encoder to produce the parameters of an implicit function given a dense explicit representation of a known 3D asset. The encoder produces the latent representation of the 3D asset and then linearly projects it to obtain the weights of the multi-layer perceptron (MLP);

2. Apply the encoder to the dataset, then train the diffusion prior on the latent dataset. The model is conditioned on image or text descriptions.

The researchers train all models on a large dataset of 3D assets with corresponding renderings, point clouds, and text captions.

3D encoder

The encoder architecture is shown in Figure 2 below.

5b7abdb7220eb83cb14c313e22d06534.png

potential spread

The generative model adopts the transformer-based Point·E diffusion architecture, but replaces the point cloud with a sequence of latent vectors. The latent function shape sequence is 1024×1024 and is fed into the transformer as a sequence of 1024 tokens, where each token corresponds to a different row of the MLP weight matrix. Thus, the model is roughly equivalent computationally to the base Point·E model (ie, has the same context length and width). On this basis, the input and output channels are added to generate samples in a higher-dimensional space.

Experimental results

Encoder evaluation

The researchers track two rendering-based metrics throughout the encoder training process. First evaluate the peak signal-to-noise ratio (PSNR) between the reconstructed image and the real rendered image. Furthermore, to measure the encoder's ability to capture semantically relevant details of 3D assets, the CLIP R-Precision for reconstructing NeRF and STF renderings is re-evaluated for encoding meshes produced by the Max Point·E model.

Table 1 below tracks the results of these two metrics at different training epochs. It can be found that distillation damages the NeRF reconstruction quality, while fine-tuning not only recovers but slightly improves NeRF quality, while greatly improving STF rendering quality.

cce66a49b5e1c4da698e187c91769fb9.png

Contrast Point・E

The latent diffusion model proposed by the researchers has the same architecture, training dataset and conditional model as Point·E. Comparing with Point·E is more helpful to distinguish the effect of generating implicit neural representations rather than explicit ones. Figure 4 below compares these methods on sample-based evaluation metrics.

12ba20f9298c82a873e9cf4d31487095.png

Qualitative samples are shown in Figure 5 below, and it can be seen that these models often generate samples of varying quality for the same text prompt. Text-conditioned Shap·E starts to get worse in evaluation before training ends.

249ee7966496a5c19812533a2e582bd3.png

The researchers found that Shap・E and Point・E tend to share similar failure cases, as shown in Figure 6 (a) below. This suggests that the training data, model architecture, and conditioned images have a greater impact on the generated samples than the chosen representation space.

We can observe that there are still some qualitative differences between the two image conditioning models, e.g. in the first row of Fig. 6(b) below, Point・E ignores small gaps on the bench, while Shap・E tries to modeling. This paper assumes that this particular discrepancy occurs because point clouds do not represent thin features or gaps well. Also observed in Table 1, the 3D encoder slightly reduces the CLIP R-Precision when applied to Point·E samples.

c07113f7370328b620f96eba9389bea7.png

compared with other methods

In Table 2 below, the researchers compare shape・E with a wider range of 3D generation techniques on the CLIP R-Precision metric.

db31dcf2d37c230503d9716df829b893.png

Limitations and Prospects

While Shap-E can understand many individual object prompts with simple properties, it is limited in its ability to compose concepts. As can be seen in Figure 7 below, this model has difficulty binding multiple properties to different objects, and cannot effectively generate the correct number of objects when more than two objects are requested. This may be a result of insufficient paired training data, which may be addressed by collecting or generating larger annotated 3D datasets.

bfbb14bb3a291733979491a8fb1e8a59.png

Additionally, Shap・E produces recognizable 3D assets, but these often appear grainy or lack detail. Figure 3 below shows that the encoder sometimes loses detailed textures (such as stripes on a cactus), suggesting that an improved encoder might recover some of the lost generation quality.

3d81d0e278dce9c851c02fedacd6564a.png

Please refer to the original paper for more technical and experimental details.

Click to enter —>【Transformer】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch27d9e5cfcbb7a01a122d5d7e3122c521.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130695793