Click the card below to follow the " CVer " official account
AI/CV heavy dry goods, delivered in the first time
Click to enter —>【Transformer】WeChat Technology Exchange Group
Reprinted from: Heart of the Machine
This time, compared with the point cloud explicit generation model Point·E, OpenAI's new conditional generative 3D model Shap-E models a high-dimensional, multi-representation output space, converges faster, and achieves equivalent or better Good sample quality.
The generative AI large model is the focus of OpenAI's efforts. At present, it has launched text-generated image models DALL-E and DALL-E 2, as well as POINT-E, which generates 3D models based on text earlier this year.
Recently, the OpenAI research team upgraded the 3D generative model and launched Shap・E, a conditional generative model for synthesizing 3D assets . At present, relevant model weights, inference codes and samples are open source.
Paper address: https://arxiv.org/abs/2305.02463
Project address: https://github.com/openai/shap-e
Let's take a look at the generation effect first. Similar to generating images based on text, the 3D object model generated by Shap・E focuses on a "magic and unconstrained style". For example, a plane that looks like a banana:
A chair that looks like a tree:
There are also classic examples, like the avocado chair:
Of course, it is also possible to generate 3D models of some common objects, such as a bowl of vegetables:
Donut:
Shap·E proposed in this paper is a latent diffusion model on 3D implicit function space, which can be rendered into NeRF and textured meshes. Given the same dataset, model architecture, and training computation, Shap·E outperforms similar explicit generative models . The researchers found that plain-text conditional models can generate diverse and interesting objects, highlighting the potential for generating implicit representations.
Unlike work on 3D generative models that produces a single output representation, Shap-E is able to directly generate the parameters of implicit functions. Training Shap-E is divided into two phases: first, an encoder is trained that deterministically maps 3D assets into the parameters of the implicit function; second, a conditional diffusion model is trained on the output of the encoder. When trained on large datasets of paired 3D and text data, the model is able to generate complex and diverse 3D assets in seconds. Compared with the point cloud explicit generative model Point·E, Shap-E models a high-dimensional, multi-representational output space, converges faster, and achieves comparable or better sample quality .
Research Background
This paper focuses on two implicit neural representations (INR) for 3D representation:
NeRF an INR that represents a 3D scene as a function that maps coordinates and orientations to densities and RGB colors;
DMTet and its extension GET3D represent a textured 3D mesh that maps coordinates to colors, signed distances, and vertex offsets as a function. This INR enables the construction of a 3D triangular mesh in a differentiable manner, which is then rendered into a differentiable rasterization library.
While INR is flexible and expressive, obtaining INR for each sample in the dataset is expensive. Furthermore, each INR may have many numerical parameters, which may pose difficulties when training downstream generative models. By addressing these problems using an autoencoder with an implicit decoder, smaller latent representations can be obtained, which are directly modeled with existing generative techniques. Another alternative is to use meta-learning to create an INR dataset that shares most of the parameters, and then train a diffusion model or normalization flow on these INR's free parameters. It has also been suggested that gradient-based meta-learning may not be necessary, and instead the Transformer encoder should be directly trained to produce NeRF parameters conditioned on multiple views of 3D objects.
The researchers combined and expanded the above methods, and finally obtained Shap·E, which became a conditional generative model for various complex 3D implicit representations. First generate INR parameters for 3D assets by training a Transformer-based encoder, and then train a diffusion model on the output of the encoder. Unlike previous approaches, an INR representing both the NeRF and the mesh is generated, allowing them to be rendered in multiple ways or imported into downstream 3D applications.
When trained on a dataset of millions of 3D assets, our model is able to generate diverse recognizable samples conditioned on text prompts. Compared with the recently proposed explicit 3D generative model Point·E, Shap-E converges faster. It achieves comparable or better results given the same model architecture, dataset, and conditioning mechanism.
Method overview
The researchers first train the encoder to generate an implicit representation, and then train the diffusion model on the latent representation generated by the encoder, which is mainly divided into the following two steps:
1. Train an encoder to produce the parameters of an implicit function given a dense explicit representation of a known 3D asset. The encoder produces the latent representation of the 3D asset and then linearly projects it to obtain the weights of the multi-layer perceptron (MLP);
2. Apply the encoder to the dataset, then train the diffusion prior on the latent dataset. The model is conditioned on image or text descriptions.
The researchers train all models on a large dataset of 3D assets with corresponding renderings, point clouds, and text captions.
3D encoder
The encoder architecture is shown in Figure 2 below.
potential spread
The generative model adopts the transformer-based Point·E diffusion architecture, but replaces the point cloud with a sequence of latent vectors. The latent function shape sequence is 1024×1024 and is fed into the transformer as a sequence of 1024 tokens, where each token corresponds to a different row of the MLP weight matrix. Thus, the model is roughly equivalent computationally to the base Point·E model (ie, has the same context length and width). On this basis, the input and output channels are added to generate samples in a higher-dimensional space.
Experimental results
Encoder evaluation
The researchers track two rendering-based metrics throughout the encoder training process. First evaluate the peak signal-to-noise ratio (PSNR) between the reconstructed image and the real rendered image. Furthermore, to measure the encoder's ability to capture semantically relevant details of 3D assets, the CLIP R-Precision for reconstructing NeRF and STF renderings is re-evaluated for encoding meshes produced by the Max Point·E model.
Table 1 below tracks the results of these two metrics at different training epochs. It can be found that distillation damages the NeRF reconstruction quality, while fine-tuning not only recovers but slightly improves NeRF quality, while greatly improving STF rendering quality.
Contrast Point・E
The latent diffusion model proposed by the researchers has the same architecture, training dataset and conditional model as Point·E. Comparing with Point·E is more helpful to distinguish the effect of generating implicit neural representations rather than explicit ones. Figure 4 below compares these methods on sample-based evaluation metrics.
Qualitative samples are shown in Figure 5 below, and it can be seen that these models often generate samples of varying quality for the same text prompt. Text-conditioned Shap·E starts to get worse in evaluation before training ends.
The researchers found that Shap・E and Point・E tend to share similar failure cases, as shown in Figure 6 (a) below. This suggests that the training data, model architecture, and conditioned images have a greater impact on the generated samples than the chosen representation space.
We can observe that there are still some qualitative differences between the two image conditioning models, e.g. in the first row of Fig. 6(b) below, Point・E ignores small gaps on the bench, while Shap・E tries to modeling. This paper assumes that this particular discrepancy occurs because point clouds do not represent thin features or gaps well. Also observed in Table 1, the 3D encoder slightly reduces the CLIP R-Precision when applied to Point·E samples.
compared with other methods
In Table 2 below, the researchers compare shape・E with a wider range of 3D generation techniques on the CLIP R-Precision metric.
Limitations and Prospects
While Shap-E can understand many individual object prompts with simple properties, it is limited in its ability to compose concepts. As can be seen in Figure 7 below, this model has difficulty binding multiple properties to different objects, and cannot effectively generate the correct number of objects when more than two objects are requested. This may be a result of insufficient paired training data, which may be addressed by collecting or generating larger annotated 3D datasets.
Additionally, Shap・E produces recognizable 3D assets, but these often appear grainy or lack detail. Figure 3 below shows that the encoder sometimes loses detailed textures (such as stripes on a cactus), suggesting that an improved encoder might recover some of the lost generation quality.
Please refer to the original paper for more technical and experimental details.
Click to enter —>【Transformer】WeChat Technology Exchange Group
The latest CVPR 2023 papers and code download
Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers
Background reply: Transformer review, you can download the latest 3 Transformer review PDFs
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群
▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!
▲扫码进星球
▲点击上方卡片,关注CVer公众号
It's not easy to organize, please like and watch