ICCV 2023 | Submission & Microsoft open source Make-It-3D: 2D to 3D generation work, Star more than 1k stars!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Computer Vision and Paper Submission] Exchange Group

Reprinted from: Heart of the Machine

Give you a few pictures, can you guess what they look like in the three-dimensional world?

e00da4cf560dc9ff96d452443638065a.png

With our rich visual prior knowledge, we can easily infer its 3D geometry and its appearance under different viewing angles from just one photo. This ability benefits from our deep understanding of the visual world. Now, just like humans, some excellent image generation models, such as Stable Diffusion and Midjourney, also possess rich visual prior knowledge, showing high-quality image generation effects. Based on such observations, the researchers hypothesize that a high-quality pretrained image generation model has the human-like ability to infer 3D content from a real or AI-generated image.

This task is very challenging, both estimating the underlying 3D geometry and simultaneously generating unseen textures. Based on previous assumptions, researchers from Shanghai Jiaotong University, HKUST, and Microsoft Research proposed the Make-It-3D method to create high-fidelity 3D images from a single image by using a 2D diffusion model as a 3D-aware prior. object. The framework does not require multi-view images for training and can be applied to any input image. This paper is also accepted by ICCV 2023.

9c09b303d470f92a12ca8d13fcfd1931.gifa0f828851c64a2c02cd65ce34281bc0e.giff4ae4ee2be8f2e749f18fbffc8cbda85.gif

0c5a8676d0b08881be0799423b99b0d0.png

  • Paper link: https://arxiv.org/pdf/2303.14184.pdf

  • Project link: https://make-it-3d.github.io/

  • Github link: https://github.com/junshutang/Make-It-3D

As soon as the paper was released, it sparked a heated discussion on Twitter, and the subsequent open source code accumulated more than 1.1k stars on Github.

011e3be0a0aa6de395643b8718fa58aa.png

0808ea54f5b17ddc7212c9de47d77c9b.png

So what are the technical details behind the method?

When optimizing the three-dimensional space, this method is mainly based on two core optimization objectives:

1. The rendering result under the reference viewing angle should be highly consistent with the input image;

2. The rendering result in the new viewpoint shows the same semantics as the input. Among them, the researchers used the BLIP2 model to label text for pictures.

Based on such an optimization objective, during a one-stage period, the method randomly samples camera poses around a reference viewpoint. Imposes pixel-level constraints on the rendered image and reference image in the reference view, and utilizes prior information from a pre-trained diffusion model to measure the similarity between images and texts in the new view.

6d606abdf667e63cd4590a8ba7786ad3.png

However, it is difficult to describe all the information of a picture with only text, which makes it difficult to completely align the generated 3D model with the reference image. Therefore, in order to enhance the correlation between the generated geometric model and the image, the paper additionally constrains the image similarity between the denoising image and the reference image during the diffusion process, that is, constrains the CLIP encoding distance between images. This method further effectively improves the similarity between the generated model and the image.

8d908bae4e8954aaf95a4a2ff8c2834d.png

In addition, the paper also uses the monocular depth estimated from a single image to avoid some geometric ambiguities, such as concave surfaces and other problems.

However, the researchers believe that it is difficult to fully reconstruct the texture details of the image with the optimized texture implicit field, such as the fluff pattern on the bear's surface and local color information, which are not reflected in the results generated in the first stage. The method therefore proposes a two-stage optimization process that focuses on texture refinement.

cdf5c60be3898a666b32de48457f47d1.png

In the second stage, the method maps the high-quality texture of the reference image into 3D space according to the geometric model obtained in the first stage . Then focus on enhancing the texture of the occluded areas in the reference view. To better implement this process, the method exports the implicit representation of one stage to an explicit representation - point cloud. Compared with the noise grid exported by Marching Cube, the point cloud can provide clearer geometric features, and it is also conducive to the division of occluded and non-occluded areas.

Subsequently, the method focuses on optimizing the texture of occluded regions. The point cloud rendering uses the Deferred-Renderer (delayed renderer) based on the UNet structure, and also uses the prior information from the pre-trained diffusion model to optimize the fine texture of the occluded area.

4ac3b01f976444e29ca7eee5635df3d0.gif

From left to right are the reference image, the normal image and texture rendering result of the first-stage optimization, and the rendering result of the second-stage texture refinement.

This approach can also support a variety of interesting applications, including free editing and stylization of 3D textures. And generate complex and diverse three-dimensional content with text-driven.

53184622d682c59019e5b5125533bcb1.gif

a80ac8a4a44d9a120a8adb640554c003.gif

epilogue

Make-It-3D, as the first method to extend 2D images to 3D space while maintaining the same rendering quality and realism as reference images, is committed to creating 3D content with the same visual effect as 2D images. The researchers hope that through the work of Make-It-3D, more attention will be paid to the 2D to 3D solution in academia or industry, and the development of 3D content creation will be accelerated. For more experimental details of the method and more results, please refer to the content of the paper and the project homepage.

Then let's make it easy Make-It-3D!

104dc43e0bb7d1407af61ae74c84d940.pngf1a725365cce2b7194dd483333054706.pngc61091e3b10a2cc2bdd24c91891b03ba.png

592335dad4d43adfb92f1f401cf74637.gif5f715006825c9926445274870b05d552.gif6de71e0d79d5a86a7ef406f4786ab0e5.gif

dd0959dc355fa898c203ff66650c37a4.png30f5ef471bbfb7d52a01ac28c351d691.png3b8b4bf1e3bc2588e662a2b511f39c56.png

e62a5f154d1f597fd0c28699947f43b7.gif449c3bfe411855fd939725d3d7eeb475.gif83c47dc680bccc50df823059c407e403.gif

 
  

Click to enter —> [Computer Vision and Paper Submission] Exchange Group

ICCV/CVPR 2023 Paper and Code Download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watchf2834cdbc951a8311a3198465b1fa917.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/132126563
Recommended