Click the card below to follow the " CVer " official account
AI/CV heavy dry goods, delivered in the first time
Click to enter —> [Computer Vision and Paper Submission] Exchange Group
Reprinted from: Heart of the Machine
Give you a few pictures, can you guess what they look like in the three-dimensional world?
With our rich visual prior knowledge, we can easily infer its 3D geometry and its appearance under different viewing angles from just one photo. This ability benefits from our deep understanding of the visual world. Now, just like humans, some excellent image generation models, such as Stable Diffusion and Midjourney, also possess rich visual prior knowledge, showing high-quality image generation effects. Based on such observations, the researchers hypothesize that a high-quality pretrained image generation model has the human-like ability to infer 3D content from a real or AI-generated image.
This task is very challenging, both estimating the underlying 3D geometry and simultaneously generating unseen textures. Based on previous assumptions, researchers from Shanghai Jiaotong University, HKUST, and Microsoft Research proposed the Make-It-3D method to create high-fidelity 3D images from a single image by using a 2D diffusion model as a 3D-aware prior. object. The framework does not require multi-view images for training and can be applied to any input image. This paper is also accepted by ICCV 2023.
Paper link: https://arxiv.org/pdf/2303.14184.pdf
Project link: https://make-it-3d.github.io/
Github link: https://github.com/junshutang/Make-It-3D
As soon as the paper was released, it sparked a heated discussion on Twitter, and the subsequent open source code accumulated more than 1.1k stars on Github.
So what are the technical details behind the method?
When optimizing the three-dimensional space, this method is mainly based on two core optimization objectives:
1. The rendering result under the reference viewing angle should be highly consistent with the input image;
2. The rendering result in the new viewpoint shows the same semantics as the input. Among them, the researchers used the BLIP2 model to label text for pictures.
Based on such an optimization objective, during a one-stage period, the method randomly samples camera poses around a reference viewpoint. Imposes pixel-level constraints on the rendered image and reference image in the reference view, and utilizes prior information from a pre-trained diffusion model to measure the similarity between images and texts in the new view.
However, it is difficult to describe all the information of a picture with only text, which makes it difficult to completely align the generated 3D model with the reference image. Therefore, in order to enhance the correlation between the generated geometric model and the image, the paper additionally constrains the image similarity between the denoising image and the reference image during the diffusion process, that is, constrains the CLIP encoding distance between images. This method further effectively improves the similarity between the generated model and the image.
In addition, the paper also uses the monocular depth estimated from a single image to avoid some geometric ambiguities, such as concave surfaces and other problems.
However, the researchers believe that it is difficult to fully reconstruct the texture details of the image with the optimized texture implicit field, such as the fluff pattern on the bear's surface and local color information, which are not reflected in the results generated in the first stage. The method therefore proposes a two-stage optimization process that focuses on texture refinement.
In the second stage, the method maps the high-quality texture of the reference image into 3D space according to the geometric model obtained in the first stage . Then focus on enhancing the texture of the occluded areas in the reference view. To better implement this process, the method exports the implicit representation of one stage to an explicit representation - point cloud. Compared with the noise grid exported by Marching Cube, the point cloud can provide clearer geometric features, and it is also conducive to the division of occluded and non-occluded areas.
Subsequently, the method focuses on optimizing the texture of occluded regions. The point cloud rendering uses the Deferred-Renderer (delayed renderer) based on the UNet structure, and also uses the prior information from the pre-trained diffusion model to optimize the fine texture of the occluded area.
From left to right are the reference image, the normal image and texture rendering result of the first-stage optimization, and the rendering result of the second-stage texture refinement.
This approach can also support a variety of interesting applications, including free editing and stylization of 3D textures. And generate complex and diverse three-dimensional content with text-driven.
epilogue
Make-It-3D, as the first method to extend 2D images to 3D space while maintaining the same rendering quality and realism as reference images, is committed to creating 3D content with the same visual effect as 2D images. The researchers hope that through the work of Make-It-3D, more attention will be paid to the 2D to 3D solution in academia or industry, and the development of 3D content creation will be accelerated. For more experimental details of the method and more results, please refer to the content of the paper and the project homepage.
Then let's make it easy Make-It-3D!
Click to enter —> [Computer Vision and Paper Submission] Exchange Group
ICCV/CVPR 2023 Paper and Code Download
Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers
后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群
▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!
▲扫码进星球
▲点击上方卡片,关注CVer公众号
It's not easy to organize, please like and watch