[Multi-view reconstruction] From Zero-123 to One-2-3-45: Multi-view generation

insert image description here


Zero-1-to-3: Zero-shot One Image to 3D Object
论文:https://arxiv.org/pdf/2303.11328.pdf

Summary

提示:这里可以添加本文要记录的大概内容:

Zero-1-3: Given only an RGB image (for new view synthesis in this underconstrained setting), learn a geometric prior about natural images using a large-scale diffusion model . Our conditional diffusion model uses synthetic datasets to learn control over relative camera viewpoints, which allows new images of the same object to be generated under specific camera transitions . Even though it is trained on synthetic datasets, our model retains zero-shot generalization to out-of-distribution datasets , as well as to images in the wild, including Impressionist paintings. Our view-conditional diffusion method can be further used for the task of 3D reconstruction from a single image.

One-2-3-45 is based on Zero-123, and further upgrades inconsistent multi-view images to three-dimensional space (mesh or sdf form)

I. Introduction

Humans are generally able to imagine the three-dimensional shape and appearance of an object from just the perspective of a camera. This ability can be explained in part by relying on geometric priors, the fact that we can predict the three-dimensional shape of objects that do not exist (or even exist) in the physical world, relying on priors accumulated by humans from a lifetime of visual exploration. test knowledge.

In contrast, most existing 3D image reconstruction methods operate in closed-world environments , as they rely on expensive 3D annotations (such as CAD models) or class-specific priors [37, 21, 36,67,68,66,25,24].

In this paper, we demonstrate large-scale diffusion models capable of learning control mechanisms that manipulate camera viewpoints, such as stable diffusion [44], for zero-shot novel view synthesis and 3D shape reconstruction. Modern generative models (over 5 billion images), diffusion models are state-of-the-art representations of natural image distributions that support covering a large number of objects from many perspectives. Although they are trained on 2D monocular images without any camera counterparts, we can fine-tune the model to learn the control of relative camera rotation and translation during generation . These controls allow us to encode arbitrary images to be decoded to a different camera viewpoint of our choosing. Figure 1 shows several examples of our results.

This paper demonstrates that large diffusion models already learn rich 3D priors about the visual world , even when they are only trained on 2D images. Section 2 briefly reviews related work. In Section 3, we describe our approach to learning camera extrinsic control by fine-tuning a large diffusion model . In Section 4 several quantitative and qualitative experiments are presented to evaluate the synthesis of zero-shot views and the 3D reconstruction of geometry and appearance from a single image.

2. Related work

slightly

Three, Zero-1-to-3

Given a single RGB image x ∈ R H×W×3 of an object , the goal is to synthesize images of the object from different camera angles. Let R ∈ R 3 × 3 and T ∈ R 3 be the relative camera rotation and translation of the desired viewpoint, respectively. Our goal is to learn a model f that synthesizes a new image under this camera transformation:
insert image description here
we want our estimated x ^ \hat{x}x^ RTis perceptually similar to the real but unobserved new viewxR,T.

Due to the size of its training data [47], pre-trained diffusion models are state-of-the-art representations of natural image distributions today. However, we must overcome two challenges to create f . First, large-scale generative models are trained on various objects under different viewpoints, and these representations do not explicitly encode the correspondence between viewpoints . Second, generative models inherit opinion biases reflected on the internet. As shown in Figure 2 below , stable diffusion tends to generate images of the front chair with canonical poses. These two issues greatly hinder the ability to extract 3D knowledge from large-scale diffusion models.

insert image description here

3.1. Learn how to control the camera's perspective

Since diffusion models are trained on web-scale data, their support for natural image distributions likely covers most viewpoints of most objects, but these viewpoints cannot be controlled in pre-trained models. Once we can teach the model a mechanism to control the extrinsics of the camera that captures the photo , then we can unlock the ability to perform new view synthesis.

To this end, given a dataset {x, xR ,T ,R,T} of pairs of images and their camera extrinsics, as shown in Figure 3, we fine-tune the pre-trained diffusion model to learn the control over the camera parameters , without destroying the representation of other parameters. Use latent diffusion architecture with encoder, denoising U-Net and decoder. At the diffusion time step t∼[1,1000] , let c(x,R,T) be the input view and the embedding of relative camera extrinsic information . We then fine-tune the model by solving the following objective: After being trained by
insert image description here
the model ε θ , the inference model f can generate images by iteratively denoising Gaussian noisy images conditioned on c(x,R,T) .

Inferring viewpoints beyond objects seen in the fine-tuning dataset , the resulting model can synthesize new views for object classes that lack 3D assets and would never appear in the fine-tuning set.

3.2. Diffusion of Viewpoints as Conditions

3D reconstruction from a single image requires both low-level perception (depth, shading, texture, etc.) and high-level understanding (type, function, structure, etc.) . Therefore, we employ a hybrid moderation mechanism :

On one stream , the CLIP embedding of the input image is concatenated with (R,T) to form a "posed CLIP" embedding c(x,R,T). We condition cross-attention on the denoising U-Net, which provides high-level semantic information of the input image .

On another stream, the input image is channel concatenated with the image being denoised, helping the model preserve the identity and details of the object being synthesized .

To be able to apply classifier-free bootstrapping [19], we follow a similar mechanism proposed in [3: InstructPix2Pix] by randomly setting the input image and hypothesized CLIP embedding to a zero vector and scaling the conditional information during inference.

3.3 3D reconstruction

In many applications, synthesizing a new view of an object is not enough. A full 3D reconstruction that is able to capture the object's appearance and geometry is ideal. We adopt a recent open-source framework, Score Jacobian Chaining (SJC) [53], to optimize a 3D representation prior for a diffusion model from text to images. However, due to the probabilistic nature of the diffusion model, gradient updates are highly random. Inspired by DreamFusion [38], a key technique used in SJC is to set the bootstrap value of no classifier significantly higher than usual. This approach reduces the diversity of each sample but improves the fidelity of the reconstruction .

As shown in Figure 4, similar to SJC, we randomly sample viewpoints, and perform volume rendering. We then perturb the resulting image with Gaussian noise ε~N(0,1) and apply U -Net ε θ denoises it to approximate the noise-free input x π :

insert image description here

insert image description here
where ∇L SJC is the PAAS score introduced by [53]. In addition, we also use MSE loss to optimize the input view. To further normalize the NeRF representation, we also apply a depth smoothness loss to each sampled viewpoint, and a near-view consistency loss to normalize appearance changes between nearby views.

3.4 Datasets

We fine-tune using the recently released Objaverse [8] dataset, a large-scale open source dataset containing 800K+ 3D models created by 100K+ artists. Although it does not have explicit class labels like ShapeNet [4], it embodies a large number of high-quality 3D models rich in geometry, many of which have fine-grained details and material properties. For each object in the dataset, we randomly sample 12 camera extrinsic matrices, i pointing at the center of the object, and render 12 views using the ray tracing engine. At training time, each object can sample two views, forming an image pair (x,xR ,T ). The corresponding relative viewpoint transformation (R, T) maps between two viewpoints and can be easily derived from the two external matrices.


4. One-2-3-45

Given an image, we first use the view-conditioned 2D diffusion model Zero123 to generate corresponding multi-view images , and then lift them to 3D space . Due to inconsistent multi-view predictions in traditional reconstruction methods, we construct a 3D reconstruction module based on SDF-based generalized neural surface reconstruction methods, and propose several key training strategies to achieve 360-degree grid (mesh) reconstruction, which can generate Better geometry and consistent results in 3D. In addition, off-the-shelf text-to-image diffusion models are integrated .

This work is a general solution to turn any kind of image into a high-quality 3D textured mesh . Compared to 3D data, 2D images are more readily available and scalable. Recent 2D generative models (such as DALLE, Imagen, and Stable Diffusion ) and visual-language models (such as CLIP) have made significant progress by pre-training on Internet-scale image datasets. An emerging field, such as DreamField, DreamFusion, and Magic3D , commonly uses differentiable rendering and guidance from CLIP models or 2D diffusion models to perform per-shape optimization . Among them, neural field is the most commonly used representation in optimization. Its disadvantages: 1. Time-consuming 2. Intense internal parameters 3. Inconsistent 3D 4. Bad texture .

4.1 Zero123: 2D Diffusion of viewing angle conditions

Recent work [ 23: Lora, 91: Adding conditional control to text-to-image diffusion models ] shows that fine-tuning of pre-trained 2D Diffusion allows us to add various conditional controls to the diffusion model and generate images according to specific conditions . Conditions, such as canny edges, user scribbles, depth and normal maps, have been shown to be effective [91].

Zero123 aims to add viewpoint-conditional control to 2D Diffusion: fine-tuning relative camera transformations on pairs of images, synthesized from large-scale 3D datasets [11] . It assumes the object is centered at the origin of the coordinate system and uses a spherical camera, i.e. the camera is placed on the surface of a sphere and always looks at the origin. For two camera poses (θ 1 , ϕ 1 , r 1 ) and (θ 2 , ϕ 2 , r 2 ) denoting polar angle, azimuth and radius, respectively, the target model: f(x 1 , θ 2 −θ 1 , ϕ 2 −ϕ 1 , r 2 −r 1 ) is perceptually similar to x 2 , where x 1 and x 2 are two images of an object captured from different views.

4.2 NeRF Optimization: Upgrading Multi-View Prediction to 3D Images

Given an image, first use Zero123 to generate 32 multi-view images (the camera pose is uniformly sampled from the sphere); then use a nerf-based method (TensoRF [48]) and an SDF-based method (NeuS [74] ), respectively optimize Density and SDF fields . The effect is shown in Figure 3: due to the inconsistent prediction of zero 123, a lot of distortion and floating objects are generated. In Figure 4, we compare the predictions of Zero123 with the performance of GroundTruth, the overall PSNR is not high, especially when the input relative pose is large or the target pose is in an unusual position (eg, from the bottom or top).
insert image description here


insert image description here

However, mask IoU (greater than 0.95 for most regions) and CLIP similarity are relatively good. show that Zero123 tends to produce predictions that are perceptually similar to GroundTruth and have similar contours or boundaries, but the pixel-level appearance may not be exactly the same, and this inconsistency is already fatal to traditional optimization-based methods .

4.3 Neural Surface Reconstruction Based on Imperfect Multiple Views

Instead of using optimization-based methods, the reconstruction module is based on a generalizable SDF reconstruction method, SparseNeuS (a variant of MVSNeRF), which combines multi-view stereo, neural scene representation, and volume rendering . As shown in Figure 2, our reconstruction module takes as input multiple source images with corresponding camera poses and generates a textured mesh in one feed-forward pass.

insert image description here

As shown in Figure 2, the reconstruction module takes source images of m poses as input, and first uses a 2D feature network to extract m 2D feature maps. Next, the module constructs a three-dimensional cost volume (cost volume), whose content is calculated by projecting each three-dimensional voxel onto m two-dimensional feature planes, and calculating the variance of the features corresponding to m projected two-dimensional positions . The cost volume is then processed using a sparse 3D CNN to obtain a geometric volume that encodes the underlying geometry of the input shape. To predict the SDF at an arbitrary 3D point, the MLP network takes as input the 3D coordinates and their corresponding interpolated features from the geometric encoding volume . To predict the color of a 3D point, another MLP network takes as input the 2D features of the projected location, the features interpolated from the geometry, and the viewing direction of the query ray relative to the viewing direction of the source image. The network predicts blending weights for each source view and predicts the color of a 3D point as a weighted sum of its projected colors. Finally, an SDF-based rendering technique is applied on top of two MLP networks for RGB and depth rendering [74: Neus].

* 2-stage source view selection and Groundtruth prediction hybrid training

SparseNeuS only demonstrates frontal view reconstruction, but we reconstruct the 360° mesh in one feed-forward pass by selecting source views in a specific way and adding deep supervision during training . Specifically, our reconstruction model is trained on a 3D object dataset while freezing zero 123. We follow Zero123 to normalize the training shape and use a spherical camera model. For each shape, we first render n GroundTruth RGB and depth images from n camera poses, uniformly placed on the sphere. For each of the n views, we use Zero123 to predict the four nearby views.

When training, we feed all 4 ×n predictions with GroundTruth poses into the reconstruction module, and randomly select one of the n ground truth RGB image views as the target view. We refer to this view selection strategy as 2-stage source view selection. We supervise the training with the RGB and depth values ​​of GT. In this way, the module can learn to handle inconsistent predictions from Zero123 and reconstruct a consistent 360◦ grid . We argue that our two-level source view selection strategy is critical, since uniform selection of n × 4 source views from a sphere results in larger distances between camera poses. However, cost volume based methods [40, 28, 6] usually rely on very close source views to find local correlations. Furthermore, as shown in Figure 4, Zero123 can provide very accurate and consistent predictions when the relative poses are small (e.g., 10 degrees apart), and thus can be used to find local correspondences and infer geometry.

When training, we use n GroundTruth renderings in the first stage to make the depth loss better supervised. However, during inference, we can replace n GroundTruth renderings with Zero123 predictions, as shown in Figure 2, and no depth input is required . We will show in experiments that this GroundTruth predictive hybrid training strategy is also important. To export the textured mesh, we use marching cubes [41] to extract the mesh from the predicted SDF field and query the colors of the mesh vertices as described in Neus . Although our reconstruction module is trained on 3D datasets, we find that it relies mainly on local correspondences and generalizes well to unseen shapes.

4.4 Camera Pose Estimation

Our reconstruction module requires camera poses of 4×n source view images. Note that we use Zero123 for image synthesis, which parameterizes the camera in a standard spherical coordinate system, (θ,ϕ,r). While we can arbitrarily adjust the azimuth ϕ and radius r of all source view images simultaneously, resulting in corresponding rotation and scale reconstruction of the object, this parameterization requires knowing the absolute elevation angle θ of one camera, and determining the relative pose. Specifically, the relative poses of cameras (θ 0 , ϕ 0 , r 0 ) and cameras (θ 0 +∆ θ , ϕ 0 +∆ ϕ , r 0 ) for different θ 0 are also different, even if ∆θ and ∆ϕ are the same of. Therefore, changing the elevation angle of all source images at the same time (for example, 30 degrees up or 30 degrees down) will result in a distortion of the reconstructed shape (see Figure 10 for an example).

Therefore, we propose an elevation estimation module to infer the elevation angle of the input image . First, we use Zero123 to predict four nearby views of the input image. We then enumerate all possible elevation angles in a coarse-to-fine manner. For each elevation candidate angle, we compute the camera poses corresponding to the four images, and compute the reprojection error for this set of camera poses to measure the agreement between images and camera poses. By combining the poses and relative poses of the input views, the camera poses of all 4 × n source views are generated using the elevation angle with the smallest reprojection error. (reprojection error, see Supplementary Information).


Summarize

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/132100570