[3D generation] Sparse reconstruction, Image-to-3D method (summary)

Table of Contents of Series Articles

总结一下近5年的三维生成算法,持续更新




1. LRM: Large Model Reconstruction of a Single Image (2023)

Title : LRM: LARGE RECONSTRUCTION MODEL FOR SINGLE IMAGE TO 3D
Paper : https://arxiv.org/pdf/2311.04400.pdf
Source : Adobe Research / Australian National University
Project : https://yiconghong.me/LRM/

Summary

LRM, is a large-scale reconstruction model that can predict a 3D model of an object from a single input image   in 5 seconds . Compared with previous category-specific methods such as ShapeNet trained on small-scale datasets, LRM adopts a highly scalable Transformer-based architecture (specifically using DINO as the image encoder) with 500 million learnable parameters. Learn a 3D representation of an object from a single image in a data-driven manner ( trained on approximately 1 million 3D shapes in different categories and video data ), directly from the input image, regressing a NeRF in the form of a three-plane representation .
  Specifically: LRM learns an image-to-three-plane Transformer decoder, projects the two-dimensional image features onto the three-plane through cross-attention, and models the relationship between the three-plane tokens in the spatial structure through self-attention.

1 Introduction

  Three elements for great success in NLP tasks (such as GPT):

(1) Use highly scalable and efficient neural networks such as Transformer.
(2) Huge datasets for learning universal priors
(3) My-supervised training goals encourage models to discover underlying data structures while maintaining high scalability

  GINA-3D applies VIT encoder and cross-attention (instead of the Transformer decoder in LRM) to convert the image into a three-plane representation. But the model and training scale are small and have different focus on the generation of specific categories. The recent data-driven method MCC utilizes CO3D-v2 data to train a generalizable Transformer-based decoder to predict occupancy and color from the input image and its unprojected point cloud. While MCC can handle real and generated images and scenes, the results are often too smooth and lose detail.

2.Method

  The LRM architecture is as shown in (Figure 1): it contains an image Encoder (pre-trained DINO), which encodes the input image into a patch-scale feature token; then an image to three-plane Decoder, which projects the image features to three planes through cross attention. on the token. The output three-plane token is upsampled and reshaped into a reshaped three-plane representation, which is used to query three-dimensional point features. Finally, the 3D point features are passed to multi-layer perception to predict RGB and density for volume rendering.

Insert image description here

  1. Encoder

  The image patch features obtained by pre-training DINO are expressed as:Insert image description here

i represents the image patch index, n is the total number of patches, and d E is the encoder dimension. DINO is a self-distillation trained model that learns interpretable attention to the structure and texture of salient content in images , which LRM can use to reconstruct the geometry and color of three-dimensional space.

  1. decoder

  Transformer decoder: Projects (image and camera features) onto (learnable spatial position embeddings) and transforms them into a three-plane representation. The decoder can be thought of as a prior network trained with large-scale data to provide the necessary geometric and appearance information to compensate for the ambiguity of single-image reconstruction .

  Camera characteristics . We expand the 4×4 camera extrinsic parameter E (representing the camera-to-world transformation) and connect it with the camera focal length foc and the reference point pp to construct the camera feature c=[E 1×16 , foc x , foc y , pp x , pp y ], c∈R 20 . In addition, we normalize the external parameters E of the cameras through similarity transformation so that all input cameras are aligned on the same axis (the search direction is aligned with the z-axis). Please note that LRM does not rely on the annotated pose of the object, while Groundtruth is only used for training. Adjustment of normalized camera parameters greatly reduces the optimization space of three-plane features and facilitates model convergence. To embed camera features, we further implement a multilayer perceptron (MLP) to map camera features to high-dimensional embeddings c ~ \tilde{c}c~ . The internal camera parameters (focal length and reference point) are normalized by the height and width of the image before being input to the MLP layer.

  Three plane features . A three-plane T contains three axis - aligned feature planes T XY , T YZ and T Number of channels. For the three-dimensional points within the bounding box [-1,1] 3 of the NeRF object , it can be projected onto each plane, and the corresponding point features (T XY , T YZ , T XZ ) are queried through bilinear interpolation , and then MLP NeRF decodes to NeRF color and density.   In the forward process, based on camera features c ~ \tilde{c}
c~ and image features {hin}, each layer of the image to three-plane Transformer decoder embeds the initial positionfinitand updates it to the final three-plane feature. The reason for applying two different conditional operations is thatthe camera controls the orientation and distortion of the entire shape, while the image features carry the fine-grained geometric and color information that needs to be embedded in the three planes

  Modulation of camera features . Modulation of camera features is inspired by DiT, which implements an adaptive layer norm (adaLN) to modulate the latents of the image through denoising time steps and category labels. Assuming {f j } is a vector sequence in Transformer, the Modulation function of camera feature c is defined as:

Insert image description here
  Transformer layers : Contains a cross attention sub-layer , a self-attention sub-layer and an MLP layer (the input token of each sub-layer is Modulated by the camera features). Assume that the feature sequence f in is the input of the Transformer layer (can be regarded as three-plane hidden features, because they correspond to the final three-plane feature T). The process is as follows:

Insert image description here
3.Triplanar NeRF

Predict RGB and density σ (representing the 4 dimensions of the MLP NeRF output)   from the queried point features in the triplanar representation T. MLP NeRF contains multiple linear layers, with ReLU activation.

  1. training objectives

  LRM generates 3D shapes from a single view and utilizes additional side views to guide reconstruction during training: V-1 side views are randomly selected for each shape as supervision. V rendering views x ^ \hat{x}xLoss between ^ and GT (including input view and side view):
Insert image description here

3. Experiment

  The datasets adopt Objaverse and MVImgNet, which include synthetic 3D data and real-life data videos respectively, to learn generalizable cross-shape 3D priors. For each 3D object, the shape is normalized to [-1, 1] 3 in world space , and 32 random views are rendered in arbitrary poses with a resolution of 1024 × 1024 and a camera pose starting from a radius of [1.5, 30] , sampled from a ball with a height range of [-0.75, 1.60] 3 . A total of 730,648 3D assets and 220,219 videos were preprocessed for training.


2. SSDNeRF: 3D generation and reconstruction of single-stage Diffusion NeRF (ICCV 2023)

Title : Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction
Paper : https://arxiv.org/pdf/2304.06714.pdf
Task : Unconditional 3D generation (such as generating different cars from noise, etc.), single View 3D generation
organization : Hansheng Chen,1,* Jiatao Gu,2 Anpei Chen, Tongji, Apple, University of California
Code : https://github.com/Lakonik/SSDNeRF

  SSDNeRF, an expressive three-plane NeRF automatic decoder, is a framework connected to the three-plane latent diffusion model. Figure 3 provides an overview of the model.

Insert image description here

1 Single-stage diffusion NeRF training

   Training objectives can be derived in a similar way to VAEs. Utilizing the NeRF decoder
p ψ ({y j } | x, {r j }) and the diffusion latent prior p ϕ (x) , the training goal is to minimize the negative logarithm of the observed data { y ij gt , r ij gt } Variational upper bound on likelihood (NLL) . Ignoring the uncertainty (variance) in the latent code, we get a simplified training loss:

Insert image description here
Among them, the scene code {x i } , the prior parameter ϕ and the decoder parameter ψ are jointly optimized in a single training stage. This loss includes the rendering loss L rend and a diffusion prior term in the form of NLL. Following the paper [Maximum likelihood training of score-based diffusion models, Score-based generative modeling in latent space, etc.] , we use an approximate upper bound L diff (also known as fractional distillation) instead of diffusion NLL. Adding the experience weight factor, the final training goal is:

Insert image description here

  Single-stage training, using the above loss-constrained scene codes {xi } , allows the learned prior to complete the unseen part, which is particularly beneficial for training on sparse view data (expressive triplane codes are seriously uncertain )

Balancing rendering and prior weights

  The rendering-prior weight ratio λ renddiff is the key to single-stage training. To ensure generalization, an empirical weighting mechanism is designed , in which the diffusion loss is normalized by the exponential moving average (EMA) of the Frobenius norm of the scene code, expressed as: λ diff := c diff / EMA(||x i | | 2 F ), c ​​diff is a fixed scale; λ rend := c rend (1−e −0.1Nv )/N v . Rendering weights are determined by the number of visible views Nv : Nv - based weighting is a calibration of the decoder pψ , preventing the rendering loss from scaling linearly with the number of rays

Comparison with two-stage generative neural fields

  The previous two-stage method [Gaudi, Diffrf, 3d neural field generation using triplane diffusion] ignored the previous term λ diff L diff in the first stage of training . This can be seen as setting the rendering-prior weights λ renddiff to infinity, resulting in biased and noisy scene codes xi . The paper [3d neural field generation using triplane diffusion] partially alleviates this problem by imposing a total variation (TV) regularization on the triplane scene code to force a smooth prior, similar to the LDM constraint on the latent space (middle column of Figure 2). Control3Diff proposes to learn a conditional diffusion model on data generated by 3D GAN pre-trained on single-view images. In contrast, our single-stage training aims to directly incorporate diffusion before promoting end-to-end consistency.


2 Image-guided sampling and fine-tuning

  To achieve generalizable fast NeRF reconstruction covering single-view to dense multi-view reconstruction, we propose to perform image-guided sampling while fine-tuning the sampling code by taking into account diffusion priors and rendering likelihoods . According to the reconstruction-guided sampling method of [Video diffusion models], the approximate rendering gradient g is calculated, that is, a noise code x (t) :
Insert image description here
where, (t)(t) ) is a signal-to-noise ratio based on ( SNR) (hyperparameter ω is 0.5 or 0.25). The guided gradient g is combined with the unconditional score prediction, expressed as the pair of denoised output x ^ \hat{x}x^ Correction:

Insert image description here
The guidance scale is λ gd . We employ a predict-correction sampler [52] to solve for x (0) by alternating a DDIM step with multiple Langevin correction steps .

  We observe that reconstruction guidance cannot strictly enforce rendering constraints for faithful reconstructions. To solve this problem, we reuse in Equation 4, fine - tuning the sampled scene code x while freezing the diffusion and decoder parameters:
Insert image description here
where λ'diff is the prior weight at test time, which should be lower than the training weight value λ diff (because the prior learned from the training data set is less reliable when transferred to a different test data set) . Use Adam to optimize code x for fine tuning

Comparison with previous NeRF fine-tuning methods
  Although fine-tuning with rendering losses is common in view-conditioned NeRF regression methods [8, 61], our fine-tuning method differs in using a diffusion prior loss on the 3D scene code. , which significantly improves generalization to new views, as shown in 5.3.

3 some details

  1. Prior gradient cache

Three-plane NeRF reconstruction requires at least hundreds of optimization iterations   for each scene code xi . Among the single-stage losses in formula (4), the diffusion loss L diff takes longer to verify than the native NeRF rendering loss L rend , which reduces the overall efficiency. To speed up training and fine-tuning, we introduce a technique called Prior Gradient Caching, where the cached backpropagation gradient x λ diff L diff is reused in multiple Adam steps while refreshing the rendered gradient in each step. x λ rend L rend . It allows less diffusion than rendering. The following is the pseudocode of the primary algorithm:

  1. Parameterization and weighting for denoising

  Denoising model x ^ \hat{x}x^ ϕ(x(t),t)is implemented as a U-Net network in DDPM (122M parameters in total). Its input and output are noisy and denoised three-plane features (channels of three planes stacked together) respectively. For the form of the test, we adoptthe v-parameterizationv^\hat{v}[43: Progressive distillation for fast sampling of diffusion models]v^ ϕ(x(t),t), so thatx ^ \hat{x}x^ = α(t)x(t)−σ(t)v ^ \hat{v}v^ . Regarding the weighting function w(t), LSGM [54] adopts two different mechanisms to optimize the latentxiand the diffusion weightϕ; we. In contrast, we observe that the signal-to-noise ratio based weighting w (t)= (α(t)/σ (t))used in Equation 5performs well.



3. ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining (CVPR 2023)

Title : ZeroRF: Fast Sparse View 360◦ Reconstruction with Zero Pretraining
Task : Sparse Reconstruction; Extension: Image to 3D, Text to 3D
Author : Ruoxi Shi* Xinyue Wei* Cheng Wang Hao Su, from UC San Diego
code : https:// github.com/eliphatfs/zerorf


  The ZeroRF pipeline is shown in Figure 3: using a deep generator network of frozen standard Gaussian noise samples as input, planes and vectors are generated in the manner of TensoRF-VM to form a decomposed tensor feature volume. The feature volume is then sampled in the rendering ray and decoded by a multilayer perceptron (MLP) (standard volume rendering process with MSE loss).

  The main idea of ​​ZeroRF is to apply an untrained deep generative network as a parameterization of a spatial feature grid . The network can learn patterns at different scales from sparse observations, naturally generalizing to unseen views, without requiring further upsampling tricks or explicit regularization, which typically requires extensive manual tuning, as opposed to previous work on sparse view reconstruction.
The important designs are as follows: spatial composition (representation of feature volumes), the structure of the representation generator; and the structure of the feature decoder .

Insert image description here

1 Decomposition of feature Volume

  The principles of applying deep generative networks for parameterization are generally applicable to any grid-based representation. The most straightforward solution is to parameterize a feature Volume directly . However, if high rendering quality is desired, the feature volume will be particularly large, memory-consuming, and computationally inefficient. TensorRF uses tensor decomposition to exploit the low-rank nature of feature volumes. When the vector is a constant, the three-plane representation used in [17] can be regarded as a special case of the vector being represented by TensoRF-VM. DiF decomposes the feature volume into multiple smaller volumes encoding different frequencies . Instant-NGP [39] adopts multi-resolution hashmap because the information in features is sparse in nature.

  In these decompositions, hashing breaks the spatial correlation between adjacent units, so depth priors cannot be applied. Deep generative networks are available for parameterization ( TensorRF, triplane and DiF ).
We build generator architectures for generating 1D vectors, 2D matrices, and 3D volumes, based on which we experiment with all three decompositions . Due to similar working principles, better performance is obtained than previous technologies; TensoRF-VM has the best performance and is the final choice for factorization.

2 Generator architecture

  The quality of deep parameterization depends heavily on the framework. So far, most generators are Conv and Attention architectures , including deep decoder (DD), stable diffusion (SD), variational autoencoder (VAE), decoder in Kadinsky, and SimMIM generation based on ViT decoder device . ZeroRF converts 2D convolution, pooling and upsampling layers into 1D and 3D to get the corresponding 1D and 3D Generators required for different decompositions .

  These generators are initially quite large as they are designed to fit into a very large data set to produce high quality content . This will result in unnecessarily long run times and slower convergence when it comes to individual NeRF scenarios. Fortunately, we found that the performance of ZeroRF after convergence remains unchanged when we shrink the width and depth of the model. Therefore, we retain the composition of blocks but modify the sizes of these architectures to improve training speed . Note that during inference we only need to store the radiation field representation and not the generator, so during rendering ZeroRF has zero overhead compared to its underlying factorization method .

  We find that the SD VAE and its decoder part, as well as the Kadinsky decoder are equally effective in new view synthesis, followed by the depth decoder, while the SimMIM architecture, as a depth prior to the radiation field, proves to be ineffective . The SD/Kadinsky encoder is mostly the original structure of convolution, Kadinsky adds self-attention in the first two blocks. We choose the (modified) SD decoder as the final choice of generator architecture because it is the least computationally intensive .

3.3 Decoder architecture

  Our decoder architecture follows SSDNeRF : decode from the feature grid with linear interpolation (bilinear or trilinear), project it with the first linear layer, resulting in a basic feature code shared between density and appearance decoding . We found that shared feature code can help reduce floaters by tightly coupling geometry and appearance . Apply SiLU activation and call another linear layer for density prediction . For color prediction , we encode the view orientation with spherical harmonics (SH) and add it to the base features via projection of linear layers to add view dependence. We then apply SiLU activation and predict RGB values ​​using another linear layer similar to density prediction, expressed as follows:

Insert image description here

F x is the characteristic field, σ(·) is the sigmoid function, and Θ• is the linear layer. Unlike the decoders used in TensoRF and DiF, this decoder does not consume any positional encoding , which would otherwise potentially leak positional information (beyond the depth prior), destroying or degrading ZeroRF performance .


4. DiffRFL: Rendering-guided three-dimensional radiation field Diffusion

  DiffRF is the first method to directly generate volumetric radiation fields , directly based on an explicit voxel grid representation of a 3D denoised model. Since the radiation field generated by a set of attitude images may be blurry and contain artifacts, it is difficult to obtain GT samples of the radiation field. We address this challenge by pairing a denoising formulation with a rendering loss , allowing our model to learn a biased prior that favors good image quality, rather than trying to replicate fitting errors (such as floaters). Compared with 2D diffusion models, our model learns a consistent prior across multiple views, enabling view-free synthesis and accurate shape generation. Compared with 3D GANs, DiffRF naturally implements conditional generation, such as masked image restoration, or single-view 3D synthesis.

Two recommended papers on merging scene semantics:
"Panoptic neural fields: A semantic object-aware neural scene representation" CVF, 2022
"Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation" 3DV, 2022

  Learning a priori neural field representations for multiple object categories or across dataset scenes is a difficult task, although applications such as single-image 3D object generation [7, 41, 51, 69, 75] and unconstrained scene exploration [16] are supported. program. These methods explore a priori decomposition of objects into components based on shape and appearance, or decomposition of radiation fields into several small local conditional radiation fields to improve scene generation quality; however, their results suffer from photographic fidelity and geometric accuracy. There is still a significant gap.

   The denoising diffusion model is difficult to apply directly to 3D volumetric radiance fields due to the nature of the diffusion model, which requires a one-to-one mapping between noise vectors and corresponding GT data samples . The GT data of the radiation field is difficult to obtain, and even running an expensive sample-by-sample NeRF optimization will lead to incomplete reconstruction of the radiation field.

  DiffRF is a three-dimensional denoising model based directly on an explicit voxel grid representation (Figure 1, left), resulting in high-frequency noise estimates . To address the ambiguous and incomplete radiation field representation of each training sample, we propose to bias the noise prediction formulation of denoising diffusion probabilistic models (DDPMs) to synthesize higher image quality through an estimated additional volume rendering loss . This enables our method to learn radiation field priors that are less prone to fitting artifacts or noise accumulation during the sampling process. The learned diffusion prior can be applied in an unconditional setting, where 3D object synthesis is obtained in a multi-view consistent manner, generating high-precision 3D shapes and allowing free-view synthesis.

Insert image description here

  

1.3D radiation field

  Our approach consists of a generative model of 3D objects built on recent state-of-the-art diffusion probabilistic models [24]. Methods for implementing radiation fields range from neural networks [39] to explicit voxel grids [28, 61]. We chose the latter because it enables good rendering quality, as well as faster training and inference. Explicit meshes can be queried at consecutive positions via bilinear interpolation of voxel vertices.
  In the explicit representation, the radiation field becomes a four-dimensional tensor, where the first three dimensions query a grid spanning X, and the last dimension queries the density and color channels.

2. Generate radiation field

  1. generation process

The generation (also called denoising) process of the radiation field is controlled by a discrete-time Markov chain, which is defined on the state space F of all possible pre-activated radiation fields and is expressed as a flat four-dimensional tensor of fixed size . The chain has finite time steps: {0, . . . , T}. The denoising process first samples a state f T from a standard multivariate normal distribution p(f T ) := N(f T |0, I) and exploits the inverse transition probability p θ (f t−1 |f t ) , generating state f t−1 .

Insert image description here

The generation process is iterated until the final state f 0 represents the radiation field of the three-dimensional object generated by our method. The mean of the Gaussian distribution in (3) can be directly modeled with a neural network. This formula can be re-parameterized as:

Insert image description here
where ϵ \epsilonϵ θ(ft, t) is the noise used by the neural network to predictft−1, whileatandbtare predefined coefficients. Additionally, thecovarianceΣtalso accepts a predefined value, although it may be data dependent.

  1. diffusion process

  The generation process iteratively denoises a completely random radiation field, while the diffusion process does the opposite, iteratively destroying samples from the distribution of the 3D object we want to model (important in the training of the generation process). The diffusion process is controlled by a discrete-time Markov chain with the same state space and time boundaries mentioned for the generation process, but with a Gaussian transition probability given by

Insert image description here

  Among them, α t := 1−β t , 0≤β t ≤1 are predefined coefficients to adjust the injected noise variance. The procedure first selects f 0 for the distribution q(f 0 ) of the radiation field of the three-dimensional object we want to model , iteratively samples f t given f t−1 , produces a scaled and noise-corrupted version of the latter, and stops with Close to completely random f T . Using the properties of Gaussian distribution, we can conveniently express the distribution of f t under the condition of f 0 directly as Gaussian distribution, and get:

Insert image description here

  

3. Training objectives

  The training objectives include two complementary losses: i) a loss L RF that penalizes the generation of radiation fields that do not conform to the data distribution, and ii) an RGB loss L RGB that aims to improve the rendering quality from the generated radiation fields. The final loss for each data point f0 is a weighted combination and random sampling for step t from the uniform distribution κ(t) :

Insert image description here

ultimate loss. In summary,:

  1. Radiance field generation loss

  According to the DDPM, we derive the training objective starting from the variational upper bound of the negative log-likelihood (NLL). This upper bound requires specifying an alternative distribution, which we call q, because it indeed corresponds to the distribution q governing the diffusion process, establishing a fundamental link to the expectations of the generative process. Through Jensen's inequality, the NLL for a data point f 0 ∈ F can be upper bounded by exploiting q as follows:
Insert image description here
L The RF formula is as follows and depends on a constant independent of θ:

Insert image description here

Insert image description here

  

  1. Radiance field rendering loss

  We supplement the previous loss with an additional RGB loss L RGB (f 0 |θ) to improve the quality of rendering in the radiation field. The Euclidean metric on the representation implicitly used in the preceding loss to evaluate the quality of the generated radiation field does not necessarily ensure the absence of artifacts at rendering time.

Insert image description here

Euclidean distance loss between the rendering image of the radiation field f at the viewing angle v and the GroundTruth image I v
Insert image description here
: The core idea is to compare: the rendering of a given radiation field f0 sampled from the data distribution destroyed by t-step diffusion, and then compared with The original GT image L v used to obtain f 0 is fully denoised. This means that f t is first sampled starting from q(f t |f 0 ) , and then sampled back to f 0 from p θ (f 0 |f t ) . We resort to a simpler approximation. From the definition of L RF t , the loss is ϵ \epsilonϵϵ \epsilonϵ θ(ft,t), from which the approximate的f˜t 0(,θ):=f 0+√1−¯αt√¯αt(−θ(ft,t))
rendering loss is defined as:
Insert image description here

Among them, for the viewpoint v and ϵ \epsilonϵ ≈φ(ϵ \epsilonϵ)的先验分布 ψ进行期望。由于近似只有在步长t接近于零时是合理的,我们引入了一个权重wt,它随着步长值的增加而衰减(例如:使用ωt := α ˉ \bar{α} αˉt2)。我们在实验部分提供的证据表明,尽管这是一个近似值,但所提出的损失有助于显著改善结果

  1. 细节

   ϵ \epsilon ϵθ具体采用一个3D-UNet,将论文呢[Diffusion models beat gans on image synthesis]中用3D操作符替换2D卷积和注意层。均匀采样时间步长t = 1,…,T = 1000,扩散方差从β1 = 0.0015,线性增加到 βT = 0.05;ωt := α ˉ \bar{α} αˉt2

4.实验

  数据集:选择 PhotoShape Chairs (从200个视图上渲染提供的15576把椅子) 和 Amazon Berkeley Objects (ABO) Tables 数据集(91个渲染图,每个对象2-3个不同设置,得以纯图像形式,使用基于体素的方法从多视图渲染中以323 的分辨率生成.

  指标:使用(FID)和KID评估图像质量。为几何质量使用倒角距离(CD)计算覆盖率评分(COV)和最小匹配距离(MMD)。MMD评估生成样本的质量(分辨率均为128×128)

1. 无条件生成

Insert image description here
Insert image description here
表1和表2显示了消融结果:二维监督对辐射合成的影响,删除2D监督(“DiffRFw/o2D”,表中第3行)对FID有显著影响。这表明,通过体积渲染损失使DDPM的噪声预测公式偏置会导致更高的图像质量

  1. 条件生成

  GANs需要训练后,才能以特定的任务为条件,而扩散模型可以在测试时间内有效地作为条件。我们利用这一特性来进行辐射场的 masked completion任务。

masked completion

  Shape completion 和 image inpainting 是被深入研究的任务,旨在分别填充几何表示中或图像中的缺失区域。masked completion 任务结合了两者:给定一个辐射场和一个三维mask,合成一个与非mask区域协调的重建。受RePaint [34]的启发,我们通过逐步引导已知区域的无条件采样过程,来执行输入fin 的条件补全

Insert image description here
其中,m是应用于输入的二进制掩模(图6中的浅蓝色),⚪表示体素网格上的元素级乘法。

Insert image description here
单张图像重建

Insert image description here

五、EG3D:高效的几何感知的三维生成对抗网络(CVPR 2022)

题目:EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks
论文: https://nvlabs.github.io/eg3d/media/eg3d.pdf
工程: EG3D: Efficient Geometry-aware 3D GANs
代码: GitHub - NVlabs/eg3d
任务:基于单视角2D图片,通过无监督方法,生成高质量、多视角一致的3D形状

  ;现有3D GAN存在:1)计算开销大;2)不具有3D一致性(3D-consistent)等问题

  现有2D GAN无法显式地建模潜在的3D场景;最近的3D GAN工作,开始解决:1)多视角一致的图片生成;2)无需多视角图片和几何监督,提取3D形状。但是3D GAN生成的图片质量和分辨率仍然远逊于2D GAN。还有一个问题是,目前3D GAN和Neural Rendering方法计算开销大

  3D GAN通常由两部分组成:1)生成网络中的3D结构化归纳偏置;2)neural rendering engine提供视角一致性结果。其中,归纳偏置可以被建模为:显式的体素网格或隐式的神经表达。但受限于计算开销,这两种表达方式都不适用于训练高分辨率的3D GAN。目前常用的方法是超分,但超分又会牺牲视觉连续性和3D形状的质量。

  本文提出EG3D,具有如下优点

1.提出一种三平面的 3D GAN框架,作为expressive hybrid explicit-implicit 神经网络框架:提速、减小计算开销;
2.解耦的特征生成和神经渲染:借助StyleGAN2等 2D GAN网络,对生成器引入pose-based conditioning,解耦pose相关属性,例如人脸表情系数;
3.提出一种3D GAN训练策略dual discrimination,用于保持多视角一致性
3.在FFHQ和AFHQ Cats的3D-aware synthesis任务上达到sota。

1.常见的:3D场景的神经表示方法

  

1.显示表达(图b),例如:discrete voxel grids。优点是fast to evaluate,缺点是需要大量的内存开销大;

2.隐式表达(图a):例如:neural rendering。优点是内存使用高效,缺点是slow to evaluate。

3.局部隐式表达和混合显-隐式表达,则兼具了两者优点。
受此启发,本文设计了hybrid explicit-implicit 3D-aware network(图c):用tri-plane representation去显示存储沿坐标轴对齐的特征,而特征则是被通过特征解码器隐式的渲染为体素。

2.几种 生成式 3D-aware图像合成 对比

  1. Mesh-based method; Voxel-based GANs have large memory overhead and usually require the use of super-resolution, but super-resolution will lead to inconsistent perspectives; Block-based sparse volume representations: poor generalization. Fully implicit representation networks, but slow to test.
  1. StyleGAN2-based 2.5D GAN: generate redundant images and depth maps;
  1. Difference between 3D GANs such as StyleNeRF and CIPS-3D: poor performance on 3D shapes.

3.Tri-plane hybrid 3D representation

  Establish three mutually perpendicular feature planes xyz, and the dimension of each feature plane is N x N x C (N is the plane resolution, C is the feature dimension). For any 3D position, three feature vectors (F xy , F xz , F yz ) can be indexed through bilinear interpolation, and the final feature F is the splicing of three feature vectors.
  Through a lightweight MLP decoding network, the features F are mapped to color and intensity, and finally they are rendered into RGB images through neural volume rendering. The figure below shows that while Tri-plane has stronger performance capabilities, it also has smaller memory overhead and faster calculation speed.

Insert image description here

4. 3D GAN framework

Insert image description here

  1. Two training methods :

Method 1: Random initialization, using non-saturating GAN loss with R1 regularization, and the training method follows StyleGAN2;
Method 2: Two-stage training strategy, first train 64 x 64 neural rendering, and then follow 128 x 128 fine-tune.

  Experiments show that regularization helps reduce distortion of 3D shapes.

  1. CNN generator backbone and rendering

  decoder: MLP, each layer contains 64 neurons and softplus activation functions. The input of MLP can be continuous coordinates, and the output is scalar density and 32-dimensional feature
  Volume rendering: input feature, not RGB image. Because features contain more information that can be used in super scores.

  1. Dual discrimination

  The discriminator input is 6 channels. This article believes that the first three channels of image feature I F are low-resolution RGB images I RGB . Dual discrimination first requires I RGB and super-resolution image I ^ \hat{I}I^ RGB+ remains consistent (via bilinear upsampling); the super-resolved image and the upsampled image are then spliced ​​together and fed to the discriminator. For real pictures, the real pictures and the blur-processed real pictures are spliced ​​together and sent to the discriminator.
  The internal and external parameters of the camera are fed into the discriminator as condition labels.

  1. Modeling pose-related attributes

  Most real-world datasets contain biases. For example, in FFHQ, facial expressions are related to the camera position. Generally speaking, when the camera is facing the person's face, the person is smiling. This article proposes generator pose conditioning, which is used to decouple pose and other attributes in training images.
  In order to enhance the robustness of the model to the input pose, during training, the pose in the camera parameter matrix P will be replaced with a random pose with a 50% probability.
  The ablation experiment found that it is important to add pose as a condition during training . Future work will consider removing it.


5. Experimental comparison and application

FFHQ real face data set, AFHQv2 Cats, real cat face data set. Compare to other methods:
Insert image description here


Insert image description here


application:
Insert image description here

  

  


Summarize

提示:这里对文章进行总结:

For example: The above is what I will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and conveniently.







  

  







d \sqrt{d}d 1 0.24 \frac {1}{0.24} 0.241 x ˉ \bar{x} xˉ x ^ \hat{x} x^ x ~ \tilde{x} x~ ϵ \epsilonϵ

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/135124311