CVPR 2023 | E3DGE allows 2D images to be instantly transformed into 3D images, Nanyang Polytechnic, SenseTime, etc.


Foreword  At CVPR 2023, researchers from Nanyang Technological University-SenseTime Joint Laboratory S-Lab proposed a fast 3D GAN Inversion method based on Encoder. The existing 3D GAN inversion method cannot take into account the reconstruction speed, reconstruction quality and editing quality. To solve the problem, a self-supervised 3D GAN inversion training framework is proposed. At the same time, high-fidelity, editable 3D reconstruction is achieved by constructing global-local multi-scale structures and 2D-3D hybrid alignment models. This method is adapted to SoTA 3D GAN models including StyleSDF, EG3D, etc., and has achieved excellent results in multiple benchmark tests.

Source: I love computer vision. Sharing only, infringement and deletion

88039936e7a4972055badd3d8113f55d.gif

Research Background

a622f9af24ef5c34e2f7d1946e4bade2.gif

In the past year or two, methods based on 2D StyleGAN Inversion have made significant progress in image semantic editing tasks by using GAN inversion to project real images into GAN latent space. Recently, there have been a series of [6,7] research on 3D generative models based on StyleGAN structure. However, the corresponding universal 3D GAN inversion framework is still missing, which greatly limits the reconstruction and editing related applications based on 3D GAN models.

Due to the ambiguity of 3D reconstruction and the lack of 2D-3D paired data, we cannot directly apply the framework of 2D GAN inversion to 3D GAN inversion. At the same time, due to the limited expression ability of a single latent variable, it is difficult to reconstruct high-quality 3D geometry and texture information through existing inversion methods. In addition, how to further support high-quality 3D editing is still an issue to be solved.

To solve the above problems, we propose an effective self-supervised method to constrain the learning space and design a global-local multi-scale model to accurately reconstruct geometric and texture details. This method has achieved better performance in 2D and 3D benchmark data set tests.

Enter image:

29f84f3e92da83d9ae0effc667510f28.jpeg

Reconstruction results:

1439906ef37401337246dea074e008fa.gif

bf3cedf9552be8b2bbd3c3f179779bc3.gif

Editing results (+Smiling):

72e44c876eb2271cff3a77006eb41436.gif

040a91fdb7eb9c5c2e9ef22cb05c8fea.gif

Stylization results:

948b6a353f9ba0eec5b7aa4bab448947.gif

12e87da0d4b5fad656d453f94ef85639.gif

3dc76e0e0174d6a10537789e0edbc776.gif

method

a56526e9ae228f0d037c5449dd9fb012.gif

We believe that an effective 3D GAN inversion framework should have the following characteristics:

1. Given a single view image as input, this method can reconstruct reasonable 3D geometry

2. Preserve high-definition texture information

3. Support 3D-based semantic editing

Based on the above criteria, we proposed the E3DGE framework and decomposed the problem into three sub-problems to solve respectively.

In the first step, we draw lessons from Sim2Real[1] and treat the pre-trained 3D GAN as a collection of massive 2D-3D data pairs. Since each Gaussian random noise z can sample 3D geometry and the corresponding 2D image from a certain perspective, we can generate training data for each batch online during the training process. At the same time, because we have the 3D geometric ground truth corresponding to the 2D image, we also add 3D reconstruction constraints based on the 2D supervision signal. This allows us to learn a 3D-aware latent space and avoid the geometric collapse problem caused by purely using 2D supervision signals.

10d8b7cb293d5a212a83c09e70587c0d.png

In the second step, related research [2] shows that the single low-dimensional latent variable space used in traditional GAN ​​inversion lacks the ability to model high-frequency details such as texture, reducing the visual effect. Compared with 2D inversion, the 3D inversion problem requires a larger modeling space and higher requirements for model representation capabilities. Therefore, the high-definition texture modeling problem becomes more serious. Inspired by the recent 3D small sample reconstruction method [3], based on shared global latent variables, we propose to introduce local latent variables to improve the expressive ability of the model and make up for the local details lost in the first stage of reconstruction. Among them, the value of the local latent variable depends on the characteristics of the specific 3D coordinates projected position on the 2D residual map.

As shown in the figure below, we calculate the residual map of local details lost in the first stage reconstruction and feed the residual map into the 2D Hourglass [4] model to extract features with missing information and combine them with the location structure encoding as supplementary features. Fusion with global features. The fused features have the expressive ability to accurately generate and reconstruct any perspective.

66d0e2509d161c724304ba37c356c219.png

Through the above design, our method can achieve high-fidelity 2D-3D reconstruction and perspective generation, but it still cannot support editing from any perspective.

Our analysis believes that the reconstruction effect of the input perspective and the editing effect of any new perspective are weighed against each other: First, in the test phase, when the input image is edited or the test perspective does not match the input perspective, the residual map we obtained in the previous stage It will lead to erroneous output; at the same time, if we supervise the model to reconstruct itself, the model is more likely to learn regressive features rather than generative features.

In order to solve the first problem above, we propose to use 2D-3D hybrid alignment to derive aligned features. Specifically, since any new perspective editing results do not match the residual map results, we use a 2D alignment module to enable the final fused features to output high-quality new perspective editing effects.

b8dcc2540ae3de17332f5265602441ae.png

In order to solve the second problem and prompt the model to learn generative features, in the GAN data generation stage, for the same Gaussian noise z, we randomly sample two perspectives and render two target images. We swap the reconstructed target view and train the model to reconstruct the new view. This training strategy not only promotes the model to learn generative features, but also makes the behavior of training and testing consistent, and helps ensure high-quality perspective generation in scene editing.

5fff4146aba9e2a14297a1df46df0877.gif

train

33894ef2810e20a27c957dfbb36244b5.gif

Since the 2D-3D data pairs generated by the pre-trained 3D GAN are used, we use both the 2D and 3D reconstructed loss functions:

ae030b3685757899a98522f0713301c6.png

In the 3D loss function, we found that simultaneously constraining the object surface point set and the spatially uniform sampling point set will bring better 3D constraint effects.

18b35e5b75b557c47f4d0e83181dc8ad.png

05db397fbb2432d702f2a21723805283.gif

experiment

e7609b6a5dd83473772bbc27f5527abd.gif

Due to its good geometric properties and high-fidelity image generation capabilities, in this work we chose StyleSDF [7] as the GAN inversion pre-training basic model.

We train on the FFHQ dataset and test our method on both 2D and 3D benchmarks. Regarding the 2D reconstruction effect, the input perspective reconstruction was tested on the CelebA-HQ data set and achieved better performance than the baseline:

56cce9e7df2e4d6d93c5e6bb3b9aa8f4.png

In terms of numerical results, our method has achieved optimal performance under a variety of indicators, and the inference speed is significantly better than the optimization-based method:

9d8312a0fe194a1c3d17fc03442bc59d.png

Regarding the 3D reconstruction effect, we used the face 3D reconstruction data set NoW [5] for testing, which verified the effectiveness of 3D supervision in our method. Median and Mean refer to the offset distance statistics between the reconstructed 3D face and the ground truth mesh surface.

5a2bfd03b6e919d9e410c6db77ee1676.png

At the same time, our method can also perform well on stylized 3D GAN:

3f28f4cf28af30c042a794bd473258aa.png

93623c1cfb1a3c8bbee5d569d7b0da8c.gif

about the author

7aece9fc345e911b62c5ee976a0b46ec.gif

Lan Yushi is a PhD student in Nanyang Technological University's S-Lab. He graduated from Beijing University of Posts and Telecommunications with a bachelor's degree. His current main research interests are 3D generative models, 3D reconstruction and editing based on neural rendering.

61ef6e5b597ce81184bc56076feb0209.gif

portal

9732ab241cfba542ae4181d663b16795.gif

Paper link

https://arxiv.org/abs/2212.07409

Paper code

https://github.com/NIRVANALAN/E3DGE

Project homepage

https://nirvanalan.github.io/projects/E3DGE/index.html

Follow the public account [Machine Learning and AI Generated Creation], more exciting things are waiting for you to read

Suppression, 60,000 words! 130 articles in 30 directions! CVPR 2023 The most comprehensive AIGC paper! Read it in one go

An in-depth explanation of stable diffusion: Interpretation of the paper on the potential diffusion model behind AI painting technology

A simple introduction to ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN must read: StyleGAN

3962bb190a3d15fc9a2521b9d628d872.png Click me to view GAN’s series of albums~!

A cup of milk tea and become the cutting-edge trendsetter of AIGC+CV vision!

The latest and most complete collection of 100 articles! Generate diffusion modelsDiffusion Models

ECCV2022 | Summary of some papers on Generative Adversarial Network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of 35 topic GAN papers

Over 110 articles! CVPR 2021 most comprehensive GAN paper review

Over 100 articles! CVPR 2020 most comprehensive GAN paper review

Unpacking a new GAN: decoupling representation MixNMatch

StarGAN version 2: multi-domain diversity image generation

Attached download | "Explainable Machine Learning" Chinese version

Attached download | "TensorFlow 2.0 Deep Learning Algorithm Practice"

Attached download | Sharing of "Mathematical Methods in Computer Vision"

"A Review of Surface Defect Detection Methods Based on Deep Learning"

"A Review of Zero-Sample Image Classification: Ten Years of Progress"

"A Review of Few-Sample Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" says: If you study alone without friends, you will be lonely and ignorant.

Click on a cup of milk tea and become the cutting-edge trendsetter of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/132893410