Foreword At CVPR 2023, researchers from Nanyang Technological University-SenseTime Joint Laboratory S-Lab proposed a fast 3D GAN Inversion method based on Encoder. The existing 3D GAN inversion method cannot take into account the reconstruction speed, reconstruction quality and editing quality. To solve the problem, a self-supervised 3D GAN inversion training framework is proposed. At the same time, high-fidelity, editable 3D reconstruction is achieved by constructing global-local multi-scale structures and 2D-3D hybrid alignment models. This method is adapted to SoTA 3D GAN models including StyleSDF, EG3D, etc., and has achieved excellent results in multiple benchmark tests.
Source: I love computer vision. Sharing only, infringement and deletion
Research Background
In the past year or two, methods based on 2D StyleGAN Inversion have made significant progress in image semantic editing tasks by using GAN inversion to project real images into GAN latent space. Recently, there have been a series of [6,7] research on 3D generative models based on StyleGAN structure. However, the corresponding universal 3D GAN inversion framework is still missing, which greatly limits the reconstruction and editing related applications based on 3D GAN models.
Due to the ambiguity of 3D reconstruction and the lack of 2D-3D paired data, we cannot directly apply the framework of 2D GAN inversion to 3D GAN inversion. At the same time, due to the limited expression ability of a single latent variable, it is difficult to reconstruct high-quality 3D geometry and texture information through existing inversion methods. In addition, how to further support high-quality 3D editing is still an issue to be solved.
To solve the above problems, we propose an effective self-supervised method to constrain the learning space and design a global-local multi-scale model to accurately reconstruct geometric and texture details. This method has achieved better performance in 2D and 3D benchmark data set tests.
Enter image:
Reconstruction results:
Editing results (+Smiling):
Stylization results:
method
We believe that an effective 3D GAN inversion framework should have the following characteristics:
1. Given a single view image as input, this method can reconstruct reasonable 3D geometry
2. Preserve high-definition texture information
3. Support 3D-based semantic editing
Based on the above criteria, we proposed the E3DGE framework and decomposed the problem into three sub-problems to solve respectively.
In the first step, we draw lessons from Sim2Real[1] and treat the pre-trained 3D GAN as a collection of massive 2D-3D data pairs. Since each Gaussian random noise z can sample 3D geometry and the corresponding 2D image from a certain perspective, we can generate training data for each batch online during the training process. At the same time, because we have the 3D geometric ground truth corresponding to the 2D image, we also add 3D reconstruction constraints based on the 2D supervision signal. This allows us to learn a 3D-aware latent space and avoid the geometric collapse problem caused by purely using 2D supervision signals.
In the second step, related research [2] shows that the single low-dimensional latent variable space used in traditional GAN inversion lacks the ability to model high-frequency details such as texture, reducing the visual effect. Compared with 2D inversion, the 3D inversion problem requires a larger modeling space and higher requirements for model representation capabilities. Therefore, the high-definition texture modeling problem becomes more serious. Inspired by the recent 3D small sample reconstruction method [3], based on shared global latent variables, we propose to introduce local latent variables to improve the expressive ability of the model and make up for the local details lost in the first stage of reconstruction. Among them, the value of the local latent variable depends on the characteristics of the specific 3D coordinates projected position on the 2D residual map.
As shown in the figure below, we calculate the residual map of local details lost in the first stage reconstruction and feed the residual map into the 2D Hourglass [4] model to extract features with missing information and combine them with the location structure encoding as supplementary features. Fusion with global features. The fused features have the expressive ability to accurately generate and reconstruct any perspective.
Through the above design, our method can achieve high-fidelity 2D-3D reconstruction and perspective generation, but it still cannot support editing from any perspective.
Our analysis believes that the reconstruction effect of the input perspective and the editing effect of any new perspective are weighed against each other: First, in the test phase, when the input image is edited or the test perspective does not match the input perspective, the residual map we obtained in the previous stage It will lead to erroneous output; at the same time, if we supervise the model to reconstruct itself, the model is more likely to learn regressive features rather than generative features.
In order to solve the first problem above, we propose to use 2D-3D hybrid alignment to derive aligned features. Specifically, since any new perspective editing results do not match the residual map results, we use a 2D alignment module to enable the final fused features to output high-quality new perspective editing effects.
In order to solve the second problem and prompt the model to learn generative features, in the GAN data generation stage, for the same Gaussian noise z, we randomly sample two perspectives and render two target images. We swap the reconstructed target view and train the model to reconstruct the new view. This training strategy not only promotes the model to learn generative features, but also makes the behavior of training and testing consistent, and helps ensure high-quality perspective generation in scene editing.
train
Since the 2D-3D data pairs generated by the pre-trained 3D GAN are used, we use both the 2D and 3D reconstructed loss functions:
In the 3D loss function, we found that simultaneously constraining the object surface point set and the spatially uniform sampling point set will bring better 3D constraint effects.
experiment
Due to its good geometric properties and high-fidelity image generation capabilities, in this work we chose StyleSDF [7] as the GAN inversion pre-training basic model.
We train on the FFHQ dataset and test our method on both 2D and 3D benchmarks. Regarding the 2D reconstruction effect, the input perspective reconstruction was tested on the CelebA-HQ data set and achieved better performance than the baseline:
In terms of numerical results, our method has achieved optimal performance under a variety of indicators, and the inference speed is significantly better than the optimization-based method:
Regarding the 3D reconstruction effect, we used the face 3D reconstruction data set NoW [5] for testing, which verified the effectiveness of 3D supervision in our method. Median and Mean refer to the offset distance statistics between the reconstructed 3D face and the ground truth mesh surface.
At the same time, our method can also perform well on stylized 3D GAN:
about the author
Lan Yushi is a PhD student in Nanyang Technological University's S-Lab. He graduated from Beijing University of Posts and Telecommunications with a bachelor's degree. His current main research interests are 3D generative models, 3D reconstruction and editing based on neural rendering.
portal
Paper link
https://arxiv.org/abs/2212.07409
Paper code
https://github.com/NIRVANALAN/E3DGE
Project homepage
https://nirvanalan.github.io/projects/E3DGE/index.html
Follow the public account [Machine Learning and AI Generated Creation], more exciting things are waiting for you to read
A simple introduction to ControlNet, a controllable AIGC painting generation algorithm!
Classic GAN must read: StyleGAN
Click me to view GAN’s series of albums~!
A cup of milk tea and become the cutting-edge trendsetter of AIGC+CV vision!
The latest and most complete collection of 100 articles! Generate diffusion modelsDiffusion Models
ECCV2022 | Summary of some papers on Generative Adversarial Network GAN
CVPR 2022 | 25+ directions, the latest 50 GAN papers
ICCV 2021 | Summary of 35 topic GAN papers
Over 110 articles! CVPR 2021 most comprehensive GAN paper review
Over 100 articles! CVPR 2020 most comprehensive GAN paper review
Unpacking a new GAN: decoupling representation MixNMatch
StarGAN version 2: multi-domain diversity image generation
Attached download | "Explainable Machine Learning" Chinese version
Attached download | "TensorFlow 2.0 Deep Learning Algorithm Practice"
Attached download | Sharing of "Mathematical Methods in Computer Vision"
"A Review of Surface Defect Detection Methods Based on Deep Learning"
"A Review of Zero-Sample Image Classification: Ten Years of Progress"
"A Review of Few-Sample Learning Based on Deep Neural Networks"
"Book of Rites·Xue Ji" says: If you study alone without friends, you will be lonely and ignorant.
Click on a cup of milk tea and become the cutting-edge trendsetter of AIGC+CV vision! , join the planet of AI-generated creation and computer vision knowledge!