[Computer Vision | Face Modeling] PanoHead: 3D full-head synthesis with 360-degree geometric awareness

This series of blog posts are deep learning/computer vision paper notes. Please indicate the source when reprinting.

标题:PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 ∘ ^{\circ}

链接:[2303.13071] PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 ∘ ^{\circ} (arxiv.org)

Summary

Recently, the synthesis and reconstruction of 3D human heads has attracted increasing attention in the fields of computer vision and computer graphics. Existing state-of-the-art 3D generative adversarial networks (GANs) models for 3D human head synthesis are either limited to near-front views or struggle to maintain 3D consistency across large viewing angles. We present PanoHead, the first 3D-aware generative model capable of synthesizing full-head images in 360-degree high-quality, consistent views with diverse appearance and detailed geometry by training using only unstructured images from the wild. structure. At its core, we improve the representation capabilities of recent 3D GANs and bridge the data alignment gap when training from images in the wild, which feature widely distributed viewpoints. Specifically, we propose a novel two-stage adaptive image alignment for robust 3D GAN training. We further introduce a three-grid neural volume representation that effectively solves the problem of entangled front and rear head features in the widely adopted three-plane formulation. Our method injects prior knowledge of 2D image segmentation for adversarially learning 3D neural scene structures, enabling combinable head synthesis in a variety of backgrounds. Due to these design benefits, our approach largely outperforms previous 3D GANs, being able to generate high-quality 3D heads with accurate geometry and diverse appearances, even with long curly and afro hair styles, from arbitrary Pose rendering. Furthermore, we show that our system can reconstruct a complete 3D head from a single input image for personalized realistic 3D avatars.

1 Introduction

Realistic portrait image synthesis has been a continuing focus in the field of computer vision and graphics, with a wide range of downstream applications such as digital avatars, remote presence, and immersive games. Recent advances in generative adversarial networks (GANs) demonstrate surprisingly high image synthesis quality that is indistinguishable from real photos. However, when synthesizing avatars in different poses, contemporary generation methods are only based on 2D convolutional network operations and do not model the underlying 3D scene, so they cannot strictly enforce 3D consistency.

To generate 3D heads with diverse shapes and appearances, traditional methods require parametric texture mesh models learned from large-scale 3D scan collections. However, the rendered images lack detail and are limited in perceived quality and expressiveness. With the advent of differentiable rendering and neural implicit representations, conditional generative models have evolved to produce more realistic 3D perceptual face images. However, these methods often require multi-view images or 3D scan supervision, which are often difficult to obtain and have limited appearance distribution when captured in controlled environments.

Recently, generative models for 3D perception have made rapid progress, driven by the integration of implicit neural representations in 3D scene modeling and generative adversarial networks (GANs) for image synthesis. Among them, the seminal 3D GAN, EG3D, demonstrates amazing quality in view-consistent image synthesis, trained only from single-view image sets in the wild. However, these 3D GAN methods are still limited by the synthesis of near-front views.

This paper proposes PanoHead, a novel 3D-aware GAN for high-quality full 3D head synthesis trained from unstructured images in the wild that can be viewed consistently from 360-degree angles. Our model is able to synthesize a consistent 3D head that can be viewed from all angles, which is ideal in many immersive interaction scenarios such as digital avatars and remote presence. To the best of our knowledge, our method is the first 3D GAN method capable of achieving complete 3D head synthesis in 360 degrees.

Extending 3D GAN frameworks such as EG3D to full 3D head synthesis faces some important technical challenges: First, many 3D GANs cannot separate foreground and background, resulting in 2.5D head geometry. We introduce a foreground-aware triple discriminator that jointly learns to decompose foreground heads in 3D space by refining prior knowledge in 2D image segmentation.

Second, although current hybrid 3D scene representations (such as triplanar) are compact and efficient, there are strong projection ambiguities for 360-degree camera poses, resulting in "mirror planes" on the dorsal head. To solve this problem, we propose a novel 3D three-grid volume representation that disentangles the front face features and back head while maintaining the efficiency of the three-plane representation.

Finally, obtaining good estimates of out-of-camera parameters for back-head images in the wild is extremely difficult for training 3D GANs. Furthermore, there is an image alignment gap between these images and front-view images with detectable facial landmarks. This alignment gap results in a noisy appearance and unattractive head geometry. Therefore, we propose a novel two-stage alignment scheme that can consistently align images from any viewing angle. This step significantly reduces the learning difficulty of 3D GANs. In particular, we propose a camera adaptation module that dynamically adjusts the position of the rendering camera to accommodate alignment drift in dorsal-head images.

Our framework significantly enhances the ability of 3D GANs to adapt to full-head images in the wild, as shown in Figure 1. The resulting 3D GAN is not only capable of generating high-fidelity 360-degree RGB images and geometric structures, but also outperforms state-of-the-art methods in terms of quantitative metrics. With our model, we demonstrate compelling 3D full-head reconstruction from monocular view images, enabling easily accessible 3D portrait creation.

Figure 1. Our PanoHead enables consistent photorealistic full-head image synthesis across 360 degrees of view through high-fidelity geometry perception, enabling the creation of realistic 3D portraits from a single view image.

In summary, our main contributions are as follows:

  • The first 3D GAN framework capable of consistent and high-fidelity full-head image synthesis at 360 degrees with detailed geometry.
  • A novel three-grid formulation that balances efficiency and expressiveness in representing 3D 360-degree head scenes.
  • A foreground-aware triple discriminator that separates 3D foreground head modeling from 2D background synthesis.
  • A novel two-stage image alignment scheme that adaptively adapts to imperfect camera poses and image cropping enables the training of 3D GANs from wild images with a wide range of camera pose distributions.

2. Related work

3D head representation and rendering. To represent 3D heads with diverse shapes and appearances, a series of works have been devoted to parametric texture mesh representations, such as 3D Plastic Models (3DMM) for faces [2-4, 33 ] and FLAME head model learned from 3D scanning [25]. However, these parametric representations do not model the photo-realistic appearance and geometry beyond the front face or skull. Recently, neural implicit functions [47] have emerged as powerful continuous and differential 3D scene representations. Among them, Neural Radiation Field (NeRF) [1, 28] is widely used in digital head modeling due to its superiority in modeling complex scene details and synthesizing multi-view images that inherit 3D consistency [10, 15, 17,32,34,43]. Rather than optimizing individual-specific neural radiation fields from multi-view images or temporal videos, our approach builds generative NeRFs from unstructured 2D monocular images.

Recently, for better efficiency, implicit-explicit hybrid 3D representations have been explored [5, 9, 27]. Among them, the three-plane formulation proposed in EG3D demonstrates an efficient 3D scene representation with high-quality view-consistent image synthesis. The three-plane representation scales efficiently with resolution, allowing more detail to be obtained at the same capacity. Our three-grid representation transforms the three-plane representation into a more expressive space for better embedding of features in unconditional 3D head synthesis.

Single-view or few-view supervised 3D GANs. In view of the amazing progress of GAN in 2D image generation, many research attempts to extend it to 3D perceptual generation. These GANs aim to learn broadly usable 3D representations from 2D image sets. For facial synthesis, Szabo et al. [42] first proposed using vertex position maps as 3D representation to generate textured mesh output. Shi et al. [39] proposed a self-supervised framework to convert 2D StyleGANs [21] into 3D generative models, although its generalizability is limited by its underlying 2D StyleGAN. GRAF [37] and pi-GAN [6] are the first methods to integrate NeRF into 3D GANs. However, their performance is limited by the strong computational cost of full NeRF forward and backward calculations. Many recent studies [5, 8, 11, 13, 29–31, 38, 40, 48, 49] have attempted to improve the efficiency and quality of these NeRF-based GANs. Specifically, our work is based on EG3D [5], which introduces a three-plane representation that can leverage a 2D GAN backbone to generate efficient 3D representations and is shown to outperform other 3D representations [38]. Parallel to these works, another series of research [30, 41, 46, 50] focuses on controllable 3D GANs that can manipulate the generated 3D face or body.

3. Methodology

3.1 PanoHead Overview

To synthesize realistic and consistent full-head images in 360 degrees, we built PanoHead based on the state-of-the-art 3D-aware GAN, namely EG3D [5], due to its efficiency and synthesis quality. Specifically, EG3D utilizes StyleGAN2 [22] as the backbone to output a three-plane representation representing a 3D scene with three 2D feature planes. Given the desired camera pose ccam, the MLP network is used to decode the three planes and do voxel rendering into a feature image, and then synthesize a higher resolution RGB image through the super-resolution module I+. Both low- and high-resolution images are jointly optimized by a dual discriminator D.

Although EG3D was successful in generating front-view faces, we found that adapting them to 360-degree full-head images in the wild is a more challenging task for the following reasons: 1) foreground-background entanglement hinders large pose rendering; 2) The strong inductive bias of the three-plane representation leads to the appearance of mirror planes on the dorsal head; 3) The dorsal head image has noisy camera labels and inconsistent cropping. To address these issues, we introduce a background generator and a tri-discriminator to decouple foreground and background (Section 3.2), an efficient and more expressive tri-grid representation while still being compatible with the StyleGAN backbone (Section 3.2). Section 3.3), and a two-stage image alignment scheme with an adaptive module that dynamically adjusts the rendering camera during training (Section 3.4). The overall process of our model is shown in Figure 2.

Figure 2. Our framework consists of three main components: foreground-aware generatorG, discriminatorDand Neural RendererR. First, a mapping network maps the latent code z and the conditional camera pose ccon to the intermediate latent code w. GeneratorG then uses w to obtain the feature f of the 3D three-grid representation. Synthesize super-resolution image I+ using f and render camera pose ccam, Neural RendererR ) are evaluated with real images. The data processing flow is shown on the right. Real images are cropped into modified YOLO bounding boxes, however they often differ in scale and position due to the lack of accurate facial landmarks. Through the camera adaptation scheme, the rendered camera pose ccam can self-correct to generate images with consistent scale and position. m+, I, I+ pair (ID. Finally, the foreground-aware three-discriminatorm+, bilinearly upsampled image I and super-resolution mask I

3.2 Three identifications of foreground perception

One of the typical challenges faced by modern 3D-aware GANs, such as EG3D [5], is the interweaving problem of foreground and background in synthesized images. Despite highly detailed geometric reconstruction, training a 3D GAN directly from a wild RGB image set (such as FFHQ [21]) results in a 2.5D face, as shown in Figure 3(a). Image supervision from the sides and back of the head can help establish a complete head geometry with a reasonable dorsal head shape, but this does not solve the problem because the three-plane representation itself is not designed to represent separated foreground and background.

In order to decouple the foreground from the background, we first introduce an additional StyleGAN2 network [22] to extract the original feature image IrGenerate 2D background at the same resolution. During voxel rendering, the foreground mask Im is obtained by:
I r ( r ) = ∫ 0 ∞ w ( t ) f ( r ( t ) ) d t , I m ( r ) = ∫ 0 ∞ w ( t ) d t , (1) I^r(r) = \int_{0}^{\infty } w(t)f(r(t))dt, \quad I^m(r) = \int_{0}^{\infty} w(t)dt, \tag{1} Ir(r)=0w(t)f(r(t))dt,Im(r)=0w(t)dt,(1)

w ( t ) = exp ⁡ ( − ∫ 0 t σ ( r ( s ) ) d s ) σ ( r ( t ) ) , (2) w(t) = \exp(-\int_{0}^{t} \sigma(r(s)) ds) \sigma(r(t)), \tag{2} w(t)=exp(0tσ(r(s))ds)σ(r(t)),(2)

Here, r(t) represents the ray emitted from the center of the rendering camera. The foreground mask is then used to combine the new low-resolution image Igen:
I g e n = ( 1 − I m ) I b g + I r , (3) I^{gen} = (1 - I^m) I^{bg} + I^r, \tag{3} Igen=(1Im)Ibg+Ir,(3)
The generated low-resolution image is then input into the super-resolution module . Note that the computational cost of the background generator is trivial since the resolution of its output is much lower than that of the triplane generator and the super-resolution module.

Merely adding a background generator does not completely separate it from the foreground, as generators tend to composite the foreground content in the background. Therefore, we propose a novel foreground-aware triple discriminator to supervise the rendered foreground masks as well as RGB images. Specifically, the input of the three discriminators has 7 channels, consisting of the bilinear upsampled RGB image I, the super-resolved RGB image I+ and a single-channel upsampled foreground mask Im+. The additional mask channel allows backpropagation of 2D segmentation prior knowledge into the density distribution of the neural radiation field. Our method reduces the learning difficulty of shaping 3D full-head geometry from unstructured 2D images, achieving realistic geometry compatible with various backgrounds (Figure 3(b)) and appearance-synthesized full-head geometry (Figure 3(b)). c)). We note that unlike ENARF-GAN [30] which uses a single discriminator for RGB images of synthetic foreground and background images using dual generated masks, our triple discriminator better ensures view-consistent high-resolution output.

Figure 3. Geometric and RGB images generated by dual discrimination (a) and foreground-aware triple discrimination (b, c). EG3D(a) cannot completely decouple the background. PanoHead's triple recognition provides background-free geometry (b) and switchable background full-head image synthesis (c).

3.3 Feature decoupling in three grids

The three-plane representation proposed in EG3D [5] provides an efficient representation method for 3D generation. The neural radiation density and appearance of a volume point are obtained by projecting its 3D coordinates onto three axis-aligned orthogonal planes and using a tiny MLP to decode the sum of three bilinear interpolated features. However, when synthesizing a 360-degree full head, we observed that the expressive power of three planes is limited, and there is a problem of mirror planes. This problem is even more severe when the camera distribution of the training images is unbalanced. The root cause of the problem is the inductive bias produced by three-plane projection, where a point on a 2D plane must represent the characteristics of a different 3D point. For example, a point on the front and a point on the back of the head will be projected onto the XY plane PXY (perpendicular to the Z axis), as shown in Figure 4(a) shown. Although in theory the other two planes should provide complementary information to alleviate the ambiguity of this projection, we find that this is not the case when there is less visual supervision from behind or when the structure behind the head is difficult to learn. In this case, it is easy for the three planes to borrow features from the front to synthesize the back, which is called the mirror plane problem (Figure 5(a)).

Figure 4. Comparing the architecture of three planes (a) and three grids (b) on the Z axis. Using three planes, the projections of two different points share the characteristics of the plane PXY, which introduces ambiguity in the representation. Using tri-grid, the features of the above two points are obtained from two different planes through trilinear interpolation, thereby generating different features.

To reduce the inductive bias of the three-plane, we lift its formulation to a higher dimension by adding an extra depth dimension in the three-plane. We call this rich version the tri-grid. Unlike triplanar, each of our tri-grids has a shape of D × H × W × C, where H and W are the spatial resolutions, C is the number of channels, and D represents the depth. For example, to represent spatial features on the XY plane, a three-grid would have D axis-aligned feature planes evenly distributed along the Z-axis P i X Y P_i^{XY} PiXY, i = 1,…,D. We query any 3D space point by projecting coordinates onto each plane of the tri-grid, retrieving the corresponding feature vectors via trilinear interpolation. Therefore, for two points that share the same projected coordinates but different depths, the corresponding features are likely to be interpolated from non-shared planes (Fig. 4(b)). Our formulation disentangles the feature expressions of the front and rear heads, thus mitigating the mirror surface problem to a great extent (Figure 5).

Figure 5. Image synthesis using three planes and three grids (D = 3). Due to the ambiguity of the projection, the three-plane representation (a) produces a better quality frontal image but has a "mirrored face" on the posterior head, whereas our three-grid representation synthesizes a high-quality posterior head appearance and geometry ( b).

Similar to the tri-plane in EG3D [5], we can use the StyleGAN2 generator [21] to synthesize the tri-grid into a 3 × D feature plane. That is, we increase the number of output channels of the original EG3D backbone by D times. Therefore, the three-plane can be viewed as a simple case of our three-grid representation with D = 1. The depth D of our three-grid is adjustable, and larger D provides more representation power but adds additional computational overhead. Empirically, we find that smaller D values ​​(e.g., D = 3) are sufficient in terms of feature decoupling while still maintaining its efficiency as a 3D scene representation.

3.4 Adaptive camera alignment

To adversarially train our 360-degree full head, we need in-the-wild image examples from a wider camera distribution than most frontal distributions, such as FFHQ [21]. Although our 3D-aware GAN is trained only from widely accessible 2D images, accurately aligning visual observations between images labeled with well-estimated camera parameters is key to obtaining optimal quality training. While there is a well-established practice for cropping and aligning frontal facial images based on facial landmarks, it has never been studied in preprocessing high-angle pose images for GAN training. Since images taken from the side and back lack robust facial landmark detection, camera estimation and image cropping are no longer straightforward.

To address the above challenges, we propose a novel two-stage processing method. In the first stage, for images with detectable facial landmarks, we still adopt standard processing, where faces are scaled to similar sizes and aligned at the center of the head using the state-of-the-art facial pose estimator 3DDFA [14]. For the remaining images of high-angle poses, we use the head pose estimator WHENet [52] to provide a rough estimate of the camera pose, and the human detector YOLO [18] with a bounding box centered on the detected head. To crop images with consistent head scale and center, we apply both YOLO and 3DDFA on a batch of frontal images, from which YOLO’s head center is scaled and translated using a constant offset. This approach allows us to preprocess all head images to a large extent in a consistent alignment labeled with camera parameters.

Due to the presence of various hairstyles, the alignment of the back head images remains inconsistent, introducing significant learning difficulty for our network to interpret the complete head geometry and appearance (see Figure 6(a)). Therefore, we propose an adaptive camera alignment scheme to fine-tune the transformation of the volume rendering cone for each training image. Specifically, our 3D-aware GAN associates each image with a latent code z that embeds 3D scene information, including geometry and appearance, that can be synthesized under the ccam's view. Since ccam may not be aligned with the image content of our training images, it is difficult for 3D GAN to find a reasonable complete head geometry. Therefore, we jointly learn the residual camera transformation mapping from (z, ccam) to Δccam via adversarial training. The size of Δccam is normalized by the L2 norm. Essentially, the network dynamically adapts image alignment by making refined correspondences between different visual observations. We note that this is simply due to the nature of 3D-aware GANs, which can synthesize view-consistent images under a variety of cameras. Our two-stage alignment enables 360-degree view-consistent head synthesis with realistic shape and appearance, learning from a wide variety of head images with a wide distribution of camera poses, styles, and structures.

Figure 6. Images synthesized with (a) and without (b) camera adaptation scheme. Without this solution, the model would produce misaligned images of the posterior head, resulting in defects on the posterior head.

4. Experiment

4.1 Datasets and baselines

We train and evaluate our framework on a combination of the balanced FFHQ [21], the K-hairstyle dataset [24], and an internal set of high-angle head images. FFHQ contains 70,000 diverse high-resolution face images, but mainly focuses on the absolute yaw range from 0 degrees to 60 degrees, assuming that the frontal camera pose corresponds to 0 degrees. We augment the FFHQ dataset with 4,000 back head images from the K-hairstyle dataset and 15,000 internal high-angle images with different styles and angles ranging from 60 degrees to 180 degrees. For the sake of brevity, we name this dataset combination FFHQ-F. For more details on dataset analysis and network training, please refer to the supplementary paper.

We compare with state-of-the-art 3D-aware GANs, including GRAF [37], EG3D [5], StyleSDF [31] and GIRAFFEHD [48]. All baselines are retrained from the same FFHQ-F dataset. We measure the quality of the generated multi-view images and geometry both quantitatively and qualitatively.

4.2 Qualitative comparison

360° image synthesis. Figure 7 visually compares the image quality of the model with the baseline. All models are trained using FFHQ-F, by synthesizing images from five different perspectives, and the yaw angle is from 0 to 180 degrees. GRAF [37] is unable to synthesize compelling avatars whose backgrounds are entangled with the foreground heads. StyleSDF [31] and GIRAFFEHD [48] are able to synthesize realistic frontal face images, but have lower perceptual quality when rendered from larger camera angles. Without explicit reliance on camera labels, we suspect that the above methods may have difficulty independently interpreting 3D scene structure directly from images with 360° camera distribution. We observed that EG3D [5] is able to synthesize high-quality view-consistent frontal avatars before rotating the view to the side or even the back. Artifacts of mirrored faces are clearly visible from the back due to three-plane projection blur and entanglement of the front and rear backgrounds. The method proposed in [43] builds personalized full head NeRF at extra cost but requires multi-view supervision. Although this method produces high-quality images on all views, the model is not a generative model per se. In strong contrast, our model generates superior photorealistic avatars for all camera poses while maintaining multi-view consistency. It delivers different looks with detailed realism, from a bald head with glasses to long curly hair. For a more comprehensive understanding of our multi-view full head synthesis, please refer to our supplementary video for a more comprehensive visual result.

Figure 7. GRAF [37], GIRAFFEHD [48], StyleSDF [31], EG3D [5], multi-view supervised NeRF [43] (different methods from top to bottom on the left), and our PanoHead (right) qualitative comparison. Except for [43], all models are trained on FFHQ-F. We render the results with yaw angles of 0, 45, 90, 135 and 180 degrees. Due to the unsupervised camera pose mechanism, GRAF, GIRAFFEHD and StyleSDF fail to model the correct camera distribution in the latent space and therefore cannot rotate to the back. EG3D is able to rotate to the back, but suffers from "mirror face" artifacts and a tangled background. Multi-view supervised NeRF is comparable to our model, but it requires multi-view data of a single person and is not a generative model.

Geometry generation. Figure 8 compares the visual quality of the underlying 3D geometry extracted by running the Marching Cubes algorithm [26]. While StyleSDF [31] generates a frontal appearance, the complete geometry of the head is messy and broken. EG3D renders the detailed geometry of the front and hair, but either the background is cluttered (Figure 3(a)) or the back of the head is hollow (Figure 8). In contrast, our model can always generate high-fidelity, background-free 3D head geometry, even with varying hair styles.

Figure 8. PanoHead achieves high-quality complete head geometry, while StyleSDF [31] and EG3D [5] generate 3D noisy or hollow heads.

4.3 Quantitative results

To quantify the visual quality, fidelity, and diversity of generated images, we used the Frechet Inception Distance (FID) [16] of 50K real images and generated image samples. We use the identity similarity score (ID) to measure multi-view consistency by calculating the average Adaface [23] cosine similarity score of pairs of synthesized facial images rendered from different camera angles. Note that this metric only works on images where facial features are detected. We use mean squared error (MSE) to calculate the accuracy of generated segmentations versus masks obtained using the DeepLabV3 ResNet101 network [7]. Table 1 compares these metrics for all baselines and our method. We observe that our model consistently outperforms other baselines across all viewpoints. See the supplementary material for metric definitions and implementation details.

Table 1. Comparison of metrics across all baselines. For segmentation MSE, only GIRAFFEHD and PanoHead decouple background and foreground. For the ID score, GRAF’s low-quality images cause face detection to fail.

To evaluate the image quality under different views, we use FID and Inception Score (IS) on composite images with only back pose (|yaw| ≥ 90◦), front pose (|yaw| < 90◦) and all camera poses [ 36]. FID measures the similarity and diversity of the distribution of real and false images, while IS focuses more on the quality of the image itself. Our GAN model follows the main architecture of EG3D, where the tri-plane generator is conditioned on the camera pose. We observe that such a design results in a bias in generated image quality towards the conditional camera pose. Specifically, our generator is inferior in the quality of the synthesized backside images when conditioned on frontal views, and vice versa. However, when calculating FID-all, the conditional camera is always the same as the rendered view. Therefore, even if the generated head images degrade in quality in unseen views, the generator can still achieve excellent FID-all scores. Therefore, the original FID indicators (FID-all and FID-front) are difficult to fully reflect the overall quality of full-head image generation in 360°. To alleviate this problem, we propose FID-back, where we condition on the front view but synthesize back images. It results in a higher FID score but better reflects the quality of the 360◦ image synthesis.

We conducted an ablation study of our approach, quantitatively assessing the effectiveness of each individual component (Table 2). As shown in the second column, the addition of foreground-aware discrimination significantly improves the quality in all cases compared to the original EG3D. This shows that prior segmentation knowledge largely alleviates the difficulty of the network learning 3D heads from a collection of wild images. Given the strong supervision from a large number of well-aligned frontal images, the frontal face synthesis quality is comparable across all methods. However, for the back-facing head, decoupling the foreground and background greatly improves the composition quality. Additionally, changing the tri-plane to tri-grid representation further improves image quality. Through the combined effect of tri-discrimination, tri-grid and camera adaptation scheme, PanoHead achieves the lowest FID-back and the highest IS for back head generation. As shown in the runtime analysis column, our novel component introduces only a slight computational overhead but greatly improves the image synthesis quality. Note that the frontal image quality is better than that of the backside image due to various hairstyles and unstructured backside appearance, which largely contributes to the significant learning difficulty.

Table 2. Ablation studies on different components. +seg. indicates the use of foreground-aware triple discrimination. +self-adapt. Indicates using the camera adaptation scheme. All models are trained using FFHQ-F.

4.4 Single-view GAN inversion

Figure 9 demonstrates the use of PanoHead to generate a latent space for single-view full head reconstruction. To achieve this goal, we first perform optimization via pixel-level L2 loss and image-level LPIPS loss [51] to find the corresponding potential noise z of the target image. To further improve the reconstruction quality, we use pivot-tuned inversion (PTI) [35] to vary the generator parameters with a fixed optimized latent code z. From a single-view target image, PanoHead not only reconstructs photorealistic images and high-fidelity geometries, it also enables the synthesis of new views in 360°, including large-angle poses and dorsal heads.

Figure 9. Single-view reconstruction from different camera angles. The first column shows the target image, the second column is the projected RGB image and reconstructed 3D shape inverted using GAN, and the last two columns are the rendered images from any given camera angle.

5. Discussion

Limitations and future work. Although PanoHead exhibits excellent image and shape quality in the 360◦ view, there are still some small imperfections, such as the tooth area. Similar to the original EG3D, flickering texture issues can also be noticed in our model. Switching to StyleGAN3 [20] as the backbone network will help preserve high-frequency details. In practice, we also observe that flickering texture artifacts are more pronounced at higher exchange probabilities of conditional camera poses. We set this value to 70% instead of 50% as in EG3D because we have empirically found that this improves the quality of 360◦ rendering at a smaller cost in texture flicker artifacts. Another observation is that it lacks finer high-frequency geometric details, such as hair tips. We will perform a quantitative assessment of the quality of our geometry in future work, such as using depth maps.

Finally, while PanoHead is able to generate diverse images in terms of gender, race, and appearance, relying on only a few datasets for combined training still makes it subject to data bias, which is somewhat problematic. Despite our data collection efforts, large-scale whole-head annotated training image datasets remain one of the most critical directions for advancing whole-head synthesis research. We expect such a dataset to address some of the aforementioned limitations.

Ethical considerations. PanoHead is not specifically designed for any malicious use, but we are aware that single-view portrait reconstructions can be manipulated, which could pose a threat to society. We discourage use of this method in any manner that violates the rights of others.

6 Conclusion

We propose PanoHead, the first 3D GAN framework capable of synthesizing view-consistent full-head images using only single-view images. Through our innovative designs in foreground-aware ternary discrimination, 3D three-dimensional mesh scene representation, and adaptive image alignment, PanoHead is capable of true multi-view consistent full-head image synthesis in 360°, and is compatible with state-of-the-art 3D GANs show compelling qualitative and quantitative results in comparison. Furthermore, we demonstrate 360-degree photorealistic reconstructions from real single-view portraits with highly detailed geometry. We believe that the proposed method opens up an interesting direction for the creation of 3D portraits, providing implications for many potential downstream tasks.

references

(……)

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/133377614