Interpreting the Latent Space of GANs for Semantic Face Editing

Interpreting the Latent Space of GANs for Semantic Face Editing

Official account: EDPJ

Table of contents

0. Summary

1 Introduction

1.1 Related work

2. Framework of InterFaceGAN

2.1 Semantics in Latent Space

2.2 Operations in Latent Space

3. Experiment

3.1 Latent Space Separation

3.2 Latent Space Manipulation 

3.3. Conditional operations

3.4 Results on StyleGAN

3.5 Real image manipulation

4 Conclusion

some thoughts

reference


0. Summary

Although Generative Adversarial Networks (GANs) are state-of-the-art in the field of high-fidelity image synthesis, how GANs map latent codes sampled from random distributions into realistic images has not yet been fully understood. cognition. Previous work assumes that the latent space learned by GAN follows a distributed representation, but can perform vector operations. In this paper, a new architecture called InterFaceGAN is proposed to perform semantic face editing by interpreting latent semantics learned by GAN. In this framework, we study in detail how different semantics are encoded in the latent space. We find that after linear transformations, the latent code of a well-trained generative model learns a disentangled representation. We study separating various semantics and manage to separate some of the entangled semantics from subspace projections, allowing more precise control over facial attributes. In addition to manipulating gender, age, expression, and whether glasses are worn, we can even change the face pose and fix artifacts that GAN models accidentally generate. When combined with GAN inversion or some encoder-related models, our proposed method can handle real images. The findings demonstrate that learning to spontaneously synthesize facial images yields a decoupled, controllable representation of facial attributes.

1 Introduction

The principle of GAN: through adversarial training, learn a mapping from latent distribution to real data. After learning this non-linear mapping, GANs are able to generate realistic images from randomly sampled latent codes. However, there is uncertainty about how semantics are generated and organized in the latent space. Taking face synthesis as an example, when a latent code is sampled to generate an image, how is the code able to determine various semantic attributes of the output face (eg, gender and age), and how are these attributes intertwined?

Existing work generally focuses on improving the synthesis quality of GANs, however, few studies have investigated what GANs actually learn in terms of latent space. Radford et al., were the first to discover the properties of vector arithmetic in latent space. A recent work further shows that some units in the middle layer of the GAN generator specialize in synthesizing certain visual concepts, such as the sofa and TV generated when generating the living room. Even so, there is not enough understanding of how GANs connect latent space and semantic space of images, and how latent codes can be used for image editing.

In this paper, we propose a framework InterFaceGAN, short for Interpreting Face GANs, to recognize semantics encoded in the latent space of well-trained face synthesis models, and then exploit them for semantic face editing. In addition to vector operations, the framework also provides theoretical analysis and experimental results to verify whether the linear subspace is aligned with the different true and false semantics that arise in the latent space. We further investigate the separation between different semantics and show that we can separate some entangled properties (e.g., older people are more likely to wear glasses than younger people) through linear subspace projections. These separated semantics enable precise control of facial attributes using any given GAN model without retraining.

The contributions of this paper are as follows:

  • We propose InterFaceGAN to explore how single or multiple semantics are encoded in the latent space of GANs, such as PGGAN and StyleGAN, and observe that GANs spontaneously learn various latent subspaces corresponding to specific attributes. A linear transformation can separate the representations of these properties.
  • We show that InterFaceGAN can use any fixed (no need to retrain) pretrained GAN model for semantic face editing. Some of the results are shown above. In addition to gender, age, expression and the presence of glasses, we can also visibly change facial pose or correct some artifacts produced by GAN.
  • We extend InterFaceGAN to real image editing using GAN inversion and an encoder-related model. We successfully manipulate the properties of real faces by simply changing the latent code, even with a GAN not specifically designed for the editing task.

1.1 Related work

GANs . GAN usually takes the sampled latent code as input for image synthesis. To make GANs suitable for real image processing, existing methods invert the mapping of latent space to image space or learn additional encoders relevant to GAN training.

Despite great success, little research has been done on how GANs learn to associate input latent space with semantics in the real visual world.

Research on latent space of GAN . The latent space of GANs is often viewed as a Riemannian manifold. Prior work has focused on how to smoothly change the output image from one composition to another through interpolation in latent space, regardless of whether the image is semantically controllable or not.

GLO optimizes both the generator and the latent code to learn a better latent space. However, research on how well-trained GANs can encode different semantics within the latent space is still missing.

Several works have observed vector arithmetic properties. In addition, this work conducts a detailed analysis of the semantics of code in latent space from two aspects: the attribute of a single semantic and the separation of multiple semantics.

Some parallel work also explores latent semantics of GAN learning.

  • Jahanian et al. study the controllability of GANs in terms of camera motion and image tone.
  • Improved memorability of output images by Goetschalckx et al.
  • Yang et al. explore hierarchical semantics in deep generative representations for scene synthesis.

Semantic face editing using GANs . Semantic face editing aims to manipulate the facial attributes of a given image. Compared to unconditional GANs, which can generate images arbitrarily, semantic editing expects the model to only change the target attributes while preserving other attributes of the input image.

To achieve this goal, current methods need to carefully design loss functions, introduce additional attribute labels or features, or special architectures to train new models. However, the synthetic resolution and quality of these models lag far behind native GANs such as PGGAN and StyleGAN.

Different from previous learning-based methods, this work explores the interpretable semantics within the latent space of a fixed GAN model, and transforms an unconstrained GAN into a controllable GAN by changing the latent code.

2. Framework of InterFaceGAN

2.1 Semantics in Latent Space

Given a well-trained GAN model, the generator can be formulated as a deterministic function g:Z→X. Here, Z \subseteq \mathop R\nolimits^drepresents a d-dimensional latent space, where a Gaussian distribution is usually used  N(0,\top I\nolimits_d ). X represents the image space, where each sample x contains certain semantic information, such as the gender and age of the facial model. Suppose there is a semantic score function \mathop f\nolimits_S :X \to S,S \in \mathop R\nolimits^mrepresenting a semantic space with m semantics. We can s = \mathop f\nolimits_S (g(z))concatenate latent space Z and semantic space S with , where s and z denote the semantic score and sampled latent code, respectively.

single semantics . It is widely observed that when two latent codes z1 and z2 are linearly interpolated, the appearance of the corresponding composite changes continuously. It implicitly means that the semantics contained in the images are also changing gradually. By property 1 (shown above), linear interpolation between z1 and z2 forms a direction in Z, which defines a hyperplane. Therefore, we assume that for any binary semantics (eg, male vs. female), there exists a hyperplane as the separating boundary in the latent space. The semantics remain the same when the latent code moves on the same side of the hyperplane, but change to another semantic when crossing the boundary.

Given a \mathop {n \in R}\limits^dhyperplane with a unit normal vector, define the distance of a sample z to the plane as:

The distance is not a strictly defined distance, as it can be negative. As z moves toward and through the hyperplane near the boundary, both the "distance" and the semantic score change accordingly. And it is when "distance" changes its numerical sign that the semantic properties are reversed. Therefore, we expect the two to be linearly related

where f( ) is the semantic-specific scoring function, \lambda > 0which is a scalar that measures how quickly the semantics change with distance. 

By property 2 (shown above), random samples drawn N(0,\top I\nolimits_d )from it are likely to be located close enough to a given hyperplane. Therefore, the corresponding semantics can be modeled by a linear subspace defined by the normal vector n. 

Multiple semantics . Consider the case where there are m different semantics, with

Where s = [s_1, . . . , s_m]^T represents the semantic score, which \Lambda = diag(\top \lambda \nolimits_1 , \cdots ,\top \lambda \nolimits_m )is a diagonal matrix containing linear coefficients, and N = [n_1, . . . . , n_m] represents the separation boundary. Knowing the distribution of a random sample z, i.e. N(0,\top I\nolimits_d ), we can easily compute the mean and covariance matrix of the semantic score s as follows

So we have s \sim N(0,\mathop \Sigma \nolimits_s ), which is a multivariate normal distribution. Different entries of s are separated from each other if and only if \mathop \Sigma \nolimits_sthey are diagonal, which requires that the normal vectors {n_1, . . . , n_m} are mutually orthogonal. If this condition does not hold, some semantics are related to each other, \mathop n\nolimits_i^T \mathop n\nolimits_jwhich can be used to measure the connection between the i-th and j-th semantics. 

2.2 Operations in Latent Space

In this part, we describe how to use the semantics established in latent space for image editing.

Single attribute operation . According to Equation (2), to manipulate the properties of the synthesized image, we can edit \beginning z\nolimits_{edit} = z + \alpha nthe original latent code z. Since the score changes after editing f(g(\top z\nolimits_{edit})) = f(g(z)) + \lambda \alpha, \alpha > 0at the time , this would make the synthesized image more positive in that sense.

conditional action . When there are multiple properties, editing one may affect the other because certain semantics are coupled to each other. For more precise control, a \mathop Unlimited^TNconditional operation is performed by enforcing that in equation (5) be a diagonal matrix. In particular, we use projections to orthonormalize different vectors.

As shown above, given two hyperplanes whose normal vectors are n_1 and n_2, we find a projection direction \mathop n\nolimits_1 - (\mathop n\nolimits_1^T \mathop n\nolimits_2 ). Moving the sample along this new direction can change "Property 1" without affecting "Property 2". We call this operation a conditional operation. If there is more than one attribute to be adjusted, we just need to subtract the projection from the original orientation onto the plane constructed from all the conditioned orientations.

Real image processing . Since our method can perform semantic editing in the latent space of a fixed GAN model, we need to map real images to latent code before performing operations. To this end, existing methods propose directly optimizing the latent code to minimize the reconstruction loss, or learning an additional encoder to transform the target image into the latent space. There are also models that involve the encoder as well as the training process of the GAN, which we can use directly for inference.

3. Experiment

In this section, we evaluate InterFaceGAN using the state-of-the-art GAN models PGGAN and StyleGAN.

  • Experiments in Sections 3.1, 3.2 and 3.3 are performed on PGGAN to account for the latent space of traditional generators.
  • The experiments in Section 3.4 are conducted on StyleGAN to investigate style-based generators and compare the differences between two groups of latent representations in StyleGAN.
  • We apply our method to real images in Section 3.5 to see how the semantics implicitly learned by GANs can be applied to real face editing.

3.1 Latent Space Separation

As described in Section 2.1, our framework is based on the assumption that for any binary attribute, there exists a hyperplane in the latent space such that all samples from the same side have the same attribute. Therefore, we wanted to first assess the validity of this assumption in order to make the rest of the analysis sound.

We train five independent linear SVMs on pose, smile, age, gender, glasses, and then run them on the validation set (6K samples with high confidence attribute scores) as well as the full set (480K random samples). Evaluate.

The result is shown in the figure above. We find that all linear bounds are more than 95% accurate on the validation set and more than 75% accurate on the full set, suggesting that for binary attributes there exists a linear hyperplane in the latent space that can Divide into two groups.

Some samples are visualized by sorting them based on distance to the decision boundary (as shown in the image above). Note that those extreme cases (the first and last lines in the image above) are unlikely to be directly sampled, but constructed by shifting the latent code "infinitely" along the normal direction. As can be seen from the above figure, positive samples and negative samples can be distinguished on the corresponding attributes.

3.2 Latent Space Manipulation 

In this section, we verify that the semantics generated by InterFaceGAN are controllable.

Manipulate a single attribute . The figure above plots the results of operations on five different properties. This shows that our manipulation method performs well on all attributes of positive and negative. Especially on the pose attribute. We observe that shifting the latent code yields continuous changes even by solving the binary classification problem to search the frontier. Furthermore, GANs are able to imagine what a profile face should look like despite lacking enough extreme pose data in the training set. The same happens with the glasses attribute. Despite the lack of data in the training set, we were able to manually create many faces with glasses. These two observations provide strong evidence that GANs do not randomly generate images, but learn some interpretable semantics from the latent space.

Distance effects in semantic subspaces . When manipulating the latent code, we observe an interesting distance effect, i.e. if a sample is too far from the boundary, its appearance will change drastically, often ending up as the extreme case shown in Figure 3.

Figure 5 illustrates this phenomenon using gender editing as an example. Operations close to the border work well. However, when the sample exceeds a certain region (the threshold is set to 5), the edited result is no longer like the original face.

But this effect does not affect our understanding of the semantics of separation in latent space. This is because such extreme samples are unlikely to be drawn directly from the standard normal distribution, which is indicated in Property 2 of Section 2.1. Instead, they are manually constructed by continuously shifting the normally sampled latent code in a specific direction. In this way, we can better explain the latent semantics of GANs.

Bugfixes . We further apply our method to fix errors that sometimes appear in synthetic output. We manually label 4K bad synthetic images, and then train a linear SVM to find the separating hyperplane. We were surprised to find that GAN also encodes this information in latent space. Based on this finding, we were able to correct some of the mistakes that GANs made during generation, as shown in the figure above.

3.3. Conditional operations

In this section, we study the separation of different properties and evaluate conditional operations.

property related . Instead of introducing perceptual path length and linear separability to measure the separation properties of latent space, we pay more attention to the relationship between different latent semantics and study how they are related to each other. Here, two different metrics are used to measure the correlation between two attributes.

  • Cosine similarity in the two directions of n_1 and n_2 (cosine similarity)
  • Treat each attribute score as a random variable, and then use the observed attribute distribution from all 500K synthetic data to calculate the correlation coefficient \object \rho \nolimits_{\object A\nolimits_1 \object A\nolimits_2 } = \frac{ {Cov(\object A\nolimits_1 ,\object A\nolimits_2 )}}{ {\object \sigma \nolimits_{ \option A\nolimits_1 } \option \sigma \nolimits_{\option A\nolimits_2 } }}, where A_1 and A_2 represent random variables corresponding to two attributes

The result is shown in the figure above. We can see that attributes behave similarly under these two metrics, indicating that our InterFaceGAN can accurately identify semantics hidden in latent space. We also found that pose and smile are almost orthogonal to other attributes. However, gender, age, and glasses are highly correlated with each other. This observation partly reflects attribute correlations in the training dataset (i.e., CelebA-HQ), where older males are more likely to wear glasses. GANs also capture this feature when learning to generate real-world observations. 

conditional action . To separate the different semantics of independent facial attribute editing, we propose conditional operations in Section 2.2.

Use one attribute as a condition to operate on another attribute, and the result is shown in the above two figures. The glasses direction vector is subtracted from its projection in the age and gender directions to get a new vector. The age and gender attributes do not change when the sample is moved in a new direction.

3.4 Results on StyleGAN

Different from traditional GAN, StyleGAN proposes a style-based generator. Basically, StyleGAN learns to map latent code z from latent space Z to another high-dimensional space W before feeding it into the generator. W exhibits much stronger attribute separation properties than Z because W is not restricted to any particular distribution and can better model the underlying characteristics of real data.

We perform a similar analysis to PGGAN on the Z space and W space of StyleGAN, and find that the W space does learn more disjointed representations. This separation gives W space a stronger advantage over Z space in property editing.

As shown above, age and glasses are correlated in the StyleGAN model. Compared to Z space (second row), W space (first row) performs better, especially in long-distance (hyperplane) operations. Nevertheless, we can use the conditional manipulation trick described in Section 2.2 to separate these two properties in Z space (third row), which yields better results.

However, this trick cannot be applied to W spaces. We find that the W-space sometimes captures attribute correlations of the training data and encodes them as related "styles". Take the above picture as an example, "age" and "glasses" are set as two separate semantics, but StyleGAN actually learns an age direction that includes glasses, so this new direction is somewhat orthogonal to the glasses direction itself . This way, subtracting almost zero projections hardly affects the final result.

3.5 Real image manipulation

In this part, we use InterFaceGAN to operate on real faces to verify whether the semantic attributes learned by GAN can be applied to real faces. Recall that InterFaceGAN achieves semantic face editing by moving latent code along a specific direction. Therefore, we need to convert the given real image to latent code first. This turns out to be a difficult task because GANs cannot fully capture all patterns and the diversity of the true distribution.

To invert a pretrained GAN model, there are two typical approaches.

  • One is an optimization-based approach that directly optimizes the latent code using a fixed generator to minimize pixel reconstruction error.
  • The other is encoder-based, where an additional encoder network is trained to learn the inverse mapping.

We tested two baseline methods on PGGAN and StyleGAN.

The result is shown in the figure above. We can see that optimization-based (first row) and encoder-based (second row) methods perform poorly when inverting PGGAN. This can be attributed to the strong difference between the training and test data distributions. For example, the model tends to generate Westerners even when the input is Easterners (see example on the right in the image above). However, even if different from the input, the invert image can still be semantically edited using InterFaceGAN. Compared with PGGAN, the results of StyleGAN (third row) are much better. Here, we consider the layer-wise style (i.e., w for all layers) as an optimization objective. When editing an instance, we push all style codes in the same direction. As shown in the figure above, we successfully change the attributes of real face images without retraining StyleGAN, which exploits the interpreted semantics from the latent space. 

We also tested InterFaceGAN on an encoder-decoder generative model, which trains the generator and discriminator simultaneously with the encoder. After the model has converged, the encoder can be used directly for inference to map a given image into a latent space.

We apply our method to account for the latent space of the recent encoder-decoder model LIA. The result of the operation is shown in the figure above. We successfully edit input faces with multiple attributes, such as age and facial pose. This suggests that latent code in encoder-decoder based generative models also supports semantic operations. Furthermore, the encoder trained together with the generator provides better reconstruction and manipulation results compared to the encoder in Figure 10(b) learned separately after the GAN is trained. 

4 Conclusion

We propose InterFaceGAN to explain the semantics encoded in the latent space of GANs. By exploiting interpreted semantics together with conditional manipulation techniques, facial attributes can be precisely controlled using any fixed GAN model, and even unconditional GANs can be turned into controllable GANs. Extensive experiments show that InterFaceGAN can also be applied to real image editing.

some thoughts

As shown in Section 3.2, "Moving the latent code will produce continuous changes. Moreover, although there is not enough extreme pose data in the training set, GAN can imagine what the profile face should look like." So is it possible to generate videos based on this? Or perform data augmentation.

As shown in Section 3.3, because of the correlation of attributes, a vector in a certain attribute direction (such as glasses) will have components in other attribute directions (such as age, gender). After subtracting the components in these directions by projection, we get A new vector in which moving has no effect on other properties.

As shown in section 3.4

  • W space is a higher-dimensional space than latent space Z, which is based on the mapping of latent space Z (this reminds me of normalizing flow, through which simple distributions can be transformed into more complex distributions, or complex distribution to a simpler distribution).
  • W space has better attribute decoupling (separation) properties than latent space Z. The intuitive understanding is to separate various attributes by mapping to a higher dimension. In higher dimensions, some new directions will appear, and these directions are the new directions obtained through the "subtraction" operation. Although this direction cannot be directly described in language, it is indeed a completely independent direction (attribute)

reference

Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9243-9252).

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/129582405