Image2StyleGAN: How to embed images into the latent space of StyleGAN

Image2StyleGAN: How to embed images into the StyleGAN latent space? 

Official account: EDPJ

Table of contents

0. Summary

1 Introduction

2. Related work

3. What images can be embedded in the latent space of StyleGAN?

3.1 Embedding results for different image categories

3.2 How robust is the facial image to noise?

3.3 What kind of latent space to choose?

4. The meaning of Embedding

4.1 Deformation (gradient, Morphing) 

4.2 Style transfer 

4.3 Expression Conversion and Face Reconstruction

5. Embedding Algorithms

5.1 Initialization

5.2 Loss Function

5.3 Other parameters 

6 Conclusion

reference

Summary & some thoughts


0. Summary

We propose an efficient algorithm to embed images into the latent space of StyleGAN. This embedding can be used for photo semantic image editing. Taking StyleGAN trained on the FFHD dataset as an example, we show the results of image warping, style transfer and expression transfer. Studying the embedding algorithm helps us understand the structure of the latent space of StyleGAN. We propose a set of experiments to test what categories of images can be embedded, how to embed them, what latent space is suitable for embedding, and whether the embedding is semantically meaningful.

In order to avoid image violations, some face images have been coded. See the paper for the original image.

1 Introduction

Our aim is to modify a given image rather than a randomly generated GAN image.

The generalization ability of pretrained StyleGAN is significantly enhanced when using the extended latent space W+ (see Section 3.3). Therefore, our embedding algorithm is able to embed not only face images, but also non-face images from different categories. Continue the investigation by analyzing the quality of the embeddings to see if the embeddings are semantically meaningful. To do this, three basic operations are used on vectors in latent space: linear interpolation, intersection, and vector addition and subtraction. These operations correspond to three types of processing: deformation, style transfer, and expression transfer. Finally, we understand more about the structure of the latent space and why we can embed non-face image instances (such as cars).

Main contributions:

  • An efficient embedding algorithm that maps a given image to the extended latent space W+ of pretrained StyleGAN.
  • We investigate multiple questions to gain insight into the structure of the StyleGAN latent space, e.g.: what types of images can be embedded? What types of faces can be embedded? What latent space can be used for embedding?
  • We use three basic operations on vectors to study the quality of embeddings. Therefore, we can better understand the latent space and how images of different categories are embedded. As a by-product, we achieved excellent results on several facial image editing applications, including morphing, style transfer, and expression transfer.

2. Related work

High-quality GANs . Recently, Karras et al. collected a more diverse and higher-quality face dataset FFHQ, and proposed a new generator architecture inspired by the idea of ​​neural style transfer, which further improves the performance of GAN in Performance on face generation tasks. However, the lack of interpretability of neural networks leads to a lack of control over image modification. In this paper, the interpretability problem is addressed by embedding images back into the GAN latent space, resulting in various potential applications.

Latent Space Embedding . In general, there are two existing ways to map images into latent space:

  • An encoder that learns to map a given image into a latent space (e.g. a variational autoencoder);
  • Pick a random initial latent code and optimize it using gradient descent

The first method quickly generates embeddings through the encoder. However, it often has problems generalizing beyond the training dataset. In this article, the second method is considered as a more general and stable solution.

Perceptual Loss . Gatys et al. observed that the learned filters of the VGG image classification model are excellent general-purpose feature extractors, and proposed to use the covariance statistics of the extracted features to perceptually measure the high-level similarity between images, formalized as the perceptual loss ( perceptual loss)

3. What images can be embedded in the latent space of StyleGAN?

3.1 Embedding results for different image categories

For testing, we collected a small dataset of 25 different images in 5 categories (i.e., faces, cats, dogs, cars, and paintings). We preprocess face images using StyleGAN. This preprocessing includes adjusting to canonical face positions. 

Study the embedding of more types of image classes to better understand the structure and properties of latent space. We chose cats, dogs, and drawn faces because they have the same overall structure as human faces, but have a very different style. Cars were chosen because they have no structural similarity to faces.

The figure above shows the embedding results with one example for each image class in the collected test dataset. It can be seen that the embedded faces are of very high perceptual quality, faithfully reproducing the input. However, the embedded faces are slightly smooth and lack small details. 

Interestingly, in addition to faces, we find that although the StyleGAN generator is trained on the face dataset, the embedding algorithm is able to go far beyond faces. As shown in the figure above, although slightly worse than human faces (resolution, details, etc.), we can obtain reasonable and relatively high-quality embeddings of cats, dogs, and even paintings and cars.

Another interesting question is how the quality of the pre-trained latent space affects the embedding. For these tests we also use StyleGAN trained on cars, cats... However, the quality of these results was significantly lower.

3.2 How robust is the facial image to noise?

Affine Transformation . As shown in Figure 2 and Table 1, the performance of StyleGAN embedding is very sensitive to affine transformations (translation, resizing and rotation). Among them, translation has the worst performance because it cannot generate effective face embeddings. For resizing and rotation, the result is a valid face. However, they are blurry and lose a lot of detail, which is still worse than normal embeddings. From these observations, we argue that the generalization ability of GANs is sensitive to affine transformations, implying that the learned representations still depend on scale and location to some extent.

Embed defective images . As shown in Figure 3, StyleGAN embedding can cope well with image defects. The embeddings of different facial features are independent of each other. For example, masking the nose has no noticeable effect on the embedding of the eyes and mouth. On the one hand, this is good for image editing in general. On the other hand, it shows that latent space does not force the embedded image to be a full face, i.e. it does not fix missing information.

3.3 What kind of latent space to choose?

There are multiple latent spaces available for embedding in StyleGAN. Two obvious candidates are the initial latent space Z and the intermediate latent space W. Pass the 512-dimensional vector z ∈ Z through a fully connected neural network to obtain a 512-dimensional vector w ∈ W. Embedding directly into W or Z is not easy. Therefore, we recommend embedding an extended latent space W+. W+ is the concatenation of 18 different 512-dimensional w vectors for each layer of the StyleGAN architecture that receives input via AdaIn.

As shown in Fig. 5(c)(d), embedding directly into W will not give good results. Another interesting question is, how important are the learned network weights to the results? We answer this question in Figure 5(b)(e). The approach is to embedding a network that is simply initialized with random weights.

4. The meaning of Embedding

We propose three tests to evaluate whether an embedding is semantically meaningful. These tests can be done by simple latent code operations on the vector w_i. These tests correspond to semantic image editing: morphing, expression transfer, and style transfer. We consider the test successful if the resulting operations produce high-quality images.

4.1 Deformation (gradient, Morphing) 

Given two embedded images with respective latent codes w_1 and w_2, compute the deformation by linear interpolation, w = λw_1 + (1 − λ)w_2, λ ∈ (0, 1), and use w to generate subsequent images.

As shown in Figure 4, our method generates high-quality deformations among face images (rows 1, 2, 3), but does not All face images fail.

Interestingly, there are face outlines in the inter-class warped intermediate images, which indicates that the latent space structure of this StyleGAN is dedicated to faces. We therefore speculate that non-face images are actually embedded in such a way that the initial layers create a face-like structure, but later layers draw on this structure so that it is no longer recognizable.

While a scalable study of morphing is beyond the scope of this paper, we believe that the facial morphing results are very good, possibly better than the current state of the art. We leave this investigation for future work.

4.2 Style transfer 

Given two latent codes w1 and w2, style transfer is performed by crossover. We show style transfer results between embedded images and other face images (Fig. 6) and between embedded images of different classes (Fig. 8).

More specifically, in Fig. 8, we keep the latent codes of the first 9 layers of the embedded image (corresponding to spatial resolution 42 - 642), and overwrite the latent codes of the last 9 layers of the embedded image with the corresponding latent codes of the target style image code (corresponding to spatial resolution 642 − 10242).

Our method is able to transfer low-level features such as color and texture, but fails to faithfully preserve the content structure of non-face images (Fig. 8 second column), especially paintings. This phenomenon suggests that the generalization and expressive power of StyleGAN are more likely to reside in the style layers corresponding to higher spatial resolutions. 

4.3 Expression Conversion and Face Reconstruction

Given three input vectors w1, w2, w3, the transformation expression w = w1 + λ(w3 − w2), where w1 is the latent code of the target image, w2 corresponds to the neutral expression of the source image, and w3 corresponds to the more prominent expression. For example, w3 can correspond to a smiling face of the same person, while w2 can correspond to a blank face of the same person.

To remove noise (eg background noise), we heuristically set a lower threshold on the L2-norm of different latent code channels, below which channels are set to zero vectors. For the above experiments, the chosen value of the threshold is 1. We normalize the resulting vector to control the strength of the expression in a particular direction. As shown in FIG.

5. Embedding Algorithms

Our method follows a simple optimization framework to embed a given image into a manifold of pretrained generators. Starting from a suitable initialization w, we search for a w* that minimizes a loss function that measures the similarity between a given image and images generated from w*.

The algorithm is shown in the figure above. Not all designs yield good results.

5.1 Initialization

We investigate two initialization design options. The first option is random initialization. In this case, each variable is sampled independently from the uniform distribution U[−1, 1]. The second option is motivated by the observation that the average latent code  {\bar w}distance can be used to identify low-quality faces. Therefore, we propose to use {\bar w}as initialization and expect the optimization to converge to a {\bar w}vector w∗ that is closer to .

To evaluate these two design choices, we compare the optimized loss value and {\bar w}the distance between the optimized latent code w∗ and the average latent code \left\|  {\mathop w\nolimits^* - \bar w} \right\|.

As shown in Table 2, initializing w = for the face image embedding {\bar w}not only makes the optimized w∗ closer {\bar w}, but also achieves lower loss. However, for other classes of images (e.g. dogs), random initialization proved to be a better choice. Intuitively, this phenomenon suggests that the distribution has only one cluster of faces, and other instances (e.g., dog, cat) are scattered points around the cluster with no obvious patterns. Qualitative results are shown in Fig. 5(f)(g).

5.2 Loss Function

To measure the similarity between the input image and the generated image during optimization, we employ a loss function that is a weighted combination of the VGG-16 perceptual loss and the pixel-wise MSE loss:

where I \in \mathop R\nolimits^{n \times n \times 3} is the input image, G( ) is the pre-trained generator, N is the number of scalars in the image (i.e. N = n × n × 3), w is the latent code to be optimized, λmse = 1 obtained empirically And the performance is good. For the perceptual loss term Lpercept( ) in Equation 1, we use: 

where  \mathop I\nolimits_1 ,\mathop I\nolimits_2 \in \mathop R\nolimits^{n \times n \times 3}is the input image, Fj are the feature outputs of conv1_1, conv1_2, conv3_2 and conv4_2 of the VGG-16 layer respectively, Nj is the number of scalars output by the jth layer, λj = 1 is obtained empirically and has good performance.

We choose perceptual loss and pixel-level MSE loss because pixel-level MSE loss alone cannot find high-quality embeddings. Therefore, the perceptual loss acts as a regularization term to guide the optimization into the correct region of the latent space.

We conduct an ablation study to justify our choice of loss function in Equation 1.

As shown in Figure 9, the pixel-level MSE loss term (column 2) alone can embed general colors well, but fails to capture the features of non-face images. Also, it has a smoothing effect that does not preserve details even for human faces. Interestingly, since the pixel-level MSE loss works in the pixel space, ignoring the difference in the feature space, its embedding tends to use the average face of the pre-trained Style-GAN on non-face images (such as cars and paintings). This problem is addressed by a perceptual loss (columns 3, 5) that measures image similarity in feature space. Since our embedding task requires the embedded image to be close to the input at all scales, we find that matching features in multiple layers (column 5) of the VGG-16 network works better than using only a single layer (column 3). This further motivates us to combine the pixel-level MSE loss with the perceptual loss (columns 4, 6), because the pixel-level MSE loss can be regarded as the lowest level perceptual loss at the pixel level. Column 6 of Fig. 9 shows the result of our final chosen embedding (pixel-wise MSE + multi-layer perceptual loss), which has the best performance.

5.3 Other parameters 

\eta = 0.01,\mathop \beta \nolimits_1 = 0.9,\mathop \beta \nolimits_2 = 0.999,\varepsilon = \mathop {1e}\nolimits^{ - 8} We use the Adam optimizer with parameter λ in all experiments  . We optimize using 5000 gradient descent steps, taking less than 7 minutes per image on a 32GB Nvidia TITAN V100 GPU.

To justify our choice of 5000 optimization steps, we study how the loss function varies with the number of iterations. As shown in Figure 10, the loss value of the face image decreases the fastest, converging at about 1000 optimization steps; the cat, dog and car images converge slowly at about 3000 optimization steps; and the drawing curve is the slowest, converging at about 5000 optimization steps. We choose to optimize the loss function for 5000 steps in all experiments.

Iterative embedding . We test the robustness of the iterative embedding method, i.e. we iteratively take the embedding result as a new input image and embedding again. This process is repeated seven times. As shown in Figure 11, although the input image is guaranteed to exist in the model distribution after the first embedding, the performance degrades slowly (more details are lost) as the number of iterative embeddings increases. The reason for this observation may be that the adopted optimization method converges slowly near the local optimum. For embeddings other than faces, random initialization of latent codes can also be a factor of degradation. Taken together, these observations suggest that our embedding method can easily achieve reasonably "good" embeddings across the model distribution, but struggles to achieve "perfect" embeddings.

6 Conclusion

We propose an efficient algorithm for embedding images into the latent space of StyleGAN. The algorithm supports semantic image editing, such as image warping, style transfer, and expression transfer. We also investigate several aspects of the StyleGAN latent space using this algorithm. We present experiments to analyze what types of images can be embedded, how and what the embedding means. We concluded that any type of image can be embedded and that embedding works best in the extended latent space W+. However, only face embeddings are semantically meaningful (capable of semantic image editing).

Our framework still has some limitations. First, the pre-trained StyleGAN images are wrong (generations that are not as expected). Second, optimization takes minutes. Algorithms that can be done in a second are more attractive for interactive editing.

In future work, we hope to extend our framework to handle videos as well as still images. Additionally, we would like to explore embeddings in GANs trained on 3D data such as point clouds or meshes. 

reference

Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4432-4441).

Summary & some thoughts

(Section 3.1) For the StyleGAN pre-trained in the face domain, in addition to realizing the embedding of the face, it can also realize the embedding of other domains (such as cats, dogs, cars), which is the future of few-shot learning or even zero-shot learning. shot learning lays the groundwork.

(Section 3.3) In order to obtain better performance than the initial latent space Z and the intermediate latent space W, use an extended latent space W+: If StyleGAN has a total of L layers, connect L different latent code w in series, and Feed to each layer of StyleGAN. As shown in FIG. The number of layers in StyleGAN is determined by the output image resolution: L=2log2 R - 2. The maximum resolution of 1024*1024 corresponds to the structure of 18 layers.

(Section 4) By performing interpolation (weighted sum), cross (grafting), addition and subtraction operations on the latent code of the image, the fusion, style conversion, and expression conversion of two images are respectively realized.

(Section 5.2) During the optimization process, in order to measure the similarity between the input image and the generated image, the loss function used in this paper is the perceptual loss (using the covariance statistics of the extracted features to perceptually measure the high-level (convolutional layer) between images. A weighted combination of feature output, embedding (similarity) and pixel-wise MSE loss. The reason for this is that high-quality embedding cannot be obtained by using MSE alone, so perceptual loss is needed as a regularization term to guide the optimization towards the correct area of ​​latent space.

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/129634548