Make a summary of several recent papers (GAN, latent space related)

Official account: EDPJ

explain latent space Interpreting the Latent Space of GANs for Semantic Face Editing
W+ latent space Image2stylegan: How to embed images into the stylegan latent space?
GAN inverse mapping GAN Inversion: A Survey
CLIP Learning transferable visual models from natural language supervision
CLIP & StyleGAN StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Table of contents

1. (2020)Interpreting the Latent Space of GANs for Semantic Face Editing

2. (2019)Image2stylegan: How to embed images into the stylegan latent space?

3. (2022)GAN Inversion: A Survey                                        

4. (2021)Learning Transferable Visual Models From Natural Language Supervision

5. (2022)StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators


1. (2020)Interpreting the Latent Space of GANs for Semantic Face Editing

Interpreting GAN's Latent Space for Semantic Face Editing - EDPJ's Blog - CSDN Blog

The main idea of ​​this article is:

  • In the latent space, there is a hyperplane that separates various attributes of images (such as age, gender, glasses, head pose, etc.).
  • The normal vector of each hyperplane is the direction corresponding to each attribute.
  • Moving the hidden code along the normal direction, while the corresponding attribute changes, other attributes are not affected.

1.1 As mentioned in the article, "moving latent code will produce continuous changes. In addition, although there is not enough extreme pose data in the training set, GAN can imagine what the profile face should look like." So can a video be generated based on this? Or perform data augmentation?

1.2 Because of the correlation of attributes, a vector in a certain attribute direction (such as glasses) will have components in other attribute directions (such as age, gender). After subtracting the components in these directions by projection, a new vector is obtained. Moving in this direction has no effect on other properties.

1.3 W space is a higher-dimensional space than latent space Z, which is based on the mapping of latent space Z (this reminds me of normalizing flow, through mapping, simple distributions can be transformed into more complex distributions, or complex distribution to a simpler distribution).

W space has better attribute decoupling (separation) properties than latent space Z. The intuitive understanding is to separate various attributes by mapping to a higher dimension. In higher dimensions, some new directions will appear, and these directions are the new directions obtained through the "subtraction" operation. Although this direction cannot be directly described in language, it is a completely independent direction (attribute)

2. (2019)Image2stylegan: How to embed images into the stylegan latent space?

Image2StyleGAN: How to embed images into StyleGAN's hidden space (latent space)_EDPJ's Blog-CSDN Blog

2.1 In order to obtain better performance than the initial latent space Z and the intermediate latent space W, use an extended latent space W+: If StyleGAN has a total of L layers, connect L different latent code w in series and send them to StyleGAN of each layer. As shown in FIG. The number of layers in StyleGAN is determined by the output image resolution: L=2log2 R - 2. The maximum resolution of 1024*1024 corresponds to the structure of 18 layers.

2.2 This article has an interesting discovery: For the StyleGAN pre-trained in the face domain, in addition to embedding the face, it can also embedding other domains (eg, cats, dogs, cars). This laid the foundation for the subsequent few-shot learning and even zero-shot learning.

2.3 Through interpolation (weighted sum), cross (grafting), addition and subtraction, etc. operations on the latent code of the image, the fusion, style conversion, and expression conversion of two images are respectively realized.

2.4 In the optimization process, in order to measure the similarity between the input image and the generated image, the loss function used in this paper is the perceptual loss (using the covariance statistics of the extracted features to perceptually measure the high-level between images (the feature output of the convolutional layer, embedding) similarity) and a weighted combination of pixel-level MSE loss. The reason for this is that high-quality embedding cannot be obtained by using MSE alone, so perceptual loss is needed as a regularization term to guide the optimization towards the correct area of ​​latent space.

3. (2022)GAN Inversion: A Survey                                        

GAN inverse mapping (Inversion): a review - Programmer Sought

3.1 The core idea of ​​GAN inverse mapping: A convenient way to process images is to inversely map images and processing methods (based on text, image, video, audio and other multi-modal guidance) to hidden space based on the inverse mapping method (latent space) to obtain the corresponding hidden code (latent code, embedding, latent representation). The indirect processing of the image is realized through the processing of hidden coding.

3.2 There are three methods of inverse mapping:

  • Learning-based methods (training an encoder),
  • Optimization-based methods (creating an objective function that generates optimal latent code),
  • Based on a hybrid method of the former two (training an encoder to predict an approximate value, using this value as the initial value for optimization).

3.3 There are seven types of latent spaces:

  • Z-space: A space randomly sampled from a simple distribution.
  • W Space: Map Z space to W space using a non-linear mapping network composed of multi-layer perceptrons. Compared to Z-space, the limitations of simple distributions can be alleviated.
  • W+ space: Concatenate the latent code w corresponding to the number of network layers, and then send it to AdaIN of each layer to get W+ space. Compared with W space, attributes have a higher degree of decoupling.
  • S-space: Transform W-space into S-space by applying a different affine transformation to each layer of the generator. Compared with W space, attributes have a higher degree of decoupling.
  • P space: Most of the density of the high-dimensional Gaussian distribution is close to the surface of the hypersphere, and (assuming) the joint distribution of the latent code is similar to the multivariate Gaussian distribution, then the image can be embedded on the surface of the hypersphere in the Z space to obtain the P space.
  • P_N space: Convert P space to P_N space through PCA whitening operation, thereby eliminating dependencies and redundancy.
  • P_N+ space: P_N+ space can be obtained by further expanding the P_N space

3.4 This article summarizes the content of the previous two articles

4. (2021)Learning Transferable Visual Models From Natural Language Supervision

Learning Transferable Vision Models Based on Natural Language Supervision - Programmer Sought

4.1 Basic Principles

Contrastive-Language-Image-Pretraining (CLIP) models jointly train an image encoder and a text encoder to predict the correct pairing for a batch of (image, text) training examples. At test time, the learned text encoder encodes the name or description of the target dataset category to obtain the corresponding embedding, and then performs zero-shot matching with the embedding of the image to be classified.

4.2 Performance

The experimental results prove that the performance of zero-shot CLIP is slightly better than that of few-shot logistic regression, unless the number of samples of few-shot exceeds a certain threshold (for example: 16). This may be due to the difference between zero-shot and few-shot methods. CLIP's zero-shot classifiers are generated through natural language (which can directly describe the specified attributes), while supervised learning must indirectly obtain information from the training data. The disadvantage of sample-based context-free learning is that many different hypotheses need to be consistent with the data, especially in the one-shot case. And a single image often contains many different visual concepts, making it impossible to determine which one is the key for prediction.

4.3 Restrictions

Models based on natural language hints cannot handle attributes that cannot be described in language, so some corresponding samples may be needed as a reference. As mentioned earlier, switching from zero-shot to few-shot will face the problem of performance degradation. While performance can be improved by increasing the number of samples (beyond the threshold), for some classes samples are very scarce.

5. (2022)StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

StyleGAN-NADA: CLIP-Guided Non-Adversarial Domain Adaptation (Domain Adaptation) Image Generator_EDPJ's Blog-CSDN Blog

5.1 Non-Adversarial Domain Adaptation

5.2 Contrastive-Language-Image-Pretraining (CLIP) models

  • Based on the goal of contrastive learning, both images and text are mapped to a joint multimodal embedding space (latent space, latent space)
  • The representation of images and text in it is called latent code
  • Image generation or modification by optimizing hidden coding
  • Each attribute of the image, such as age and gender, is linearly separable in the latent space, and there is a hyperplane to separate the two. Therefore, by moving according to the direction of the hyperplane of different attributes, one attribute can be edited and multiple attributes can be edited without interfering with each other.

5.3 Adaptive layer selection, or adaptive trainable weight selection

  • Principle: When the model is trained, different layers in the deep network extract different features; conversely, when the image is generated, different layers generate different features.
  • The essence of domain migration: Based on the difference between the source domain and the target domain, some key information of the source domain image is retained, and then the rest of the information is changed, so that the generated image meets the requirements of the target domain while retaining the key information of the source domain.
  • Freeze the layers that are more relevant to key information generation and train the remaining layers. See section 4.2 layer freezing for specific operation methods

5.4 Segmentation mask (segmentation mask, Appendix D )

  • It is similar to layer selection, the difference is that the mask directly operates on the image
  • Reserve the area specified by the image (the area that does not need to be changed), mask the remaining area, and then send it to CLIP to convert the mask area

5.5 Similar to the above, this article also mentions few-shot and zero-shot. Probably due to the difference in the framework, the transition from zero-shot to few-shot (text-guided to image-guided) did not show a performance drop. At this time, using few-shot can deal with attributes that cannot be described in words. But at this time, another problem is faced: since the embedding space of CLIP is semantic in nature, image-based guidance cannot guarantee accurate conversion to this style. 

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/129753762