InfoMax-GAN: Improved Adversarial Image Generation via Information Maximization and Contrastive Learning

Improved adversarial image generation via information maximization and contrastive learning

Official account: EDPJ

Table of contents

0. Summary

1 Introduction

2. Background

3. InfoMax-GAN

3.1 Framework

3.2 Contrastive Loss

3.3 Mitigation of Catastrophic Forgetting

3.4 Alleviate Mode Collapse

4. Experiment

4.1 Experiment Settings

4.2 Generating Performance Evaluation

4.3 Training Stability

4.4 Low Computational Amount

4.5 Ablation Studies

5. Supplement: spectral normalization

5.1 Theoretical Basis

5.2 Algorithm

6. Reference


InfoMax-GAN: Improved Adversarial Image Generation via Information Maximization and Contrastive Lear

0. Summary

Although Generative Adversarial Networks (GANs) are the basis of many generative models, they still face many problems. In this paper, the author proposes a principled framework to alleviate two basic problems of GAN at the same time: the discriminator's forgetting (catastrophic forgetting) and the generator's mode collapse (mode collapse). Implementation method: add contrastive learning and mutual information maximization methods to GAN, and understand the reasons for performance improvement through extensive analyzes. Compared with the latest research, this method greatly stabilizes the training of GAN and improves the performance of GAN to generate images. Especially, in the image domain (e.g. face), this method has better performance than the state-of-the-art SSGAN. Our approach is practical and easy to implement: it involves only one auxiliary objective, is computationally low, and performs well on a large number of training settings and datasets without any hyperparameter tuning.

1 Introduction

GAN is a generative model known for its sampling efficiency in generating high-fidelity data. GAN consists of two modules: discriminator and generator.

where V is the output value, \mathop p\nolimits_zis the prior noise distribution, \mathop p\nolimits_r(x)is the true data distribution, and G(z)is the data generated by sampling random noise z.

Training the generator and discriminator with their respective loss functions is equivalent to minimizing the JS divergence (Jensen-Shannon divergence) of the true distribution from the generated data distribution. However, training GANs is notoriously difficult. First, the underlying assumption of the theory is that the discriminator is trained to be optimal, which may lead to saturating gradients in practice. Even so, there is no guarantee of convergence of the optimization results. Because the discriminator and generator are independently and simultaneously optimized in high-dimensional space. Finally, GAN faces the problem of mode collapse, that is, the distribution of generated data can only fit a part of the mode of the real distribution, resulting in limited diversity of generated samples. Therefore, many studies in recent years have tried to solve this problem.

The main reason for the instability of GAN training is the dynamic training environment: as the generator learns, the model distribution faced by the discriminator is constantly changing. Because GAN is a neural network, the discriminator is easy to forget: during the training process, as the network parameters are updated, the network only pays attention to the current task and forgets the previous task, which also leads to the instability of training. The latest Self-supervised GAN (SSGAN) proposes a method that can alleviate the forgetting of the discriminator, thereby improving the training stability. However, this method cannot resolve mode collapse. And it fails in the image domain (eg: human face). Furthermore, SSGAN, while alleviating forgetting in the discriminator, promotes mode collapse in the generator.

To address these issues, the authors propose methods to alleviate forgetting and schema collapse simultaneously. On the discriminator side, long-term representation learning is improved by maximizing mutual information, thereby reducing forgetting in dynamic training environments. On the generator side, contrastive learning is used to force the generator to produce different images (generating distinct positive/negative examples) to solve mode collapse.

2. Background

This is the objective function to maximize mutual information. Among them, X is the input, and E is the encoder, which is used to extract the most important features in X. \varepsilonare classes of functions. Maximizing this function is equivalent to maximizing I(\mathop C\nolimits_\psi (X);\mathop E\nolimits_\psi (X)), where \mathop C\nolimits_\psi (X)and \mathop E\nolimits_\psi(X)are encoders of the same architecture. Maximization I(\mathop C\nolimits_\psi (X);\mathop E\nolimits_\psi (X))is equivalent to maximizing the lower bound of the InfoMax objective function:

Maximization I(\mathop C\nolimits_\psi (X);\mathop E\nolimits_\psi (X))has the following advantages:

  • Using different encoders can obtain different perspectives and modalities of the data, thereby improving the flexibility of the model.
  • Compared with the original data, the encoded data is located in a lower-dimensional latent space (latent space), thereby reducing computational constraints

The latest unsupervised representation learning uses a comparative method to maximize the mutual information between local features and global features. However, it is usually not feasible to directly maximize the mutual information, so it is usually replaced by maximizing the lower boundary of InfoNCE: based on the critic (critic), find a positive example, the comparison loss between the positive example and the negative example set is the smallest. These positive and negative examples are randomly generated by matching features, data augmentation (aumentation), or their combination. The method in this paper also maximizes the lower bound of InfoNCE, and is more similar to Deep InfoNCE (maximizing with local and global features).   

3. InfoMax-GAN

3.1 Framework

The figure below is the framework of InfoMax-GAN.

First maximize I(\mathop C\nolimits_\psi (X);\mathop E\nolimits_\psi (X))the lower bound. \mathop E\nolimits_\psi(X)​Represents the layers in the discriminator that generate global features. \mathop C\nolimits_\psi (X)Represents layers that generate local features. \mathop C\nolimits_\psi = \mathop C\nolimits_{\psi ,1} \circ \ldots \circ \mathop C\nolimits_{\psi ,n}It is n middle-level discriminators, \mathop f\nolimits_\psiwhich are layers that convert local features into global features, and are finally used to calculate the objective function of GAN \top L\nolimits_{yet}. The local and global features are the penultimate and final output features of the encoder of the discriminator, respectively.

In the next step, local features \mathop C\nolimits_\psi (x)and global features \mathop E\nolimits_\psi(x)are sent to the critic network  \mathop \Phi \nolimits_\thetaand \mathop \Phi \nolimits_\omegaprojected to RKHS (Reproducing Kernel Hilbert Space) to obtain the similarity of local and global features. These projected features obtain positive and negative examples through contrastive matching (Contrastive Pairing). Given an image x, positive examples are obtained by matching the projected global feature vector \original \Phi \nolimits_\omega(\original E\nolimits_\psi(X))with one of the projected local vectors , where is the index of the local feature. Therefore, positive samples can be expressed as . For each positive example, the negative example comes from another image of the same mini-batch, denoted as . Only the first term differs in order to maximize the global features with the local features of the same image, but not the local features of other images.\original \Phi \nolimits_\theta(\original C\nolimits_\psi^{(i)}(X))i \in A = \{ 0,1, \ldots ,\mathop M\nolimits^2 - 1\}M \times M\source {(\Phi }\nolimits_\theta(\source C\nolimits_\psi^{(i)}(x)),\source \Phi \nolimits_\omega(\source E\nolimits_\psi(x)); )\source {(\Phi }\nolimits_\theta(\source C\nolimits_\psi^{(i)}(x')),\source \Phi \nolimits_\omega(\source E\nolimits_\psi(x); ))

3.2 Contrastive Loss

For the N images in the mini-batch, it is necessary to classify each positive example N\end { \times M}\nolimits^2, and the contrastive loss is:

Among them, \original g\nolimits_{\theta ,\omega } :\original R\nolimits^{1 \times 1 \times K} \times \original R\nolimits^{1 \times 1 \times K} \to Ris a critic, which maps the K-dimensional local/global features to a constant. Generally, \origin g\nolimits_{\theta ,\omega }it is defined as:

Among them, \original \Phi \nolimits_\theta :\original R\nolimits^{M \times M \times K} \to \original R\nolimits^{M \times M \times R} ,\original \Phi \nolimits_\omega :\top R\nolimits^{1\times1\timesK}\to\top R\nolimits^{1\times1\timesR}is the critic network, which projects local and global features to high-dimensional RKHS. In fact, \source {\source \Phi \nolimits_\theta ,\Phi }\nolimits_\omegait is a shallow network (shallow networks) with only one hidden layer, but with spectral normalized weights ( reference 1 , reference 2 ). These shallow networks are only used to project the feature dimensions of the input features (K \to R), while preserving (M \times M,1 \times 1)the original spatial sizes.

To stabilize training, restrict the discriminator and generator to only learn from the contrastive loss of fake image features, but not from the contrastive loss of real image features. The loss of the generator and discriminator is expressed as:

Among them,  \alpha ,\betaare the hyperparameters; \hat D,\hat Grepresent the fixed discriminator and generator respectively; \top X\nolimit_r \top {,X}\nolimit_grepresent the collection of real images and generated images respectively; \top L\nolimits_{yet}are the loss of GAN:

In fact, for simplicity, for all experiments, set \alpha = \beta = 0.2. The ablation study shows that InfoMax-GAN is suitable for many \alpha ,\betavalues. 

3.3 Mitigation of Catastrophic Forgetting

The author trained a discriminative classifier on the one-vs-all CIFAR-10 classification task: every 1K iterations change the category distribution, and every 10K iterations is a cycle. The test results are shown in the figure above:

  • Without InfoMax, the classifier overfits to a certain class distribution, so when the class distribution changes, the accuracy is low.
  • With Infomax, when the class distribution changes, the discriminative classifier still remembers all previous classes.

3.4 Alleviate Mode Collapse

The author used the training data of CIFAR-10 to train a discriminator for the contrastive task, and simulated three generators with the testing data of CIFAR-10.

As shown in the figure above, the perfect generator without mode collapse can handle contrastive tasks very well.

The full mode collapse generator can only generate one type of image, so the precision of the contrastive task p(\mathop C\nolimits_\psi ^{(i)} (x),\mathop E\nolimits_\psi (x)|X)is 0. For any N images, a total of samples need to be classified in the contrastive task N\end { \times M}\nolimits^2. For every positive example, there is N\top { \times M}\nolimits^2 -a negative example. However, if all N images are identical due to full mode collapse, then there are N - 1negative examples that are identical to every positive example, making contrastive tas almost impossible to accomplish. Therefore, in order to complete the contrastive task, the images generated by the generator should have more diversity, thereby reducing the mode collapse.

As shown in the figure above, it is true that any category of images (partial mode collapse) will lead to a decrease in performance.

4. Experiment

4.1 Experiment Settings

  • Training : All models are trained with the same residual network backbone.
  • Evaluation : Evaluate the quality of the generated images using three different metrics. Fréchet Inception Distance (FID), Kernel Inception Distance (KID) and InceptionScore (IS). In general, FID and KID are used to evaluate the diversity of generated images (smaller is better), and IS is used to evaluate the quality of generated images (larger is better). For all values, three experiments were performed and means and standard deviations were obtained.

4.2 Generating Performance Evaluation

The results show that InfoMax-GAN has excellent performance. In the FID column, the gain of InfoMax-GAN compared to SSGAN is higher on CIFAR-100 than on CIFAR-10. The authors believe this is due to SSGAN's tendency to generate easily rotated images, which sacrifices diversity. More categories means lower FID value.

4.3 Training Stability

This article tests the training stability by evaluating: when the hyperparameters change in a wide range, the performance of the model changes. Hyperparameters include the Adam parameter \top {(\beta }\nolimits_1 ,\top \beta \nolimits_2 ), the number of times the discriminator is updated every time the generator is updated \top n\nolimits_{this} , and so on. All these parameters are selected from previous GAN works with excellent performance.

\top n\nolimits_{this} =The above data (FID) shows that InfoMax-GAN has better performance even if the training of GAN is not completed (for example, ). For different parameters, compared with SNGAN, the FID of InfoMax-GAN is stably maintained at a lower level, which shows that InfoMax-GAN's robustness to hyperparameters can achieve good performance without any hyperparameter tuning. This is useful in practice: when training a new GAN or using a new dataset, if the hyperparameters are not adjusted well, the training will be very unstable. The robustness of hyperparameters can solve this problem.

As shown in the figure, red and blue respectively indicate the change of FID of SNGAN and InfoMax-GAN as the number of iterations increases. GAN training converges faster and continues to improve performance throughout the process, which can stabilize GAN training. This is attributed to an additional constraint: the global feature has high mutual information with all its local features. In this way, the space of the generated data distribution is constrained, and the generated data changes less, finally stabilizing the GAN training environment. This is a practical benefit when training GANs given a fixed computational budget, as significant performance gains can be obtained early in training. 

4.4 Low Computational Amount

As shown, the training time of InfoMax-GAN is very short. This is because, in practice, only two shallow (one hidden layer) MPL networks need to calculate contrastive loss. Also, \top n\nolimits_{this} =the time required \top n\nolimits_{this} =is much less than when it was required. This is because large \top n\nolimits_{this}is an important bottleneck in training time.

The data in the above figure is to illustrate that the calculation amount of training InfoMax-GAN is low and the time required is short. But when I first saw this picture, I thought it was the wrong labeling of SNGAN and InfoMax-GAN. Because obviously InfoMax-GAN takes longer than SNGAN. After reading the relevant descriptions in this section repeatedly, I guess that InfoMax-GAN is not intended to be compared with SNGAN. But if it is not compared with SNGAN, there is no other GAN training time here, and there is no reference benchmark in the description, it is difficult for people to appreciate the low calculation amount of InfoMax-GAN.

Then I found the answer in its related references. In the article "SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS", it is described as follows that the relative computational cost of the power iteration (18) is negligible when compared to the cost of forward and backward propagation on CIFAR-10. This is to say that the calculation of SNGAN is negligible compared to FP and BP on CIFAR-10. Therefore, the comparison between InfoMax-GAN and SNGAN can show the low calculation amount of InfoMax-GAN (but it cannot be seen from the picture alone).

4.5 Ablation Studies

Ablation study often reduces some improved features (such as reducing several layers of networks, etc.) on the model finally proposed in the paper to verify the necessity of corresponding improved features.
(Generally, when running the ablation study, it is often found that removing the improvement effect is better)

RKHS Dimension (R): As shown in the figure above, for different R values, the FID of InfoMax-GAN remains at a low level stably, reflecting the robustness. This is because the critics of InfoMax-GAN are MLP networks with only a single hidden layer, which is enough to achieve a good representation.

As shown in the figure above, for different hyperparameters \alpha ,\beta,

  • InfoMax-GAN has a lower FID value and \alpha = \beta = 0.2achieves the best performance at .
  • From Figure (b), it can be seen that the discriminator is very important to improve the performance of GAN: hold \alpha = 0.2, when \betathe value of σ increases from 0.0 to 0.2, the FID gradually decreases.
  • At that time\alpha = 0,\beta = 0.2 , the objective function of InfoMax-GAN achieved good performance only by adding the discriminator as a regularization term. This is because maximizing mutual information can reduce catastrophic forgetting, thereby stabilizing the training of GAN.
  • Continuing to increase \alphathe value can further improve performance. This is because the regularization term of the generator helps to reduce the mode collapse and thus improve the FID.

5. Supplement: spectral normalization

5.1 Theoretical Basis

Lipschitz continuity is a property that describes the "goodness" of a function. Taking a one-dimensional function as an example, if the function is Lipschitz continuous, we can find a cone, and the cone centered at each point on the function image makes the function image lie outside the cone . As shown below,

If a one-dimensional function is differentiable, then its Lipschitz constant (Lipschitz constant) K is the maximum value of its derivative. Lipschitz continuity requires K to be a bounded quantity, which limits the gradient of the discriminator, thereby solving the problem of gradient explosion during gradient descent.

The Lipschitz constant of a general differentiable function is its largest singular value or spectral norm.

The premise of spectral normalization is that for any multi-layer discriminator (which may be a composite function of a linear map and a nonlinear component), the Lipschitz constant or its upper bound can be found.

5.2 Algorithm

Randomly initialize two vectors u,v, let the network weight W \leftarrow W/\sigma (W), where \sigma (W)is the largest singular value of W, the operation of calculating it is called power iteration, the operation is as follows:

\begin{array}{l} \mathop u\nolimits_{t + 1} = W\mathop v\nolimits_t \\ \mathop v\nolimits_{t + 1} = W\mathop u\nolimits_t\\ \sigma (W ) = \mathop u\nolimits^T Wv\end{array}

After many iterations, u,vwill converge to the eigenvector of W.

Compared with gradient descent, this algorithm has a small amount of calculation.

6. Reference

  1. Lee K S, Tran N T, Cheung N M. Infomax-gan: Improved adversarial image generation via information maximization and contrastive learning[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021: 3942-3952. 下载地址:WACV 2021 Open Access Repository (thecvf.com)​​​​​​
  2. Spectral Normalization for GAN bzdww
  3. Spectral Normalization Explained
  4. Miyato T, Kataoka T, Koyama M, et al. Spectral normalization for generative adversarial networks[J]. arXiv preprint arXiv:1802.05957, 2018. 下载地址:[1802.05957] Spectral Normalization for Generative Adversarial Networks (arxiv.org)
  5. What is an ablation study? _The Blog of the Silent Gods-CSDN Blog_Ablation Experiment

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/128254925