Improved adversarial image generation via information maximization and contrastive learning
Official account: EDPJ
Table of contents
3.3 Mitigation of Catastrophic Forgetting
4.2 Generating Performance Evaluation
5. Supplement: spectral normalization
InfoMax-GAN: Improved Adversarial Image Generation via Information Maximization and Contrastive Lear
0. Summary
Although Generative Adversarial Networks (GANs) are the basis of many generative models, they still face many problems. In this paper, the author proposes a principled framework to alleviate two basic problems of GAN at the same time: the discriminator's forgetting (catastrophic forgetting) and the generator's mode collapse (mode collapse). Implementation method: add contrastive learning and mutual information maximization methods to GAN, and understand the reasons for performance improvement through extensive analyzes. Compared with the latest research, this method greatly stabilizes the training of GAN and improves the performance of GAN to generate images. Especially, in the image domain (e.g. face), this method has better performance than the state-of-the-art SSGAN. Our approach is practical and easy to implement: it involves only one auxiliary objective, is computationally low, and performs well on a large number of training settings and datasets without any hyperparameter tuning.
1 Introduction
GAN is a generative model known for its sampling efficiency in generating high-fidelity data. GAN consists of two modules: discriminator and generator.
where V is the output value, is the prior noise distribution, is the true data distribution, and is the data generated by sampling random noise z.
Training the generator and discriminator with their respective loss functions is equivalent to minimizing the JS divergence (Jensen-Shannon divergence) of the true distribution from the generated data distribution. However, training GANs is notoriously difficult. First, the underlying assumption of the theory is that the discriminator is trained to be optimal, which may lead to saturating gradients in practice. Even so, there is no guarantee of convergence of the optimization results. Because the discriminator and generator are independently and simultaneously optimized in high-dimensional space. Finally, GAN faces the problem of mode collapse, that is, the distribution of generated data can only fit a part of the mode of the real distribution, resulting in limited diversity of generated samples. Therefore, many studies in recent years have tried to solve this problem.
The main reason for the instability of GAN training is the dynamic training environment: as the generator learns, the model distribution faced by the discriminator is constantly changing. Because GAN is a neural network, the discriminator is easy to forget: during the training process, as the network parameters are updated, the network only pays attention to the current task and forgets the previous task, which also leads to the instability of training. The latest Self-supervised GAN (SSGAN) proposes a method that can alleviate the forgetting of the discriminator, thereby improving the training stability. However, this method cannot resolve mode collapse. And it fails in the image domain (eg: human face). Furthermore, SSGAN, while alleviating forgetting in the discriminator, promotes mode collapse in the generator.
To address these issues, the authors propose methods to alleviate forgetting and schema collapse simultaneously. On the discriminator side, long-term representation learning is improved by maximizing mutual information, thereby reducing forgetting in dynamic training environments. On the generator side, contrastive learning is used to force the generator to produce different images (generating distinct positive/negative examples) to solve mode collapse.
2. Background
This is the objective function to maximize mutual information. Among them, X is the input, and E is the encoder, which is used to extract the most important features in X. are classes of functions. Maximizing this function is equivalent to maximizing , where and are encoders of the same architecture. Maximization is equivalent to maximizing the lower bound of the InfoMax objective function:
Maximization has the following advantages:
- Using different encoders can obtain different perspectives and modalities of the data, thereby improving the flexibility of the model.
- Compared with the original data, the encoded data is located in a lower-dimensional latent space (latent space), thereby reducing computational constraints
The latest unsupervised representation learning uses a comparative method to maximize the mutual information between local features and global features. However, it is usually not feasible to directly maximize the mutual information, so it is usually replaced by maximizing the lower boundary of InfoNCE: based on the critic (critic), find a positive example, the comparison loss between the positive example and the negative example set is the smallest. These positive and negative examples are randomly generated by matching features, data augmentation (aumentation), or their combination. The method in this paper also maximizes the lower bound of InfoNCE, and is more similar to Deep InfoNCE (maximizing with local and global features).
3. InfoMax-GAN
3.1 Framework
The figure below is the framework of InfoMax-GAN.
First maximize the lower bound. Represents the layers in the discriminator that generate global features. Represents layers that generate local features. It is n middle-level discriminators, which are layers that convert local features into global features, and are finally used to calculate the objective function of GAN . The local and global features are the penultimate and final output features of the encoder of the discriminator, respectively.
In the next step, local features and global features are sent to the critic network and projected to RKHS (Reproducing Kernel Hilbert Space) to obtain the similarity of local and global features. These projected features obtain positive and negative examples through contrastive matching (Contrastive Pairing). Given an image x, positive examples are obtained by matching the projected global feature vector with one of the projected local vectors , where is the index of the local feature. Therefore, positive samples can be expressed as . For each positive example, the negative example comes from another image of the same mini-batch, denoted as . Only the first term differs in order to maximize the global features with the local features of the same image, but not the local features of other images.
3.2 Contrastive Loss
For the N images in the mini-batch, it is necessary to classify each positive example , and the contrastive loss is:
Among them, is a critic, which maps the K-dimensional local/global features to a constant. Generally, it is defined as:
Among them, is the critic network, which projects local and global features to high-dimensional RKHS. In fact, it is a shallow network (shallow networks) with only one hidden layer, but with spectral normalized weights ( reference 1 , reference 2 ). These shallow networks are only used to project the feature dimensions of the input features , while preserving the original spatial sizes.
To stabilize training, restrict the discriminator and generator to only learn from the contrastive loss of fake image features, but not from the contrastive loss of real image features. The loss of the generator and discriminator is expressed as:
Among them, are the hyperparameters; represent the fixed discriminator and generator respectively; represent the collection of real images and generated images respectively; are the loss of GAN:
In fact, for simplicity, for all experiments, set . The ablation study shows that InfoMax-GAN is suitable for many values.
3.3 Mitigation of Catastrophic Forgetting
The author trained a discriminative classifier on the one-vs-all CIFAR-10 classification task: every 1K iterations change the category distribution, and every 10K iterations is a cycle. The test results are shown in the figure above:
- Without InfoMax, the classifier overfits to a certain class distribution, so when the class distribution changes, the accuracy is low.
- With Infomax, when the class distribution changes, the discriminative classifier still remembers all previous classes.
3.4 Alleviate Mode Collapse
The author used the training data of CIFAR-10 to train a discriminator for the contrastive task, and simulated three generators with the testing data of CIFAR-10.
As shown in the figure above, the perfect generator without mode collapse can handle contrastive tasks very well.
The full mode collapse generator can only generate one type of image, so the precision of the contrastive task is 0. For any N images, a total of samples need to be classified in the contrastive task . For every positive example, there is a negative example. However, if all N images are identical due to full mode collapse, then there are negative examples that are identical to every positive example, making contrastive tas almost impossible to accomplish. Therefore, in order to complete the contrastive task, the images generated by the generator should have more diversity, thereby reducing the mode collapse.
As shown in the figure above, it is true that any category of images (partial mode collapse) will lead to a decrease in performance.
4. Experiment
4.1 Experiment Settings
- Training : All models are trained with the same residual network backbone.
- Evaluation : Evaluate the quality of the generated images using three different metrics. Fréchet Inception Distance (FID), Kernel Inception Distance (KID) and InceptionScore (IS). In general, FID and KID are used to evaluate the diversity of generated images (smaller is better), and IS is used to evaluate the quality of generated images (larger is better). For all values, three experiments were performed and means and standard deviations were obtained.
4.2 Generating Performance Evaluation
The results show that InfoMax-GAN has excellent performance. In the FID column, the gain of InfoMax-GAN compared to SSGAN is higher on CIFAR-100 than on CIFAR-10. The authors believe this is due to SSGAN's tendency to generate easily rotated images, which sacrifices diversity. More categories means lower FID value.
4.3 Training Stability
This article tests the training stability by evaluating: when the hyperparameters change in a wide range, the performance of the model changes. Hyperparameters include the Adam parameter , the number of times the discriminator is updated every time the generator is updated , and so on. All these parameters are selected from previous GAN works with excellent performance.
The above data (FID) shows that InfoMax-GAN has better performance even if the training of GAN is not completed (for example, ). For different parameters, compared with SNGAN, the FID of InfoMax-GAN is stably maintained at a lower level, which shows that InfoMax-GAN's robustness to hyperparameters can achieve good performance without any hyperparameter tuning. This is useful in practice: when training a new GAN or using a new dataset, if the hyperparameters are not adjusted well, the training will be very unstable. The robustness of hyperparameters can solve this problem.
As shown in the figure, red and blue respectively indicate the change of FID of SNGAN and InfoMax-GAN as the number of iterations increases. GAN training converges faster and continues to improve performance throughout the process, which can stabilize GAN training. This is attributed to an additional constraint: the global feature has high mutual information with all its local features. In this way, the space of the generated data distribution is constrained, and the generated data changes less, finally stabilizing the GAN training environment. This is a practical benefit when training GANs given a fixed computational budget, as significant performance gains can be obtained early in training.
4.4 Low Computational Amount
As shown, the training time of InfoMax-GAN is very short. This is because, in practice, only two shallow (one hidden layer) MPL networks need to calculate contrastive loss. Also, the time required is much less than when it was required. This is because large is an important bottleneck in training time.
The data in the above figure is to illustrate that the calculation amount of training InfoMax-GAN is low and the time required is short. But when I first saw this picture, I thought it was the wrong labeling of SNGAN and InfoMax-GAN. Because obviously InfoMax-GAN takes longer than SNGAN. After reading the relevant descriptions in this section repeatedly, I guess that InfoMax-GAN is not intended to be compared with SNGAN. But if it is not compared with SNGAN, there is no other GAN training time here, and there is no reference benchmark in the description, it is difficult for people to appreciate the low calculation amount of InfoMax-GAN.
Then I found the answer in its related references. In the article "SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS", it is described as follows that the relative computational cost of the power iteration (18) is negligible when compared to the cost of forward and backward propagation on CIFAR-10. This is to say that the calculation of SNGAN is negligible compared to FP and BP on CIFAR-10. Therefore, the comparison between InfoMax-GAN and SNGAN can show the low calculation amount of InfoMax-GAN (but it cannot be seen from the picture alone).
4.5 Ablation Studies
Ablation study often reduces some improved features (such as reducing several layers of networks, etc.) on the model finally proposed in the paper to verify the necessity of corresponding improved features.
(Generally, when running the ablation study, it is often found that removing the improvement effect is better)
RKHS Dimension (R): As shown in the figure above, for different R values, the FID of InfoMax-GAN remains at a low level stably, reflecting the robustness. This is because the critics of InfoMax-GAN are MLP networks with only a single hidden layer, which is enough to achieve a good representation.
As shown in the figure above, for different hyperparameters ,
- InfoMax-GAN has a lower FID value and achieves the best performance at .
- From Figure (b), it can be seen that the discriminator is very important to improve the performance of GAN: hold , when the value of σ increases from 0.0 to 0.2, the FID gradually decreases.
- At that time , the objective function of InfoMax-GAN achieved good performance only by adding the discriminator as a regularization term. This is because maximizing mutual information can reduce catastrophic forgetting, thereby stabilizing the training of GAN.
- Continuing to increase the value can further improve performance. This is because the regularization term of the generator helps to reduce the mode collapse and thus improve the FID.
5. Supplement: spectral normalization
5.1 Theoretical Basis
Lipschitz continuity is a property that describes the "goodness" of a function. Taking a one-dimensional function as an example, if the function is Lipschitz continuous, we can find a cone, and the cone centered at each point on the function image makes the function image lie outside the cone . As shown below,
If a one-dimensional function is differentiable, then its Lipschitz constant (Lipschitz constant) K is the maximum value of its derivative. Lipschitz continuity requires K to be a bounded quantity, which limits the gradient of the discriminator, thereby solving the problem of gradient explosion during gradient descent.
The Lipschitz constant of a general differentiable function is its largest singular value or spectral norm.
The premise of spectral normalization is that for any multi-layer discriminator (which may be a composite function of a linear map and a nonlinear component), the Lipschitz constant or its upper bound can be found.
5.2 Algorithm
Randomly initialize two vectors , let the network weight , where is the largest singular value of W, the operation of calculating it is called power iteration, the operation is as follows:
After many iterations, will converge to the eigenvector of W.
Compared with gradient descent, this algorithm has a small amount of calculation.
6. Reference
- Lee K S, Tran N T, Cheung N M. Infomax-gan: Improved adversarial image generation via information maximization and contrastive learning[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021: 3942-3952. 下载地址:WACV 2021 Open Access Repository (thecvf.com)
- Spectral Normalization for GAN bzdww
- Spectral Normalization Explained
- Miyato T, Kataoka T, Koyama M, et al. Spectral normalization for generative adversarial networks[J]. arXiv preprint arXiv:1802.05957, 2018. 下载地址:[1802.05957] Spectral Normalization for Generative Adversarial Networks (arxiv.org)
- What is an ablation study? _The Blog of the Silent Gods-CSDN Blog_Ablation Experiment