[GAN in Computer Vision] How to stabilize GAN training (3)

1. Description

    In the previous post , we reached the point of understanding unpaired image-to-image translation. Still, there are some very important concepts you must understand before implementing your own super cool deep GAN model. The introduction of new members of the GAN model as mentioned in this article

2. About: Wasserstein distance, boundary balance, and gradually increasing GAN value

        GANs dominate deep learning tasks such as image generation and image translation.

        In the previous post , we reached the point of understanding unpaired image-to-image translation. Still, there are some very important concepts you must understand before implementing your own super cool deep GAN model.

        In this part, we'll look at some groundwork. We will see the most common GAN distance functions and why they work. We then view the training of GANs as trying to find the equilibrium of a two-player game. Finally, we'll see a revolutionary incremental training effort that achieves realistic megapixel image resolution for the first time .

        The research we will explore primarily addresses mode collapse and training instabilities. Someone who has never trained a GAN can easily argue that we always refer to these two axes. In real life, training large-scale GANs for new problems can be a nightmare .

If you start reading and implementing state-of-the-art methods, it will be almost impossible to successfully train a new GAN         on a new problem . In fact, it's like winning the lottery.

When you move away from the most common datasets (CIFAR, MNIST, CELEBA), you run into chaos.

Typically, you try to understand the learning curve intuitively as debugging in order to guess hyperparameters that might work better. But GAN training is so unstable that this process is often a waste of time. This lovely work is one of the first to provide a broad theoretical rationale for GAN training programs . Interestingly, they found a pattern between all existing distances between distributions.

2.1 Core Concept

        The core idea is to effectively measure how close the model distribution is to the actual distribution. Because the choice of measuring distance directly affects the convergence of the model. As we now know, GANs can represent distributions of low-dimensional manifolds (noise z).

Intuitively, the weaker this distance is, the easier it is to define a mapping from the parameter space (theta-space) to the probability space, since the distribution turns out to converge more easily.

        We have reason to require this continuous mapping. Mainly because a continuous function can be defined to satisfy this continuous mapping that gives the desired probability space or generated samples.

        For this reason, this work introduces a new distance called Wasserstein-GAN . It is an approximation of the Bulldozer (EM) distance , which theoretically shows that it can progressively optimize the training of GANs. Surprisingly, no balancing of D and G is required during training, nor is a specific design of the network architecture required. In this way, the mode collapse inherent in GANs is reduced.

2.2 Understanding the Wasserstein distance

        Before we delve into the proposed loss, let's look at some math. As   perfectly described in the wiki , the supremum  ( sup) of a subset of a poset  is the smallest element of all elements greater than or equal to. Therefore, the supremum is also called the smallest upper bound. Personally I call it   the maximum value of the subset of all possible combinations that can be found in T.

        Now, let's introduce this concept in GAN terminology. T  is   all possible pairs of function approximations f that we can get from G  and  D. S  will be a subset of those functions that we will constrain to make the training better (some kind of regularization). Ranking comes naturally from the computed loss function. Based on the above, we can finally see the Wasserstein loss function that measures the distance between two distributions Pr and Pθ.

Taken from theaisummer.com https://theaisummer.com/gan-computer-vision-incremental-training/

        A strict mathematical constraint called  K-Lipschitz function is used  to obtain the subset  S. But if the math is widely proven, you don't need to know more. But how do we introduce this constraint?

One way to address this problem is to roughly approximate this constraint by training a neural network with weights in a compact space. The easiest way to achieve this is to clamp the weights to a fixed range.

        That's it, the weight clip works the way we want it to! Therefore, after each gradient update, we clip the w range to [−0.01, 0.01]. In this way, we can significantly enforce the Lipschitz constraint. Simple, but I can assure you it works!

        In fact, with this distance loss function, which is, of course, continuous and differentiable, we can now train D with the proposed criterion until optimal , while other distances saturate. Saturation means that the loss of the discriminator is zero and the generated samples are only meaningful in some cases. So now, the saturation (which naturally leads to mode collapse) is alleviated and we can train with more linear style gradients across all training ranges. Let's look at an example to clarify this:

Image from WGAN paper [https://arxiv.org/abs/1701.07875]
The WGAN standard provides clean gradients in all parts of space

        To see all the previous mathematics in practice, we   provide the WGAN encoding scheme in Pytorch . You can directly modify the project to include this loss criterion. Usually it's best seen in real code . It's worth mentioning that to save subsets and take caps, it means we have to take many pairs. That's why you'll see us train the generator every few times so that the discriminator gets updated. Thus, we have the set that defines the supremum. Note that in order to get close to the supremum, we can also   do many steps  for  G before upgrading D.

        In later work , it was shown that even if the idea were sound, weight clipping was a poor way to enforce the required constraints. Another way to force a function to be K-Lipschitz is gradient penalty .

The key idea is the same: keep weight in a compact space . However, they do this by constraining the gradient norm of the critic's output with respect to its input .

        We won't cover this article, but for user consistency and ease of experimentation, we provide the code as an improved alternative to vanilla wgan.

2.3 Results and discussion

        Following our brief description, we can now jump into some results. It's great to see how the GAN learns during training, as shown in the image below:

Image from WGAN paper [https://arxiv.org/abs/1701.07875]

Wasserstein loss criterion for DCGAN generators. As you can see, the loss decreases rapidly and steadily, while the sample quality improves. This work is considered to be the foundation of the theoretical aspects of GANs, which can be summarized as:

        TL;RL

  • The Wasserstein criterion allows us to train  D  until optimal. When the criterion reaches an optimal value, it simply provides a loss to the generator, which we can train like any other neural network.
  • We no longer need to properly balance  G  and  D  capacities.
  • Wasserstein loss results in higher quality gradients for training G.
  • WGANs are observed to be more robust than vanilla GANs in terms of architecture choices for generators and hyperparameter tuning

        Indeed, we did improve the stability of the optimization process. However, nothing comes at zero cost. WGAN training becomes unstable with momentum-based optimizers such as Adam and with high learning rates. This is reasonable, since the standard loss is highly non-stationary, so momentum-based optimizers seem to perform worse. That's why they use RMSProp, which is known to perform well on non-stationary problems .

        Finally, an intuitive way to understand the paper is to make a gradient analogy to the history of activation functions within a layer. Specifically, the gradients of sigmoid and tanh activations disappear and are replaced by ReLU, as the gradients across the entire value range improve.

3. Start (Boundary Equilibrium Generative Adversarial Network 2017)

        We often see the discriminator improve too quickly at the beginning of training. Nonetheless, balancing the convergence of discriminators and generators remains an existing challenge.

        This is the first work that can control the trade-off between image diversity and visual quality. Acquire high-resolution images with simple model architecture and standard training schemes.

        To achieve this, the authors introduce a trick to balance the training of generator and discriminator. The core idea of ​​BEGIND is this newly implemented equilibrium, which is combined with the described Wasserstein distance. For this, they trained an autoencoder-based discriminator. Interestingly, since D is now an autoencoder, it produces images as output rather than scalars. Before we go any further, let's keep this in mind!

        As we can see, it is more efficient to match the distribution of the errors rather than directly matching the distribution of the samples. A key point is that this work aims to optimize the Wasserstein distance between autoencoder loss distributions, not the Wasserstein distance between sample distributions. An advantage of BEGIND is that it does not explicitly require the discriminator to be K-Lipschitz constrained . Autoencoders are usually trained using the L1 or L2 norm.

3.1 The expression of two-player game equilibrium

        To express this problem in game theory , an equilibrium term that balances the discriminator and the generator is added. Suppose we can ideally generate indistinguishable samples. Then, their error distributions should be the same, including their expected error, which is the error we measure after processing each batch. Perfectly balanced training would result in equal expected values ​​of L(x) and L(G(z). However, this is not the case! Therefore, BEGIN decides to quantify the balance ratio , defined as:

Image courtesy of the author, originally written in Latex

        This quantity is modeled as a hyperparameter in the network. Thus, the new training scheme involves two competing goals: a) automatically encoding real images and b) distinguishing

        Authentic from generated images. The gamma term allows us to balance these two goals. Lower γ values ​​lead to lower image diversity, since the discriminator focuses more on automatically encoding real images. But how do you control this hyperparameter when the expected loss changes?

3.2 Boundary balanced GAN (beginning)

        The answer is simple: we just need to introduce another variable kt that falls in the range [0, 1]. This variable will be designed to control the focus placed on L(G(z)) during training.

Image courtesy of the author, originally written in Latex

 

It is initialized to k0 = 0, and λ_k is also defined as a proportional gain of k (0.001 is used)         in this study   . This can be seen as a form of closed-loop feedback control, where kt is adjusted at each step to maintain the desired balance for the chosen hyperparameter γ.

Note that G tends to  generate easily reconstructable data  for  D         during early training stages . At the same time, the real data distribution has not been accurately learned. Basically, L(x) > L(G(z)). In contrast to many GANs, BEGIN does not require pre-training and can be optimized using Adam. Finally, a global measure of convergence is derived using the concept of equilibrium.

        Essentially, the convergence process can be formulated as finding a) the closest reconstruction L(x) and b)  the lowest absolute value ||γ L(x) − L(G(z)) ||. Add Using these two terms, we can identify when the network has converged.

3.3 Model Architecture

        The model architecture is very simple. A major difference is the introduction of Exponential Linear Units instead of ReLU. They use an autoencoder with a deep encoder and decoder. Hyperparameterization aims to avoid typical GAN ​​training tricks.

Image from BEGIN [https://arxiv.org/abs/1703.10717] paper.  Model Architecture
 Model Architecture

With U-shape architecture, skip connections are not required . Downsampling is implemented as a subsampled convolution with a kernel of 3 and a stride of 2. On the other hand, upsampling is done by nearest neighbor interpolation. Between encoder and decoder, tensors of processed data are mapped through fully-connected layers without any nonlinearities afterwards.

3.4 Results and discussion

Some rendered visual results can be seen in the 128x128 interpolated image below:

Image source: BEGAN[https://arxiv.org/abs/1703.10717]. Interpolated 128x128 image generated by BEGIN

Image source: BEGAN[https://arxiv.org/abs/1703.10717]. Interpolated 128x128 image generated by BEGIN

Notably, diversity is observed to increase with γ, but so are artifacts (noise). It can be seen that the interpolation shows good continuity. In the first line, the hair transition and hairstyle are changed. It is also worth noting that some features in the left image are missing (cigarettes). The second and last row shows a simple rotation. While the rotation was smooth, we could see that the profile picture wasn't captured perfectly.

Finally, using the BEGIND balancing method, the network converges to diverse and visually pleasing images. This is still the case at 128x128 with minor modifications. Training is stable, fast, and robust to small parameter changes.

But let's see what happens at really high resolution !

4. Progressive GAN (Gradual growth of GANs to improve quality, stability and variation 2017)

        The methods we have described so far produce sharp images. However, they only generate images at relatively small resolutions and with limited variation. One of the reasons for keeping the resolution low is that the training is not stable. If you have deployed your own GAN models, you probably know that large resolutions require smaller mini-batches due to computational space complexity. In this way, the problem of time complexity also rises, which means that you need several days to train a GAN.

4.1 Incremental Growth Architecture

        To address these issues, the authors gradually increase the generator and discriminator from low-resolution to high-resolution images.

The intuition is that as training progresses, newly added layers aim to capture higher frequency details corresponding to high-resolution images .

But what makes this approach so great?

The answer is simple: the model does not have to learn all scales simultaneously, but first discovers large-scale (global) structure and then discovers local fine-grained details . The incremental training nature aims to move in this direction. It is important to note that throughout the training process, all layers are trainable and the network architecture is symmetric (mirrored). A diagram of the described architecture is shown below:

Image from Progressive Growth of GAN paper [https://arxiv.org/abs/1710.10196]

Image from Progressive Growth of GAN paper [https://arxiv.org/abs/1710.10196]

However, mode collapse persists due to unhealthy competition , which increases the magnitude of the error signal in G and D.

4.2 Introducing smooth layers between transitions

The key innovation of this work is the smooth transition of newly added layers to stabilize training . But what happens after each transition?

The picture is from the progressive growth of the GAN paper, link: https://arxiv.org/abs/1710.10196
Image from Progressive Growth of GAN paper

The picture is from the progressive growth of GAN paper, link: https://arxiv.org/abs/1710.10196

        What's really happening is that the image resolution is doubled. Therefore,   a new layer is added on top of G  and  D. This is where the magic happens. During the transition, layers operating at higher resolutions are used as residual skip-connect blocks whose weights (α) increase linearly from 0 to 1. One indicates that the skipped connection was dropped.

        The depicted  toRGB blocks represent layers that project and reshape  one-dimensional feature vectors into RGB colors. It can be viewed as a connected layer that always makes the image have the correct shape. Meanwhile, fromRGB  does the opposite, while both use 1 × 1 convolutions. The real image is scaled down accordingly to match the current size.

Interestingly, during the transition, the authors interpolate between the two resolutions of the real image, similar to GAN-like learning. Also, for progressive GANs, most iterations are performed at lower resolutions, resulting in 2 to 6 train speedups. As such, this is the first production to reach megapixel resolution, ie  1024x1024 .

Unlike downstream tasks that encounter covariance shift, GANs exhibit increasing error signal magnitudes and race issues . To address these issues, they use normal distribution initialization and per-layer weight normalization via a scalar that is dynamically computed per batch. This is thought to allow the model to learn scale invariance. To further constrain the signal magnitude, they also normalize the pixel feature vectors to unit length in the generator. This prevents feature map upscaling without significantly deteriorating the results. The accompanying video may help in understanding the design choices. The official code is published here in TensorFlow .

4.3 DR: Results and Discussion

        The results can be summarized as follows:

        1) The gradual increase in network capacity explains the improvement in convergence. Intuitively, the existing layers learn lower scales, so after the transition, the task of the introduced layers is simply to refine the representation with smaller and smaller scale effects.

        2) The speedup of the asymptotic growth increases with the output resolution. For the first time, it is possible to produce sharp images of 1024x1024.

        3) While implementing such an architecture is indeed difficult and lacks many training details (i.e. when to transition and why), it's still an incredible piece of work that I personally enjoy.

Image from Progressive Growth of GAN, megapixel resolution, link: https://arxiv.org/abs/1710.10196

Image from Progressive Growth of GANs, megapixel resolution, link: https://arxiv.org/abs/1710.10196

4.4 Conclusion

In this article, we have encountered some of the most advanced training concepts that are used even today. The reason we focus on covering these important training aspects is to be able to move on to more advanced applications. If you want to look at GANs from a more game-theoretic perspective, we highly recommend watching Daskalakis' talk . Finally, for us math geeks, here's an excellent article covering the transition to WGANs in more detail.

Altogether, we have found several ways to handle mode collapse, large datasets, and megapixel resolutions with incremental training. For the entire article series, feel free to visit Summer of AI.

5. Citation

[1] Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstangan. arXiv preprint arXiv:1701.07875 .

[2] Berthelot, D., Schumm, T., & Metz, L. (2017). Getting Started: Boundary Equilibrium Generative Adversarial Networks. arXiv preprint arXiv:1703.10717 .

[3] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Gradual growth of Gannensis for enhanced quality, stability and variation. arXiv preprint arXiv:1710.10196 .

[4] Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2017). Training people to be optimistic . arXiv preprint arXiv:1711.00141 .

[5] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, AC (2017). Improved Wasserstein Gans training. Advances in Neural Information Processing Systems (pp. 5767-5777).

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131990442