[Computer Vision | Generative Adversarial] Gradually Growing Generative Adversarial Networks (GANs) to Improve Quality, Stability, and Variation

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

标题:Progressive Growing of GANs for Improved Quality, Stability, and Variation

链接:[1710.10196] Progressive Growing of GANs for Improved Quality, Stability, and Variation (arxiv.org)

Summary

We describe a new method for training generative adversarial networks (GANs). The key idea is to grow the generator and discriminator incrementally: starting at low resolution, we add new layers that model finer and finer details as training progresses. This not only speeds up the training process, but also greatly stabilizes it, allowing us to generate images of unprecedented quality, e.g. 102 4 2 1024^210242 resolution CELEBA images. We also propose a simple method to increase the variation of generated images and achieve a record-breaking 8.80 8.80on the unsupervised CIFAR10 datasetInception score of 8.80 .

Furthermore, we describe some implementation details that are important to prevent unhealthy competition between the generator and the discriminator. Finally, we propose a new metric for evaluating GAN results, involving both image quality and variation. As an additional contribution, we constructed a higher quality version of the CELEBA dataset.

1 Introduction

Generative methods are used to generate new samples from high-dimensional data distributions (such as images), and are being widely used in speech synthesis (van den Oord et al., 2016a), image translation (Zhu et al., 2017; Liu et al., 2017; Wang et al., 2017 ) and image inpainting (Iizuka et al., 2017) and other fields. Currently, the most prominent methods include autoregressive models (van den Oord et al., 2016b;c), variational autoencoders (VAEs) (Kingma and Welling, 2014), and generative adversarial networks (GANs) (Goodfellow et al., 2014). Currently, they all have significant advantages and disadvantages. Autoregressive models such as PixelCNN can generate sharp images, but are slow to evaluate, and since they directly model conditional distributions over pixels, they have no latent representation, which may limit their applicability. VAEs are easy to train but tend to produce ambiguous results due to model limitations, although recent work is improving this (Kingma et al., 2016). GANs can generate sharp images, albeit at relatively small resolutions, with somewhat limited variation, and training remains unstable despite recent advances (Salimans et al., 2016; Gulrajani et al., 2017; Berthelot et al., 2017; Kodali et al. , 2017). Hybrid methods combine various advantages of these three methods, but still lag behind GANs in terms of image quality (Makhzani and Frey, 2017; Ulyanov et al., 2017; Dumoulin et al., 2016).

Typically, GANs consist of two networks: a generator and a discriminator (also known as a critic). The generator generates samples, such as images, from latent encodings whose distribution should ideally be indistinguishable from the training distribution. Since it is usually not possible to design a function to tell whether this is the case, a discriminator network is trained for evaluation. Since the networks are differentiable, we can also obtain gradients to guide the two networks to tune in the right direction. Typically, the generator is the main focus and the discriminator is an adaptive loss function that is discarded after the generator is trained.

There are several potential problems with this representation. When we measure the distance between the training distribution and the generating distribution, if these distributions do not substantially overlap, i.e. are easily distinguishable (Arjovsky and Bottou, 2017), the gradient may point in more or less random directions. Originally, the Jensen-Shannon divergence was used as a distance metric (Goodfellow et al., 2014), this formulation has recently been improved (Hjelm et al., 2017), and many more stable alternatives have been proposed, including least squares (Mao et al., 2016b), absolute deviation with margins (Zhao et al., 2017) and Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017). Our contribution is largely unrelated to this ongoing discussion, we mainly use a modified Wasserstein loss, but also experiment with a least squares loss.

Since higher resolution makes it easy to distinguish generated images from training images (Odena et al., 2017), the gradient problem is severely amplified and generating high-resolution images becomes difficult. Due to memory constraints, large resolutions also require the use of smaller mini-batches, further affecting the stability of training. Our key insight is that we can incrementally grow the generator and discriminator, starting with easy low-resolution images, and adding new layers that introduce higher-resolution details as training progresses. This greatly speeds up training and improves stability for high-resolution images, which we discuss in Section 2.

The GAN formulation does not explicitly require that the generative model be representative of the entire training data distribution. The traditional view that there is a trade-off between image quality and variation has recently been challenged (Odena et al., 2017). Attention is currently being paid to the degree of variation in retention, and various methods have been proposed to measure it, including Inception scores (Salimans et al., 2016), multi-scale structural similarity (MS-SSIM) (Odena et al., 2017; Wang et al., 2003), The birthday paradox (Arora and Zhang, 2017) and the number of discrete patterns found by explicit testing (Metz et al., 2016). We describe our approach to encouraging change in Section 3 and propose new metrics for assessing quality and change in Section 5.

Section 4.1 discusses minor modifications to the network initialization to achieve a more balanced learning rate across different layers. Furthermore, we observe that the mode collapse that usually plagues GANs tends to happen very quickly, within only a few tens of small batches. Typically, they start when the discriminator overshoots, leading to exaggerated gradients, and then an unhealthy competition ensues, with elevated signal amplitudes in both networks. We propose a mechanism to prevent generators from participating in such upgrades, overcoming this problem (Section 4.2).

We use CELEBA, LSUN, and CIFAR10 datasets to evaluate our contributions. We improve the best published Inception score on CIFAR10. Due to the low resolution of datasets commonly used in benchmark generation methods, we also created a higher quality version of the CELEBA dataset, allowing experiments at output resolutions up to 1024×1024 pixels. The dataset and our full implementation can be found at https://github.com/tkarras/progressive_growing_of_gans and the trained network can be found at https://drive.google.com/open?id=0B4qLcYyJmiz0NHFULTdYc05lX0U along with resulting images , and an explanatory video showing the dataset, additional results, and latent space interpolation at https://youtu.be/G06dEcZ-QTg.

2 Gradually growing generative adversarial networks

Our main contribution is a method for training generative adversarial networks (GANs), where we start with low-resolution images and gradually increase the resolution by adding layers to the network, as shown in Figure 1. This progressive approach allows training to first discover the large-scale structure of the image distribution, and then shift attention to finer and finer scale details, rather than having to learn all scales simultaneously.

Figure 1: Our training starts with the low-resolution state of the generator (G) and discriminator (D), which are 4×4 pixels. As training progresses, we progressively add layers to G and D, thereby increasing the spatial resolution of the generated images. All existing layers remain trainable throughout the process. Here N × N means that the convolutional layer operates on N × N spatial resolution. This enables stable image synthesis at high resolutions and greatly speeds up training. On the right, we show six example images at 1024 × 1024 resolution generated using progressive growing.

We use generator and discriminator networks, which are mirror images of each other, always growing in tandem. All existing layers in both networks remain trainable throughout the training process. When adding new layers to the network, we smoothly fade them in, as shown in Figure 2. This avoids sudden hits to already well-trained smaller resolution layers. Appendix A details the structure of the generator and discriminator, as well as other training parameters.

Figure 2: When increasing the resolution of the generator (G) and discriminator (D), we smoothly fade into new layers. This example shows the transition from a 16 × 16 image (a) to a 32 × 32 image©. During transition (b), we treat the layer operating at the higher resolution as a residual block whose weight α increases linearly from 0 to 1. The 2× and 0.5× here represent the doubling and halving of image resolution using nearest neighbor filtering and average pooling, respectively. toRGB represents a layer that projects feature vectors to RGB colors, while fromRGB does the opposite; both use 1 × 1 convolutions. When training the discriminator, we downscale the actual image to match the network's current resolution. During resolution transitions, we interpolate between the two resolutions of the actual image, similar to how the generator output combines the two resolutions.

We observe that progressive training has several benefits. Early on, the stability is higher for generating smaller images, where there is less class information and fewer patterns (Odena et al., 2017). By gradually increasing the resolution, we are consistently asking a simpler problem than going from latent vectors to the final goal of, say, 1024^2 images. This approach is conceptually similar to the recent work of Chen and Koltun (2017). In practice, it stabilizes training, allowing us to reliably synthesize megapixel-scale images using the WGAN-GP loss (Gulrajani et al., 2017) or even the LSGAN loss (Mao et al., 2016b). Another benefit is reduced training time. With step-wise growing GANs, doing most of the iterations at lower resolutions, it is often possible to achieve comparable result quality at 2-6x speed, depending on the final output resolution.

The idea of ​​progressively growing GANs is related to the work of Wang et al. (2017), who use multiple discriminators to operate at different spatial resolutions. This work is inspired by Durugkar et al. (2016), who use both a generator and multiple discriminators, and Ghosh et al. (2017), who use multiple generators and a discriminator. Hierarchical GANs (Denton et al., 2015; Huang et al., 2016; Zhang et al., 2017) define a generator and discriminator for each level of an image pyramid. These approaches build on the same observations as our work, namely that complex mappings from latent vectors to high-resolution images are easier to implement in stepwise learning, but the key difference is that we have a single GAN rather than a series of GAN hierarchy. In contrast to earlier work on adaptively growing networks, such as Adaptively Growing Neural Gas (Fritzke, 1995) and Neuroevolution of Augmented Topologies (Stanley and Miikkulainen, 2002), which greedily grow networks, we only delay the introduction of preconfigured layer. In this sense, our approach is similar to layer-by-layer training of autoencoders (Bengio et al., 2007).

3 Increasing variation using minibatch standard deviation

Generative Adversarial Networks (GANs) often only capture changes in a subset of the training data, and Salimans et al. (2016) proposed "mini-batch differentiation" as a solution. They compute feature statistics not only from individual images, but also from small batches, thus encouraging small batches of generated and trained images to display similar statistics. This is achieved by adding a mini-batch layer at the end of the discriminator, which learns a large tensor that projects the input activations onto a set of statistics. For each example in the mini-batch, a separate set of statistics is generated and connected to the output of the layer so that the discriminator can use these statistics internally. While we've greatly simplified this method, we've also improved variation.

Our simplified solution has neither learnable parameters nor new hyperparameters. We first compute the standard deviation of each feature at each spatial location over the mini-batch. We then average these estimates across all features and spatial locations to obtain a single value. We replicate this value and concatenate it across all spatial locations and minibatches, resulting in an additional (constant) feature map. This layer can be inserted anywhere in the discriminator, but we found it best to insert it at the end (see Appendix A.1 for details). We experimented with richer statistical datasets, but were unable to further improve the variation. In parallel work, Lin et al. (2017) provided theoretical insights into the benefits of showing multiple images to the discriminator.

Other approaches to address variation include unrolling the discriminator (Metz et al., 2016) to regularize its updates, and adding a "repulsion regularizer" (Zhao et al., 2017) to the generator, which attempts to encourage the generator to Make the eigenvectors orthogonal. The multiple generators of Ghosh et al. (2017) have similar goals. We acknowledge that these solutions may add further variation and may even be orthogonal to ours, but leave detailed comparisons for later.

4 Normalization in Generator and Discriminator

Generative Adversarial Networks (GANs) are prone to signal magnitude escalation due to unhealthy competition between the two networks. In most, if not all cases, previous solutions use a variant of batch normalization (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016) in the generator and often also in the discriminator use. These normalization methods were originally introduced to remove covariate shift. However, in GANs, we do not observe this to be a problem, so we believe that the actual need in GANs is to constrain signal magnitude and competition. We use a different approach with two components, neither of which has learnable parameters.

4.1 Balanced learning rate

Instead of the current trend of cautious weight initialization, we use a simple N ( 0 , 1 ) N(0, 1)N(0,1 ) Initialize and then explicitly scale the weights at runtime. To be precise, we setw ^ i = wi / c ŵ_i = w_i/cw^i=wi/ c,in whichwi w_iwiis the weight, ccc is a per-layer normalization constant from He's initializer (He et al., 2015). The benefit of doing this at runtime rather than initialization is somewhat subtle and has to do with scale invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize gradient updates by their estimated standard deviations, making the updates independent of the scale of the parameters. Therefore, some parameters will take longer to adjust if they have a larger dynamic range than others. This is a situation caused by modern initializers, so it is possible that the learning rate is both too large and too small. Our approach ensures that all weights have the same dynamic range as well as learning speed. Van Laarhoven (2017) also independently uses similar reasoning.

4.2 Normalization of pixel feature vectors in the generator

To prevent the situation where the magnitudes in the generator and discriminator get out of control due to competition, we normalize the feature vector of each pixel in the generator to unit length after each convolutional layer. We use a variant of "local response normalization" (Krizhevsky et al., 2012), configured as

bx , y = ax , y 1 N ∑ j = 0 N − 1 ( ax , yj ) 2 + ε b_{x,y} = \frac {a_{x,y}} {\sqrt {\frac{1} {N} \sum_{j=0}^{N-1}(a^j_{x,y}) ^ 2 + ε}}bx,y=N1j=0N1(ax,yj)2+e ax,y

ε = 1 0 − 8 ε = 10^{-8 }e=108 N N N is the number of feature maps,ax , y a_{x,y}ax,ySum bx , y b_{x,y}bx,yRespectively pixel ( x , y ) (x,y)(x,The original and normalized eigenvectors in y ) . Surprisingly, this tight constraint doesn't seem to hurt the generator in any way, in fact on most datasets it doesn't change the results much, but it is very effective at preventing signal amplitude fluctuations when needed upgrade.

5 Multiscale Statistical Similarity for Evaluating GAN Results

To compare the results of one GAN with another requires investigating a large number of images, which can be tedious, difficult, and subjective. Therefore, it is desirable to rely on automated methods to compute some indicative metrics from large image collections. We note that existing methods, such as Multi-Scale Structural Similarity (MS-SSIM) (Odena et al., 2017), can reliably detect large-scale mode collapses, but fail to respond to smaller effects of color or texture changes. response, they also cannot directly assess image quality from similarity to the training set.

We build on the intuition that a successful generator produces samples whose local image structure is similar to the training set at all scales. We propose to investigate this by considering the multiscale statistical similarity between the distributions of local image patches extracted from Laplacian pyramid (Burt & Adelson, 1987) representations of the generated and target images, starting from a 16 × 16 pixel low-pass resolution starts. Following standard practice, this pyramid is gradually multiplied until full resolution is reached, with each successive level encoding the difference to the upsampled version of the previous level.

A single Laplacian pyramid level corresponds to a specific spatial frequency band. We randomly sampled 16384 images and extracted 128 descriptors from each level of the Laplacian pyramid, giving 2 21 2^{21} for each level221 (2.1M) descriptors. Each descriptor is a 7×7 7×7with 3 color channels7×7- pixel neighborhood, withx ∈ R 7 × 7 × 3 = R 147 x ∈ R^{7×7×3} = R^{147}xR7×7×3=R147 said. We combine the training set and the generation set with levelllThe image blocks of l are represented as{ xil } i = 1 2 21 \{x^l_i\}^{2^{21}}_{i=1}{ xil}i=1221 { x i l } i = 1 2 21 \{x^l_i\}^{2^{21}}_{i=1} { xil}i=1221, we first pair { xil } \{x^l_i\} according to the mean and standard deviation of each color channel{ xil} { y i l } \{y^l_i\} { yil} , and then calculate their slice Wasserstein distanceSWD ( { xil } { yil } ) SWD(\{x^l_i\}\{y^l_i\})S W D ({ xil}{ yil}) to estimate the statistical similarity, which is an efficiently computable random estimate of the distance moved, using 512 projections (Rabin et al., 2011).

Intuitively, a small Wasserstein distance indicates that the distribution of image patches is similar, implying that training images and generated samples appear similar in appearance and variation at this spatial resolution. In particular, from the lowest resolution of 16×16 to 16×1616×16 The distance between sets of extracted blocks in an image indicates the similarity of large-scale image structures, while the finest level of blocks encodes information about pixel-level properties such as sharpness and noise of edges.

6 experiments

This section discusses a series of experiments we performed to assess the quality of our results. See Appendix A for a detailed description of the network structure and training configuration. We also invite readers to check out the accompanying video (https://youtu.be/G06dEcZ-QTg) for more resulting images and latent space interpolation.

In this section, we distinguish between network structure (e.g., convolutional layers, resizing), training configuration (various normalization layers, mini-batch-related operations), and training losses (WGAN-GP, LSGAN).

6.1 Assessing the significance of each contribution by statistical similarity

First, we will use sliced ​​Wasserstein distance (SWD) and multiscale structural similarity (MS-SSIM) (Odena et al., 2017) to assess the importance of our individual contributions, with perceptual validation on the metric itself. We will build on the previous state-of-the-art loss function (WGAN-GP) and training configuration (Gulrajani et al., 2017), using CELEBA (Liu et al., 2015) and The LSUN BEDROOM (Yu et al., 2015) dataset is tested at 1282 resolution. CELEBA is particularly suitable for such comparisons because the training images contain significant artifacts (aliasing, compression, blurring, etc.) that are difficult to reproduce faithfully. In this test, we amplify the differences between training configurations by choosing a relatively low-capacity network architecture (Appendix A.2) and terminating training when the total number of real images that will be shown to the discriminator reaches 10 million. Therefore, the results have not fully converged.

Table 1 lists the numerical values ​​of SWD and MS-SSIM in several training configurations, gradually enabling our respective contributions on top of the baseline (Gulrajani et al., 2017). MS-SSIM numbers are averaged from 10000 pairs of generated images, while SWD is computed as described in Section 5. CELEBA images generated by these configurations are shown in Figure 3. Due to space limitations, the figure shows only a small sample of each row in the table, but more extensive examples are provided in Appendix H. Intuitively, a good evaluation metric should reward plausible images that exhibit rich variation in color, texture, and viewing angle. However, MS-SSIM does not capture this: we can immediately see that configuration (h) produces significantly better images than configuration (a), but MS-SSIM remains largely unchanged since it only measures changes between outputs , rather than the similarity to the training set. On the other hand, SWD does show a clear improvement.

Table 1: Slice Wasserstein distance (SWD) between generated images and training images (Section 5) and multi-scale structural similarity (MS-SSIM) between generated images under different training settings at 128 × 128 resolution . For SWD, each column represents a level of the Laplacian pyramid, and the last column gives the average of the four distances.

Figure 3: (a) – (g) CELEBA examples corresponding to the rows in Table 1, which are intentionally non-convergent. (h) Our convergence results. Note that some images show jaggies and some images are not sharp enough - this is a defect of the dataset, and the model has learned to faithfully reproduce these characteristics.

The first training configuration (a) corresponds to Gulrajani et al. (2017) with batch normalization in the generator and layer normalization in the discriminator with a mini-batch size of 64. (b) Gradual growth of the network is enabled, resulting in sharper and more believable output images. SWD correctly finds that the distribution of generated images is more similar to the training set. Our main goal is to achieve high-resolution output, which requires reducing the mini-batch size so that it fits within the available memory budget. We show in © the challenges posed by decreasing the mini-batch size from 64 to 16. The resulting image is unnatural, which is clearly visible in both metrics. In (d), we stabilize the training process by tuning hyperparameters and removing batch normalization and layer normalization (Appendix A.2). As an intermediate test (e∗), we enabled mini-batch discriminant (Salimans et al., 2016), which surprisingly did not improve on any metric, including MS-SSIM, which measures output variation. In contrast, our mini-batch standard deviation (e) improves the mean SWD score and image quality. Then, our other contributions are enabled in (f) and (g), resulting in improvements in both SWD and subjective visual quality. Finally, in (h) we use a non-reduced network and longer training - we think the resulting image quality is at least comparable to the best results published so far.

6.2 Convergence and training speed

Figure 4 illustrates the effect of stepwise growth on training in terms of SWD metrics and raw image throughput. The first two plots correspond to the training configurations of Gulrajani et al. (2017), without stepwise growth and with stepwise growth. We observe that the stepwise growing variant offers two main advantages: it converges to better extrema, and it reduces the total training time by about a factor of two. The improved convergence is explained by an implicit curriculum learning implemented by gradually increasing the network capacity. Without stepwise growth, all layers of the generator and discriminator are tasked with finding compact intermediate representations for both large-scale variations and small-scale details. However, by stepwise growing, existing low-resolution layers may have converged early on, so the network only needs to refine the representation by introducing new layers to gradually reduce the scale effect. Indeed, we can see in Fig. 4(b) that the maximum-scale statistical similarity curve (16) reaches its optimal value very quickly and remains consistent for the rest of the training. The smaller scale curves (32, 64, 128) level off one by one as the resolution increases, but the convergence of each curve is equally consistent. In Figure 4(a), the SWD metrics at each scale converge roughly simultaneously during non-stepwise growth training, which is to be expected.

Figure 4: Effect of asymptotic growth on training speed and convergence These times are measured using an NVIDIA Tesla P100 in a single-GPU setup. (a) Statistical similarity to wall clock time against the Gulrajani et al. (2017) method using the CELEBA dataset at 128 × 128 resolution. Each graph represents the sliced ​​Wasserstein distance at one level of the Laplacian pyramid, and the vertical lines indicate the points at which we stopped training in Table 1. (b) The same graph with asymptotic growth enabled. Dashed vertical lines indicate the point where we double the resolution for G and D. (c) Effect of progressive growth on raw training speed at 1024 × 1024 resolution.

Stepwise acceleration increases with output resolution. Figure 4© shows the training time as measured by the number of real images shown to the discriminator during training all the way up to 1024 × 1024 resolution. We can see that the stepwise growth variant gains a significant starting advantage because the network starts out shallow and evaluates quickly. Once full resolution is reached, image throughput is equal between the two methods. The graph shows that the stepwise variant reached about 6.4 million images in 96 hours, while it can be speculated that the non-stepwise variant would take about 520 hours to reach the same level. In this case, stepwise growth provides a speedup of approximately 5.4x.

6.3 Generating high-resolution images using the CELEBA-HQ dataset

To meaningfully demonstrate results at high output resolutions, we need a sufficiently diverse and high-quality dataset. However, almost all publicly available datasets previously used in the GAN literature are limited to relatively low resolutions ranging from 322 to 4802. To this end, we created a high-quality version of the CELEBA dataset, consisting of 30,000 images at 1024 × 1024 resolution. See Appendix C for more details on the creation of this dataset. Our contribution allows us to handle high output resolutions in a robust and efficient manner.

Figure 5 shows selected 1024 × 1024 images generated by our network. While megapixel GAN ​​results have been demonstrated in another dataset (Marchesi, 2017), our results are more diverse and of higher visual quality. See Appendix F for more resulting images and examples of nearest neighbors found from the training data. The accompanying video demonstrates latent space interpolation and visualizes step-by-step training. The interpolation works by first randomly generating a latent code for each frame (512 components sampled individually from N(0,1)), then blurring the latent variable in time using a Gaussian (σ = 45 frames @ 60Hz), Finally each vector is normalized to a vector on the hypersphere.

Figure 5: 1024 × 1024 images generated using the CELEBA-HQ dataset. See Appendix F for a larger result set, and latent space interpolation in the accompanying video.

We trained the network for 4 days using 8 Tesla V100 GPUs, after which we no longer observed qualitative differences in the results of successive training iterations. Our implementation uses an adaptive mini-batch size according to the current output resolution in order to make optimal use of the available memory budget.

To demonstrate that our contribution is largely independent of the choice of loss function, we also train the same network using the LSGAN loss instead of the WGAN-GP loss. Figure 1 shows an example of six 10242 images generated using LSGAN. See Appendix B for more details on this setup.

6.4 Results of LSUN

Figure 6 visually compares our solution with previous results in LSUN BEDROOM. Figure 7 selects some examples from seven very different LSUN categories in 2562. A larger, non-curated set of results from all 30 LSUN categories is provided in Appendix G, and a video demonstrates interpolation. We do not know what prior results were available in these categories, and while some perform better than others, we believe the overall quality is high.

Figure 6: Comparison of visual quality in LSUN BEDROOM; image from cited article.

Figure 7: A selection of 256 × 256 images generated from different LSUN categories.

6.5 Inception score of CIFAR10

We know that the best Inception score for CIFAR10 (10 categories of 32 × 32 RGB images) is 7.90, while the score for label conditioning is 8.87 (Grinblat et al., 2017). The large discrepancy between the two numbers is mainly due to "ghosts" (i.e., transitions between categories) that necessarily occur in the unsupervised setting, while the label conditioning setting can eliminate many of these transitions.

When all our contributions are enabled, we achieve a score of 8.80 in the unsupervised setting. Appendix D presents a representative set of examples of generated images, as well as a more comprehensive list of results for earlier methods. The network and training settings are the same as CELEBA, of course, the resolution is limited to 32 × 32. The only customization is WGAN-GP's regularization term KaTeX parse error: Got function '\hat' with no arguments as subscript at position 14: E_{\hat{x}∼P_\̲h̲a̲t̲{x}}[(||∇ \hat{x... . Gulrajani et al. (2017) used γ = 1.0 γ = 1.0c=1.0 , which corresponds to 1-Lipschitz, but we note that it is actually better to prefer fast transitions (γ = 750 γ = 750c=750 ) to minimize "ghosting". We did not try to use this trick on other datasets.

7 discussion

While the quality of our results is generally high compared to previous work on GANs, and training at high resolutions is stable, there is still a long way to go before achieving true fidelity. Semantic plausibility and understanding dataset-specific constraints, such as certain objects being straight rather than curved, still leaves a lot of room for improvement. There is also room for improvement in the microstructure of the image. Nonetheless, we believe convincing realism may now be within reach, especially in the CELEBA-HQ dataset.

8 thanks

We would like to thank Mikael Honkavaara, Tero Kuosmanen, and Timi Hietanen for providing the computing infrastructure. Efforts by Dmitry Korobchenko and Richard Calderwood for related work on the CELEBA-HQ dataset. Oskar Elek, Jacob Munkberg and Jon Hasselgren for helpful comments.

REFERENCES

  1. Martin Arjovsky and L´eon Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.

  2. Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.

  3. Sanjeev Arora and Yi Zhang. Do GANs actually learn the distribution? an empirical study. CoRR, abs/1706.08224, 2017.

  4. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.

  5. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In NIPS, pp. 153–160. 2007.

  6. David Berthelot, Tom Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017.

  7. Peter J. Burt and Edward H. Adelson. Readings in computer vision: Issues, problems, principles, and paradigms. Chapter: The Laplacian Pyramid As a Compact Image Code, pp. 671–679. 1987.

  8. Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. CoRR, abs/1707.09405, 2017.

  9. Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard H. Hovy, and Aaron C. Courville. Calibrating energy-based generative adversarial networks. In ICLR, 2017.

  10. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. CoRR, abs/1506.05751, 2015.

  11. Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. CoRR, abs/1606.00704, 2016.

  12. Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multi-adversarial networks. CoRR, abs/1611.01673, 2016.

  13. Bernd Fritzke. A growing neural gas network learns topologies. In Advances in Neural Information Processing Systems 7, pp. 625–632. 1995.

  14. Arnab Ghosh, Viveka Kulharia, Vinay P. Namboodiri, Philip HS Torr, and Puneet Kumar Dokania. Multi-agent diverse generative adversarial networks. CoRR, abs/1704.02906, 2017.

  15. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. In NIPS, 2014.

  16. Guillermo L. Grinblat, Lucas C. Uzal, and Pablo M. Granitto. Class-splitting generative adversarial networks. CoRR, abs/1709.07359, 2017.

  17. Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of Wasserstein GANs. CoRR, abs/1704.00028, 2017.

  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. CoRR, abs/1502.01852, 2015.

  19. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, pp. 6626–6637. 2017.

  20. R Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundary-Seeking Generative Adversarial Networks. CoRR, abs/1702.08431, 2017.

  21. Xun Huang, Yixuan Li, Omid Poursaeed, John E. Hopcroft, and Serge J. Belongie. Stacked generative adversarial networks. CoRR, abs/1612.04357, 2016.

  22. Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ACM Trans. Graph., 36(4):107:1–107:14, 2017.

  23. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.

  24. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

  25. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.

  26. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In NIPS, volume 29, pp. 4743–4751. 2016.

  27. Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. How to train your DRAGAN. CoRR, abs/1705.07215, 2017.

  28. Dmitry Korobchenko and Marco Foco. Single image super-resolution using deep learning, 2017. URL: https://gwmt.nvidia.com/super-res/about. Machines Can See summit.

  29. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. 2012.

  30. Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016.

  31. Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. PacGAN: The power of two samples in generative adversarial networks. CoRR, abs/1712.04086, 2017.

  32. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. CoRR, abs/1703.00848, 2017.

  33. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.

  34. Alireza Makhzani and Brendan J. Frey. PixelGAN autoencoders. CoRR, abs/1706.00531, 2017.

  35. Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using convolutional autoencoders with symmetric skip connections. CoRR, abs/1606.08921, 2016a.

  36. Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, and Zhen Wang. Least squares generative adversarial networks. CoRR, abs/1611.04076, 2016b.

  37. Marco Marchesi. Megapixel size image creation using generative adversarial networks. CoRR, abs/1706.00082, 2017.

  38. Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. CoRR, abs/1611.02163, 2016.

  39. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.

  40. Julien Rabin, Gabriel Peyr, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision (SSVM), pp. 435–446, 2011.

  41. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.

  42. Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. CoRR, abs/1602.07868, 2016.

  43. Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016.

  44. Kenneth O. Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127, 2002.

  45. Tijmen Tieleman and Geoffrey E. Hinton. Lecture 6.5 - RMSProp. Technical report, COURSERA: Neural Networks for Machine Learning, 2012.

  46. Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Adversarial generator-encoder networks. CoRR, abs/1704.02304, 2017.

  47. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. CoRR, abs/1609.03499, 2016a.

  48. A¨aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, pp. 1747–1756, 2016b.

  49. A¨aron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016c.

  50. Twan van Laarhoven. L2 regularization versus batch and weight normalization. CoRR, abs/1706.05350, 2017.

  51. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. CoRR, abs/1711.11585, 2017.

  52. Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multi-scale structural similarity for image quality assessment. In Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, pp. 1398–1402, 2003.

  53. David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017.

  54. Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh. LR-GAN: Layered recursive generative adversarial networks for image generation. In ICLR, 2017.

  55. Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015.

  56. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris N. Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.

  57. Junbo Jake Zhao, Micha¨el Mathieu, and Yann LeCun. Energy-based generative adversarial network. In ICLR, 2017.

  58. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.

A Network structure and training configuration

A.1 1024 × 1024 network for CELEBA-HQ

Table 2 shows the network architecture of the full resolution generator and discriminator we use on the CELEBA-HQ dataset. These two networks mainly consist of 3 layers of replicated blocks that are gradually introduced during training. The last 1 × 1 1 \times 1 of the generator1×1 convolutional layer corresponds to the toRGB block in Figure 2, the first 1 × 1 1 \times 1of the discriminator1×1 convolutional layer similarly corresponds to fromRGB. We start from4 × 4 4 \times 44×4 resolution to start training the network, showing a total of 800k real images to the discriminator. We then fade in the first 3-layer block in the next 800k images, stabilize the network for another 800k images, fade in the next 3-layer block in the next 800k images, and so on.

Table 2: The generator and discriminator we use in CELEBA-HQ for generating 1024×1024 images.

The latent vectors correspond to random points on a 512-dimensional hypersphere, and we normalize the training and generated images to [−1, 1][-1,1][1,1 ] range. Leaky rectified linear units (Leaky ReLU) with a leak rate of 0.2 are used in all layers of both networks except the last layer which uses a linear activation function. In either network, we did not use batch normalization, layer normalization or weight normalization. However, at each 3 × 3 3 \times 3of the generator3×After 3 convolutional layers, we perform pixel-wise normalization on the feature vectors, just like in Section 4.2.

All bias parameters are initialized to zero, and all weights are initialized to follow a normal distribution with unit variance. However, we scale the weights at runtime using layer-specific constants, as described in Section 4.1. 4 × 4 4 \times 4 at the end of the discriminator in the process we will describe in Section 34×4 resolution, the cross-minibatch standard deviation is injected as an additional feature map. The upsampling and downsampling operations in Table 2 correspond to2 × 2 2 \times 22×2- element replication and average pooling, respectively.

Adam optimizer (Kingma & Ba, 2015) was used during training, where α = 0.001 \alpha = 0.001a=0.001β 1 = 0 \beta_1 = 0b1=0β 2 = 0.99 \beta_2 = 0.99b2=0.99 ,ϵ = 1 0 − 8 \epsilon = 10^{-8}ϵ=10−8 . _ We did not use any learning rate decay or reduction strategy. To visualize the output of the generator during training, an exponential moving average was used on the weights of the generator with a decay rate of 0.999. For resolutions from4 2 4^242 to12 8 2 128^21282 different cases, we use a small batch size of 16, and then gradually reduce the size, respectively become25 6 2 → 14 256^2 \rightarrow 14256214 51 2 2 → 6 512^2 \rightarrow 6 51226 102 4 2 → 3 1024^2 \rightarrow 3 102423 to avoid exceeding the memory budget.

We use the WGAN-GP loss, but unlike Gulrajani et al. (2017), we alternately optimize the generator and discriminator on each mini-batch, i.e. set n critic = 1 n_{\text{critic} } = 1ncritic=1 . Additionally, we introduce a fourth term in the discriminator loss with very small weights to prevent the discriminator output from drifting too far from zero. To be precise, we setL ′ = L + ϵ drift E x ∈ P r [ D ( x ) 2 ] L' = L + \epsilon_{\text{drift}}E_{x \in P_r} [D(x )^2]L=L+ϵdriftExPr[D(x)2 ], whereϵ drift = 0.001 \epsilon_{\text{drift}} = 0.001ϵdrift=0.001

A.2 Other networks

For below 1024 × 1024 1024 \times 10241024×With a spatial resolution of 1024 , we appropriately omit the number of replicated layer 3 blocks in both networks.

Also, in Section 6.1 we used a slightly lower capacity version, at 16 × 16 16 \times 1616×16- resolution Conv3 × 3 3 \times 33×The number of feature maps is reduced in 3 layers, followed by a factor of 4 in subsequent resolutions. This makes the last Conv3 × 3 3 \times 33×There are 32 feature maps in 3 layers. In Table 1 and Figure 4, we train on a total of 600k images per resolution instead of 800k, and fade in new layers for the duration of 600k images.

For the "Gulrajani et al. (2017)" case in Table 1, we try to follow their training configuration. Specifically, including α = 0.0001 \alpha = 0.0001a=0.0001β 2 = 0.9 \beta_2 = 0.9b2=0.9 n critic = 5 n_{\text{critic}} = 5 ncritic=5 ϵ drift = 0 \epsilon_{\text{drift}} = 0 ϵdrift=0 , and a mini-batch size of 64. Progressive resolution, mini-batch standard deviation, and runtime weight scaling are disabled, and the He initializer (He et al., 2015) is used for all weights. Furthermore, in the generator, we replaced LReLU with ReLU, changed the linear activation of the last layer to tanh, and changed the pixel-wise normalization to batch normalization. In the discriminator, for all Conv3 × 3 3 \times 33×3 and Conv4 × 4 4 \times 44×Layer 4 adds layer normalization. The latent vector consists of 128 components independently sampled from a normal distribution.

B 1024 × 1024 Least-Squares Generative Adversarial Network (Least-Squares GAN, LSGAN)

We found that LSGAN is generally more unstable than WGAN-GP and also tends to lose some variation at the end of a long run. Therefore, we prefer to use WGAN-GP, but also generate high-resolution images by building on top of LSGAN. For example, the 10242 images in Fig. 1 are based on LSGAN.

In addition to the techniques described in Sections 2 to 4, we need an additional "hack" in LSGAN to prevent training runaway in cases where the dataset is too simple for the discriminator, causing the discriminator gradient to lose significance. We adaptively increase the magnitude of the multiplicative Gaussian noise in the discriminator based on the output of the discriminator. This noise will be applied to each Conv 3 × 3 3 \times 33×3 and Conv4 × 4 4 \times 44×4 layers of input. Adding noise to the discriminator has historically been practiced, but often negatively affects image quality (Arjovsky et al., 2017). Ideally this would not be needed, and according to our test results, for WGAN-GP (Gulrajani et al., 2017), this holds true. The magnitude of the noise is given by the formula0.2 ⋅ max ⁡ ( 0 , d ^ t − 0.5 ) 2 0.2 \cdot \max(0, \hat{d}_t - 0.5)^20.2max(0,d^t0.5)2 decision, whered ^ t = 0.1 d + 0.9 d ^ t − 1 \hat{d}_t = 0.1d + 0.9\hat{d}_{t-1}d^t=0.1 d+0.9d^t1is the discriminator output ddExponential moving average of d . The motivation for this "hack" is that whenddWhen d approaches (or exceeds) 1.0, LSGAN becomes very unstable.

C CELEBA-HQ dataset

In this section, we describe the process of creating a high-quality version of the CELEBA dataset, which contains 30,000 images of 1024 × 1024 1024 \times 10241024×1024 resolution images. As a starting point, we employed various collections of in-the-wild images contained in the original CELEBA dataset. The images vary greatly in resolution and visual quality, from43 × 55 43 \times 5543×55 to6732 × 8984 6732 \times 89846732×8984 varies. Some of these images show crowds of several people, while others focus on a single person's face - often only part of a face. Therefore, we found it necessary to apply multiple image processing steps to ensure consistent quality and center the image on the face region.

Our processing flow is shown in Figure 8. To improve the overall image quality, we preprocess each JPEG image with two pretrained neural networks: a convolutional autoencoder to remove JPEG noise in natural images, similar in structure to Mao et al. (2016a) ; and an adversarially trained 4x super-resolution network (Korobchenko & Foco, 2017), similar to Ledig et al. (2016). To deal with the case where the face area exceeds the image, we use padding and filtering to expand the size of the image, as shown in Figure 8(cd).

Figure 8: Creation of the CELEBA-HQ dataset. We obtain JPEG images (a) from the CelebA in-the-wild dataset. We improve visual quality (b, middle) by removing JPEG image artifacts (b, top) and 4x super-resolution (b, bottom). We then extend the image by mirror-filling (c) and Gaussian filtering (d) to produce a visually pleasing depth-of-field effect. Finally, we use the facial landmark locations to select an appropriate crop region (e) and perform high-quality resampling to obtain the final 1024 × 1024 resolution image (f).

Then, based on the facial landmark annotations contained in the original CELEBA dataset, we choose a face cropping rectangle as follows:

x ′ = e 1 − e 0 y ′ = 1 2 ( e 0 + e 1 ) − 1 2 ( m 0 + m 1 ) c = 1 2 ( e 0 + e 1 ) − 0.1 ⋅ y ′ s = max ⁡ ( 4.0 ⋅ ∣ x ′ ∣ , 3.6 ⋅ ∣ y ′ ∣ ) x = Normalize ( x ′ − Rotate90 ( y ′ ) ) y = Rotate90 ( x ′ ) \begin{align} x' = & e_1 - e_0 \\ y ' = & \frac{1}{2}(e_0 + e_1) - \frac{1}{2}(m_0 + m_1)\\c = &\frac{1}{2}(e_0 + e_1) - \cdot y' \\ s = & \max (4.0 \cdot |x'|, 3.6 \cdot |y'|) ​​\\ x = & \text{Normalize} (x' - \text{Rotate90}(y' )) \\y = &\text{Rotate90}(x')\end{align}x=y=c=s=x=y=e1e021(e0+e1)21(m0+m1)21(e0+e1)0.1ymax(4.0x,3.6y)Normalize(xRotate90(y))Rotate90(x)

Among them, e 0 e_0e0and 1 and_1e1 m 0 m_0 m0and m 1 m_1m1represent the 2D pixel positions of the two eye signs and the two mouth signs, respectively, ccc andsss represents the center and size of the desired clipping rectangle,xxxyyy represents its direction. We empirically built the above formula to ensure that the cropping rectangle remains consistent when viewing the face from different angles. After calculating the cropping rectangle, we use bilinear filtering to transform the rectangle to4096 × 4096 4096 \times 40964096×4096 pixels, then use box filtering to scale it to1024 × 1024 1024 \times 10241024×1024 resolution.

We perform the above processing on all 202599 images in the dataset and further analyze the resulting 1024 × 1024 1024 \times 10241024×1024 images to estimate the final image quality, sort the images by quality, and discard all but the best 30,000 images. We use a frequency-based quality metric, which favors images that contain a broad frequency range and roughly radially symmetric power spectrum. This penalizes blurry images as well as images with significant directional features due to things like visible halftone patterns. We chose 30,000 images as a practical balance of best results, as this seemed to produce the best results.

D CIFAR10 results

Figure 9 shows uncurated images generated in the unsupervised setting, and Table 3 compares previous methods based on Inception scores. We report scores in two different ways: 1) the highest observed score during the training run (the ± here means the standard deviation returned by the Inception score calculator); 2) the mean calculated from the highest observed score during training and standard deviation, starting with ten random initializations. Arguably, the latter approach makes more sense, since individual runs may have lucky situations (as in our case). In this dataset, we did not use any form of data augmentation.

Table 3: CIFAR10 Inception scores, the higher the score, the better.

Figure 9: Using an unsupervised trained network (no label condition), the generated CIFAR10 image achieves a record-breaking 8.80 Inception score.

E MNIST-1K Discrete Mode Test with Disability Discriminator

Metz et al. (2016) describe a setup where the generator simultaneously synthesizes MNIST digits into 3 color channels, uses a pretrained classifier to classify the digits (0.4% error rate in our case), and is then concatenated on Together form numbers in the range [0, 999]. They generated a total of 25600 images and counted the number of discrete patterns covered. They also compute the KL divergence, which is KL(histogram || uniform distribution). Modern GAN implementations can easily cover all modes with very low divergence (0.05 in our case), so Metz et al. specify a rather low-capacity generator and two disabled discriminators (“K/ 2" has about 2000 parameters, and "K/4" has only about 500), to reveal differences between training methods. Both networks use batch normalization.

As shown in Table 4, the patterns covered by the WGAN-GP loss using the network structure specified by Metz et al. More. KL divergence, which may be a more accurate metric than raw counts, works better.

Table 4: Results of the MNIST discrete pattern test using two tiny discriminators (K/4, K/2) defined by Metz et al. (2016). The number of modes covered (#) and the KL divergence from the uniform distribution are given as mean ± standard deviation over 8 random initializations. For number of modes, higher is better, for KL divergence, lower is better.

Replacing batch normalization with our normalization (balanced learning rate, pixel-wise normalization) improves results significantly, while also removing some trainable parameters from the discriminator. Adding a mini-batch standard deviation layer further improves the score while restoring the capacity of the discriminator to 0.5% of the original capacity. For these small images, the incremental approach doesn't help much, but it doesn't hurt either.

F Additional CELEBA-HQ results

Figure 10 shows the nearest neighbors of our generated images. Figure 11 gives additional generation examples from CELEBA-HQ. We enabled image enhancement for all tests with CELEBA and CELEBA-HQ. In addition to the sliced ​​Wasserstein distance (SWD), we also incorporate the recently introduced Fréchet Inception distance (FID) (Heusel et al., 2017), computed from 50K images.

Figure 10: Top: Our CELEBA-HQ results. Next five lines: nearest neighbors found from training data, based on feature space distance. We used activations from five VGG layers, as suggested by Chen & Koltun (2017). Only the cropped region highlighted in the lower right image was used for comparison to exclude image background and focus the search on matching facial features.

Figure 11: Additional 1024×1024 images generated using the CELEBA-HQ dataset. The slice Wasserstein distance (SWD)×103 is 7.48, 7.24, 6.08, 3.51, 3.55, 3.02, 7.22 at 1024, ..., 16 levels respectively, and the average value is 5.44. The Fréchet Inception Distance (FID) calculated from 50,000 images is 7.30. See video for latent space interpolation.

G LSUN Result

Figures 12-17 show representative images generated for all 30 LSUN categories. A separate network was trained for each category, using the same parameters. All classes were trained on 100K images, except BEDROOM and DOG, which used all available data. Since 100K images is very limited training data for most classes, we enabled mirror augmentation in these tests (but not for BEDROOM or DOG).

Figure 12: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

Figure 13: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

Figure 14: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

Figure 15: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

Figure 16: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

Figure 17: 256 × 256 256 \times 256 generated from LSUN categories256×256 sample images. Slice Wasserstein distance (SWD)× 1 0 3 \times 10^3×103 gives 256, 128, 64, 32 and 16 levels, respectively, where the average value is indicated in bold. We also cite the Fréchet Inception Distance (FID) computed from 50k images.

H Additional image for Table 1

Figure 18 shows a larger set of images for the non-convergent settings in Table 1. The training time is intentionally limited to make the differences between the various methods more visible.

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132365957