[Computer Vision | Generative Confrontation] Large-Scale GAN Training for High-Fidelity Natural Image Synthesis Large-Scale GAN Training for High-Fidelity Natural Image Synthesis (BigGAN)

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

标题:Large Scale GAN Training for High Fidelity Natural Image Synthesis

链接:[1809.11096] Large Scale GAN Training for High Fidelity Natural Image Synthesis (arxiv.org)

Summary

Despite recent advances in generative image modeling, successfully generating high-resolution and diverse samples from complex datasets such as ImageNet remains an elusive goal. To achieve this goal, we attempt to train generative adversarial networks at the largest size to date, and study the instabilities specific to this size. We find that applying orthogonal regularization to the generator adapts it to a simple "truncation trick" that achieves fine-grained control over sample fidelity and diversity by reducing the variance of the generator's input. Our modifications lead to state-of-the-art models in class-conditional image synthesis. On ImageNet trained at 128×128 resolution, our model (BigGANs) achieves an Inception Score (IS) of 166.5 and a Fréchet Inception Distance (FID) of 7.4, compared to the previous best IS of 52.52 and FID of 18.65. improved.

Figure 1: Shows the class-conditioned samples generated by our model.

1 Introduction

The field of generative image modeling has made tremendous progress in recent years, and generative adversarial networks (GANs, Goodfellow et al., 2014) are at the forefront of efforts to learn to generate high-fidelity, diverse images directly from data. The training of GANs is dynamic and touches almost every aspect of their setup (from optimizing parameters to model architecture), but a large body of research has yielded empirical and theoretical insights that make it possible to achieve stable training in various environments. Despite these advances, the current state-of-the-art on conditional ImageNet modeling (Zhang et al., 2018) achieves 52.5 in Inception score (Salimans et al., 2016), compared to 233 on real data.

In this study, we aim to close the fidelity and diversity gap between GAN-generated images and real images in the ImageNet dataset. We contribute to this goal in the following three ways:

  • We demonstrate that GANs can achieve huge gains in scale, training models with two to four times as many parameters and eight times larger batch sizes than previous methods. We introduce two simple, general architectural changes that improve scalability and modify the regularization scheme to improve conditions, resulting in a significant performance boost.
  • Thanks to our modification, our model works well with the "truncation trick", a simple sampling technique that allows explicit, fine-grained control over the trade-off between sample diversity and fidelity.
  • We discover instabilities specific to large-scale GANs and empirically characterize them. Drawing on insights from this analysis, we show that a combination of novel and state-of-the-art techniques can reduce these instabilities, but achieve full training stability only at a significant cost in performance.

Our modification significantly improves class-conditional GANs. On ImageNet trained at 128×128 resolution, our model (BigGANs) improves the state-of-the-art Inception Score (IS) and Fréchet Inception Distance (FID) from 52.52 and 18.65 to 166.5 and 7.4, respectively. We also successfully trained BigGANs on ImageNet at 256×256 and 512×512 resolutions, achieving IS and FID of 232.5 and 8.1 at 256×256 resolution and IS and FID at 512×512 resolution, respectively. The FIDs are 241.5 and 11.5. Finally, we train our model on a much larger dataset, JFT-300M, and show that our design choices transfer well to ImageNet. The code and weights for our pretrained generator are publicly available1 .

2. Background

Generative Adversarial Networks (GANs) involve generator (G) and discriminator (D) networks whose purpose is to map random noise to samples and distinguish real from generated samples, respectively. Formally, the goal of GANs, in its original form (Goodfellow et al., 2014), involves finding the Nash equilibrium of the following two-player min-max problem: min ⁡ G max ⁡
DE x ∼ q data ( x ) [ log ⁡ D ( x ) ] + E z ∼ p ( z ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ] (1) \min_G \max_D \mathbb{E}_{x \sim q_{ \text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))] \ quad \tag{1}GminDmaxExqdata(x)[logD(x)]+Ezp(z)[log(1D(G(z)))](1)
其中 z ∈ R d z z \in \mathbb{R}^{d_z} zRdzis from the distribution p ( z ) p(z)p ( z ) (e.g.N ( 0 , I ) N(0, I)N(0,I ) ORU [ − 1 , 1 ] U[-1, 1]U[1,1 ] ) latent variables extracted in .

When applied to images, G and D are usually convolutional neural networks (Radford et al., 2016). In the absence of assisted stabilization techniques, this training process is often very fragile, requiring finely tuned hyperparameters and architectural choices to work properly.

As a result, much recent research has focused on modifications to traditional GAN ​​methods to achieve stability and exploit increasing empirical and theoretical insights (Nowozin et al., 2016; Sønderby et al., 2017; Fedus et al., 2018). One of the research directions is to change the objective function (Arjovsky et al., 2017; Mao et al., 2016; Lim and Ye, 2017; Bellemare et al., 2017; Salimans et al., 2018) to encourage convergence. Another line of research is to constrain D via gradient penalties (Gulrajani et al., 2017; Kodali et al., 2017; Mescheder et al., 2018) or normalization (Miyato et al., 2018) to offset the use of unbounded loss functions and ensure that D provides gradients to G everywhere.

Of particular relevance to our work is spectral normalization (Miyato et al., 2018), which enforces Lipschitz continuity on D by normalizing using running estimates of the first singular value of the parameter, introducing an adaptive ground regularizes the backward dynamics in the top singular direction. Relatedly, Odena et al. (2018) analyzed the condition number of the Jacobian matrix of G and found that the performance depends on the condition of G. Zhang et al. (2018) found that applying spectral normalization in G improves stability, allowing fewer steps in D in each iteration. We extend these analyzes to further understand pathologies of GAN training.

Other studies focus on the choice of architecture, such as SA-GAN (Zhang et al., 2018), which adds self-attention blocks from (Wang et al., 2018) to improve the ability of G and D to model the global structure. ProGAN (Karras et al., 2018) trains high-resolution GANs in the single-class setting by training a single model on a sequence of progressively increasing resolutions.

In conditional GANs (Mirza and Osindero, 2014), category information can be fed into the model in a variety of ways. In (Odena et al., 2017), it feeds G by concatenating 1-hot class vectors with noise vectors, and modifies the objective function to encourage conditional samples to maximize the corresponding class probabilities predicted by the auxiliary classifier. de Vries et al. (2017) and Dumoulin et al. (2017) modify the way the class condition is passed to G by providing a gain and bias for the class condition in the BatchNorm (Ioffe and Szegedy, 2015) layer. In Miyato and Koyama (2018), D effectively encourages the generation of samples matching features of learned category prototypes by using the cosine similarity between their features and a set of learned category embeddings as additional evidence for distinguishing real from generated samples .

Objectively evaluating implicit generative models is difficult (Theis et al., 2015). Many studies have proposed heuristics for measuring model sample quality without computable likelihoods (Salimans et al., 2016; Heusel et al., 2017; Bińkowski et al., 2018; Wu et al., 2017). Among them, the Inception score (IS, Salimans et al., 2016) and the Fréchet Inception distance (FID, Heusel et al., 2017) have become popular despite their obvious flaws (Barratt and Sharma, 2018). We use them as an approximate measure of sample quality and for comparison with prior work.

3. Extending GANs

In this section, we explore ways to scale GAN training to larger models and larger batches for better performance. As a baseline, we use the SA-GAN architecture of Zhang et al. (2018), which uses the GAN objective of the hinge loss (Lim and Ye, 2017; Tran et al., 2017). We use category-conditioned BatchNorm (Dumoulin et al., 2017; de Vries et al., 2017) to GGG provides category information, using projections (Miyato and Koyama, 2018) toDDD provides category information. The optimization settings follow Zhang et al. (2018) (especially inGGG uses spectral normalization), but we halve the learning rate and make eachGGG step executesDDStep D. For evaluation, we usedthe GGMoving average of the weights of G, which was performed according to Karras et al. (2018), Mescheder et al. (2018) and Yazc et al. (2018), with a decay factor of 0.9999 . We use orthogonal initialization (Saxe et al., 2014), whereas previous work usedN ( 0 , 0.02 I ) N(0,0.02I)N(0,0 . 0 2 I ) (Radford et al., 2016) or Xavier initialization (Glorot and Bengio, 2010). Each model is trained on 128 to 512 cores of Google TPUv3 Pod (Google, 2018), and GGis computed on all devicesBatchNorm statistic in G instead of calculating each device separately. We find that asymptotic growth (Karras et al., 2018) is unnecessary even for our 512×512 model. Additional details are provided in Appendix C.

We start by increasing the batch size of the baseline model and immediately see a huge benefit in doing so. Rows 1-4 of Table 1 show that increasing the batch size by a factor of 8 improves the state-of-the-art Inception score by 46% simply by increasing the batch size to cover more patterns, providing better gradients for both networks. We guess this is because each batch covers more patterns, providing better gradients for both networks. A notable side effect of this scaling is that our models reach better final performance in fewer iterations, but become unstable and experience complete training crashes. We discuss the reasons and implications of this in Section 4. For these experiments, we report the fraction of checkpoints saved before crashing.

Table 1: The ablation results of FID (Fréchet Inception Distance, the lower the value, the better) and IS (Inception Score, the higher the better) of our proposed modified method. Batch represents the batch size, Param represents the total number of parameters, Ch. represents the channel multiplier, representing the number of units per layer, Shared represents the use of shared embedding, Skip-z represents the use of skip connections from latent variables to multiple layers, and Ortho. represents For orthogonal regularization, Itr indicates whether the setting is stable at 1000000 iterations, or collapses at a given number of iterations. Except for rows 1-4, the results are calculated based on 8 random initializations.

We then increased the width (number of channels) in each layer by 50%, roughly doubling the number of parameters in both models. This leads to a further 21% improvement in the Inception score, which we attribute to the increased capacity of the model relative to the complexity of the dataset. Doubling the depth initially brought no improvement - we addressed this later in the BigGAN-deep model, which used a different residual block structure.

We note that for GGCategoryEmbeddings for Conditional BatchNorm Layers in Gccc contains a large number of weights. Rather than using separate layers for each embedding (Miyato et al., 2018; Zhang et al., 2018), we choose to use a shared embedding that is linearly projected onto the gain and bias of each layer (Perez et al. , 2018). This reduces computational and memory costs and improves training speed with a 37% increase in training speed (the number of iterations required to achieve a given performance). Next, we add the direct slave noise vectorzzz toGGSkip connections (skip-z) for multiple layers of G , not just the initial layer. The intuition behind this design was to allowGGG uses the latent space to directly affect features at different resolutions and levels. In BigGAN, this is done by addingzzz is divided into one block per resolution, and each block is concatenated to the condition vectorccc , this vector is projected into BatchNorm's gain and bias. In BigGAN-deep, we use a simpler design where the entirezzz is concatenated with the conditional vector without breaking it into blocks. Variations on this concept have been considered in previous work (Goodfellow et al., 2014; Denton et al., 2015); our implementation is a slight modification of this design. Skip-z improves performance slightly, about 4%, and further improves training speed by 18%.

3.1 Balancing Fidelity and Diversity: Truncation Techniques

Figure 2: (a) Effect of increasing truncation. From left to right, the threshold is set to 2, 1, 0.5, 0.04. (b) Saturation artifacts when truncation is applied to the poorly conditioned model.

Unlike models that require backpropagation through their latent variables, GANs can use arbitrary prior distributions p ( z ) p(z)p ( z ) , however the vast majority of previous work chooses fromN ( 0 , I ) N(0,I)N(0,I ) ORU [ − 1 , 1 ] U[-1,1]U[1,1 ] to extractzzz . We question the optimality of this choice, and in AppendixEEAlternatives are explored in E. It is worth noting that our best results are obtained by using a different latent variable distribution at sampling time than in training. Takez ∼ N ( 0 , I ) z \sim N(0, I)zN(0,I ) A model for training if zzis sampled from a truncated normal distributionz (resampling values ​​that fall within a certain range to values ​​within that range), immediately improves IS and FID. We call this the truncation trick: zzby resampling values ​​whose magnitude exceeds a chosen thresholdThe z- vector, which can improve the quality of individual samples, but reduces the diversity of the overall sample. Figure 2(a) demonstrates this: As the threshold decreases,zzElements of z are truncated near zero (the mode of the latent variable distribution), with individual samples tending towardsGGThe mode of the output distribution of G. Related observations are also mentioned in (Marchesi, 2016; Pieters and Wiering, 2014).

This technique allows for a given GGG makes fine choices about the trade-off between sample quality and diversity. Notably, we can compute FID and IS for a range of thresholds, resulting in a diversity-fidelity curve similar to the precision-recall curve (Fig. 17). Since IS does not penalize insufficient diversity in class-conditional models, lowering the truncation threshold directly increases IS (similar to accuracy). FID penalizes insufficient diversity (similar to recall), but also rewards precision, so we initially see a modest improvement in FID, but as the cutoff approaches zero and diversity decreases, FID drops dramatically. Distribution shifts caused by sampling different latent variables than those seen in training are problematic for many models. Some larger models do not lend themselves well to truncation and produce saturation artifacts when fed truncated noise (Fig. 2(b)). To counteract this, we try to counteract this by makingGGG- smoothing is implemented to accommodate the truncated properties so that the entirezzThe space of z will be mapped to good output samples. For this, we turn to orthogonal regularization (Brock et al., 2017), which directly enforces the orthogonality condition:
R β ( W ) = β ∥ WTW − I ∥ F 2 (2) R_{\beta}(W) = \beta \left\| W^TW - I \right\| ^2_F \tag{2}Rb(W)=bWTWIF2(2)

Among them WWW is the weight matrix,β \betaβ is a hyperparameter. Such regularization is known to be often too restrictive (Miyato et al., 2018), so we explore several designs aimed at relaxing the constraints while still achieving the desired smoothness of the model. The version we found to work best removes the diagonal term from the regularization and aims to minimize the pairwise cosine similarity between filters, but does not limit their norm:

R β ( W ) = β ∥ W T W ⊙ ( 1 − I ) ∥ F 2 (3) R_{\beta}(W) = \beta \left\| W^T W \odot (1 - I) \right\| ^2_F \tag{3} Rb(W)=bWTW(1I)F2(3)

where 1 means a matrix with all elements set to 1. We scan for β \betaβ value and choose1 0 − 4 10^{-4}104 , this small additional penalty was found to be sufficient to improve the likelihood of our model adapting to truncation. In the different runs in Table 1, we observe that without orthogonal regularization, only 16% of the models are suitable for truncation, while when training with orthogonal regularization, the proportion of models suitable for truncation reaches 60%.

3.2 Summary

We find that current GAN techniques are sufficient for massively scalable models and distributed high-batch training. We find that we can significantly improve upon the state of the art and train models at 512×512 resolution without explicitly using multi-scale methods as in Karras et al. (2018). Despite these improvements, our model still experiences training crashes, requiring early stopping in practice. In the next two sections, we investigate why settings that were stable in previous work become unstable when applied to large scale.

4. Analysis

4.1 Measuring Instability: Generators

Figure 3: Typical plots of the first singular value σ0 in layers G(a) and D(b), before applying spectral normalization. Most layers of G have a good spectrum, but without constraints, a small subset grows during training and explodes when it collapses. D is spectrally noisier, but otherwise performs well. Colors from red to purple indicate a gradual increase in depth.

Many past studies have investigated the stability of GANs from various analytical perspectives and on toy problems, but the instabilities we observe appear in settings that are stable at small scales, thus requiring direct analysis at large scales. We monitor a range of weights, gradients, and loss statistics during training, looking for metrics that might signal the onset of training crashes, similar to (Odena et al., 2018). We find that the first three singular values ​​σ 0 of each weight matrix σ_0p0p 1 p_1p1σ 2 σ_2p2is the most informative. They can be computed efficiently using the Arnoldi iterative method (Golub and der Vorst, 2000), which extends the power iterative method used in Miyato et al. (2018) to estimate additional singular vectors and values. A clear pattern emerges, as shown in Figure 3(a) and Appendix F: Most GGG layers have good spectral norms, but some layers (usuallyGGThe spectral norm of the first layer in G , which is overcomplete and non-convolutional, grows gradually during training and explodes when it collapses.

To determine whether this pathology is the cause of the breakdown or merely a symptom, we investigated the effects of GGG imposes the effect of additional conditioning to explicitly counteract the spectral explosion. First, we directly evaluate the first three singular values ​​σ 0 σ_0of each weightp0Regularize, either towards a fixed value σ reg σ_{reg}preg, either towards some ratio rr of the second singular valuerr ⋅ sg ( σ 1 ) r \cdot sg(σ_1)rs g ( p1) (wheresg sgs g is to prevent regularization from increasingσ 1 σ_1p1stop gradient operation). Alternatively, we use a partial singular value decomposition instead of dividing σ 0 into σ_0p0Clamp. For a given weight WWW , its first three singular vectorsu 0 u_0u0and v 0 v_0v0, 可以σ clamp σ_{clamp}pclamp是将σ 0 σ_0p0Clamping to the value, our weights become:
W = W − max ⁡ ( 0 , σ 0 − σ clamp ) v 0 u 0 T (4) W = W - \max(0, σ_0 - σ_{clamp} )v_0u_0^T \tag{4}W=Wmax(0,p0pclamp)v0u0T( 4 )
σ clampσ_{clamp}pclampset to reg σ_{reg}pregr ⋅ sg ( σ 1 ) r \cdot sg(σ_1)rs g ( p1) . We observe that these techniques prevent σ 0with or without spectral normalizationp0σ 0 / σ 1 σ_0 / σ_1p0/ p1Gradually increase and explode, but even if they improve performance slightly in some cases, no combination prevents training from crashing. These evidences suggest that although for GGG tweaking might improve stability, but it's not enough to ensure stability. Therefore, we turn our attention toDDD

4.2 Measuring Instability: The Discriminator

with GGSimilar to G , we analyzeDDD' s weight spectrum to gain insights about its behavior and then seek to stabilize training by imposing additional constraints. Figure3 (b) 3(b)3 ( b ) showsDDDσ 0 σ_0p0A typical case (Appendix FFThere are more diagrams in F ). withGGG is different, we see that the spectrum is noisy,σ 0 / σ 1 σ_0 / σ_1p0/ p1is good, the singular value grows gradually during training, but only jumps in crashes, not explodes. DDThe peaks in the D spectrum may indicate that it periodically receives very large gradients, but we observe that the Frobenius norm is smooth (Appendix F), suggesting that this effect is mainly concentrated in the first few singular directions superior. We hypothesize that this noise is generated by optimization of the adversarial training process, whereGGG periodically produces a strong disturbanceDDBatch of D. If this spectral noise is causally related to instability, then a natural countermeasure is to use gradient penalties, which explicitly regulateDDThe Jacobian change of D. We explorethe R 1 R1R 1 zero center gradient penalty:
R 1 : = γ 2 E p D ( x ) [ ∥ ∇ D ( x ) ∥ F 2 ] (5) R1 := \frac{\gamma}{2} \mathbb{E} _{p_D(x)} \left[ \left\| \nabla D(x) \right\|_F^2 \right] \tag{5}R 1:=2cEpD(x)[D(x)F2]( 5 )
Use the default suggestedγ \gammaGamma strength is10 101 0 , the training becomes stable and improvesGGG andDDThe spectrum is smooth and bounded in D , but the performance suffers severely, with a 45% reduction in IS. Reducing the penalty partially mitigates this degradation, but causes the spectrum to become increasingly unstable. Even reducing the penalty strength to 1 (at which strength there is no sudden crash), the IS is reduced by 20%. Repeated experiments with various strengths of Orthogonal Regularization, DropOut (Srivastava et al., 2014) and L2 (see Appendix I fordetails) found that these regularization strategies behave similarly:D imposes a sufficiently large penalty that training stability can be achieved, but at the cost of severe performance loss.

We also observed that during training, DDD' s loss is close to zero, but shows a sharp rise when it crashes (Appendix F). One possible explanation for this behavior is thatDDD overfits the training set, memorizing the training samples instead of learning a meaningful boundary between real and generated images. as pairDDA simple test of D memory (related to Gulrajani et al. (2017)), we evaluate the uncollapsed discriminator on the training and validation sets of ImageNet and measure what percentage of samples are classified as real images or generated images. While training accuracy is consistently above 98%, validation accuracy is in the 50-55% range, no better than random guessing (regardless of the regularization strategy). This confirms thatDDD is indeed memorizing the training set; we think this is consistent withDDThe role of D is the same, it is not to generalize explicitly, but to extract training data and provideGGG provides useful learning signals. AppendixGGAdditional experiments and discussions are provided in G.

4.3 Summary

We found that stability doesn't just come from GGG orDDD , but from their interaction through the adversarial training process. While instabilities can be tracked and identified using symptoms of their condition, ensuring reasonable conditions is necessary for training but not sufficient to prevent eventual training crashes. By strongly constrainingDDD can achieve stability, but this results in a significant loss in performance. With current techniques, it is possible to achieve good results by relaxing this condition and allowing crashes to occur in later stages of training, by which time the model has trained enough.

5 experiments

5.1 Evaluation on ImageNet

Table 2: Evaluation of the models at different resolutions. We report the score without truncation (column 3), the score for the best FID (column 4), the score for IS on the validation data (column 5), and the score for maximum IS (column 6). The standard deviation is calculated based on at least three random initializations.

We evaluate our model on ImageNet ILSVRC 2012 (Russakovsky et al., 2015) at 128×128, 256×256 and 512×512 resolutions, using the settings in Table 1, row 8. We show the samples generated by our model in Figure 4, with additional samples in Appendix A and also on Line 2 . We report IS and FID in Table 2. Since our model is able to trade off between sample quality and diversity, it is not clear how best to compare with previous methods; therefore, we report values ​​according to three settings, with full curves in Appendix D. First, we report FID/IS values ​​at the truncation setting that achieves the optimal FID. Second, we report the FID under the truncation setting, where the IS of our model is the same as that achieved on the real validation data, on the grounds that this is a feasible way to achieve maximum sample diversity while still achieving a good level of "objectness". Accept metrics. Third, we report the FID at the maximum IS achieved by each model to show how much of the sample diversity tradeoff must be made in maximizing quality. In all three cases, our model surpasses the previous state-of-the-art scores achieved by Miyato et al. (2018) and Zhang et al. (2018) on both IS and FID scores.

Figure 4: Examples generated by our BigGAN model with a truncation threshold of 0.5 (ac), and examples of class leakage present in part of the trained model (d).

In addition to the BigGAN model introduced in the first edition of the paper and used in most experiments (unless stated otherwise), we also show a 4x deeper model (BigGAN-deep) that uses a different residual block configuration. As can be seen from Table 2, BigGAN-deep significantly outperforms BigGAN on all resolutions and metrics. This confirms that our findings apply to other architectures and that increasing depth improves sample quality. A detailed description of the BigGAN and BigGAN-deep architectures is in Appendix B.

We are right about DDD's observation of overfitting to the training set, and the sample quality of our model, raises an obvious problem, namely thatGGWhether G only memorizes training points. To test this, we perform class-wise nearest neighbor analysis (Appendix A) in both pixel space and feature space of a pretrained classifier network. Furthermore, in Figures 8 and 9, we show the between-sample and between-category (holdingzzz constant) interpolation. Our model interpolates between different samples, and the nearest neighbors of its samples are visually distinct, suggesting that our model does not merely memorize the training data.

We noticed that some of the failure modes of our partially trained models differed from previously observed failure modes. Most previous failures involve localized artifacts (Odena et al., 2016), images consisting of textured blobs instead of objects (Salimans et al., 2016), or classical mode collapse. We observe class leakage, where images of one class contain attributes of another class, as shown in Figure 4(d). We also found that on ImageNet, many classes were more difficult for our model; our model was better at generating dogs (the majority of the dataset, distinguished mainly by texture) than crowds (a small fraction of the dataset). part, with more large-scale structures) are more successful. Further discussion is in Appendix A.

5.2 Additional evaluation on JFT-300M

To confirm whether our design choices are valid for larger, more complex, and more diverse datasets, we also present the results of our system on a subset of JFT-300M (Sun et al., 2017). The full JFT-300M dataset contains 300M real-world images with 18K class labels. Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images, two orders of magnitude larger than ImageNet. For images with multiple labels, we randomly and independently select a label when sampling the image. To compute IS and FID for GANs trained on this dataset, we use the Inception v2 classifier (Szegedy et al., 2016) trained on this dataset. Quantitative results are presented in Table 3. All models are trained with batch size 2048. We compared an ablated version of our model—similar to SA-GAN (Zhang et al., 2018), but using a larger batch size—with a "full" BigGAN model using All techniques (shared embedding, skip-z, and orthogonal regularization) for best results. Our results show that these techniques can significantly improve performance on this much larger dataset even at the same model capacity (64 base channels). We further show that, for datasets of this size, we can see significant additional improvements by extending the capacity of our model to 128 base channels, whereas on ImageNet, this additional capacity does not confer a benefit. .

Table 3: BigGAN results on JFT-300M at 256×256 resolution. The FID and IS columns report these scores given by the Inception v2 classifier trained by JFT-300M with noise distribution z ∼ N ( 0 , I ) z ∼ N (0, I)zN(0,I ) (non-truncated). The (min FID)/IS and FID/(max IS) columns report the time fromσ = 0 to σ = 0p=0 toσ = 2 σ = 2p=The best FID and IS scores obtained after scanning the truncated noise distribution in the range of 2 . Images from the JFT-300M validation set have an IS of 50.88 and a FID of 1.94.

In Figure 19 (Appendix D), we present a truncated plot of the model trained on this dataset. Unlike ImageNet, for ImageNet, the truncation limit σ ≈ 0 \sigma \approx 0p0 tends to yield the highest fidelity scores, while for our JFT-300M model IS is usually at the cutoff valueσ \sigmaσ ranges from 0.5 to 1 to a maximum. We suspect that this is at least partially due to the intra-class variability of the JFT-300M labels, and the relative complexity of the image distribution, which includes images with multiple objects at multiple scales. Interestingly, unlike models trained on ImageNet (in Section 4, training tends to crash without substantial regularization), models trained on JFT-300M maintain Stablize. This suggests that scaling from ImageNet to larger datasets may partially alleviate the stability problem of GANs.

We achieve improvements over baseline GAN models on this dataset without making changes to the underlying model, training, and regularization techniques (other than scaling capacity), and these results demonstrate the extension of our findings from ImageNet to data set, its scale and complexity are hitherto unprecedented in image generation models.

6 Conclusion

We have shown that GANs benefit greatly in terms of increased fidelity and diversity of generated samples when modeling natural images from multiple classes. Thus, our model sets a new level of performance among ImageNet GAN models, substantially improving the state of the art. We also analyze the training behavior of large-scale GANs, characterize their stability in terms of the singular values ​​of their weights, and discuss the correlation between stability and performance.

thank you

Credits: Kai Arulkumaran, Matthias Bauer, Peter Buchlovsky, Jeffrey Defauw, Sander Dieleman, Ian Goodfellow, Ariel Gordon, Karol Gregor, Dominik Grewe, Chris Jones, Jacob Menick, Augustus Odena, S woman Ravuri、Ali Razavi、Mihaela Rosca和Jeff Stanway 。 。

references

(……)

Appendix A: Extra Samples, Interpolation, and Nearest Neighbors from ImageNet Models

Figure 5: Samples generated by our BigGAN model at 256×256 resolution.

Figure 6: Samples generated by our BigGAN model at 512×512 resolution.

Figure 7: Comparing the easy class (a) and the hard class (b) at 512×512 resolution. Classes like dogs, which are ubiquitous in the dataset and mainly texture-based, are much easier to model than classes involving misaligned faces or crowds. These categories are more dynamic and structured, often with details to which human observers are more sensitive. Even with non-local blocks, the difficulty of modeling global structure is further increased when generating high-resolution images.

Figure 8: Interpolation between z, c pairs.

Figure 9: Interpolation between c while keeping z constant. Gesture semantics often remain constant between endpoints (especially in the last line). The second row demonstrates that the grayscale is encoded in the joint z,c space instead of z.

Figure 10: Nearest neighbors in the feature space of VGG-16-fc7 (Simonyan & Zisserman, 2015). The resulting image is in the upper left corner.

Figure 11: Nearest neighbors in the feature space of ResNet-50-avgpool (He et al., 2016). The resulting image is in the upper left corner.

Figure 12: Nearest neighbors in pixel space. The resulting image is in the upper left corner.

Figure 13: Nearest neighbors in the feature space of VGG-16-fc7 (Simonyan & Zisserman, 2015). The resulting image is in the upper left corner.

Figure 14: Nearest neighbors in the feature space of ResNet-50-avgpool (He et al., 2016). The resulting image is in the upper left corner.

Appendix B: Architecture Details

In the BigGAN model (Fig. 15), we used the ResNet (He et al., 2016) GAN architecture from (Zhang et al., 2018), which is the same as (Miyato et al., 2018), but in D The channel mode in is modified so that the number of filters in the first convolutional layer of each block is equal to the number of output filters (rather than the number of input filters, as in Miyato et al. (2018) and Gulrajani et al. .(2017)). We use a single shared class embedding in G and add skip connections to the latent vector z (skip-z). In particular, we use a hierarchical latent space, such that the latent vector z is split into equal-sized blocks (20 dimensions in our case) along its channel dimension, and each block is concatenated to a shared class embedding, and Pass it as a condition vector to the corresponding residual block. The condition of each block is linearly projected to produce the per-sample gain and bias of the block's BatchNorm layer. The bias projection is centered at zero, while the gain projection is centered at 1. Since the number of residual blocks depends on the image resolution, the full dimension z is 120 for a 128×128 image, 140 for a 256×256 image, and 160 for a 512×512 image.

The BigGAN-deep model (Fig. 16) differs from BigGAN in several ways. It uses a simpler variant of the skip-z condition: instead of first splitting z into blocks, the entire z is concatenated with the class embeddings, and the resulting vector is passed into each residual block via a skip connection. BigGAN-deep is based on a bottlenecked residual block (He et al., 2016), which includes two additional 1×1 convolutions: the first reduces the number of channels by a factor of 4 before the more expensive 3×3 convolution ; the second produces the desired number of output channels. While BigGAN relies on 1×1 convolutions, we use a different strategy where the number of channels needs to change, aiming to preserve identity across skip connections. In G, the number of channels needs to be reduced, we just keep the first set of channels and discard the rest to produce the desired number of channels. In D, the number of channels should be increased, we pass the input channel and then concatenate with the remaining channels resulting from 1×1 convolution. As far as the network configuration is concerned, the discriminator is exactly the same as the generator. There are two blocks per resolution (BigGAN uses only one), so BigGAN-deep is four times deeper than BigGAN. Although the BigGAN-deep model is deeper, it has significantly fewer parameters due to the bottleneck structure of its residual block. For example, BigGAN-deep G and D of 128×128 have 50.4M and 34.6M parameters, respectively, while the corresponding original BigGAN models have 70.4M and 88.0M parameters. All BigGAN-deep models use attention mechanism at 64×64 resolution, channel width multiplier ch=128 ch=128ch=128 z ∈ R 128 z ∈ R^{128} zR128

Figure 15: (a) Typical architectural layout of G for BigGAN; details are in the tables that follow. (b) Residual block (ResBlock up) in G of BigGAN. (c) Residual block (ResBlock down) in D of BigGAN.

Figure 16: (a) Typical architectural layout of G for BigGAN-deep; details are in the tables that follow. (b) Residual block (ResBlock up) in G of BigGAN-deep. (c) Residual block (ResBlock down) in D of BigGAN-deep. In BigGAN-deep, there is no ResBlock that includes upsampling (up) or downsampling (down), and there is no ResBlock that includes upsampling (up) or downsampling (down), which has identity skipping.

Table 4: BigGAN architecture for 128×128 pixel images. ch represents the channel width multiplier for each network from Table 1.

Table 5: BigGAN architecture for 256×256 pixel images. Compared to the 128×128 architecture, an extra ResBlock is added in each network at 16×16 resolution, and non-local blocks in G are moved to 128×128 resolution. Due to memory constraints, we cannot move non-local blocks in D.

Table 6: BigGAN architecture for 512×512 pixel images. Compared to the 256×256 architecture, an additional ResBlock is added at 512×512 resolution. Due to memory constraints, we had to move the non-local blocks in both networks back to 64×64 resolution, as in the 128×128 pixel setting.

Table 7: BigGAN-deep architecture for 128×128 pixel images.

Table 8: BigGAN-deep architecture for 256×256 pixel images.

Table 9: BigGAN-deep architecture for 512×512 pixel images.

Appendix C: Experimental Details

Our basic setup follows SA-GAN (Zhang et al., 2018) and is implemented using TensorFlow (Abadi et al., 2016). We employ the architecture detailed in Appendix B to insert non-local blocks at a single stage in each network. Both G and D networks are initialized using orthogonal initialization (Saxe et al., 2014). We use the Adam optimizer (Kingma & Ba, 2014), where β1 = 0 β_1 = 0b1=0β 2 = 0.999 β_2 = 0.999b2=0 . 9 9 9 , and a constant learning rate. For BigGAN models at all resolutions, we use 2 ⋅ 1 0 − 4 2 \cdot 10^{-4}in D2104 , while using 5 ⋅ 1 0 − 5 5 \cdot 10^{-5}in G510−5 . _ For BigGAN-deep, we use 2 ⋅ 1 0 − 4 2 \cdot 10^{-4}in D for the 128×128 model2104 , use 5 ⋅ 1 0 − 5 5 \cdot 10^{-5}in G5105 , use 2.5 ⋅ 1 0 − 5 2.5 \cdot 10^{-5}in D and G for 256×256 and 512×512 models, respectively2.510−5 . _ We experimented with the number of D steps per G step (variing it from 1 to 6) and found that two D steps per G step produced the best results.

We use an exponential moving average of the weights of G when sampling, with the decay rate set to 0.9999. We use cross-replica BatchNorm (Ioffe & Szegedy, 2015) in G, where batch statistics are aggregated across all devices, rather than on a single device as in standard implementations. Spectral normalization (Miyato et al., 2018) is used in both G and D, following the approach of SA-GAN (Zhang et al., 2018). We use Google TPU v3 Pods for training, and the number of cores scales with the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training for most models takes 24 to 48 hours. We change the ε in BatchNorm and Spectral Norm from the default 1 0 − 8 10^{-8}108 increases to1 0 − 4 10^{-4}104 to alleviate low-precision numerical problems. We preprocess the data by cropping along the long edges and performing region sampling.

C.1 BatchNorm statistics and sampling

By default, classifier networks using batch normalization use running averages of activation moments at test time. Previous research (Radford et al., 2016) instead used batch statistics when sampling images. While this isn't technically an ineffective way to sample, it means that the results are dependent on the test batch size (and how many devices it's split into), and further complicates reproducibility. We found this detail to be very important, and changes in the test batch size produced huge performance changes. The situation is further exacerbated when sampling with an exponential moving average of the weights of G, since BatchNorm's running average is computed using unaveraged weights and is not a good estimate of the activation statistic of the averaged weights.
To combat both of these problems, we use "standing statistics" at sampling time by running G through multiple forward passes (typically 100), each time with a different batch of random noise, stored in The mean and variance are aggregated across all forward passes to compute the activation statistic. Similar to using running statistics, this makes the output of G invariant to batch size and number of devices, even if only a single sample is produced.

C.2 CIFAR-10

We run our network on CIFAR-10 (Krizhevsky & Hinton, 2009), using the settings in Table 1, row 8, and achieve an IS of 9.22 and a FID of 14.73 without truncation.

C.3 Inception score of IMAGENET images

We computed IS for the training and validation sets of ImageNet. At 128×128 resolution, the training data has an IS of 233 and the validation data has an IS of 166. At 256×256 resolution, the training data has an IS of 377 and the validation data has an IS of 234. At 512×512 resolution, the training data has an IS of 348 and the validation data has an IS of 241. The difference between training and validation scores is because the Inception classifier has been trained on the training data, resulting in high confidence outputs being prioritized in the Inception scores.

Appendix D: Additional Charts

Figure 17: Comparison of IS and FID at 128×128 resolution. Scores are based on the average of three random seeds.

Figure 18: Comparison of IS and FID at 256 pixels and 512 pixels. The 256-pixel score is based on the average of three random seeds.

Figure 19: Comparison of 256×256 JFT-300M IS and FID. We show that from σ = 0 σ = 0p=0 toσ = 2 σ = 2p=A cutoff of 2 (top) and from σ = 0.5 to σ = 0.5p=0 . 5 toσ = 1.5 σ = 1.5p=A cutoff value of 1.5 ( bottom) . Each curve corresponds to a row in Table 3. The curve labeled baseline corresponds to the first row (without orthogonal regularization and other techniques), and the remaining curves correspond to rows 2-4, the same architecture at different capacities (Ch).

Appendix E: Selecting Latent Spaces

In this appendix we explore the impact of choosing different latent distributions. Although our main experimental results use the truncation trick (Truncation Trick), we still consider the possibility of other potential distributions. We introduce some possible latent distribution designs below, and provide the design rationale behind the intuition of each latent distribution, as well as the performance when replacing z ∼ N(0, I) in the SA-GAN baseline. Since the truncation technique performs better, we did not perform a full ablation study, but chose to use z ∼ N(0, I) as the main result to take full advantage of the truncation technique. In the absence of truncation tricks, we found that the two best latent distributions are the Bernoulli distribution {0, 1} and the truncated normal distribution max (N(0, I), 0), which can improve the training speed And slightly improves final performance, but is less suitable for truncation tricks.

potential distribution

  • N(0, I): Standard latent distribution, used in the main experiments.
  • U[-1, 1]: Another common choice, behaves similarly to N(0, I).
  • Bernoulli {0, 1}: Discrete potential distribution, which can reflect that the potential variation factors of natural images are not continuous, but discrete (a certain feature exists, and another feature does not exist). This latent distribution outperforms N(0, I) in terms of IS with an 8% performance boost and requires 60% fewer iterations.
  • Truncated normal max (N(0, I), 0): This latent distribution is designed to introduce sparsity in the latent space (reflecting that certain latent features are sometimes present and sometimes absent), while also allowing these latent features to Changes continuously, exhibiting varying degrees of activation strength. This latent distribution performs better than N(0, I) in terms of IS, with a performance gain of 15-20%, and generally requires fewer iterations.
  • Bernoulli {-1, 1}: This latent distribution is designed to be discrete but not sparse (since the network can learn to activate on negative inputs). This latent distribution has almost the same performance as N(0, I).
  • Independent categorical distribution {-1, 0, 1} with equal probability for each class: this distribution is chosen to be discrete and sparse, but also allows for positive and negative values ​​of latent features. This latent distribution has almost the same performance as N(0, I).
  • N(0, I) multiplied by Bernoulli {0, 1}: this distribution is chosen to have a continuous latent feature, but also to be sparse (with a peak at zero), similar to a truncated normal distribution , but not limited to positive values. This latent distribution has almost the same performance as N(0, I).
  • Connect N(0, I) and Bernoulli {0, 1}, each occupying half of the latent dimension: This design, inspired by Chen et al. (2016), aims to discretize some variables while allowing others The factors are continuous. This latent distribution improves performance by about 5% over N(0, I).
  • Variance annealing: Sample from N(0, σI), where σ is allowed to vary during training. We compared various segmentation schemes and found that starting from σ = 2 and gradually annealing to σ = 1 improves the performance to some extent. There may be more ways to choose an annealing strategy for σ, but we did not delve into them. We suspect that more principled or better tuned schemes may affect performance to a greater extent.
  • Variance of variables per sample: N(0, σiI), where σi is independently sampled from U[σl, σh] for each sample i in the batch, and (σl, σh) are hyperparameters . This distribution is chosen to improve adaptability to the truncation trick by feeding the network with noisy samples with non-constant variance. However, this doesn't seem to affect performance and we didn't dig into it. Adjustment (σl, σh) can also be considered, a strategy similar to variance annealing.

APPENDIX F: MONITORED TRAINING STATISTICS

Figure 20: Training statistics for a typical model, without special modifications. The crash happened after 200000 iterations.

Figure 21: Training statistics for G where σ0 is normalized to 1 in G. The crash happened after 125000 iterations.

Figure 22: Training statistics for D where σ0 in G is normalized to 1. The crash happened after 125000 iterations.

Figure 23: Training statistics for G with an R1 gradient penalty of strength 10 applied to D. The model didn't crash, but the max IS was only 55.

Figure 24: Training statistics for D with an R1 gradient penalty of strength 10 applied to D. The model didn't crash, but the max IS was only 55.

Figure 25: Training statistics for G with dropout applied on the last feature layer of D (retention probability 0.8). The model didn't crash, but the max IS was only 70.

Figure 26: Training statistics for D, where Dropout is applied on the last feature layer of D (retention probability 0.8). The model didn't crash, but the max IS was only 70.

Figure 27: Additional training statistics for a typical model, without special modifications. Crash occurs after 200000 iterations.

Figure 28: Additional training statistics using R1 gradient penalty (D with penalty strength of 10). The model doesn't crash, but the max IS is only 55.

Appendix G: Additional Discussion: Stability and Crashes

In this section, we present and discuss further investigations of the stability of our model, extending the discussion in Section 4.

G.1 Intervention before a breakdown

Symptoms of the crash are sudden and sharp, the sample quality drops rapidly from its peak to its lowest value, and the process completes in only a few hundred iterations. We can detect this collapse by monitoring whether the singular values ​​of G explode, but although the (unnormalized) singular values ​​grow throughout training, there is no consistent threshold for determining when the collapse occurs. This raises the question if it is possible to checkpoint the model in thousands of iterations before crashing and then continue training with some hyperparameters modified (e.g. learning rate).

We ran a series of intervention experiments in which we took a checkpoint of the model 10,000 or 20,000 iterations before it crashed, changed certain aspects of the training setup, and then observed whether the crash occurred, relative to the original crash time and the final performance achieved at crash time.

We found that increasing the learning rate (relative to the initial value) in either the generator (G) or the discriminator (D) resulted in an immediate crash. Even doubling the learning rate from 2·10-4 in D and 5·10-5 in G to 4·10-4 in D and 1·10-4 in G , this setting is usually at the initial learning rate The download will not be unstable, but it still causes a crash. We also tried changing the momentum terms (Adam's β1 and β2), or resetting the momentum vector to zero, but this often either had no effect or immediately crashed when adding momentum.

We found that reducing the learning rate in G but keeping the learning rate constant in D can delay collapse (by more than a hundred thousand iterations in some cases), but also cripple training - once the learning rate in G is reduced Small, performance either stays the same or slowly degrades. Conversely, decreasing the learning rate of D while keeping the learning rate of G constant leads to an immediate crash.

We hypothesize that this is because D must always be optimal for G, either for stability or to provide useful gradient information (as Miyato et al., 2018; Gulrajani et al., 2017; Zhang et al., 2018 as pointed out). The consequence of allowing G to win is a complete breakdown of the training process, regardless of G's tuning or optimization settings.

Second, even under the condition that D is good, the stability cannot be ensured, even by training D with a larger learning rate or more steps than G. This suggests that, in practice, an optimal D is necessary but not sufficient for training stability, or that there is some aspect of the system that prevents D from training towards the optimal state. Considering the latter possibility, we observe noise in the D spectrum in more detail in the next section.

G.2 Peaks in the Discriminator Spectrum

Figure 29: A close-up of the spectrum of D at the noise spike.

If some element of D has bad dynamics during training, the spectral behavior of D may reveal what that element is. Unlike the first three singular values ​​of G, the first three singular values ​​of D have a large noise component that tends to grow throughout training but only show a small response to collapse, and the first two singular values ​​of The ratios tend to be centered around one, suggesting that the spectrum of D has a slow decay. From a close inspection (Fig. 29) it can be seen that the noise peaks resemble an impulse response: at each peak, the spectrum jumps up and then slowly decreases with some oscillations.

One possible explanation is that this behavior is the result of D memorizing the training data, as suggested by the experiments in Section 4.2. As it approaches perfect memory, it gets less and less signal from real data because both the original GAN ​​loss and the hinge loss provide zero gradient when D outputs a confident and correct prediction for a given example. If the gradient signal from the real data decays to zero, this can cause D to end up becoming negatively biased by only receiving gradients that encourage its output to be negative. If this bias exceeds a certain threshold, D will eventually misclassify a large number of real examples and receive large gradients that encourage the output to be positive, leading to the observed shock response.

This argument suggests some fixes. First, unbounded losses (such as the Wasserstein loss (Arjovsky et al., 2017)) can be considered, which do not suffer from such gradient decay. We found that our model cannot be stably trained for more than a few thousand iterations when using this loss, even with gradient penalties and brief retuning of the optimizer hyperparameters. Instead, we explored changing the margin of the hinge loss as a partial compromise: for a given model and data minibatch, increasing the margin would result in more examples falling within the margin, resulting in a loss of 3 . While training with reduced margins (by half) measures performance, training with increased margins (up to a factor of 3) neither prevents crashes nor reduces noise in the D-spectrum. Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss. Finally, the memory argument may suggest that using a smaller D or using dropout in D improves training by reducing its memory capacity, but in reality this reduces the quality of training.

Appendix H: Attempts for Failing to Have a Positive Impact

We explored a range of novel and established techniques that ultimately degraded performance or had no noticeable impact in our setup. Here we report these results; although the evaluation in this part is not as exhaustive as that of our main architectural choices, our intention is to save time for future research and to provide readers with a more comprehensive picture that allows them to better understand Learn about ways we've tried to improve performance or stability. It is important to note that these results must be understood as specific to the concrete setting we used, rather than general conclusions applicable in other situations. A potential problem with reporting negative results is that we may conclude that a particular technique is not effective, when in reality, the technique may work differently when applied in a particular way for a particular problem. Therefore, we should treat these results with caution and avoid overgeneralizing the potential value of an approach.

Here are techniques we tried in our research but didn't work as expected:

  • We found that doubling the depth of the model (inserting an extra residual block after each upsampling or downsampling block) degrades performance.

  • We have tried sharing category embeddings between G and D (and not only within G). This can be achieved by replacing D's category embeddings with projections from G's embeddings, similar to what is done in G's BatchNorm layer. In initial experiments, this seemed to help speed up training, but we found that this method scales poorly and is very sensitive to the choice of optimizing hyperparameters, especially the number of D steps.

  • We tried replacing BatchNorm in G with WeightNorm, but the result crashed the training. We also tried removing BatchNorm and only using Spectral Normalization, but that also crashed the training.

  • We have tried adding BatchNorm (both category-conditioned and non-category-conditioned) to D and performing Spectral Normalization, but this also crashes the training.

  • We have tried using attention blocks in different positions in G and D (even inserting multiple attention blocks at different resolutions), but found no significant benefit at 128×128, and the computational and memory costs A significant increase. When switching to 256×256, we found that moving the attention block up one stage helped, which is in line with our expectations due to the increased resolution.

  • We try to use a filter size of 5 or 7 instead of 3 in G and D. We found that using a filter size of 5 in G only marginally improves the baseline, but at an unacceptable computational cost. All other settings will reduce performance.

  • We tried varying the dilation factor of the convolutional filters at 128×128, but found that even a small dilation factor degrades performance.

  • We tried using bilinear upsampling in G instead of nearest neighbor upsampling, but this also degrades performance.

  • In some models, we observed mode collapse in category conditioning, where the model only generated one or two samples for some subcategories, but was still able to generate samples for all other categories. We noticed that the embeddings of the collapsed categories had become very large compared to other embeddings, so tried to alleviate this problem by applying weight decay to the shared embeddings. We find a small amount of weight decay ( 1 0 − 6 10^{-6}106 ) degrades performance, while only smaller values ​​(1 0 − 8 10^{-8}10−8 ) does not degrade performance, but these values ​​are also too small to prevent category vector explosion . Higher resolution models seem to be more resistant to this problem, and we don't seem to be seeing this type of crash in our final model.

  • We tried using an MLP instead of the linear projection from G's category embeddings to its BatchNorm gains and biases, but found no benefit. We also tried Spectrally Normalizing these MLPs and providing bias at their output, but saw no benefit.

  • We tried gradient norm clipping (a global version is often used for clipping in recurrent networks and a local version on a per-parameter basis), but found that this does not alleviate the instability.

Appendix I: Hyperparameters

In this work, we perform tuning of various hyperparameters:

  • We scan the learning rate of each network, the range includes [ 1 0 − 5 , 5 × 1 0 − 5 , 1 0 − 4 , 2 × 1 0 − 4 , 4 × 1 0 − 4 , 8 × 1 0 − 4 , 1 0 − 3 ] [10^{-5}, 5 \times 10^{-5}, 10^{-4}, 2 \times 10^{-4}, 4 \times 10^{- 4}, 8 \times 10^{-4}, 10^{-3}][105,5×105,104,2×104,4×104,8×104,103 ]Cartesian product. It was initially found that at smaller batch sizes, the SA-GAN setting (learning rate of G1 0 − 4 10^{-4}104 , the learning rate of D is4 × 1 0 − 4 4 \times 10^{-4}4×104 ) Works best at lower batch sizes. We did not repeat this scan at a larger batch size, but experimented with halving and doubling the learning rate settings, and ended up experimenting with the halved setting.
  • We scan the strength of the R1 gradient penalty over the range [ 1 0 − 3 , 1 0 − 2 , 1 0 − 1 , 0.5 , 1 , 2 , 3 , 5 , 10 ] [10^{-3}, 10 ^{-2}, 10^{-1}, 0.5, 1, 2, 3, 5, 10][103,102,101,0.5,1,2,3,5,1 0 ] . We found that the strength of the penalty is negatively correlated with performance, but setting it above 0.5 can provide training stability.
  • We scan the retention probabilities of the last layer DropOut of D in the range [ 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.95 ] [0.5, 0.6, 0.7, 0.8, 0.9, 0.95][0.5,0.6,0.7,0.8,0.9,0.95 ] . _ _ _ We found that DropOut has a similar stabilization effect as R1, but also reduces performance.
  • We scan the β1 parameter of Adam of D in the range [ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 ] [0.1, 0.2, 0.3, 0.4, 0.5][0.1,0.2,0.3,0.4,0.5 ] , which was found to have a slight regularization effect, similar to DropOut, but did not significantly improve the results . In any network, high values ​​of β1 crash the training.
    Tried halved and doubled learning rate settings, and ended up experimenting with halved settings.
  • We scan the strength of the R1 gradient penalty over the range [ 1 0 − 3 , 1 0 − 2 , 1 0 − 1 , 0.5 , 1 , 2 , 3 , 5 , 10 ] [10^{-3}, 10 ^{-2}, 10^{-1}, 0.5, 1, 2, 3, 5, 10][103,102,101,0.5,1,2,3,5,1 0 ] . We found that the strength of the penalty is negatively correlated with performance, but setting it above 0.5 can provide training stability.
  • We scan the retention probabilities of the last layer DropOut of D in the range [ 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.95 ] [0.5, 0.6, 0.7, 0.8, 0.9, 0.95][0.5,0.6,0.7,0.8,0.9,0.95 ] . _ _ _ We found that DropOut has a similar stabilization effect as R1, but also reduces performance.
  • We scan the β1 parameter of Adam of D in the range [ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 ] [0.1, 0.2, 0.3, 0.4, 0.5][0.1,0.2,0.3,0.4,0.5 ] , which was found to have a slight regularization effect, similar to DropOut, but did not significantly improve the results . In any network, high values ​​of β1 crash the training.
  • We scan the strength of the modified orthogonal regularization penalty in G over the range [1 0 − 5 , 5 × 1 0 − 5 , 1 0 − 4 , 5 × 1 0 − 4 , 1 0 − 3 , 1 0 − 2 ] [10^{-5}, 5 \times 10^{-5}, 10^{-4}, 5 \times 10^{-4}, 10^{-3}, 10^{- 2}][105,5×105,104,5×104,103,102 ], and selected1 0 − 4 10^{-4}104

  1. https://tfhub.dev/s?q=biggan ↩︎

  2. https://drive.google.com/drive/folders/1lWC6XEPD0LT5KUnPXeve_kWeY-FxH002 ↩︎

  3. Unconstrained models can easily learn different output scales to fit this margin, but constraining our model using spectral normalization makes the choice of a specific margin meaningful. ↩︎

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132479340