[StyleGAN2 Paper Intensive Reading CVPR_2020] Analyzing and Improving the Image Quality of StyleGAN

I. Introduction

[Paper] > Official TensorFlow version [Code] > Pytorch version [Code] > [Project]
This blog is an intensive Chinese reading of the StyleGAN2 paper. I hope it will be helpful for everyone to fully understand the StyleGAN2 generator.
Pipeline:

  1. First of all, since the AdaIN operation will produce water drop artifacts, the solution is to redesign the normalization step. See Figure 2 for details.
  2. 修改包括Weight demodulation, Lazy regularization, Path length regularization, No growing, new G & D arch, Large networks (StyleGAN2)。

Abstract

背景:Style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling.
方法:We expose and analyze several of its characteristic artifacts, and propose changes in the model architecture and training method to address them.
In particular, we redesign the normalization of the generator, revisit asymptotic growth, and regularize the generator to encourage good conditioning in the mapping from latent codes to images.
In addition to improving image quality, this path length regularizer yields the added benefit that the generator becomes easier to invert.
This makes it possible to reliably attribute generated images to specific networks.
We further visualize how well the generator utilizes its output resolution and identify capacity issues, motivating us to train larger models for additional quality improvements.
总结:Overall, our improved model redefines the state-of-the-art in unconditional image modeling, including existing distribution quality metrics and perceptual image quality.

1. Introduction

The resolution and quality of images generated by generative methods, especially generative adversarial networks (GAN) [13] are rapidly improving [20, 26, 4].
The current state-of-the-art method for high-resolution image synthesis is StyleGAN [21], which has been proven to work reliably on various datasets.
Our work focuses on fixing its characteristic artifacts and further improving the result quality.

The distinguishing feature of StyleGAN {Karras2018} is its unconventional generator architecture.
mapping network fff is not just input latent codez ∈ Z \mathrm{z} \in \mathcal{Z}zZ is fed to the beginning of the network, but it is first transformed into an intermediate latent codew ∈ W \mathrm{w} \in \mathcal{W}ww . Then the affine transformation generates styles, and controls the synthesis network gg
through adaptive instance normalization (AdaIN) {Huang2017, Dumoulin2016, Ghiasi2017, Dumoulin2018}layers of g .
Furthermore, stochastic variation is facilitated by providing an additional random noise map to the synthetic network.
It has been shown {Karras2018, Shen2019} that this design allows an intermediate latent spaceW \mathcal{W}W than the input latent spaceZ \mathcal{Z}Z is much less entangled.
In this paper, we focus all analysis only onW \mathcal{W}W because it is the relevant latent space from the perspective of the synthetic network.

Many observers have noticed characteristic artifacts in images generated by StyleGAN {Bergstrom2019}.
We identify two causes of these artifacts and describe changes to architectures and training methods that remove them.
First , we investigate the origin of common speckle artifacts and find that generators create them to circumvent design flaws in their architectures.
In Section 2, we redesigned the normalization used in the generator so that artifacts are removed.
Second , we analyze artifacts associated with progressive growing {Karras2017}, which is very successful in stabilizing high-resolution GAN training.
We propose an alternative design that achieves the same goal - training starts by focusing on low-resolution images, then gradually shifts the focus to higher and higher resolutions - without changing during training Network topology.
This new design also allowed us to infer the effective resolution of the generated images, which turned out to be lower than expected, motivating the capacity increase (Section 4).

Quantitative analysis of image quality generated using generative methods remains a challenging topic.
F re ˊ chet Fr\'echetFreˊ chetInitial Distance (FID) {Heusel2017} measures the density difference between two distributions in the high-dimensional feature space of the InceptionV3 classifier {simonyan2014}.
Precision and recall (P & RP\&RP & R ) {Sajjadi2018, Tuomas2019} provide additional visibility by explicitly quantifying the percentage of generated images similar to training data and the percentage of training data that can be generated, respectively.
We use these metrics to quantify improvements.

FID 和P & RP\&RBoth P & R are based on classifier networks that have recently been shown to focus on texture rather than shape {Geirhos2018}, thus, these metrics cannot accurately capture all aspects of image quality.
We observe that the perceptual path length (PPL) metric {Karras2018}, originally introduced as a way to estimate the quality of latent space interpolation, is related to shape consistency and stability.
On this basis, we regularize the synthesis network to favor smooth mapping (Section 3) and achieve a clear improvement in quality.
To offset its computational cost, we also propose to perform all regularization less frequently, observing that this can be done without compromising effectiveness.

Finally, we find that projecting images into the latent space W \mathcal{W} using the new path-length regularized StyleGAN2 generatorThe effect of W is significantly better than the original StyleGAN.
This makes it easier to attribute generated images to their sources (Section 5).

Our implementation and trained model are available at https://github.com/NVlabs/stylegan2 .

2. Removing normalization artifacts

insert image description here
We first observe that most of the images generated by StyleGAN exhibit typical blob-like artifacts resembling water droplets.
As shown in Figure 1 , even though the droplet may not be obvious in the final image, it still exists in the intermediate feature map of generator 1 .
This anomaly starts to appear around 64×64 resolution, appears in all feature maps, and gradually intensifies at higher resolutions.
The presence of this consistency artifact is puzzling since the discriminator should be able to detect it.

We localize the problem to the AdaIN operation, which normalizes the mean and variance of each feature map separately, potentially destroying any information found in the relative sizes of these features.
We hypothesize that droplet artifacts are the result of the generator intentionally sneaking in signal strength information after instance normalization: by creating a strong, localized, dominant peak, the generator can effectively scale the signal elsewhere .
Our hypothesis is supported by the finding that water droplet artifacts disappear completely when the normalization step is removed from the generator.

2.1. Generator architecture revisited

insert image description here

We redesigned the architecture of the StyleGAN synthetic network.
(a) Original StyleGAN, where A \boxed{A}Ameans from W \mathcal{W}The affine transformation learned by W produces styles, while B \boxed{B}Bis a noise broadcast operation.
(b) The same figure with full details. Here, we decompose AdaIN into explicit normalization followed by modulation, both of which operate on the mean and standard deviation of each feature map. We also annotate the learned weights ( www ), deviation (bbb ) and constant input (ccc ), and repaint the gray boxes so that each box activates one style. The activation function (leaky ReLU) is always applied immediately after adding the bias.
(c) We have made some modifications to the original schema that are justified in the text. We removed some redundant operations at the beginning,bbb B \boxed{B} BThe addition of σ moves outside the active region of the style and adjusts only the standard deviation of each feature map.
(d) The modified architecture enables us to replace instance normalization with a "demodulation" operation, which we apply to the weights associated with each convolutional layer.

We will first modify several details of the StyleGAN generator to better facilitate our redesigned regularization.
In terms of quality measures, these changes themselves have neutral or small positive effects.

Figure 2a shows the original StyleGAN synthetic network ggg [21],in Figure 2b, we expand the diagram to full detail by showing the weights and biases, and decomposing the AdaIN operation into two components: normalization and modulation.
This allows us to redraw the conceptual gray boxes so that each gray box represents a part of the network where a certain style is active (i.e. a "style block").
Interestingly, the original StyleGAN applied bias and noise in style blocks, causing their relative influence to be inversely proportional to the size of the current style.
We observe that more predictable results can be obtained by moving these operations outside of the style block, where they operate on normalized data.
Furthermore, we note that after this change, it is sufficient for normalization and modulation to act on the standard deviation only (i.e., the mean is not required).
Application of bias, noise, and normalization to constant inputs can also be safely removed without apparent pitfalls.
This variantis shown in Figure 2c, and it is the starting point for our redesigned normalization.

2.2. Instance normalization revisited

One of the main strengths of StyleGAN is the ability to control the generated images through style mixing, that is, feeding different latent w to different layers at inference time.
In practice, style modulation can amplify certain feature maps by an order of magnitude or more.
For style blending to work, we must explicitly counteract this amplification on a per-sample basis - otherwise subsequent layers won't be able to operate on the data in a meaningful way.

If we are willing to sacrifice scale-specific control (see video), we can simply remove normalization, thereby removing artifacts and slightly improving FID [22].
We will now propose a better alternative that removes artifacts while retaining full controllability.
The main idea is to base the normalization on the expected statistics of the incoming feature maps, but without explicit enforcement.

Recall that the style block in Figure 2c consists of modulation, convolution, and normalization.
Let's start by considering the effect of the modulated convolution.
Modulation scales each input feature map of the convolution according to the incoming style, and can also be achieved by scaling the convolution weights:
wijk ′ = si ⋅ wijk , \begin{equation} w'_{ijk} = s_i \ cdot w_{ijk}, \end{equation}wiq=siwijk,in that wwwwa w′ w'w are the original weight and modulation weight respectively,si s_isiis corresponding to the iiThe scale of i input feature maps,jjj andkkk enumerates the output feature map and spatial footprint of the convolution, respectively.

Now, the purpose of instance normalization is to essentially remove the influence of s from the statistics of the convolution output feature map.
We believe that this goal can be achieved more directly.
Assume that the input activations are independent and identically distributed iid random variables with unit standard deviation.
After modulation convolution, the standard deviation of the output activation is
σ j = ∑ i , kwijk ′ 2 , \begin{equation} \sigma_j = \sqrt{ { \underset{i,k}{ {}\displaystyle\sum{} }} {w'_{ijk}}^2}, \end{equation}pj=i,kwiq2 ,That is, output L 2 L_2 by corresponding weightsL2Norm scaling. Subsequent normalization aims to restore the output to unit standard deviation.
According to Equation 2, if we take each output feature map jjj scaling ("demodulation")1 / σ j 1/\sigma_j1/ pj, this can be achieved.
Alternatively, we can bake it into the convolutional weights again:
wijk ′ ′ = wijk ′ / ∑ i , kwijk ′ 2 + ϵ , \begin{equation} w''_{ijk} = w'_{ijk} \bigg/ \sqrt{ {\underset{i,k}{ {}\displaystyle\sum{}}} {w'_{ijk}}^2 + \epsilon}, \end{equation}wiq′′=wiq/i,kwiq2+ϵ ,where ϵ \epsilonϵ is a small constant to avoid numerical problems.

We have now baked the entire style block into a single convolutional layer whose weights are adjusted according to s using Eq. 1 and Eq. 3 (Fig. 2d).
Compared to instance normalization, our demodulation technique is weak because it is based on statistical assumptions of the signal rather than the actual content of the feature maps.
Similar statistical analysis has been used extensively in modern network initializers [12, 16], but we were unaware that it had been used before in place of data-dependent normalization.
Our demodulation is also related to weight normalization [32], which performs the same computation as reparameterizing weight tensors.
Previous work has found that weight normalization is beneficial in the context of GAN training [38].
insert image description here
insert image description here

Our new design removes characteristic artifacts ( Figure 3 ), while retaining full controllability, as shown in the accompanying video.
FID is largely unaffected ( Table 1, rows A, B ), but there is a significant shift from precision to recall.
We argue that this is generally desirable because recall can be converted to precision by truncation, while the converse is not true [22].
In practice, our design can be implemented efficiently using grouped convolutions, as detailed in Appendix B.
To avoid having to account for the activation functions in Equation 3, we scale the activation functions so that they preserve the expected signal variance.

3. Image quality and generator smoothness

While GAN metrics like FID or Precision and Recall (P&R) successfully capture many aspects of the generator, they still have certain blind spots in terms of image quality.
For example, see Figures 3 and 4 in the supplement for a comparison of generators with the same FID and P&R scores but significantly different overall quality. 2

insert image description here
insert image description here

We observe a correlation between perceived image quality and perceptual path length (PPL) [21], a metric originally introduced for The average LPIPS distance [44] is used to quantify the smoothness of the mapping from the latent space to the output image.
Referring again to Figures 3 and 4 in the Supplement, a smaller PPL (smoother generator map) appears to correlate with higher overall image quality, while other measures are blind to this change.
Figure 4 through the PPL score of each image on LSUN CAT, by w ∼ f ( z ) w \sim f(z)wThis correlation is examined more closely by sampling the latent space around f ( z ) .
A low score is indeed an indication of a high-quality image, and vice versa.
Figure 5a showsthe corresponding histogram and reveals the long tail of the distribution.
The overall PPL of the model is simply the expected value of the PPL score for each image.
We always compute the PPL for the whole image, not Karras et al. [21] They use a smaller central crop.

It's not immediately obvious why low PPL correlates with image quality.
We hypothesize that since the discriminator penalizes broken images during training, the most straightforward way for the generator to improve is to effectively stretch the regions of the latent space that produce good images.
This will result in low-quality images being compressed into rapidly changing small latent space regions.
While this improves the average output quality in the short term, the accumulated distortions impair the training dynamics and thus the final image quality.

Clearly, we cannot simply encourage the minimum PPL, as this would steer the generator towards a degenerate solution with zero recall.
Instead, we describe a new regularizer whose goal is to achieve smoother generator maps without this shortcoming.
Since the resulting regularization term is somewhat expensive to compute, we first describe a general optimization applicable to any regularization technique.

3.1. Lazy regularization

Typically, a main loss function (e.g., Logistic loss [13]) and a regularization term (e.g., R 1 R_1R1[25]) are written as a single expression and thus are optimized simultaneously.
We observe that the regularization terms are computed less frequently than the main loss function, greatly reducing their computational cost and overall memory usage.
Row C of Table 1 shows that no harm is done when R1 regularization is performed only once every 16 mini-batches, and we apply the same strategy to our new regularizer. Appendix B gives implementation details.

3.2. Path length regularization

We encourage W \mathcal{W}A fixed-size step in W results in non-zero, fixed-magnitude changes in the image.
We can do this by stepping into random directions in image space and observing the correspondingw \mathrm{w}w gradient to empirically measure the deviation from the ideal.
Regardlessof w \mathrm{w}Regardless of w or the image space orientation, these gradients should all have nearly equal lengths, suggesting that the mapping from latent space to image space is well-conditioned {Odena2018}.

In a single w ∈ W \mathrm{w} \in \mathcal{W}wAt W , the generator mapg ( w ) g(\mathrm{w})Local metric scaling properties of g ( w ) : W ↦ Y \mathcal{W} \mapsto \mathcal{Y}WY consists of the Jacobian matrixJ w = ∂ g ( w ) / ∂ w \mathbf{J}_\mathrm{w} = {\partial g(\mathrm{w})}/{\partial \mathrm{w}}Jw=g ( w ) / w captures.
Out of the desire to preserve the expected length of the vector regardless of orientation, we denote the regularizer as
E w , y ∼ N ( 0 , I ) ( ∥ J w T y ∥ 2 − a ) 2 , \begin{equation} \mathbb{E}_{\mathrm{w}, \mathrm{y} \sim \mathcal{N}(0, \mathbf{I})} \left(\left\lVert \mathbf{J}_\mathrm {w}^T \mathrm{y}\right\rVert_2 - a\right)^2, \end{equation}Ew,yN(0,I)( JwTy 2a)2,where y \mathrm{y}y is a random image with normally distributed pixel intensities,w ∼ f ( z ) \mathrm{w}\sim f(\mathbf{z})wf(z),其中 z \mathbf{z} z is normally distributed. We show in Appendix C that, in high dimensions, whenJ w \mathbf{J}_\mathrm{w}JwIn any w \mathrm{w}This prior is minimized when w is orthogonal (in the global context).
Orthogonal matrices preserve length and do not introduce squashing along any dimension.

To avoid explicit computation of the Jacobian matrix, we use the identity J w T y = ∇ w ( g ( w ) ⋅ y ) \mathrm{J}^{T}_\mathrm{w} \mathrm{y} = \ nabla_\mathrm{w} (g(\mathrm{w} )\cdot \mathrm{y})JwTy=w(g(w)y ) , which can be efficiently computed using standard backpropagation {Dauphin2015}.
constantaaa is dynamically set to length ∥ J w T y ∥ 2 \lVert\mathrm{J}^{T}_\mathrm{w} \mathrm{y}\rVert_2 duringoptimizationJwTy2The long-term exponential moving average of , allowing the optimization itself to find a suitable global scale.

Our regularizer is closely related to the Jacobian clamping regularizer proposed by Odena et al. {Odena2018}.
The actual difference consists in the fact that we analytically compute the product J w T y \mathrm{J}^{T}_\mathrm{w} \mathrm{y}JwTy , while they use finite differences to estimateJ w δ \mathbf{J}_\mathrm{w} \boldsymbol{\delta}JwδZ ∋ δ ∼ N ( 0 , I ) \mathcal{Z} \ni \ball symbol{\delta} \sim \mathcal{N}(0, \mathbf{I})ZdN(0,I ) .
It should be noted that the spectral normalization {Miyato2018B} of the generator {Zhang2018sagan} only constrains the maximum singular value and has no restrictions on other values, thus not necessarily leading to better conditioning.
We found that, in addition to our contribution, enabling spectral normalization—or replacing them—always harmed the FID, as detailed in Appendix E.

In practice, we note that path length regularization leads to more reliable and consistent behavioral models, making architecture exploration easier.
We also observe that smoother generators are easier to invert (Section 5).
Figure 5b shows that path length regularization significantly tightens the distribution of per-image PPL scores without pushing modes to zero.
However, row D of Table 1 points out the trade-off between FID and PPL in datasets less structured than FFHQ.

4. Progressive growing revisited

insert image description here

Progressive growing [20] has been very successful in stabilizing high-resolution image synthesis, but it introduces its own characteristic artifacts.
The key issue is that the growing generator seems to have a strong positional preference for details ; the accompanying video shows that when features such as teeth or eyes should move smoothly across an image, they may stay where they are before jumping to the next preferred position not moving.
Figure 6 shows a related artifact.
We think the problem is that in asymptotic growth, each resolution is temporarily used as an output resolution, forcing it to produce maximum frequency detail, which then causes the trained network to have too many frequencies in intermediate layers, compromising shift invariance [43].
Appendix A shows an example. These issues prompted us to search for an alternative formulation that retains the benefits of stepwise growth without this drawback.

4.1. Alternative network architectures

While StyleGAN uses a simple feed-forward design in the generator (synthetic network) and discriminator, there is a lot of work devoted to researching better network architectures.
Skip connections [29, 19], residual networks [15, 14, 26] and hierarchical methods [6, 41, 42] have also been shown to be very successful in generative methods.
Therefore, we decided to re-evaluate the network design of StyleGAN and search for an architecture that does not require progressive growth to produce high-quality images.
insert image description here

Figure 7a shows MSG-GAN [19], which uses multiple skip connections to connect the matching resolution of the generator and the discriminator.
The MSG-GAN generator is modified to output mipmaps [37] instead of images, and a similar representation is also computed for each actual image.
In Figure 7b , we simplify this design by upsampling and summing the contributions of RGB outputs corresponding to different resolutions.
In the discriminator, we similarly feed the downsampled image to each resolution block of the discriminator.
We use bilinear filtering in all upsampling and downsampling operations.
In Figure 7c , we further modify the design to use the remaining connections. 3
This design is similar to LAPGAN [6] without the per-resolution discriminator used by Denton et al.

insert image description here
Table 2 compares three generator and three discriminator architectures: the original feed-forward network used in StyleGAN, skip connections and residual networks, all of which are trained without progressive growth.
FID and PPL are provided for each of the 9 combinations.
We can see two broad trends: skip connections in the generator greatly improve PPL in all configurations, and residual discriminator networks clearly benefit FID.
The latter is perhaps not surprising, since discriminators are structured similarly to classifiers, where the residual structure is a known useful structure.
However, residual architectures are detrimental in generators—the only exception is FID in LSUN cars, when both networks are residual.

For the rest of this paper, we use a skip generator and a residual discriminator instead of asymptotic growth. This corresponds to configuration E in Table 1, which significantly improves FID and PPL.

4.2. Resolution usage

The key aspect of progressive growth that we wish to preserve is that the generator will initially focus on low-resolution features and then slowly shift its attention to finer details. The architecture in Figure
7 allows the generator to first output low-resolution images that are not significantly affected by high-resolution layers, and then shift focus to high-resolution layers as training progresses.
Since this is not enforced in any way, the generator will only do so when it is beneficial.
To analyze the behavior in practice, we need to quantify how much the generator depends on a particular resolution during training.
insert image description here
Since the skip generator ( Fig. 7b ) forms images by explicitly summing RGB values ​​from multiple resolutions, we can estimate the relative importance of the corresponding layers by measuring their contribution to the final image.
In Figure 8a , we plot the standard deviation of the pixel values ​​produced by each TRGB layer as a function of training time.
We compute the standard deviation of 1024 random samples of w and normalize it so that it sums to 100%.

At the beginning of training, we can see that the new skip generator behaves like asymptotic growth – now implemented without changing the network topology.
Therefore, it is reasonable to expect the highest resolution to dominate towards the end of training.
However, the figure shows that this does not happen in practice, suggesting that the generator may not be able to "fully utilize" the target resolution.
To verify this, we manually inspected the generated images and noticed that they often lacked some of the pixel-level detail present in the training data – these images could be described as sharpened versions of 512 2 images rather than true 1024 2 images .

This leads us to hypothesize that there is a capacity problem in our network, which we test by doubling the feature maps in the highest resolution layers of both networks. 4
This makes the behavior more as expected: Figure 8b shows a significant increase in the contribution of the highest resolution layer, and row F of Table 1 shows a significant improvement in FID and recall.
The last row shows that the baseline StyleGAN also benefits from the extra capacity, but its quality is still much lower than StyleGAN2.

insert image description here

Table 3 compares StyleGAN and StyleGAN2 in four LSUN categories, again showing clear improvements in FID and significant progress in PPL. Further increases in scale may have additional benefits.

5. Projection of images to latent space

The inversion of synthetic network g is an interesting problem with wide applications.
Manipulating a given image in the latent feature space first requires finding a matching latent code w for it.
Previous studies [1, 9] have shown that results are improved if an individual w is chosen for each layer of the generator, rather than finding a common latent code w.
The same approach [27] was used in earlier encoder implementations.
While expanding the latent space in this way to find a closer match to a given image, it is also able to project arbitrary images that should not have a latent representation.
Instead, we focus on finding latent codes in the original, unexpanded latent space, as these codes correspond to images that the generator might produce.

Our projection method differs from previous methods in two ways.
First , in order to more comprehensively mine the latent space, we add ramped-down noise to the latent encoding during optimization.
Second , we also optimize the random noise inputs to the StyleGAN generator, regularizing them to ensure that they do not end up carrying coherent signals.
Regularization is based on augmenting the autocorrelation coefficient of the noise map to match the autocorrelation coefficient of unit Gaussian noise at multiple scales.
Details of our projection method can be found in Appendix D.

Detection of manipulated or generated images is a non-trivial task. Currently, classifier-based methods can detect generated images fairly reliably, regardless of their exact origin [24, 39, 35, 45, 36]. However, given the rapid progress in generative methods, this situation may not last. In addition to general fake image detection, we can also consider a more limited form of this problem: being able to attribute fake images to their specific source [2]. For StyleGAN, this amounts to checking whether there is w ∈ W that resynthesizes the image in question.

5.1. Attribution of generated images

Detecting manipulated or generated images is a non-trivial task.
Currently, classifier-based methods can detect generated images very reliably, regardless of their exact origin {Li2018, Yu2018, Wang2019, Zhang2019gan artifacts, Wang2019b}.
However, given the rapid progress in generative methods, this may not last.
In addition to general detection of fake images, we can also consider a more limited form of the problem: being able to attribute fake images to their specific origin {Albright2019}.
For StyleGAN, this is equivalent to checking whether there exists a w ∈ W \mathrm{w} \in \mathcal{W}wW to resynthesize the image in question.
insert image description here
insert image description here

We measure the success of the projection by computing the LPIPS{Zhang2018metric} distance between the original image and the resynthesized image.
The formula is DLPIPS [ x , g ( g ~ − 1 ( x ) ) ] D_\mathrm{LPIPS}[\boldsymbol{x}, g(\tilde{g}^{-1}(\boldsymbol{x})) ]DLPIPS[x,g(g~1 (x))], wherex \boldsymbol{x}x is the image being analyzedg ~ − 1 \tilde{g}^{-1}g~1 indicates an approximate projection operation.
Figure 10 showshistograms of these distances for the LSUN Car and FFHQ datasets using the original StyleGAN and StyleGAN2, andFigure 9 showsexample projections.
Images generated using StyleGAN2 project well ontoW \mathcal{W}W , so that they can be attributed almost unambiguously to generative networks.
However, for the original StyleGAN, even though technically it should be possible to find matching latent codes, fromW \mathcal{W}The mapping of W to images seems too complex to be reliably successful in practice.
We find it encouraging that StyleGAN2 makes source attribution much easier, although image quality has improved significantly.

6. Conclusions and future work

We have identified and fixed some image quality issues in StyleGAN, further improved the quality, and greatly improved the state of the art on several datasets.
In some cases, these improvements can be seen more clearly in motion, as shown in the attached video.
Appendix A contains further examples of the results that can be obtained using our method.
Despite the improved quality, StyleGAN2 makes it easier to attribute generated images to their sources.

Training performance has also improved. At a resolution of 1024× 2 , the original StyleGAN (configuration A in Table 1) runs at 37 images per second on an NVIDIA DGX-1 and 8 Tesla V100 GPUs, while our configuration E runs at 61 img /s, which is 40% faster.
Most of the speedup comes from simplified data flow due to weight demodulation, delay regularization, and code optimization.
StyleGAN2 (configuration F, larger network) trains at 31 img/s and is thus only slightly more expensive than the original StyleGAN.
Its total training time is 9 days FFHQ and 13 days LSUN CAR.

The entire project, including all exploration, consumed 132 MWh of which 0.68 MWh was used to train the final FFHQ model. In total, we used about 51 single-GPU-years of computation (Volta-class GPUs). See Appendix F for a more detailed discussion.

In the future, research into further improvements in path length regularization may bear fruit, e.g. replacing pixel-space L2 distances with data-driven feature-space metrics.
Considering the practical deployment of GANs, we believe it is important to find new ways to reduce the training data requirements.
This is especially important in applications where obtaining tens of thousands of training samples is not feasible, and the dataset contains a large amount of intrinsic variation.


  1. In rare cases (perhaps 0.1% of the image), a droplet is lost, resulting in severely damaged images. See Appendix A for details. ↩︎

  2. We believe that the key to this apparent inconsistency lies in the choice of feature space rather than the basis of FID or P&R. It was recently found that classifiers trained using ImageNet [30] tend to be based more on texture than shape [10], while humans strongly focus on shape [23]. This is relevant in our context because FID and P&R use high-level features from InceptionV3 [34] and VGG-16 [34] respectively, which are trained in this way and thus expected to be biased towards texture detection. Consequently, images with e.g. strong cat textures may look more similar to each other than human observers would agree, thus partially compromising density-based metrics (FID) and manifold coverage metrics (P&R). ↩︎

  3. In the residual network architecture, the addition of the two paths results in a multiplication of the signal variance, which we multiply by 1 / 2 1/\sqrt{2}1/2 to offset this variance. This is crucial for our network, and in classification resnets [15] this problem is usually hidden by batch normalization. ↩︎

  4. We double the number of feature maps at 64 2 -1024 2 resolution while keeping the rest of the network constant. This increases the total number of trainable parameters in the generator by 22% (25M → 30M) and in the discriminator by 21% (24M → 29M). ↩︎

Guess you like

Origin blog.csdn.net/qq_45934285/article/details/132120695