[StyleGAN Paper Intensive Reading CVPR_2019] A Style-Based Generator Architecture for Generative Adversarial Networks

I. Introduction

[Paper] > PyTorch version [Code] > Official TensorFlow version [Code] > [supplement intensive reading]
This blog is an intensive reading of the Chinese version of the original StyleGAN paper, which helps to fully understand StyleGAN.
Pipeline:

  1. First of all, StyleGAN is inspired by style transfer literature.
  2. 改进是基于Progressive GAN(A) + Tuning (incl. bilinear up/down)(B) + Add mapping and styles(C) + Remove traditional input(D) + Add noise inputs(E) + Mixing regularization(F)
  3. See Section 3.1 for Style mixing. Add noise to add randomness, see Section 3.2
  4. Two automated metrics are also proposed: perceptual path length and linear separability are used to quantify the estimation of the generator's latent space disentanglement degree.
  5. Also provides an open source face dataset FFHQ (Flickr-Faces-HQ) https://github.com/nvlabs/ffhq-dataset

Abstract

简介:Drawing on the style transfer literature, we propose an alternative generator architecture for generative adversarial networks.
特点:The new architecture leads to automatic learning, unsupervised separation of high-level attributes (e.g., pose and identity when training on faces) and random variations in generated images (e.g., freckles, hair), and it is able to Intuitive, scale-specific controls.
The new generator improves the state of the art on traditional distribution quality measures, leading to significantly better interpolation properties and better separation of underlying variables.
比较方法:To quantify interpolation quality and disentanglement, we propose two new, automated methods, applicable to any generator structure.
FFHQ数据集:Finally, we introduce a new, highly variable and high-quality face dataset.

1. Introduction

In recent years, generative methods – especially Generative Adversarial Networks (GAN) [21] – have produced images that have rapidly improved in resolution and quality [28, 41, 4]. However, generators still operate as black boxes, and despite recent efforts [2], understanding of aspects of the image synthesis process, such as the origin of stochastic features, is still lacking. The properties of latent spaces are also poorly understood, and the commonly proven interpolation of latent spaces [12, 48, 34] does not provide a quantitative way to compare different generators.

Motivated by the style transfer literature [26], we redesign the generator architecture to control the image synthesis process in a new way. Our generator starts with a learned constant input and adjusts the image "style" at each convolutional layer according to the latent code, thus directly controlling the strength of image features at different scales. Combined with noise injected directly into the network, this structural variation leads to automatic, unsupervised separation of high-level properties (e.g., pose, identity) from random variations (e.g., freckles, hair) in generated images, and enables Intuitive graduated blending and interpolation operations. We do not modify the discriminator or loss function in any way, so our work is orthogonal to ongoing discussions on GAN loss functions, regularization, and hyperparameters [23, 41, 4, 37, 40, 33].

Our generator embeds the input latent code into an intermediate latent space, which has a profound impact on the representation of factors of variation in the network. The latent space of the input must follow the probability density of the training data, which we believe leads to some degree of unavoidable entanglement. Our intermediate latent space is not subject to this restriction and thus is allowed to be disentangled. Since previous methods for estimating the degree of latent space disentanglement are not directly applicable in our case, we propose two new automatic metrics – perceptual path length and linear separability – to quantify the generation these aspects of the device. Using these metrics, we show that our generator allows a more linear and less entangled representation of different variables of variation than conventional generator structures.

Finally, we propose a new face dataset (FlickrFaces-HQ, FFHQ) that offers higher quality and wider variation than existing high-resolution datasets (Appendix A). We have made this dataset publicly available, along with our source code and pretrained network. The accompanying video can be found under the same link. In the official TensorFlow version [Code] .

2. Style-based generator

insert image description here

While traditional generators {Karras2017} only provide latent codes through the input layer, we first map the input to an intermediate latent space W \mathcal{W}W , the generator is then controlled by Adaptive Instance Normalization (AdaIN) at each convolutional layer. Gaussian noise is added after each convolution and before nonlinearities are evaluated. Here "A" represents the learned affine transformation and "B" applies the learned per-channel scaling factor to the noisy input. map networkfff consists of 8 layers, the synthetic networkggg consists of 18 layers ----- two layers for each resolution (4 2 − 102 4 2 4^2-1024^24210242 ). The output of the last layer uses a separate1 × 1 1\times11×1 Convolution to RGB, similar to Karrasetal. {Karras2017}. Our generator has a total of 26.2 million trainable parameters compared to 23.1 million for traditional generators.

Traditionally, the latent code is fed to the generator through an input layer (i.e., the first layer of a feed-forward network) ( Fig. 1a ). We deviate from this design by omitting the input layer entirely and starting from a learned constant (Fig. 1b, right). Given an input latent space Z \mathcal{Z}Latent code z \mathrm{z}in Zz , nonlinear mapping networkf : Z → W f:\mathcal{Z}\to\mathcal{W}f:ZW first generatesw ∈ W \mathrm{w} \in \mathcal{W}wW (Fig. 1b, left). For simplicity, we set the dimensions of the two spaces to 512, and use an 8-layer MLP to implement the mappingfff , we will analyze this decision in Section 4.1.
Then learning the affine transformation willw \mathrm{w}w is specialized tostyles y = ( ys , yb ) y = (y_s, y_b)y=(ys,yb) to control Adaptive Instance Normalization (AdaIN) {Huang2017, Dumoulin2016, Ghiasi2017, Dumoulin2018} synthetic networkggOperations after each convolutional layer of g . The AdaIN operation is defined as
AdaIN ( xi , y ) = ys , ixi − μ ( xi ) σ ( xi ) + yb , i , \begin{equation} \textrm{AdaIN}(x_i,y) = y_{s,i} \frac{x_i-\mu(x_i)}{\sigma(x_i)} + y_{b,i} \textrm{,} % \end{equation}AdaIN(xi,y)=ys,is ( xi)xim ( xi)+yb,i,where each feature map xi x_ixiNormalize separately, then use style yyThe corresponding scalar components in y are scaled and biased. soyyThe dimension of y is twice the number of feature maps on this layer.

Comparing our method to the style transfer method, we start from the vector w \mathrm{w}w instead of example image computing space invariant styleyyy .
we choose toyyyreuse the word “style” because similar network architectures have been used for feed-forward style transfer {Huang2017}, unsupervised image-to-image translation {Huang2018} and domain mixing {Hao2018}. With more general feature transformation {Li2017C, Siarohin2018}, AdaIN is particularly suitable for our purposes due to its efficiency and compact representation.

Finally, we provide the generator with a straightforward way to generate random details by introducing explicit noise inputs .
These are single-channel images composed of uncorrelated Gaussian noise, and we feed dedicated noisy images to each layer of the synthesis network.
The noisy image is broadcast to all feature maps using the learned scaling factors for each feature, and then added to the output of the corresponding convolution, as shown in Figure 1b. The effect of adding noise to the input is discussed in Sections 3.2 and 3.3.

2.1. Quality of generated images

insert image description here
Before investigating the properties of our generator, we experimentally demonstrate that the redesign does not affect image quality and, in fact, it significantly improves it. Table 1 presents the Fr'echet inception distances (FID){Heusel2017} for various generator architectures in CelebA-HQ{Karras2017} and our new FFHQ dataset (Appendix A).
Results on other datasets are given in the Supplementary.
Our baseline configuration (A) is the progressive GAN setup of Karras et al. {Karras2017}, from which we inherit the network and all hyperparameters unless otherwise stated. We first switch to an improved baseline (B) by using bilinear up/downsampling operations {zhang2019}, longer training, and tuned hyperparameters.
A detailed description of the training settings and hyperparameters is included in the supplement.
We then further improve this new baseline by adding a mapping network and an AdaIN operation (C), and surprisingly observe that the network no longer benefits from feeding the latent code into the first convolutional layer. Therefore, by removing the traditional input layer and learning from 4 × 4 × 512 4\times4\times5124×4×512 const Tensor(D) Start image composition to simplify architecture.
We find it striking that the synthetic network is able to produce meaningful results even when it receives input only through the style that controls the AdaIN operation.

Finally, we introduce a noisy input (E) that further improves the results, as well as a novel mixing regularization (F), which removes the correlation of adjacent styles and allows finer control over the resulting images (Section 3.1).

We evaluate our method using two different loss functions: for Celeba-HQ we rely on WGAN-GP [23], while FFHQ uses WGAN-GP to describe configuration A and non-saturation loss [21] with R1 regularization [ 40, 47, 13] to describe configuration BF. We found these options to give the best results. Our contribution does not modify the loss function.
insert image description here

We observe that the style-based generator (E) significantly improves FIDs by nearly 20% over the traditional generator (B), confirming large-scale ImageNet measurements in parallel work {Chen2018self, Brock2018}.
Figure 2 shows an uncurated set of novel images generated from the FFHQ dataset using our generator.
As confirmed by FIDs, the average quality is high, and even accessories such as glasses and hats can be synthesized successfully.
For this figure, we avoid using the so-called truncation trick{Marchesi2017, Brock2018, Kingma2018}} from W \mathcal{W}Sampling in extreme regions of W ----- Appendix B details how to do this in W \mathcal{W}W instead ofZ \mathcal{Z}Perform the trick in Z.
Note that our generator only allows truncation to be selectively applied to low resolutions, so high resolution details are not affected.

All FIDs in this paper were calculated without truncation tricks and we only use them for illustrative purposes in Fig. 2 and in the video. All images are in 102 4 2 1024^210242 resolution generation.

2.2. Prior art

Most of the work on GAN architectures has focused on improving the discriminator by using multiple discriminators {Durugkar2016, Mordido2018, Doan2018}, multi-resolution discriminative {Wang2017, Sharma2018} or self-attention {Zhang2018sagan}.
Work on generators has focused on precise distributions in the input latent space {Brock2018} or shaping the input latent space via Gaussian mixture models {BenYosef2018}, clustering {Mukherjee2018} or encouraging convexity {Sainburg2018}.

Recent conditional generators feed the class identifier to a large number of layers in the generator {Miyato2018} through a separate embedding network, while the latent is still fed through the input layer.
Some authors consider feeding parts of the latent code to multiple generator layers {Denton2015, Brock2018}.
In parallel work, Chen et al. {Chen2018self} use AdaINs “self modulate” generator, similar to our work, but without considering intermediate latent spaces or noisy inputs.

3. Properties of the style-based generator

Our generator structure makes it possible to control image composition through scale-specific modification of styles.
We can think of mapping networks and affine transformations as methods for drawing samples for each style from a learned distribution, and synthesis networks as methods for generating new images based on a collection of styles.
The effect of each style is localized in the network, i.e., modifying a specific subset of styles can be expected to affect only certain aspects of the image.

To understand the reason for this localization, let us consider how AdaIN operates (Equation 1) by first normalizing each channel to zero mean and unit variance, and only then applying scale and bias according to style.
The new per-channel statistics, as specified by style, modify the relative importance of features for subsequent convolution operations, but they are not dependent on the original statistics due to normalization.
Therefore, each style only controls one convolution before being overwritten by the next AdaIN operation.

3.1. Style mixing

To further encourage styles localization, we employ mixing regularization , where a given percentage of images are generated during training using two random latent codes instead of one latent code.
When generating such an image, we simply switch from one latent code to another at randomly selected points in the synthesis network—an operation we call style mixing.
Specifically, we run two latent codes z 1 , z 2 \mathrm{z}_1, \mathrm{z}_2 through the mapping networkz1z2, and let the corresponding w 1 , w 2 \mathrm{w}_1, \mathrm{w}_2w1w2Control the style so that w 1 \mathrm{w}_1w1Applied before the intersection, w 2 \mathrm{w}_2w2Applied after the intersection.
This regularization technique prevents the network from assuming that adjacent patterns are correlated.
insert image description here
insert image description here

Two sets of images (source A and source B) are generated according to their respective latent codes; the remaining images are generated by copying the specified subset of styles from source B and taking the rest from source A. Duplicating the styles corresponding to the coarse spatial resolution (4 2 - 8 2 ) brings in high-level aspects such as pose, general hairstyle, face shape and glasses obtained from source B, while all colors (eyes, hair, light ) and better facial features like A.
If we copy the mid-resolution (16 2 - 32 2 ) style from B, we inherit the smaller scale facial features, hairstyle, eye opening/closing from B while preserving the pose, general face shape and Glasses.
Finally, the fine style of Replication B (64 2 - 1024 2 ) mainly brings about the color scheme and microstructure.

Table 2 shows how enabling mixture regularization during training improves localization significantly, as indicated by improved FIDs in scenarios where multiple latent variables are mixed at test time.
Figure 3 shows examples of images synthesized by mixing two latent codes with different ratios.
We can see that each style subset controls meaningful high-level properties of the image.

3.2. Stochastic variation

In human portraits, there are many aspects that can be considered random, such as the exact placement of hair, stubble, freckles, or skin pores. Either one can be randomized as long as they follow the correct distribution without affecting our perception of the image.

Let's consider how traditional generators achieve random variation. Given that the only input to the network is through the input layer, the network needs to invent a way to generate spatially varying pseudorandom numbers from earlier activations when needed.
This consumes network capacity, and it is difficult to hide the periodicity of the generated signal—and not always successfully, as evidenced by the common repeating patterns in generated images.
Our architecture avoids these problems entirely by adding pixel-wise noise after each convolution.
insert image description here

Example of random variation. (a) Two generated images. (b) Amplification achieved with different input noises. While their overall appearance is nearly identical, the placement of each hair is very different. (c) Standard deviation per pixel over 100 different realizations, highlighting which parts of the image are affected by noise. The main areas are the hair, silhouette, and background parts, but there are also interesting random changes in the eye reflections. Global aspects such as identity and pose are not affected by random changes.

insert image description here

Effect of noisy input at different layers of our generator. (a) Noise is applied to all layers. (b) No noise. (c) Noise only in fine layers (64 2 - 1024 2 ). (d) Noise only in the coarse layer (4 2 - 32 2 ). We can see that artificially ignoring the noise leads to a featureless "painting" look. Coarse noise results in the appearance of large-scale curls in hair and larger background features, while fine noise brings in finer curls in hair, finer background detail, and skin pores.

Figure 4 shows a random implementation of the same underlying image, using our generator with different noise implementations.
We can see that the noise only affects the stochastic aspects, but not the overall composition and high-level aspects such as identity integrity.
Figure 5 further illustrates the effect of applying random variation to different subsets of layers.
Since these effects are best seen in animation, please refer to the accompanying video demonstrating how varying a layer's noise input results in random changes at a matching scale.

We find an interesting phenomenon: the effect of noise is tightly localized in the network.
We assume that, at any point in the generator, there is pressure to introduce new content as quickly as possible, and that the easiest way for our network to create random variation is to rely on the noise provided.
Each layer has a new set of noise, so there is no incentive to generate random effects from early activations, leading to local effects.

3.3. Separation of global effects from stochasticity

The previous sections and the accompanying video show that while changing style has global effects (changing poses, identities, etc.), noise only affects insignificant random changes (differently combed hair, beard, etc.).
This observation is consistent with the style transfer literature, where it has been established that spatially invariant statistics (Gram matrix, channel means, variance, etc.) reliably encode the style of an image [19,36], while spatially varying features encode instance-specific.

In our style-based generator, the style affects the entire image, since the full feature map is scaled and biased with the same value.
Thus, global effects such as pose, lighting, or background style can be controlled consistently.
At the same time, noise is added to each pixel independently, so it is very suitable for controlling random changes.
If the network tries to control, for example, pose using noise, this will lead to spatially inconsistent decisions, which will then be penalized by the discriminator.
Thus, in the absence of explicit guidance, the network learns to use global and local channels appropriately.

4. Disentanglement studies

insert image description here

There are various definitions of disentanglement [50, 46, 1, 6, 18], but a common goal is a latent space composed of linear subspaces, each of which controls a variable.
But Z \mathcal{Z}The sampling probabilities for each combination of factors in Z need to match the corresponding densities in the training data.
As shown in Figure 6, this makes these factors not fully separable from typical datasets and underlying distributions of inputs. 1

A major benefit of our generator architecture is the intermediate latent space W \mathcal{W}W need not support sampling from anyfixeddistribution;
its sampling density is given bythe learnedpiecewise continuous mapf ( z ) f(\mathrm{z})caused by f ( z ) .
This mapping can be adapted to "unwarp"W \mathcal{W}W so that the variation factor becomes more linear.
We believe that the generator is under pressure to do so, since generating realistic images based on disentangled representations should be easier than based on entangled representations. Therefore, we expect training to produce less entangled W \mathcal{W}
in an unsupervised settingW , that is, when the variable factors are not known in advance {Desjardins2012, Kingma2014VAE, Rezende2014, Chen2016, Higgins2017, Kim2018, Chen2018}.

Unfortunately, recently proposed metrics for quantifying disentanglement {Higgins2017, Kim2018, Chen2018, Eastwood2018} require an encoder network that maps input images to latent codes.
These metrics are not suitable for our purposes, since our baseline GAN lacks such an encoder.
While it is possible to add additional networks for this purpose {Chen2016, Donahue2016, Dumoulin2017}, we want to avoid putting effort into components that are not part of the actual solution.
To this end, we describe two new methods for quantized disentanglement, neither of which require encoders or known factors of variation and are therefore computable for any image dataset and generator.

4.1. Perceptual path length

As pointed out by Laine [34], interpolation of latent space vectors can produce surprisingly non-linear changes in images.
For example, features that are not at any endpoint may appear in the middle of a linear interpolation path.
This is a sign that the latent space is entangled and the variable factors are not properly separated.
To quantify this effect, we can measure how drastically the image undergoes a change when we perform interpolation in the latent space.
Intuitively, a less curved latent space should have perceptually smoother transitions than a highly curved latent space.

As the basis of our metric, we use the perceptually based pairwise image distance {Zhang2018metric}, which is computed as the weighted difference between two VGG16 {simonyan2014} embeddings, where the weights are fitted to make the metric compatible with human perceptual similarity judgments unanimous.
If we subdivide the latent space interpolation path into linear segments, we can define the total perceptual length of that segmented path as the sum of the perceptual differences over each segment, as reported by the image distance metric.
The natural definition of perceptual path length is the limit of this sum under infinite subdivisions, but in practice we use small subdivisions epsilon ϵ = 1 0 − 4 \epsilon=10^{-4}ϵ=104 to approximate it.
Therefore, the latent spaceZ \mathcal{Z}The average perceptual path length of all possible endpoints in Z
is l Z = E [ 1 ϵ 2 d ( G ( slerp ( z 1 , z 2 ; t ) ) , G ( slerp ( z 1 , z 2 ; t + ϵ ) ) ) ] , \begin{equation} \begin{array} ll_{\mathcal{Z}} = \mathbb{E}\Big[{\displaystyle \frac{1}{ \epsilon^2}}d\big( G (slerp(\mathrm{z}_1,\mathrm{z}_2;\,t)),G(slerp(\mathrm{z}_1,\mathrm{z}_2;\,t+\epsilon))\big )\Big]\textrm{,} \end{array} \end{equation}lZ=E[ϵ21d(G(slerp(z1,z2;t)),G(slerp(z1,z2;t+ϵ )) ) ] ,where z 1 , z 2 ∼ P ( z ) , t ∼ U ( 0 , 1 ) \mathrm{z}_1,\mathrm{z}_2\sim P(\mathrm{z}),t\sim U(0 ,1)z1,z2P(z),tU(0,1 ) ,GGG is the generator (i.e.g ∘ fg \circ fgf for style-based networks),d ( ⋅ , ⋅ ) d(\cdot,\cdot)d(,) evaluates the perceptual distance between the resulting images. hereslerp slerps l er p denotes spherical interpolation {shoemake85}, which is the most appropriate interpolation in our normalized input latent space ~\cite{white16}.
To focus on facial features rather than the background, we crop the generated images to contain only faces before evaluating the pairwise image metric.
due to measureddd is the quadratic {Zhang2018metric}, we divide byϵ 2 \epsilon^2ϵ2 .
We calculate the expected value by taking 100,000 samples.

Calculate W \mathcal{W}The average perceptual path length in W
proceeds in a similar fashion: l W = E [ 1 ϵ 2 d ( g ( lerp ( f ( z 1 ) , f ( z 2 ) ; t ) ) , g ( lerp ( f ( z 1 ) , f ( z 2 ) ; t + ϵ ) ) ) ] , \begin{equation} \begin{array} ll_{\mathcal{W}} = \mathbb{E}\Big[{\displaystyle\frac{ 1}{\epsilon^2}}d\big(g(lerp(f(\mathrm{z}_1),f(\mathrm{z}_2);\,t)), g(lerp(f(\ mathrm{z}_1),f(\mathrm{z}_2);\,t+\epsilon))\big)\Big]\textrm{,} \end{array} \end{equation}lW=E[ϵ21d(g(lerp(f(z1),f(z2);t)),g(lerp(f(z1),f(z2);t+ϵ )) ) ] ,The only difference is that the interpolation happens at W \mathcal{W}In W space. SinceW \mathcal{W}The vectors in W are not normalized in any way, so we use linear interpolation ( lerp lerplerp)。
insert image description here

Table 3 shows that this full path length is much shorter for style-based generators with noisy inputs, suggesting that W \mathcal{W}W is perceptually better thanZ \mathcal{Z}Z is more linear.
However, this measure is actually slightly biased towards the input latent spaceZ \mathcal{Z}Z. _
FruitW \mathcal{W}W is indeedZ \mathcal{Z}A disentangled and "flat" map of Z , then it may contain regions that are not on the input manifold --------- so seriously even between points mapped from the input manifold can be reconstructed by the generator , while the input latent spaceZ \mathcal{Z}Z has no such region by definition.
Therefore, it is expected that if we restrict the measurement to the path endpoints, i.e.t ∈ { 0 , 1 } t \in \{0,1\}t{ 0,1 } , we should obtain a smallerl W l_{\mathcal{W}}lW, l Z l_{ \mathcal{Z}}lZNot affected.
This is indeed what we observed in Table 3.
insert image description here
Table 4 shows how the mapped network affects the path length.
We find that both traditional and style-based generators benefit from mapped networks, and that additional depth generally improves perceived path length and FIDs.
Interestingly, although l W l_{\mathcal{W}}lWImproved over conventional generators, but l Z l_{\mathcal{Z}}lZbecomes quite bad, which illustrates our point that the input latent space can indeed be arbitrarily entangled in GANs.

4.2. Linear separability

If a latent space is sufficiently disentangled, it should be possible to find direction vectors that consistently correspond to individual variables of variation.
We propose another metric to quantify this effect by measuring how latent space points are separated into two distinct sets by a linear hyperplane such that each set corresponds to a specific binary attribute of the image.

To label the generated images, we train an auxiliary classification network on a number of binary attributes, e.g. distinguishing between male and female faces.
In our tests, the classifier has the same architecture as the discriminator we use (i.e., the same as in {Karras2017}), and is trained using the CelebA-HQ dataset, which holds the original CelebA dataset.
To measure the separability of an attribute, we use z ∼ P ( z ) \mathrm{z}\sim P(\mathrm{z})zP ( z ) generates 200,000 images and classifies them using an auxiliary classification network.
We then sort the samples by classifier confidence and remove the half with the lowest confidence, resulting in a latent space vector of 100,000 labels.

For each attribute, we fit a linear SVM to predict the label from the latent space point -------- z \mathrm{z}z (traditional) andw \mathrm{w}w (style-based) ----- and classify points by this plane.
Then, we calculate the conditional entropyH ( Y ∣ X ) {\mathrm H}(Y|X)H ( Y X ) , whereXXX is the category predicted by the SVM,YYY is the category determined by the pretrained classifier.
This tells us how much additional information is needed to determine the true class of a sample, provided we know which side of the hyperplane it lies on.
Lower values ​​indicate that the latent space directions of the corresponding variable factors are consistent.

We compute the final separability score as exp ⁡ ( ∑ i H ( Y i ∣ X i ) ) \exp(\sum_i{\mathrm H}(Y_i|X_i))exp(iH(YiXi)) , of whichiii enumerates about 40 properties.
Similar to the original fraction {Salimans2016B}, exponentiation brings values ​​from the logarithmic domain into the linear domain for easier comparison.

Tables 3 and 4 show that W \mathcal{W}W is always greater thanZ \mathcal{Z}Z has better separability, suggesting it is a less entangled representation.
In addition, increasing the depth of the mapping network can improveW \mathcal{W}image quality and separability in W , which fits with the assumption that synthetic networks inherently benefit disentangled input representations.
Interestingly, adding a mapping network in front of a conventional generator results inZ \mathcal{Z}Z severely loses separability, but improves the intermediate latent spaceW \mathcal{W}W , and the FID is also improved.
This shows that even traditional generator architectures perform better when we introduce an intermediate latent space that does not necessarily follow the distribution of the training data.

5. Conclusion

Based on our results and parallel work by Chen et al. [5], it is increasingly evident that traditional GAN ​​generator architectures are inferior to style-based designs in every respect.
Being true in terms of established quality metrics, we further believe that our study of the separation of high-level attributes and random effects and the linearity of the intermediate latent space will prove fruitful for research on improving the understanding and controllability of GAN synthesis of.

We note that our average path length metric can easily be used as a regularizer in training, and perhaps some variant of the linear separability metric can also be used as a regularizer.
Overall, we expect that methods that directly shape the intermediate latent space during training will provide interesting avenues for future work.

A. The FFHQ dataset

insert image description here

We collected a new face dataset Flickr-Faces-HQ (FFHQ), consisting of 70,000 images with a resolution of 102 4 2 1024^210242 's high-quality image composition (Fig. 7).
This dataset contains more variation than CelebA-HQ {Karras2017} in terms of age, ethnicity, and image background, and also has better coverage for accessories such as glasses, sunglasses, hats, etc.
Crawled from Flickr (thus inheriting all biases from that site) and automatically aligned {Kazemi2014} and cropped.
Only licensed images are collected. Trim the collection using various automatic filters, and finally Mechanical Turk allows us to remove the occasional statue, painting or photo.
We havemade this dataset publicly available athttps://github.com/nvlabs/ffhq-dataset

B. Truncation trick in W \mathcal{W} W

insert image description here

If we consider the distribution of the training data, it is clear that low-density regions are poorly represented and thus the generator may be difficult to learn.
This is an important open problem in all generative modeling techniques.
However, it is known that drawing latent vectors from truncated {Marchesi2017, Brock2018}} or other shrunk {Kingma2018} sampled spaces tends to improve the average image quality, although some variation is lost.

We can follow a similar strategy.
First, we will W \mathcal{W}The centroid of W is calculated as w ˉ = E z ∼ P ( z ) [ f ( z ) ] \bar{\mathrm{w}} = \mathbb{E}_{\mathrm{z}\sim P(\mathrm{ z})}[f(\mathrm{z})]wˉ=EzP(z)[ f ( z )] .
In the case of FFHQ, this point represents an average face (Figure 8,ψ = 0 \psi=0p=0 ).
Then we can givew \mathrm{w}The deviation of w from the center is scaled tow ′ = w ˉ + ψ ( w − w ˉ ) \mathrm{w}' = \bar{\mathrm{w}} + \psi(\mathrm{w} - \bar{\ mathrm{w}})w=wˉ+ψ ( wwˉ ), whereψ < 1 \psi < 1p<1 .
Although Brock et al. {Brock2018} observe that even with orthogonal regularization, only a subset of networks are suitable for this truncation, even without changing the loss function, W \mathcal{W}Truncation in W space also seems to work reliably.


  1. The few artificial datasets designed for disentanglement studies (e.g., [39, 18]) tabulate all combinations of predetermined variable factors with uniform frequencies, thus hiding the problem. ↩︎

Guess you like

Origin blog.csdn.net/qq_45934285/article/details/132077440