[Computer Vision | Generative Confrontation] Improved training techniques for generative confrontation networks (GANs)

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

Title: Improved Techniques for Training GANs

Link: [ 1606.03498v1] Improved Techniques for Training GANs (arxiv.org)

Summary

This paper introduces a series of new architectural features and training procedures applied to the generative adversarial network (GANs) framework. We focus on two application areas of GANs: semi-supervised learning and generating images that are visually realistic to humans. Unlike most research on generative models, our main goal is not to train a model that assigns high probabilities to the test data, nor do we require the model to learn well without using any labels. With our new technique, we achieve state-of-the-art results on semi-supervised classification tasks on MNIST, CIFAR-10, and SVHN. The generated images are of high quality, confirmed by a visual Turing test: our model generates MNIST samples that are indistinguishable from real data, while CIFAR-10 samples have a human error rate of 21.3%. We also demonstrate ImageNet samples at unprecedented resolution, and show that our method enables models to learn features that recognize ImageNet categories.

1 Introduction

Generative Adversarial Networks (GANs for short) is a method of learning generative models based on game theory [1]. The goal of GANs is to train a generator network G ( z ; θ ( G ) ) G(z; \pmb{θ}^{(G)})G(z;i( G ) ), by adding the noise vectorzzz converted to samplex = G ( z ; θ ( G ) ) x = G(z; \pmb{θ}^{(G)})x=G(z;i( G ) ), from the data distributionpdata ( x ) p_{data}(x)pdataGenerate samples in ( x ) . GeneratorGGG 's training signal comes from a discriminator networkD ( x ) D(x)D ( x ) , the network is trained to distinguish the generator distributionpmodel ( x ) p_{model}(x)pmodel( x ) sample and real data. Generator NetworkGGG is in turn trained such that the discriminator accepts its output as a true sample.

Recent applications of GANs show that they can generate high-quality samples [2, 3]. However, training GANs requires finding a Nash equilibrium point of a non-convex game with continuous, high-dimensional parameters. GANs are usually trained using gradient descent techniques, which aim to find lower values ​​of the cost function rather than finding the Nash equilibrium of the game. These algorithms may fail to converge when used to find Nash equilibria [4].

In this paper, we introduce several techniques designed to encourage game convergence in GANs. These techniques are proposed based on a heuristic understanding of non-convergent problems, and they lead to improved semi-supervised learning performance and improved sample generation. We hope that some of these techniques can form the basis of future work providing formal guarantees of convergence.

All code and hyperparameters can be found at the following link: https://github.com/openai/improved_gan

2 related work

Several recent papers focus on improving the training stability and generation quality of GAN samples [2, 3, 5, 6]. We borrow some of these techniques in this article. For example, we use in this paper the "DCGAN" architectural innovation proposed by Radford et al., described below.

One of the techniques we propose, Feature Matching, discussed in Section 3.1, is similar in spirit to the approach of training generator networks [10, 11] using maximum mean difference [7, 8, 9]. Another technique we propose, mini-batch features, is partly based on ideas used for batch normalization [12], while our proposed virtual batch normalization is a direct extension of batch normalization.

One of the main goals of this work is to improve the performance of GANs in semi-supervised learning (improving the performance of supervised tasks, in this case classification tasks, by learning on additional unlabeled examples). Like many deep generative models, GANs have previously been applied to semi-supervised learning [13, 14], and our work can be seen as a continuation and improvement of this effort.

3 Towards convergent GAN training

Training generative adversarial networks (GANs) involves finding the Nash equilibrium point of a two-player non-cooperative game. Each player wishes to minimize its own cost function, which for the discriminator is J ( D ) ( θ ( D ) , θ ( G ) ) J^{(D)}(\pmb{θ}^{(D )}, \pmb{θ}^{(G)})J( D ) (i(D),i( G ) ), which for a generator isJ ( G ) ( θ ( D ) , θ ( G ) ) J^{(G)}(\pmb{θ}^{(D)}, \pmb{θ }^{(G)})J( G ) (i(D),i( G ) ). Nash equilibrium is a point( θ ( D ) , θ ( G ) ) (\pmb{θ}^{(D)}, \pmb{θ}^{(G)})( i(D),i( G ) ), makingJ ( D ) J^{(D)}J( D )θ ( D ) \pmb{θ}^{(D)}i( D ) reaches the minimum value, andJ ( G ) J^{(G)}J( G ) atθ ( G ) \pmb{θ}^{(G)}i( G ) reaches a minimum. However, finding a Nash equilibrium is a very difficult problem. While some algorithms exist for specific cases, we are not aware of any applicable to GAN games where the cost function is non-convex, the parameters are continuous, and the parameter space is extremely high-dimensional.

The idea of ​​minimizing the cost function for each player, or Nash equilibrium, seems intuitively supportive of using traditional gradient-based minimization techniques to simultaneously minimize each player's cost. However, reducing θ ( D ) \pmb{θ}^{(D)}i( D ) to reduceJ ( D ) J^{(D)}J( D ) may increaseJ ( G ) J^{(G)}J( G ) , while reducingθ ( G ) \pmb{θ}^{(G)}i( G ) to reduceJ ( G ) J^{(G)}J( G ) may increaseJ ( D ) J^{(D)}J( D ) . Gradient descent therefore fails to converge in many games. For example, when a player is relative toxxx minimizesxy xyx y , while the other player is relative toyyy minimize− xy -xyx y , the gradient descends into a stable orbit instead of converging to the desired equilibrium pointx = y = 0 x = y = 0x=y=0 [15]. Therefore, previous GAN training methods apply gradient descent simultaneously on each player's cost, although there is no guarantee that this process will converge. We introduce the following heuristic techniques to encourage convergence:

3.1 Feature matching

Feature matching addresses the instability of GANs by specifying a new objective for the generator that prevents the generator from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to produce data that matches the real data statistics, and we only use the discriminator to specify statistics worthy of matching. Specifically, we train the generator to match the expected value of the features of the intermediate layers of the discriminator. This is a natural choice for the generator to choose which statistics to match, because by training the discriminator, we ask it to find the features that best distinguish real data from data generated by the current model.

Let f ( x ) \pmb{f}(x)f ( x ) represents the activation of the intermediate layer of the discriminator, and the new objective we define for the generator is:∣ ∣ E x ∼ pdataf ( x ) − E z ∼ pz ( z ) f ( G ( z ) ) ∣ ∣ 2 2 ||\mathbb{E}_{x∼p_{data}} \pmb{f}(x) − \mathbb{E}_{z∼p_z(z)} \pmb{f}(G(z)) ||^2_2∣∣Expdataf(x)Ezpz(z)f(G(z))22. Discriminator and f ( x ) \pmb{f}(x)f ( x ) are trained in the usual way. As with regular GAN training, this objective has a fixed point whereGGG exactly matches the distribution of the training data. We cannot guarantee that this fixed point is reached in practice, but our empirical results show that feature matching is indeed effective in situations where regular GANs become unstable.

3.2 Minibatch discrimination

One of the major failure modes of GANs is that the generator crashes to a parameter setting that makes it always emit the same point. When collapsing into a single pattern is about to occur, the discriminator gradients of many similar points may point in similar directions. Because the discriminator processes each example independently, there is no coordination between its gradients, and thus no mechanism to tell the generator's output to become less similar. Instead, all outputs tend towards a single point that the discriminator currently considers to be very realistic. After the crash happens, the discriminator learns that this single point came from the generator, but gradient descent cannot separate the same output. The gradient of the discriminator then pushes the single point produced by the generator into space forever, and the algorithm fails to converge to a distribution with the correct amount of entropy. An obvious strategy to avoid this type of failure is to allow the discriminator to look at combinations of multiple data examples and do what we call mini-batch discrimination.

The concept of mini-batch discriminants is very general: any discriminator model that looks at multiple examples in combination, rather than in isolation, can potentially help avoid crashing the generator. In fact, Radford et al. [3] explain it well from this perspective by successfully applying batch normalization in the discriminator. However, our experiments so far have been limited to models explicitly aimed at identifying particularly close generator samples. A successful specification is the following description for modeling the closeness between examples in a mini-batch: Let f ( xi ) ∈ RA f(x_i) ∈ \mathbb{R}^Af(xi)RA means inputxi x_ixiA feature vector produced at some intermediate layer of the discriminator. Then, the vector f ( xi ) f(x_i)f(xi) and tensorT ∈ RA × B × CT ∈ \mathbb{R}^{A×B×C}TRMultiply A × B × C to get the matrix M i ∈ RB × C M_i ∈ \mathbb{R}^{B×C}MiRB x C. _ Then, calculate the resulting matrixM i M_iMiThe L1 distance between rows of , spanning samples i ∈ { 1 , 2 , . . . , n } i ∈ \{1, 2, . . . , n\}i{ 1,2,...,n } , and apply the negative exponential function (Figure 1):cb ( xi , xj ) = exp ( − ∣ ∣ M i , b − M j , b ∣ ∣ L 1 ) ∈ R c_b(x_i, x_j) = exp( −||M_{i,b} - M_{j,b}||_{L1}) ∈ \mathbb{R}cb(xi,xj)=exp(∣∣Mi,bMj,bL 1)R. _ The output of the mini-batch layero ( xi ) o(x_i)o(xi) for samplexi x_ixiDefined as cb(xi, xj) c_b(x_i, x_j) with all other samplescb(xi,xj) sum:

o ( x i ) b = ∑ j = 1 n c b ( x i , x j ) ∈ R o ( x i ) = [ o ( x i ) 1 , o ( x i ) 2 , . . . , o ( x i ) B ] ∈ R B o ( X ) ∈ R n × B \begin{align} o(x_i)_b & = \sum_{j=1}^{n} c_b(x_i, x_j) \in \mathbb{R} \\ o(x_i) & = [o(x_i)_1, o(x_i)_2, ..., o(x_i)_B] \in \mathbb{R}^B \\ o(X) & \in \mathbb{R}^{n \times B} \end{align} o(xi)bo(xi)o(X)=j=1ncb(xi,xj)R=[o(xi)1,o(xi)2,...,o(xi)B]RBRn×B

Figure 1: Illustration of how mini-batch discriminant works. From sample xi x_ixiFeatures f ( xi ) f(x_i)f(xi) through the tensorTTT are multiplied, and the cross-sample distance is calculated.

Next, we take the output of the mini-batch layer o ( xi ) o(x_i)o(xi) with the intermediate feature f ( xi ) f(x_i)as its inputf(xi) and then feed the result into the next layer of the discriminator. We compute these mini-batch features for samples from the generator and training data separately. As before, the discriminator still needs to output a single number for each example, indicating how likely it is from the training data: the task of the discriminator is still actually to classify a single example as real or generated, but now it can use Additional examples in the mini-batch serve as auxiliary information. Mini-batch discriminative allows us to quickly generate visually appealing samples, and in this respect it outperforms feature matching (Section 6). Interestingly, however, feature matching performs better in obtaining strong classifiers using the semi-supervised learning approach described in Section 5.

3.3 Historical averaging

When applying this technique, we modify each player's cost to include a term ∣ ∣ θ − 1 t ∑ i = 1 t θ [ i ] ∣ ∣ 2 ||\pmb{θ} - \frac{1}{ t} \sum_{i=1}^{t} \pmb{θ}[i]||^2∣∣θt1i=1tθ[i]2 , whereθ [ i ] \pmb{θ}[i]θ [ i ] is the parameter value at past time point i. The historical average of the parameters can be updated in an online manner, so this learning rule is suitable for long-term series. This approach is inspired by the virtual game [16] algorithm, which can find equilibrium points in other types of games. We found that our method is capable of finding the equilibrium point of a low-dimensional continuous non-convex game, e.g., a player controllingxxx , another player controlsyyy , the value function is( f ( x ) − 1 ) ( y − 1 ) (f(x) - 1)(y - 1)(f(x)1)(y1 ) , wheref ( x ) = xf(x) = xf(x)=x (forx < 0 x < 0x<0 ) andf ( x ) = x 2 f(x) = x^2f(x)=x2 (other cases). For these toy games, gradient descent fails into extended orbits that do not approach the equilibrium point.

3.4 One-sided label smoothing

Label smoothing, a technique from the 1980s and recently rediscovered independently by Szegedy et al. Vulnerability of Neural Networks to Adversarial Examples [18]. Replace the positive classification target with α, the negative target with β, and the optimal discriminator becomes D ( x ) = α pdata ( x ) + β pmodel ( x ) pdata ( x ) + pmodel ( x ) D(x) = \frac{αp_{data}(x)+βp_{model}(x)}{p_{data}(x)+p_{model}(x)}D(x)=pdata(x)+pmodel(x)a pdata(x)+βpmodel(x). pmodel p_{model} in the moleculepmodelThe existence of is problematic because in pdata p_{data}pdataclose to zero and pmodel p_{model}pmodelIn larger regions, from pmodel p_{model}pmodelThe erroneous samples of have no incentive to move closer to the data. Therefore, we only smooth the positive labels and set the negative labels to 0.

3.5 Virtual batch normalization

Batch normalization greatly improves the optimization of neural networks and has been shown to be very effective for DCGANs [3]. However, it leads to the neural network for the input example xxThe output of x is in the same mini-batch other inputx 0 x_0x0Highly correlated. To avoid this problem, we introduce virtual batch normalization (VBN), where each example xxx is based on normalizing the statistics of the batch of reference examples that are chosen once and fixed at the beginning of training, and based onxxx itself. The reference batch is only normalized using its own statistics. VBN is computationally expensive since it requires running the forward pass on two mini-batches of data, so we only use it in the generator network.

4 Image Quality Evaluation

GANs lack objective functions, which makes it difficult to compare the performance of different models. An intuitive performance metric can be obtained by having human annotators evaluate the visual quality of samples [2]. We automate this process using Amazon Mechanical Turk (MTurk), using the graph web interface (at http://infinite-chamber-35121.herokuapp.com/cifar-minibatch/), which we use to ask annotators to differentiate generated data and real data. The quality assessment results of our model are described in Section 6.

A downside of using human annotators is that metrics can vary depending on the task setup and the annotator's motivation. We also found that when we gave the annotators feedback about their mistakes, the results changed considerably: by learning from this feedback, the annotators were better able to point out flaws in the generated images and thus gave more pessimistic quality assessment. The left column of Figure 2 presents a screen during the annotation process, while the right column shows how we notified the annotator of its error.

Figure 2: The web interface provided to annotators. Annotators are asked to distinguish computer-generated images from real images.

As an alternative to human annotators, we propose an automatic method to evaluate samples, which we find correlates well with human evaluation: we apply the Inception model1 [19] to each generated image to obtain Conditional label distribution p ( y ∣ x ) p(y|x)p ( y x ) . Images containing meaningful objects should have a conditional label distributionp ( y ∣ x ) p(y|x)p ( y x ) . Furthermore, we expect the model to generate a wide variety of images, so the marginal distributionp ( y ∣ x = G ( z ) ) dzp(y|x = G(z))dzp(yx=G ( z )) d z should have high entropy. Combining these two requirements, our proposed metric is:exp ( E x KL ( p ( y ∣ x ) ∣ ∣ p ( y ) ) ) exp(E_xKL(p(y|x)||p(y)))exp(ExK L ( p ( y x ) ∣∣ p ( y ))) , we exponentiate the result to make it easier to compare values. Our Inception score is closely related to the objective used in CatGAN [14] for training generative models: while we did not have much success at training time, we found it to be a good evaluation metric that correlates very well with human judgement. We found that it is important to evaluate on a sufficiently large number of samples (i.e. 50k) when evaluating this metric, since part of the metric measures diversity.

5 Semi-supervised learning

Consider a method for dividing the data point xxx is classified asKKA standard classifier for one of K possible classes. Such a model wouldxxx as input, and outputs a K-dimensional logarithmic vector { l 1 , . . . , l K l_1, . . . , l_K l1,...,lK}, which can be transformed into class probabilities by applying softmax: p model ( y = j ∣ x ) = exp ( lj ) ∑ k = 1 K exp ( lk ) p_{\text{model}}(y = j|x) = \frac{exp(l_j)}{\sum_{k=1}^{K} exp(l_k)}pmodel(y=jx)=k=1Kexp(lk)exp(lj). In supervised learning, such a model works by minimizing the observed label with the model predicted distribution p model ( y ∣ x ) p_{\text{model}}(y|x)pmodel( y x ) for training.

We can pass the GAN generator GGSamples of G are added to our dataset for semi-supervised learning using any standard classifier, labeling them as new "generated" categoriesy = K + 1 y = K + 1y=K+1 , and correspondingly scale our classifier output dimension fromKKK increases toK+1 K+1K+1 . Then we can usep model ( y = K + 1 ∣ x ) p_{\text{model}}(y = K + 1 | x)pmodel(y=K+1∣ x ) to providexxThe probability that x is false, corresponds to 1 − D ( x ) 1 - D(x)in the original GAN ​​framework1D ( x ) . Now, we can also learn from unlabeled data as long as we know that it is related toKKCorresponding to one of the K -type real data, by maximizing the logp model ( y ∈ { 1 , . . . , K } ∣ x ) log p_{\text{model}}(y \in \{1, . . . , K\}|x)logpmodel(y{ 1,...,K}x)

Assuming our dataset is half real data and half generated data (this is arbitrary), our loss function for training the classifier becomes:

L = − E x , y ∼ p data ( x , y ) [ log ⁡ p model ( y ∣ x ) ] − E x ∼ G [ log ⁡ p model ( y = K + 1 ∣ x ) ] = L supervised + L unsupervised \begin{align} L & = -\mathbb{E}_{x,y \sim p_{\text{data}}(x,y)} [\log p_{\text{model}}(y|x)] - \mathbb{E}_{x \sim G} [\log p_{\text{model}}(y = K + 1|x)] \\ & = L_{\text{supervised}} + L_{\text{unsupervised}} \end{align} L=Ex,ypdata(x,y)[logpmodel(yx)]ExG[logpmodel(y=K+1∣x)]=Lsupervised+Lunsupervised

,in

L supervised = − E x , y ∼ p data ( x , y ) log ⁡ p model ( y ∣ x , y < K + 1 ) L unsupervised = − { E x ∼ p data ( x ) log ⁡ [ 1 − p model ( y = K + 1 ∣ x ) ] + E x ∼ G log ⁡ [ p model ( y = K + 1 ∣ x ) ] } \begin{align} L_{\text{supervised}} & = -\mathbb{E}_{x,y \sim p_{\text{data}}(x,y)} \log p_{\text{model}}(y|x, y < K + 1) \\ L_{\text{unsupervised}} & = -\{\mathbb{E}_{x \sim p_{\text{data}}(x)} \log[1 - p_{\text{model}}(y = K + 1|x)] + \mathbb{E}_{x \sim G} \log[p_{\text{model}}(y = K + 1|x)]\} \end{align} LsupervisedLunsupervised=Ex,ypdata(x,y)logpmodel(yx,y<K+1)={ Expdata(x)log[1pmodel(y=K+1∣x)]+ExGlog[pmodel(y=K+1∣x)]}
Here, we decompose the total cross-entropy loss into the standard supervised loss function L supervised L_{\text{supervised}}Lsupervised(the negative log probability of the label given the true data) and an unsupervised loss L unsupervised L_{\text{unsupervised}}Lunsupervised, in fact, it is the standard GAN game value, when we put D ( x ) = 1 − p model ( y = K + 1 ∣ x ) D(x) = 1 - p_{\text{model}}(y = K + 1|x)D(x)=1pmodel(y=K+1∣ x ) into the expression:
L unsupervised = − { E x ∼ p data ( x ) log ⁡ D ( x ) + E z ∼ noise log ⁡ ( 1 − D ( G ( z ) ) ) } L_{\text{unsupervised}} = -\{\mathbb{E}_{x \sim p_{\text{data}}(x)} \log D(x) + \mathbb{E}_ {z \sim \text{noise}} \log(1 - D(G(z)))\}Lunsupervised={ Expdata(x)logD(x)+Eznoiselog(1D(G(z)))}

Minimize L supervised L_{\text{supervised}}Lsupervised L unsupervised L_{\text{unsupervised}} LunsupervisedThe best solution for is to make exp [ lj ( x ) ] = c ( x ) p ( y = j , x ) exp[l_j (x)] = c(x)p(y=j, x)exp[lj(x)]=c(x)p(y=j,x ) for allj < K + 1 j < K + 1j<K+1 , andexp [ l K + 1 ( x ) ] = c ( x ) p G ( x ) exp[l_{K+1}(x)] = c(x)p_G(x)exp[lK+1(x)]=c(x)pG( x ) wherec ( x ) c(x)c ( x ) is an undetermined scaling function. Therefore, the unsupervised loss is consistent with the supervised loss from the perspective of Sutskever et al. [13], and we can better estimate this optimal solution from the data by jointly minimizing these two loss functions. In fact, when minimizing L unsupervised L_{\text{unsupervised}}for our classifierLunsupervisedWhen not trivial, L unsupervised L_{\text{unsupervised}}LunsupervisedMight help, so we need to train GGG to approximate the data distribution. One way to do this is by trainingthe GGG to minimize the GAN game value using the discriminator DDdefined by our classifierD. _ This approach introducesthe GGWe do not fully understand the interplay between G and our classifier, but in fact we found that optimizingGGG is very effective in semi-supervised learning, while using GAN with mini-batch discrimination to trainGGG doesn't work at all. Here we present our empirical results using this approach; using this approach todevelopD andGGA complete theoretical understanding of the interactions between G is left for future work.

Finally, note that our classifier with K + 1 outputs is overparameterized: subtract a general function f(x) from each output logistic, i.e. lj ( x ) ← lj ( x ) − f ( x ) l_j (x) \leftarrow l_j (x) - f(x)lj(x)lj(x)f ( x ) for alljjj , will not change the output of softmax. This means we can equivalently fixl K + 1 ( x ) = 0 l_{K+1}(x) = 0lK+1(x)=0 for allxxx , in this caseL supervised L_{\text{supervised}}Lsupervisedbecomes the standard supervised loss function for our original classifier with K classes, while our discriminator DDD is given asD ( x ) = Z ( x ) Z ( x ) + 1 D(x) = \frac{Z(x)}{Z(x)+1}D(x)=Z(x)+1Z(x), where Z ( x ) = ∑ k = 1 K exp [ lk ( x ) ] Z(x) = \sum_{k=1}^{K} exp[l_k(x)]Z(x)=k=1Kexp[lk(x)]

5.1 The Importance of Labels to Image Quality

In addition to achieving state-of-the-art results in semi-supervised learning, the above approach also has the unexpected effect of improving the quality of generated images with human annotator evaluations. The reason seems to be that the human visual system is very sensitive to image statistics that can help infer the class of objects represented by the image, while it may be relatively insensitive to less important local statistics that explain the image. This is supported by the high correlation between the Inception score we develop in Section 4, which is explicitly constructed to measure the "objectness" of generated images, and the quality reported by human annotators. By letting the discriminator DDD classifies objects shown in images, which we form an internal representation emphasizing the same features that humans emphasize. This effect can be understood as a method of transfer learning that may be more widely applicable. We leave further exploration of this possibility as future work.

6 experiments

We conduct semi-supervised experiments on MNIST, CIFAR-10, and SVHN datasets, and sample generation experiments on MNIST, CIFAR-10, SVHN, and ImageNet datasets. We provide code to reproduce most of the experiments.

6.1 MNIST

The MNIST dataset contains 60,000 labeled images of handwritten digits. We conduct semi-supervised training, randomly selecting a fraction of them, and consider settings with 20, 50, 100, and 200 labeled examples. The results are averaged over 10 random subsets, each selected to have a balanced number of examples for each class. The rest of the training images are unlabeled. Our networks each have 5 hidden layers. We use weight normalization [20] and add Gaussian noise on the output of each layer of the discriminator. Table 1 summarizes our results.

Table 1: Number of misclassified test examples in the semi-supervised setting on MNIST with permutation invariance. Results are averaged over 10 seeds.

Generator samples generated during semi-supervised learning using feature matching (Section 3.1) do not look visually appealing (left Figure 3). Instead, using mini-batch discrimination (Section 3.2), we can improve their visual quality. On MTurk, annotators were able to distinguish samples 52.4% of the time (2000 votes in total), where random guessing would yield 50% accuracy. Likewise, researchers at our institution did not find any traces that could be used to differentiate the samples. However, semi-supervised learning using mini-batch discriminants has not produced classifiers as good as feature matching.

Figure 3: (Left) Samples generated by the model during semi-supervised training. These samples can be clearly distinguished from images from the MNIST dataset. (Right) Samples generated using mini-batch discriminant. These samples are completely indistinguishable from the images in the dataset.

6.2 CIFAR-10

CIFAR-10 is a small and extensively researched 32 × 32 dataset of natural images. We use this dataset to study semi-supervised learning, as well as to examine the visual quality of samples that can be achieved. For the discriminator in our GAN, we use a 9-layer deep convolutional neural network with dropout and weight normalization. The generator is a convolutional neural network with a depth of 4 layers and uses batch normalization. Table 2 summarizes our results on semi-supervised learning tasks.

Table 2: Test errors on semi-supervised CIFAR-10. Results are averaged over 10 splits of the data.

When we generated 50% real data and 50% fake data using our best CIFAR-10 model, MTurk users correctly classified 78.7% of the images. However, MTurk users may not be familiar or motivated with CIFAR-10 images; we ourselves were able to classify images with >95% accuracy. We validate the Inception score described above by observing that when filtering using only the top 1% of samples according to the Inception score, MTurk's accuracy drops to 71.4%. We conduct a series of ablation experiments to demonstrate that our proposed technique improves the Inception score, and the results are summarized in Table 3. We also show images from these ablation experiments—in our opinion, the Inception score correlates well with our subjective judgment of image quality. A sample of the dataset reached the highest value. All even partially broken models scored relatively low. We caution that the Inception score should be used as a rough guide to evaluate models trained by some independent criterion; directly optimizing the Inception score leads to adversarial examples [25].

Figure 4: Samples generated during training on semi-supervised CIFAR-10 using feature matching (Section 3.1, left) and mini-batch discrimination (Section 3.2, right).

6.3 SVHN

For the SVHN dataset, we used the same architecture and experimental settings as CIFAR-10.

Figure 5: (Left) Error rate on SVHN. (Right) A sample from the SVHN generator.

Table 3: Inception score table for samples generated by different models, for 50,000 images. Scores are highly correlated with human judgment, with natural images scoring the highest. Models that generate collapsed samples score relatively poorly. This metric frees us from relying on human evaluation. "Our method" includes all techniques described in this paper, except feature matching and historical averaging. The remaining experiments are ablation experiments, showing that our technique is effective. "-VBN+BN" replaces VBN with BN in the generator, same as DCGANs. This results in a small loss of sample quality on CIFAR. VBN is more important to ImageNet. "-L+HA" removes labels from the training process and adds historical averages to compensate. HA makes it possible to still generate some recognizable objects. Without HA, the sample quality is drastically reduced (see "-L"). "-LS" removes label smoothing and leads to a noticeable performance drop relative to "our method". "-MBF" removes small batch features and causes a very large performance drop, even larger than the drop caused by removing labels. Adding HA does not prevent this problem.

6.4 ImageNet

We tested our technique on a dataset of unprecedented size: 128×128 images from the ILSVRC2012 dataset, with 1,000 categories. To the best of our knowledge, there are no previous publications applying generative models to datasets with this high resolution and this many object categories. Large numbers of object categories are particularly challenging for GANs since generative models tend to underestimate the entropy in the distribution. We extensively modify DCGANs2, a publicly available TensorFlow [26] implementation , using a multi-GPU implementation to achieve high performance. Unmodified DCGANs can learn some basic image statistics and generate continuous shapes with some natural color and texture, but will not learn any objects. Using the techniques described in this paper, GANs learn to generate animal-like objects, but with incorrect anatomy. The result is shown in Figure 6.

Figure 6: Samples generated from the ImageNet dataset. (Left) Samples generated by DCGAN. (Right) Samples generated using the technique proposed in this paper. The new technique allowed GANs to learn recognizable features of animals, such as fur, eyes, and noses, but these features failed to combine correctly to form animals with realistic anatomy.

7 Conclusion

Generative adversarial networks are a promising class of generative models, but their unstable training and lack of proper evaluation metrics have been limiting factors so far. This study proposes a partial solution to these two problems. We propose several techniques to stabilize training, allowing us to train previously untrainable models. Furthermore, our proposed evaluation metric (Inception score) provides us with a basis for comparing the quality of these models. We apply our technique to semi-supervised learning problems, achieving state-of-the-art results on several different datasets in computer vision. The contribution of this study has practical implications; we hope that a more rigorous theoretical understanding can be developed in future studies.

references

  1. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, etc. Generative Adversarial Networks. In NIPS, 2014.
  2. Emily Denton, Soumith Chintala, Arthur Szlam and Rob Fergus. Depth Generative Image Models Using Laplacian Pyramids. arXiv preprint arXiv:1506.05751, 2015.
  3. Alec Radford, Luke Metz and Soumith Chintala. Unsupervised representation learning in deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  4. Ian J Goodfellow. On estimating discriminability criteria for generative models. arXiv preprint arXiv:1412.6515, 2014.
  5. Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang and Roland Memisevic. Generate images using recurrent adversarial networks. arXiv preprint arXiv:1602.05110, 2016.
  6. Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek and In So Kweon. Pixel-level domain conversion. arXiv preprint arXiv:1603.07442, 2016.
  7. Arthur Gretton, Olivier Bousquet, Alex Smola and Bernhard Schölkopf. Statistical dependence was measured using the Hilbert-Schmidt norm. In Algorithmic Learning Theory, pp. 63-77. Springer, 2005.
  8. Kenji Fukumizu, Arthur Gretton, Xiaohai Sun and Bernhard Schölkopf. Condition-dependent kernel measures. In NIPS, Vol. 20, pp. 489-496, 2007.
  9. Alex Smola, Arthur Gretton, Le Song and Bernhard Schölkopf. The Hilbert space embedding of the distribution. In Algorithmic Learning Theory, pp. 13-31. Springer, 2007.
  10. Yujia Li, Kevin Swersky and Richard S. Zemel. Generate a moment matching network. CoRR, abs/1502.02761, 2015.
  11. Gintare Karolina Dziugaite, Daniel M Roy and Zoubin Ghahramani. Training a generative neural network with maximum mean difference optimization. arXiv preprint arXiv:1505.03906, 2015.
  12. Sergey Ioffe and Christian Szegedy. Batch Normalization to Accelerate Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167, 2015.
  13. Ilya Sutskever, Rafal Jozefowicz, Karol Gregor and others. Towards principles-based unsupervised learning. arXiv preprint arXiv:1511.06440, 2015.
  14. Jost Tobias Springenberg. Unsupervised and semi-supervised learning using generative adversarial networks for classification. arXiv preprint arXiv:1511.06390, 2015.
  15. Ian Goodfellow, Yoshua Bengio and Aaron Courville. deep learning. 2016. MIT Press.
  16. George W Brown. Solve the game through iterations of the virtual game. Activity Analysis of Production and Distribution, Vol. 13, pp. 374-376, 1951.
  17. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethink the Inception architecture of computer vision. ArXiv e-prints, December 2015.
  18. David Warde-Farley and Ian Goodfellow. Adversarial Perturbations for Deep Neural Networks. Perturbation, Optimization, and Statistics, Chapter 11, edited by Tamir Hazan, George Papandreou, and Daniel Tarlow. 2016. A book in preparation at MIT Press.
  19. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens and Zbigniew Wojna. Rethinking the founding architecture of computer vision. arXiv preprint arXiv:1512.00567, 2015.
  20. Tim Salimans and Diederik P Kingma. Weight Normalization: A Simple Reparameterization Method to Accelerate Deep Neural Networks. arXiv preprint arXiv:1602.07868, 2016.
  21. Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende and Max Welling. Semi-supervised learning with deep generative models. In Neural Information Processing Systems, 2014.
  22. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae and Shin Ishii. Distribution smoothing via virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.
  23. Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
  24. Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola and Tapani Raiko. Semi-supervised learning with terraced networks. In Advances in Neural Information Processing Systems, 2015.
  25. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever and others. Interesting properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  26. Mart´ın Abadi, Ashish Agarwal, Paul Barham, etc. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software is available at tensorflow.org.

References

  1. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In NIPS, 2014.
  2. Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. arXiv preprint arXiv:1506.05751, 2015.
  3. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  4. Ian J Goodfellow. On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515, 2014.
  5. Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110, 2016.
  6. Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain transfer. arXiv preprint arXiv:1603.07442, 2016.
  7. Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨olkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic learning theory, pages 63–77. Springer, 2005.
  8. Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch¨olkopf. Kernel measures of conditional dependence. In NIPS, volume 20, pages 489–496, 2007.
  9. Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch¨olkopf. A Hilbert space embedding for distributions. In Algorithmic learning theory, pages 13–31. Springer, 2007.
  10. Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. CoRR, abs/1502.02761, 2015.
  11. Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
  12. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  13. Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, et al. Towards principled unsupervised learning. arXiv preprint arXiv:1511.06440, 2015.
  14. Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
  15. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. 2016. MIT Press.
  16. George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374–376, 1951.
  17. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints, December 2015.
  18. David Warde-Farley and Ian Goodfellow. Adversarial perturbations of deep neural networks. In Tamir Hazan, George Papandreou, and Daniel Tarlow, editors, Perturbations, Optimization, and Statistics, chapter 11. 2016. Book in preparation for MIT Press.
  19. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
  20. Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868, 2016.
  21. Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Neural Information Processing Systems, 2014.
  22. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.
  23. Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
  24. Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, 2015.
  25. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  26. Mart´ın Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
    perties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  27. Mart´ın Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

  1. We use the pre-trained Inception model, the download link is http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz. At publication time, the code to calculate the Inception score using this model will be provided. ↩︎

  2. https://github.com/carpedm20/DCGAN-tensorflow ↩︎

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132271521