[Computer Vision | Generative Confrontation] Generative Confrontation Network (GAN)

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

Title: Generative Adversarial Nets

Link: Generative Adversarial Nets (nips.cc)

Summary

We propose a new framework for estimating generative models via an adversarial process, in which we train two models simultaneously:

  • A generative model G that captures the data distribution

  • A discriminative model D that estimates the probability that a sample comes from the training data or G.

The training process of G is to maximize the probability of D making mistakes.

This framework corresponds to a minimax two-player game.

In the space of arbitrary functions G and D, there exists a unique solution where G restores the training data distribution and D is equal to 1/2 everywhere. In the case where G and D are defined by a multi-layer perceptron, the entire system can be trained by backpropagation. No Markov chains or unrolled approximate inference networks are required during training or sample generation.

Experiments demonstrate the potential of the framework by generating qualitative and quantitative evaluations of samples.

1 Introduction

The role of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the types of data encountered in AI applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora.

The most notable successes of deep learning so far have involved discriminative models, typically models that map high-dimensional, rich sensory inputs to class labels [14, 20]. These compelling successes are largely based on backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9], which have particularly good gradient behavior.

Deep generative models have less impact due to the many hard-to-approximate probability computations that occur in maximum likelihood estimation and related strategies, as well as the difficulty of exploiting the benefits of piecewise linear units in a generative setting . We propose a new generative model estimation procedure that sidesteps these difficulties. 1

In the proposed adversarial network framework, the generative model is opposed to an adversary:

  • 一个判别模型,学会判断样本是来自模型分布还是数据分布。生成模型可以被看作与一组伪造者相类似,试图生产假货币并在不被检测的情况下使用它

  • 判别模型则与警察相类似,试图检测伪造货币

这个游戏中的竞争推动两个团队改进其方法,直到伪造品与真品无法区分。

该框架可以为许多种模型和优化算法产生特定的训练算法。

在本文中,我们探讨了生成模型通过多层感知机(MLP)传递随机噪声生成样本,而判别模型也是多层感知机的特殊情况。我们将此特殊情况称为对抗网络(adversarial nets)。

在这种情况下,我们可以仅使用高度成功的反向传播和随机失活算法[17]来训练两个模型,并仅使用前向传播从生成模型中抽样。不需要近似推断或马尔可夫链。

2 相关工作

直到现在,深度生成模型的大部分工作都集中在提供有规范参数的概率分布函数,然后可以通过最大化对数似然函数来训练模型上。

  • 在这类模型中,可能最成功的是深度玻尔兹曼机(deep Boltzmann machine)[25]。
  • 这类模型通常具有难以处理的似然函数,因此需要对似然梯度进行多次近似。

这些困难促使“生成机(generative machines)”模型的发展——

  • 这些模型不显式地表示似然函数,但能够从所需分布中生成样本。

  • 生成随机网络[4]就是一个生成机的例子,它可以通过精确的反向传播进行训练,而不需要像玻尔兹曼机那样进行多次近似。

本文通过消除生成随机网络中使用的马尔可夫链,扩展了生成机的思想

Our work exploits generative processes for backpropagation of derivatives by exploiting the following observation:
lim ⁡ σ → 0 ∇ x E ϵ ∼ N ( 0 , σ 2 I ) f ( x + ϵ ) = ∇ xf ( x ) \lim_ {\sigma\rightarrow0}\nabla_{\pmb{x}}\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\sigma^{2}\pmb{I})}f(\ pmb{x}+\epsilon)=\nabla_{\pmb{x}}f(\pmb{x})σ0limxEϵ N ( 0 , p2 I).f(x+) _=xf(x)

Translator's Note: The meaning of the above formula is for ffThe expected derivative of f is equivalent to fff derives its own derivative, which is why the author uses the reverse transfer of the error to solve the GAN

At the time, we did not know that Kingma and Welling [18] and Rezende et al. [23] had developed more general stochastic backpropagation rules for backpropagating through Gaussian distributions with finite variance and for backpropagating to covariance parameter and the mean parameter .

  • These backpropagation rules allow us to learn the conditional variance of the generator, which we treat as a hyperparameter in this paper.

Kingma and Welling [18] and Rezende et al. [23] use stochastic backpropagation to train variational autoencoders (VAEs).

  • Unlike GANs, VAEs pair a differentiable generator network with a second neural network.

  • Unlike GANs, the second network in VAE is a recognition model that performs approximate inference.

  • GANs need to be differentiated by visible units and thus cannot model discrete data, while VAEs need to be differentiated by hidden units and thus cannot have discrete latent variables.

Other VAE-like methods exist [12, 22], but are less relevant to our method.

Previous work has also adopted discriminative criteria to train generative models [29, 13]. These methods are intractable for deep generative models because they involve ratios of probabilities that cannot be handled by approximating variational approximations to lower bounds on the probabilities.

Noise-contrastive estimation (NCE) [13] involves training a generative model by learning weights that make the model discriminative for data from a fixed noisy distribution.

  • Using a previously trained model as the noise distribution allows training a series of models of progressively higher quality. This can be seen as an informal competition mechanism , similar to the formal competition mechanism used in confrontation online games.

  • A key limitation of NCE is that its "discriminator" is defined by the ratio of the probability densities of the noise distribution and the model distribution, so it needs to be able to evaluate and backpropagate these two densities.

先前的工作使用了两个神经网络相互竞争的一般概念。最相关的工作是可预测性最小化(predictability minimization,下称PM)[26]。在可预测性最小化中,神经网络中的每个隐藏单元被训练得与第二个网络的输出不同,而这第二个网络根据所有其他隐藏单元的值来预测该隐藏单元的值。

译者注:可预测性最小化是一种神经网络训练方法,旨在使隐藏单元在给定其他隐藏单元的值的情况下与另一个网络的输出不同。具体而言,第二个网络会预测某个隐藏单元的值,而这个隐藏单元是在网络中的一个特定部分。通过训练隐藏单元与预测的值不同,可预测性最小化试图确保网络的隐藏表示在进行某项任务时具有统计上的独立性。这有助于提高网络的表达能力和泛化性能。

本文与可预测性最小化有三个重要的区别:

  1. 在本文中,**网络之间的竞争是唯一的训练准则,**足以训练网络。可预测性最小化只是一种鼓励神经网络的隐藏单元在完成其他任务的同时具有统计独立性的正则化器,它不是主要的训练准则。

  2. 竞争的性质是不同的。在可预测性最小化中,比较了两个网络的输出,其中一个网络试图使输出相似,而另一个网络试图使输出不同。所涉及的输出是一个单一标量。在GAN中,一个网络产生一个丰富的高维向量,用作另一个网络的输入,并试图得出一个另一个网络不知道如何处理的输入。

  3. 学习过程的规范是不同的。可预测性最小化被描述为一个要最小化的目标函数的优化问题,学习接近目标函数的最小值。**GAN基于极小极大博弈而不是优化问题,并且具有一个值函数,其中一个代理试图最大化,而另一个代理试图最小化。**游戏在一个鞍点终止,该鞍点对于一个玩家的策略是一个最小值,对于另一个玩家的策略是一个最大值。

有时人们会将GAN错误地与相关概念“对抗性样本(adversarial examples)”[28]混淆。

  • Adversarial examples are examples found by using gradient-based optimization methods directly on the input of a classification network, with the goal of finding examples that are similar to the data but misclassified.

  • This differs from our work because adversarial examples are not a mechanism for training generative models. Instead, adversarial examples are primarily used in analytics tools to demonstrate how a neural network behaves in such a way that a neural network can confidently classify two images differently with high confidence even when they appear nearly indistinguishable to a human observer .

  • The existence of such adversarial examples does imply that GAN training may be inefficient, as they show that modern discriminative networks can confidently identify a class without modeling any human-perceivable properties of that class.

3 Adversarial networks

The adversarial modeling framework applies most directly when the models are all multi-layer perceptrons (MLPs).

  • To learn the generator data x \pmb{x}Distribution pg p_gon xpg, we input the noise variable pz ( z ) p_z(z)pz( z ) Define a prior, and then denote the mapped data space asG ( z ; θ g ) G(z;\theta_g)G(z;ig) , whereGGG is a parameterθ g \theta_gigA differentiable function for a multilayer perceptron represented by .

  • We also define a second multi-layer perceptron D ( x ; θ d ) D(\pmb{x};\theta_d)D(x;id) , which outputs a scalar. D ( x ) D(\pmb{x})D ( x ) stands forx \pmb{x}x comes from the real data rather than the generative distributionpg p_gpgThe probability.

we train DDD to maximize the assignment of correct labels to training examples and fromGGThe probability of a sample of G.

We train GG at the same timeG to minimizelog ⁡ ( 1 − D ( G ( z ) ) ) \log(1-D(G(z)))log(1D(G(z)))

Translator's Note: The above could be described more vividly.

The purpose of GAN is to get the most powerful discriminator (D) and powerful generator (G)

If you need to use a certain method to imitate a certain game rendering screen, for example, you want to render a game that conforms to a certain distribution (such as when a character dies), xxx pixels (xxx维)的画面 x \pmb{x} x,可以有两种方法:

  1. 反汇编游戏程序,了解到具体每行代码对于每帧画面生成所具体产生的作用,如“人物造型”、“对象移动”等,以期对画面分布的生成进行完美的建模。
  2. 定义若干个变量(即一个若干维的变量),然后假定认为这若干维的变量通过某种函数关系,共同影响了最终生成数据 x \pmb{x} x的分布情况。

前者类似“相关工作”中所提到的“拟合似然函数”的方法,即“追根溯源”的方法,这种方法解释性很强,可以很好地解释出每个参数对于最终结果生成所产生的影响,但是操作难度较高,也难找到合适的似然函数。

后者就有点类似于“多层感知机(MLP)”的方法了,理论上来说,MLP可以拟合任意函数的表达,只是可解释性差。虽然我不知道从游戏代码到画面其背后到底是个什么映射关系,但是我估摸着这若干维的参数就足以表达这内容背后隐藏的逻辑了。

只不过我不会知道那若干维的参数之中的每一个,最终对结果 x \pmb{x} x到底产生了怎样的影响,以及每个参数具体有什么含义而已。

整理一下。论文当中的各个变量可以作如下解释:


数据 x \pmb{x} x的分布规律 p g p_g pg,这是最终要得到的结果

  • 生成器 G ( z ; θ g ) G(z;\theta_g) G(z;θg)
    • Input: taken from random noise pz ( z ) p_z(z)pz( z ) initialization datazzz
    • Parameters: θ g \theta_gig
    • Output: x \pmb{x}x (eg,xxx -dimensional picturex \pmb{x}x
    • Excellent generator GGG , able to generate as much as possible
      • x \pmb{x} that is closer to the real datax
      • pg p_g that is closer to the true distributionpg
  • Discriminator D ( x ; θ d ) D(\pmb{x};\theta_d)D(x;id)
    • Input: x \pmb{x}x
    • Parameters: θ d \theta_did
    • Output: a scalar representing x \pmb{x}x comes from the real data rather than the generated distributionpg p_gpgprobability of sampling.
      • x \pmb{x} The more likely x comes from real data, outputD ( x ) D(\pmb{x})The closer D ( x ) is to 1 11
      • x \pmb{x} The more likely x comes from the generator, outputD ( x ) D(\pmb{x})The closer D ( x ) is to 0 00
    • Excellent discriminator DDD , able to determine as much as possible
      • x \pmb{x}The source of x isthe generator?
      • x \pmb{x}The source of x isa sampling of the true distribution?

Training GAN is to train GG at the same timeG andDDD , and expect both to meet a standard of excellence.

In order to train a better GGG , the author proposes a measure, namelylog ⁡ ( 1 − D ( G ( z ) ) ) \log(1-D(G(z)))log(1D(G(z))),要求这个式子表达的内容尽可能小,下面我们仔细一下分析这个式子:

  • z z z表示初始化的随机输入
  • G ( z ) G(z) G(z)表示生成器生成的结果,期望这个结果更接近真实分布中的采样
    • 即期望让生成器 G G G造出更“真”的假数据。
  • D ( G ( z ) ) D(G(z)) D(G(z))表示使用判别器 D D D判别生成器 G G G所生成的结果,期望这个结果更接近 1 1 1
    • 即期望让判别器 D D D尽可能误以为结果 G ( z ) G(z) G(z)为来自真实分布的采样
  • 只有当 D ( G ( z ) ) D(G(z)) The closer D ( G ( z )) is to 1 11时, log ⁡ ( 1 − D ( G ( z ) ) ) \log(1-D(G(z))) log(1D ( G ( z ))) will be closer to negative infinity (− ∞ -\infty
    • This is what the text says to "minimize log ⁡ ( 1 − D ( G ( z ) ) ) \log(1-D(G(z)))log(1D ( G ( z ))) ” reason

In other words, DDD andGGG plays the following two-player minimax game, the value function isV ( G , D ) V(G,D)V ( G ,D)

min ⁡ G max ⁡ D V ( D , G ) = E x ∼ p data ( x ) [ log ⁡ D ( x ) ] + E z ∼ p z ( x ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ] (1) \mathop{\min}\limits_{G}\mathop{\max}\limits_{D}V(D,G)=\mathbb{E}_{x\sim{p_{\text{data}}(x)}}[\log D(x)]+\mathbb{E}_{z\sim{p_{z}(x)}}[\log(1 - D(G(z)))]\tag{1} GminDmaxV(D,G)=Expdata(x)[logD(x)]+Ezpz(x)[log(1D(G(z)))](1)

Translator's Note: Don't write V in the formula ( G , D ) V(G,D)V ( G ,D ) and writeV ( D , G ) V(D,G)V(D,G ) , it should be a clerical error by the author.

  • E x ∼ p data ( x ) [ log ⁡ D ( x ) ] \mathbb{E}_{x\sim{p_{\text{data}}(x)}}[\log D(x)] Expdata(x)[logxxin D ( x )]x is sampled from the distribution of true values
    • In discriminator DDIf D is perfect, it should be able to identify allxxx are samples from the real distribution
    • D ( x ) D(x) D ( x ) should tend towards1 11的, log ⁡ D ( x ) \log D(x) logD(x)就应该是趋向于 0 0 0
    • 那么该期望就应该是趋向于 0 0 0
  • E z ∼ p z ( x ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ] \mathbb{E}_{z\sim{p_{z}(x)}}[\log(1 - D(G(z)))] Ezpz(x)[log(1D(G(z)))]中的 z z z采样于随机噪声 p z ( z ) p_z(z) pz(z)
    • 在生成器 G G G和判别器 D D D都很完美的情况下, D D D should be able to identify allG ( z ) G(z)G ( z ) are all results from the generator
    • D ( G ( z ) ) D(G(z)) D ( G ( z )) should tend to0 00的, log ⁡ ( 1 − D ( G ( z ) ) ) \log(1-D(G(z))) log(1D ( G ( z ))) should also tend to0 00 's
    • Then the expectation should also tend to 0 00 's
  • max ⁡ D \mathop{\max}\limits_{D} DmaxDenotes the expectation that DDD try not to make mistakes, that is, to maximizeDDD' s value.
  • min ⁡ G \mathop{\min}\limits_{G} Gminexpress expectation GGG can makeDDD try to make mistakes, that is, to minimizeGGthe value of G.

In the next section, we provide a theoretical analysis of adversarial networks, essentially showing that the training criteria allowG andDDIn the case of sufficient capacity of D , that is, under non-parametric constraints, the recovery data generation distribution.

For a less formal, but better-understood explanation of the methodology, see Figure 1.

Figure 1: GAN simultaneously updates the discriminant distribution ( DDD , blue, dotted line) to train so that it can distinguish px p_xfrom the data generating distribution (black, dotted line)pxwith the generating distribution pg p_gpgGGG ) (green, solid line) samples. The horizontal line below iszzUniformly sampled domain of z . The horizontal line above isxxpart of the domain of x . The upward arrow shows the mappingx = G ( z ) x = G(z)x=How G ( z ) imposes a non-uniform distribution pg p_gon the transformed samplespgGGG inpg p_gpgThe high-density regions contract and the low-density regions expand. (a) Consider an adversarial pair that is close to convergence: pg p_gpg p d a t a p_{data} pdataSimilar to DDD is a partially accurate classifier. (b) In the inner loop of the algorithm,DDD is trained to distinguish samples from the data, converging toD ∗ ( x ) = pdata ( x ) pdata ( x ) + pg ( x ) D^*(x) = \frac{p_{data}(x)}{ p_{data}(x)+p_g(x)}D(x)=pdata(x)+pg(x)pdata(x). © Update GGAfter G ,DDThe gradient of D guidesG ( z ) G(z)G ( z ) flows to regions that are more likely to be classified as data. (d) After multiple training sessions, ifGGG andDDD has enough capacity, they will reach a point wherepg = pdata p_g = p_{data}pg=pdata. At this time, the discriminator cannot distinguish between these two distributions, that is, D ( x ) = 1 2 D(x) = \frac{1}{2}D(x)=21

In practice, we have to implement games using iterative numerical methods. Fully optimize DD in the inner loop of trainingD is computationally prohibitive and may lead to overfitting on limited datasets. Instead, we are optimizingDDk steps of D and optimizing GGAlternate between a step of G. as long asGGG changes slowly enough thatDDD remains near its optimal solution. This procedure is formally presented in Algorithm 1.

In practice, Equation 1 may not provide enough gradients for GGG良好学习。在学习的早期阶段,当 G G G表现不佳时, D D D可以高度自信地拒绝样本,因为它们与训练数据明显不同。在这种情况下, log ⁡ ( 1 − D ( G ( z ) ) ) \log(1 - D(G(z))) log(1D(G(z)))会饱和。我们可以训练 G G G来最大化 log ⁡ D ( G ( z ) ) \log D(G(z)) logD(G(z)),而不是训练 G G G来最小化 log ⁡ ( 1 − D ( G ( z ) ) ) \log(1 - D(G(z))) log(1D(G(z)))。这个目标函数导致 G G G D D D的动态相同的固定点,但在学习早期提供了更强的梯度。

4 Theoretical results

Generator GGG implicitly defines a probability distributionpg p_gpg,当z ∼ pzz \sim p_zzpz, this distribution serves as a sample G ( z ) G(z)The distribution of G ( z ) . Thus, given sufficient capacity and training time, we expect Algorithm 1 to converge topdata p_{data}pdataA good estimator for . The results in this section are done in a nonparametric setting, e.g. we represent models with infinite capacity by studying convergence in the space of probability density functions.

Algorithm 1 is used for mini-batch stochastic gradient descent training of GAN. The number of steps to apply to the discriminator, kkk , is a hyperparameter. In our experiments we usedk = 1 k = 1k=1 , which is the least expensive option.


  • for number of training iterations do

    • for k steps do

      • From the noise prior pg ( z ) p_g(z)pg( z ) sample m noise samples{ z ( 1 ) , … , z ( m ) } \{z^{(1)}, \ldots, z^{(m)}\}{ z(1),,z(m)}
      • Generate distribution pdata( x ) p_{data}(x) from datapdata( x ) sample m samples{ x ( 1 ) , … , x ( m ) } \{x^{(1)}, \ldots, x^{(m)}\}{ x(1),,x(m)}
      • Update the discriminator by boosting its stochastic gradient:

      ∇ θ d 1 m ∑ i = 1 m [ log ⁡ D ( x ( i ) ) + log ⁡ ( 1 − D ( G ( z ( i ) ) ) ) ] \nabla \theta_d \frac{1}{m} \sum_{i=1}^{m} \left[ \log D \left( x^{(i)} \right) + \log \left( 1 - D \left( G \left( z^{(i)} \right) \right) \right) \right] θdm1i=1m[logD(x(i))+log(1D(G(z(i))))]

    • end for

    • From the noise prior pg ( z ) p_g(z)pg( z ) sample m noise samples{ z ( 1 ) , … , z ( m ) } \{z^{(1)}, \ldots, z^{(m)}\}{ z(1),,z(m)}

    • Update the generator by decreasing its stochastic gradient:

    ∇ θ g 1 m ∑ i = 1 m log ⁡ ( 1 − D ( G ( z ( i ) ) ) ) \nabla \theta_g \frac{1}{m} \sum_{i=1}^{m} \log \left( 1 - D \left( G \left( z^{(i)} \right) \right) \right) θgm1i=1mlog(1D(G(z(i))))

  • end for

Gradient-based updates can use any standard gradient-based learning rule. We used momentum in our experiments.

4.1 p g = p d a t a p_g = p_{data} pg=pdataThe global optimality of

We first consider any given generator GGThe optimal discriminatorDD for GD

Proposition 1 For a fixed GGG , optimal discriminatorDDD
D G ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( x ) (2) D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \tag{2} DG(x)=pdata(x)+pg(x)pdata(x)(2)

Proof: Given any generator GGG , discriminatorDDThe training standard of D is to maximize the amountV ( G , D ) V(G, D)V ( G ,D)

V ( G , D ) = ∫ x p d a t a ( x ) log ⁡ ( D ( x ) ) d x + ∫ z p z ( z ) log ⁡ ( 1 − D ( g ( z ) ) ) d z = ∫ x p d a t a ( x ) log ⁡ ( D ( x ) ) + p g ( x ) log ⁡ ( 1 − D ( x ) ) d x (3) \begin{align} V(G, D) & = \int_x p_{data}(x) \log(D(x))dx + \int_z p_z(z) \log(1 - D(g(z)))dz \\ & = \int_x p_{data}(x) \log(D(x)) + p_g(x) \log(1 - D(x))dx \end{align} \tag{3} V ( G ,D)=xpdata(x)log(D(x))dx+zpz(z)log(1D ( g ( z ))) d z=xpdata(x)log(D(x))+pg(x)log(1D(x))dx(3)

For any ( a , b ) ∈ R 2 ∖ { 0 , 0 } (a, b) \in \mathbb{R}^2 \setminus \{0, 0\}(a,b)R2{ 0,0},函数 y → a log ⁡ ( y ) + b log ⁡ ( 1 − y ) y \rightarrow a \log(y) + b \log(1 - y) yalog(y)+blog(1y) [ 0 , 1 ] [0, 1] [0,1] 中达到其最大值,即 a a + b \frac{a}{a+b} a+ba。判别器不需要在 S u p p ( p d a t a ) ∪ S u p p ( p g ) Supp(p_{data}) \cup Supp(p_g) Supp(pdata)Supp(pg) 之外定义,从而得出证明。

请注意, D D D的训练目标可以解释为最大化对条件概率 P ( Y = y ∣ x ) P(Y = y|x) P(Y=y x ) , whereYYY meansxxDoes x come from pdata p_{data}pdata(当y = 1 y = 1y=1 ) or frompg p_gpg(当y = 0 y = 0y=0 hours). Now, the minimax game in Equation 1 can be reformulated as:

C ( G ) = max ⁡ D V ( G , D ) = E x ∼ p d a t a [ log ⁡ D G ∗ ( x ) ] + E z ∼ p z [ log ⁡ ( 1 − D G ∗ ( G ( z ) ) ) ] = E x ∼ p d a t a [ log ⁡ D G ∗ ( x ) ] + E x ∼ p g [ log ⁡ ( 1 − D G ∗ ( x ) ) ] = E x ∼ p d a t a [ log ⁡ p d a t a ( x ) p d a t a ( x ) + p g ( x ) ] + E x ∼ p g [ log ⁡ p g ( x ) p d a t a ( x ) + p g ( x ) ] (4) \begin{align} C(G) & = \max_{D} V (G, D) \\ & = \mathbb{E}_{x\sim p_{data}} [\log D^*_{G}(x)] + \mathbb{E}_{z\sim p_z} [\log(1 - D^*_{G}(G(z)))] \\ & = \mathbb{E}_{x\sim p_{data}} [\log D^*_{G}(x)] + \mathbb{E}_{x\sim p_g} [\log(1 - D^*_{G}(x))] \\ & = \mathbb{E}_{x\sim p_{data}} \left[ \log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \right] + \mathbb{E}_{x\sim p_g} \left[ \log \frac{p_g(x)}{p_{data}(x) + p_g(x)} \right] \end{align} \tag{4} C(G)=DmaxV ( G ,D)=Expdata[logDG(x)]+Ezpz[log(1DG(G(z)))]=Expdata[logDG(x)]+Expg[log(1DG(x))]=Expdata[logpdata(x)+pg(x)pdata(x)]+Expg[logpdata(x)+pg(x)pg(x)](4)

Theorem 1 if and only if pg = pdata p_g = p_{data}pg=pdata, the virtual training criterion C ( G ) C(G)C ( G ) reaches the global minimum. At that point,C ( G ) C(G)C ( G ) reaches the value− log ⁡ 4 - \log 4log4

Proof: For pg = pdata p_g = p_{data}pg=pdata D G ∗ ( x ) = 1 2 D^*_G(x) = \frac{1}{2} DG(x)=21(Refer to Equation 2). Therefore, by DG ∗ ( x ) = 1 2 D^*_G(x) = \frac{1}{2}DG(x)=21When examining equation 4, we find that C ( G ) = log ⁡ 1 2 + log ⁡ 1 2 = − log ⁡ 4 C(G) = \log \frac{1}{2} + \log \frac{1}{ 2} = - \log 4C(G)=log21+log21=log4 . To see this isC ( G ) C(G)The best possible value of C ( G ) , only ifpg = pdata p_g = p_{data}pg=pdatato be reached, please note

E x ∼ p d a t a [ − log ⁡ 2 ] + E x ∼ p g [ − log ⁡ 2 ] = − log ⁡ 4 \mathbb{E}_{x\sim p_{data}} [- \log 2] + \mathbb{E}_{x\sim p_g} [- \log 2] = - \log 4 Expdata[log2]+Expg[log2]=log4

And by starting from C ( G ) = V ( DG ∗ , G ) C(G) = V (D^*_G, G)C(G)=V(DG,Subtracting this expression from G ) , we get:

C ( G ) = − log ⁡ ( 4 ) + KL ( p d a t a | | p d a t a + p g 2 ) + KL ( p g | | p d a t a + p g 2 ) (5) C(G) = - \log(4) + \text{KL} \left( p_{data} \middle| \middle| \frac{p_{data} + p_g}{2} \right) + \text{KL} \left( p_g \middle| \middle| \frac{p_{data} + p_g}{2} \right) \tag{5} C(G)=log(4)+KL(pdata 2pdata+pg)+KL(pg 2pdata+pg)(5)

其中 KL 是 Kullback–Leibler 散度。我们在上述表达式中识别出模型分布与数据生成过程之间的 Jensen–Shannon 散度:

C ( G ) = − log ⁡ ( 4 ) + 2 ⋅ JSD ( p d a t a ∥ p g ) (6) C(G) = - \log(4) + 2 \cdot \text{JSD} (p_{data} \parallel p_g) \tag{6} C(G)=log(4)+2JSD(pdatapg)(6)

由于两个分布之间的 Jensen–Shannon 散度总是非负的,并且仅当它们相等时为零,我们已经证明了 C ∗ = − log ⁡ ( 4 ) C^* = - \log(4) C=log(4) C ( G ) C(G) C(G) 的全局最小值,唯一的解是 p g = p d a t a p_g = p_{data} pg=pdata,即生成模型完美复制了数据生成过程。

4.2 算法1的收敛性

命题2 如果 G G G D D D有足够的容量,并且在算法1的每一步中,都允许鉴别器 D D D在给定 G G G的情况下达到其最优,并且 p g p_g pgis updated to improve the criterion
E x ∼ pdata [ log ⁡ DG ∗ ( x ) ] + E x ∼ pg [ log ⁡ ( 1 − DG ∗ ( G ( x ) ) ) ] \mathbb{E}_{x\sim p_ {data}} [\log D^*_{G}(x)] + \mathbb{E}_{x\sim p_g} [\log(1 - D^*_{G}(G(x)) )]Expdata[logDG(x)]+Expg[log(1DG(G(x)))]

then pg p_gpgConverge on pdata p_{data}pdata

Proof: Consider V ( G , D ) = U ( pg , D ) V(G, D) = U(p_g, D)V ( G ,D)=U(pg,D ) aspg p_gpgfunction of , as the above criterion does. Note that U ( pg , D ) U(p_g, D)U(pg,D ) inpg p_gpgis convex in the middle. The sub-derivative of the maximum value of a convex function includes the derivative of the function at the point where the maximum value is taken. In other words, if f ( x ) = sup ⁡ α ∈ A f α ( x ) f(x) = \sup_{\alpha\in A} f_\alpha(x)f(x)=supαAfa( x ) andf α ( x ) f_\alpha(x)fa( x ) for eachα \alphaα atxxx is convex, then∂ f β ( x ) ∈ ∂ f \partial f_\beta(x) \in \partial ffb(x)f ifβ = arg ⁡ sup ⁡ α ∈ A f α ( x ) \beta = \arg \sup_{\alpha\in A} f_\alpha(x)b=argsupαAfa( x ) . This is equivalent to the corresponding GGin the givenOptimalDD for GIn the case of D , calculate pg p_gpgGradient descent updates for . sup ⁡ DU ( pg , D ) \sup_D U(p_g, D)supDU(pg,D ) inpg p_gpgis convex and has a unique global optimum, as proved by Theorem 1, therefore, by pg p_gpgFor small enough updates, pg p_gpgconverges to px p_xpx, leading to a proof.

In fact, the adversarial network passes the function G ( z ; θ g ) G(z; \theta_g)G(z;ig) meanspg p_gpgdistribution, and we optimize θ g \theta_giginstead of pg p_gpgitself. Defining GG using a multi-layer perceptronG introduces multiple critical points in the parameter space. However, the excellent performance of multilayer perceptrons in practice shows that they are reasonable models despite the lack of theoretical guarantees.

5 experiments

We trained adversarial networks on a range of datasets, including MNIST [21], Toronto Face Database (TFD) [27], and CIFAR-10 [19]. The generator network uses a mixture of rectified linear activations [17,8] and sigmoid activations, while the discriminator network uses maxout [9] activations. Dropout [16] is applied when training the discriminator network. While our theoretical framework allows the use of dropout and other noise in the middle layers of the generator, we only use noise as the input to the lowest layer of the generator network.

We use GG by fitting a Gaussian Parzen window toG generates samples and reports the log-likelihood under this distribution to estimate the test set data inpg pgpg下的概率。高斯的σ参数是通过对验证集进行交叉验证获得的。这个程序最初是在Breuleux等人的工作中[7]引入的,并被用于各种精确似然不可行的生成模型[24,3,4]。结果在表1中报告。这种估计似然的方法具有较高的方差,且在高维空间中表现不佳,但据我们所知,这是最好的可用方法。可以采样但无法直接估计似然的生成模型的进展激发了关于如何评估此类模型的进一步研究。

表 1:基于Parzen窗口的对数似然估计。在MNIST上报告的数字是测试集样本的平均对数似然,均值的标准误差是根据样本计算的。在TFD上,我们根据数据集的折叠计算了标准误差,并使用每个折叠的验证集选择了不同的σ。在TFD上,每个折叠上都进行了σ的交叉验证,并计算了每个折叠上的平均对数似然。对于MNIST,我们与数据集的实值(而不是二进制)版本的其他模型进行了比较。

在图2和图3中,我们展示了训练后从生成器网络中抽取的样本。虽然我们并不声称这些样本优于现有方法生成的样本,但我们相信这些样本至少可以与文献中更好的生成模型相媲美,并突显了对抗框架的潜力。

图2:模型样本的可视化。最右侧一列显示相邻样本的最近训练示例,以证明该模型没有记忆训练集。样本是公平的随机抽取,没有精选。与大多数深度生成模型的其他可视化不同,这些图像显示了来自模型分布的实际样本,而不是隐藏单元样本给出的条件均值。此外,这些样本是不相关的,因为采样过程不依赖于马尔可夫链混合。a)MNIST b)TFD c)CIFAR-10(全连接模型)d)CIFAR-10(卷积鉴别器和“反卷积”生成器)

图3:通过在完整模型的z空间坐标之间线性插值获得的数字。

6 优点和缺点

这个新框架相对于先前的建模框架具有优点和缺点。主要的缺点是没有对 p g ( x ) pg(x) Explicit representation of p g ( x ) , and DDduring trainingD must be withGGG is well synchronized (in particular, cannot be updated without updatingDDD case of overtrainingGGG , to avoid the "Helvetica scenario" whereGGG will be too muchzzz values ​​collapse to the samexxx values ​​such that there is not enough diversity to modelpdata p_datapda t a ), just like a Boltzmann machine must maintain a negative chain between learning steps. The advantage is that Markov chains are never needed, only backpropagation is used to obtain gradients, no reasoning is required during learning, and various functions can be incorporated into the model. Table 2 summarizes the comparison of GANs with other generative modeling methods.

Table 2: Challenges in Generative Modeling: A summary of the main operational difficulties encountered by different approaches in deep generative modeling.

The above advantages are mainly computational. Adversarial models may also gain some statistical advantage from the fact that the generator network is not directly updated with data examples, but only via gradients flowing through the discriminator. This means that the components of the input are not copied directly into the generator's arguments. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, whereas methods based on Markov chains require the distribution to be somewhat blurred so that chains can mix between modes.

7 Conclusions and future work

The framework allows many straightforward extensions:

  1. by adding ccc asGGG andDDThe input of D can obtain the conditional generative modelp ( x ∣ c ) p(x | c)p(xc)
  2. By training an auxiliary network to predict a given xxx z z z , can perform learned approximate inference. This is similar to the inference net trained by the wake-sleep algorithm [15], but has the advantage that the inference net can be trained against a fixed spanning net after the spanning net has finished training.
  3. Roughly for all conditions p ( x S ∣ x S not ) p(x_S | x_{S_{not}}) can be trained by training a set of conditional models with shared parametersp(xSxSnot) for modeling, whereSSS is forxxA subset of indices of x . Essentially, a stochastic extension of deterministic MP-DBM [10] can be achieved using adversarial nets.
  4. Semi-supervised learning: When only limited labeled data is available, features from a discriminator or inferred network can improve classifier performance.
  5. Efficiency Improvement: Coordinating GGs by Designing Better MethodsG andDDD , or determine to sample zzfrom during trainingA better distribution of z can greatly speed up training.

This paper has demonstrated the feasibility of an adversarial modeling framework, suggesting that these research directions may be useful.

thank you

We would like to thank Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume Alain, and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window evaluation code with us. We would like to thank the developers of Pylearn2 [11] and Theano [6,1], especially Frédéric Bastien, who hastily released a Theano feature in support of this project. Arnaud Bergeron provided much needed support in LATEX typesetting. We would also like to thank CIFAR and Canada Research Chairs for funding, and Compute Canada and Calcul Québec for providing computing resources. Ian Goodfellow was supported by the 2013 Google Fellowship in Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.

references

  1. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, IJ, Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: New Features and Speed ​​Improvements . NIPS 2012 Workshop on Deep Learning and Unsupervised Feature Learning.
  2. Bengio, Y. (2009). Learning Deep Architectures for Artificial Intelligence. Now Publishers.
  3. Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Improving Mixing via Depth Representation. On ICML'13.
  4. Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative random networks trained by backpropagation. On ICML'14.
  5. Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep Generative Random Networks Trained by Backpropagation. In Proceedings of the 30th International Conference on Machine Learning (ICML'14).
  6. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010) . Theano: A CPU and GPU math expression compiler. In Proceedings of the Conference on Scientific Computing in Python (SciPy). Oral report.
  7. Breuleux, O., Bengio, Y., and Vincent, P. (2011). Rapid generation of representative samples from RBM-derived processes. Neural Computation, 23(8), 2053–2073.
  8. Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. In AISTATS'2011.
  9. Goodfellow, IJ, Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maximum Output Networks. In ICML'2013.
  10. Goodfellow, IJ, Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS'2013.
  11. Goodfellow, IJ, Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: A machine learning research library. arXiv preprint, number: arXiv:1308.4214.
  12. Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep Autoregressive Networks. In ICML'2014.
  13. Gutmann, M. and Hyvarinen, A. (2010). Noise Contrastive Estimation: A New Approach to Denormalized Statistical Model Estimation. In Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS'10).
  14. Hinton, G., Deng, L., Dahl, GE, Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. ( 2012a). Deep Neural Networks in Speech Recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
  15. Hinton, GE, Dayan, P., Frey, BJ, and Neal, RM (1995). A wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.
  16. Hinton, GE, Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. Technical Report, number: arXiv:1207.0580.
  17. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the optimal multi-level architecture for object recognition? In Proceedings of the International Conference on Computer Vision (ICCV'09), pp. 2146–2153. IEEE.
  18. Kingma, DP and Welling, M. (2014). Autoencoding Variational Bayesian. In Proceedings of the International Conference on Learning Representations (ICLR).
  19. Krizhevsky, A. and Hinton, G. (2009). Learning multi-layer features from small images. Technical Report, University of Toronto.
  20. Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet Classification Using Deep Convolutional Neural Networks. In NIPS'2012.
  21. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  22. Mnih, A. and Gregor, K. (2014). Neural Variational Inference and Learning with Belief Networks. Technical Report, number: arXiv:1402.0030.
  23. Rezende, DJ, Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical Report, number: arXiv:1401.4082.
  24. Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sample-shrinking autoencoders. On ICML'12.
  25. Salakhutdinov, R. and Hinton, GE (2009). Depth Boltzmann machines. In AISTATS'2009, pp. 448-455.
  26. Schmidhuber, J. (1992). Learning factor codes via predictability minimization. Neural Computation, 4(6), 863–879.
  27. Susskind, J., Anderson, A., and Hinton, GE (2010). Toronto Faces Dataset. University of Toronto Technical Report No.: UTML TR 2010-001.
  28. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, IJ, and Fergus, R. (2014). Singular properties of neural networks. ICLR, Number: abs/1312.6199.
  29. Tu, Z. (2007). Learning generative models via discriminative methods. In Computer Vision and Pattern Recognition, 2007. CVPR'07. In IEEE International Conference, pp. 1–8. IEEE.

References

  1. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
  2. Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
  3. Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’13.
  4. Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.
  5. Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).
  6. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
  7. Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
  8. Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
  9. Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In ICML’2013.
  10. Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS’2013.
  11. Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
  12. Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive networks. In ICML’2014.
  13. Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS’10).
  14. Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
  15. Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.
  16. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
  17. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.
  18. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
  19. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
  20. Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’2012.
  21. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  22. Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical report, arXiv preprint arXiv:1402.0030.
  23. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.
  24. Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’12.
  25. Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448455.
  26. Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6), 863–879.
  27. Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.
  28. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. ICLR, abs/1312.6199.
    er, J. (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6), 863–879.
  29. Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.
  30. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. ICLR, abs/1312.6199.
  31. Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE.

  1. All code and hyperparameters are available at http://www.github.com/goodfeli/adversarial. ↩︎

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132199157