GAN 合集

Generative Adversarial Nets

NIPS2014

问题

之前的一些工作在最大释然估计和近似推理（related strategies）中会遇到很多难以解决的概率计算困难
‘and due to difficulty of leveraging the benefits of piecewise linear units in the generative context ‘（这里不是很理解，应该跟训练，反向传播有关）

本文提出了一种新的生成模型，解决了以上问题

方法

训练两个网络
- 一个生成模型：将随机噪声传输到多层感知机来生成假的样本
- 一个判别模型：也是通过多层感知机实现，将生成的假样本和真实的样本输入训练
- 可以用反向传播和dropout训练两个模型，生成模型在生成样本时只用前向传播算法
- 并且不需要近似推理和马尔科夫链作为前提
目标 :
$min_Gmax_DV(D,G)=\mathbb{E}_{x\sim p_{data}(x)}[\log D(x)]+\mathbb{E}_{z\sim p_{z}(z)}[\log (1-D(G(z)))\tag1$
- $D(x)$ 是D output for real data $x$
- $D(G(z))$ 是D ouput for generated fake data $G(z)$
- Discriminator想最大化目标函数，让 $D(x)$ 接近1(real) , $D(G(z))$ 接近0(fake)
- Generator想最小化目标函数，让让 $D(G(z))$ 接近1(real)
- 初期，G的生成效果很差，D会高置信度来拒绝生成样本，因为它们跟真实数据明显不同，所以 $\log (1-D(G(z)))$ 会饱和
- 因此我们选择最大化 $\log D(G(z))$ （即最大化D出错的概率）而不是最小化 $log (1-D(G(z)))$ 来训练G
- 该目标函数使G和D在动力学稳定点相同，并且在训练初期，该目标函数可以提供更强大的梯度（CS231n 2017课程里有讲）
- 这个极大极小问题的全局最优解为 $p_g=p_{data}$ 此时D无法区分训练数据分布和生成数据分布，即 $D(x)=\frac{1}{2}$ ，论文4.1给出证明（就求了一个导，令等于0，解出来 $D_G^*(x)=\frac{p_{data}(x)}{p_{data}(x)+p_g(x)}$ 即得证
- （1） 式等价于：
$C(G)=max_DV(G,D)$

$=\mathbb{E}_{x\sim p_{data}(x)}[\log D^*_G(x)]+\mathbb{E}_{x\sim p_{z}}[\log (1-D^*(G(z)))]$

$=\mathbb{E}_{x\sim p_{data}(x)}[\log D^*_G(x)]+\mathbb{E}_{x\sim p_{g}}[\log (1-D^*(x))]$

$=\mathbb{E}_{x\sim p_{data}(x)}[\log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}]+\mathbb{E}_{x\sim p_{g}}[\log \frac{p_{g}(x)}{p_{data}(x)+p_g(x)}]$
- 当且仅当 $p_g=p_{data}$ 时， $C(G)$ 达到全局最小，值为 $-\log 4$
- “`python
翻译成python代码如下 tensorflow

d_loss_real = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real,labels=tf.ones_like(d_logits_real)) * (1 - smooth))

识别生成的图片

d_loss_fake = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake,labels=tf.zeros_like(d_logits_fake)))

总体loss

d_loss = tf.add(d_loss_real, d_loss_fake)

generator的loss

g_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake,labels=tf.ones_like(d_logits_fake)) * (1 - smooth))
“`
优缺点
- 优点： $p_g(x)$ 是隐式表示，且在训练期间，D和G必须很好地同步
优点：无需马尔科夫链，仅用反向传播获得梯度，学习间无需推理，且模型中可以融入多种函数

收获

如何根据实际情况转换目标函数很重要
最大释然估计等于最小化KL散度

参考

生成模型、最大化似然、KL散度

https://arxiv.org/abs/1406.2661

Conditional Generative Adversarial Nets

arXiv 1411

问题

在非条件生成模型中，无法控制其生成数据的模式
本方法使用附加信息作为模型的条件变量，可以引导D的数据生成
- 这些条件变量可以来自于类别标签
- 待修复数据的其他部分
- 甚至来自于其他模态的数据（or even on data from different modality）

方法

引入条件概率进行控制
$min_Gmax_DV(D,G)=\mathbb{E}_{x\sim p_{data}(x)}[\log D(x|y)]+\mathbb{E}_{z\sim p_{z}(z)}[\log (1-D(G(z|y)))$
把噪声z和条件y作为输入同时送进生成器，生成跨域向量，再通过非线性函数映射到数据空间
把数据x和条件y作为输入同时送进判别器，生成跨域向量，并进一步判断x是真实训练数据的概率

收获

可以尝试结合其他相关学科的知识
只要是概率模型都能用条件概率加以控制

参考

https://arxiv.org/abs/1411.1784

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

NIPS 2016

问题

原始的GAN中，G的输入只有噪声z，那么z包含了生成一个样本所需的全部信息
原始的GAN没有对G如何使用这个噪声z做出约束，训练出来的G，对于z的每一个维度不能够很好的对应到相关的语义特征
- 因为许多数据都可以分解成各个语义变量特征

本文提出的InfoGAN是尝试发现这些潜在的有关联的特征的方法，InfoGAN在计算成本上和GAN相比可忽略不计，并且容易训练

方法

希望通过拆分噪声z的方式，从而控制GAN的学习过程，也使得学习出来的结果更加具备可解释性

加入一个新的潜变量c，使得c与生成的样本具有较高的互信息
- 即生成网络可以表示成： $G(z,c)$
- c用于表示数据某个方面的语义信息，z用于表示样本x中与c无关的其他信息
- 原始的G是把变量c与x看成相互独立，不相关，即 $P_G(x|c)=P_G(x)$
文献提出信息正则化约束项
潜变量c与生成样本的互信息量应该较大，即应该较大
- $I$ 是互信息： $I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)$
- $I(X,Y)$ 可以看成： is the reduction of uncertainty in $X$ when $Y$ is observed
- 如果 $X$ , $Y$ 相互独立，那么 $I(X,Y)=0$
- 如果 $X$ , $Y$ 相关性较大，那么 $I(X,Y)$ 也较大
- 因此对于 $I(c,G(z,c))$ 来说，如果想让它更大可以通过使得 $P_G(c|x)$ 更小
- 因此在原始GAN的损失函数基础上，提出加入正则约束 $I(c,G(z,c))$
- $min_Gmax_DV(D,G)=V(D,G)-\lambda I(c,G(z,c))$
- 文献中5 给出了证明
InfoGAN可以看成三个网络组成
- 生成网络 $x=G(c,z)$ 、
- 判别真伪网络 $y_1=D_1(x)$
- 判别类别c网络 $y_2=D_2(x)$ （当c用于代表类别信息的时候，网络最后一层是sotfmax层）
- 且 $D_1$ 与 $D_2$ 除了最后一层外，共享网络参数

收获

可以尝试结合其他相关学科的知识
自己设计损失函数使可以尝试加入某种正则化项
可以通过InfoGAN发现高维数据

参考

InfoGAN学习笔记

https://arxiv.org/abs/1606.03657

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

NIPS‘15

问题

Building a good generative model of natural images has been a fundamental problem within computer vision. However, images are complex and high dimensional, making them hard to model well, despite extensive efforts
即生成模型很难生成高分辨率的图像

方法

本文方法是基于 NIPS 2014 年的GAN 做的，提出了LAPGAN model，结合了 a conditional form of GAN model into the framework of a Laplacian pyramid.
G学习realistic high-frequency structure
the models at each level are trained independently

收获

拉普拉斯金字塔这种分层结构

参考

https://papers.nips.cc/paper/5773-deep-generative-image-models-using-a-laplacian-pyramid-of-adversarial-networks

Improved Techniques for Training GANs

NIPS’16

问题

training GANs requires finding a Nash equilibrium of a non-convex game with continuous, highdimensional parameters.
GANs are typically trained using gradient descent techniques that are designed to find a low value of a cost function, rather than to find the Nash equilibrium of a game. When used to seek for a Nash equilibrium, these algorithms may fail to converge*

方法

Feature matching
- 为G设定一个新的目标函数：
- $||\mathbb{E}_{x\sim p_{data}}f(x) - \mathbb{E}_{z\sim p_z(z)}f(G(z))||_2^2$
- 替代原先的最大化D的目标，新的目标是用G去产生数据来匹配真实数据
- D还是和原来一样
- f(x) 代表D中的一个中间层，新的目标函数为 f(G(x)) 和 f(z) 的二范数最小
Minibatch discrimination
- 不孤立的看sample，加入sample之间的信息
- 具体公式见原文3.2
Historical averaging
- 加入一个正则化项
- $||\theta - \frac{1}{t}\sum_{i=1}^t\theta[i]||^2$
One-side label smooth
- replaces the 0 and 1 targets for a classifier with smoothed values, like :9 or :1
- $D(x)=\frac{\alpha p_{data}(x) + \beta p_{model}(x)}{p_{data}(x) + p_{model}(x)}$
Virtual batch normalization
- 首先从训练集中拿出一个batch在训练开始前固定起来，算出这个特定batch的均值和方差，进行更新训练中的其他batch

参考

http://papers.nips.cc/paper/6125-improved-techniques-for-training-gan

Generative Visual Manipulation on the Natural Image Manifold

ECCV‘16

问题

First, the generated images, while good, are still not quite photo-realistic (plus there are practical issues in making them high resolution).
Second, these generative models are setup to produce images by sampling a latent vector-space, typically at random

方法

Given a real photo, we first project it onto our approximation of the image manifold by finding the closest latent feature vector z of the GAN to the original image
Then, we present a realtime method for gradually and smoothly updating the latent vector z so that it
generates a desired image that both satisfies the user’s edits (e.g. a scribble or a warp; more details in Section 5) and stays close to the natural image manifold.
Unfortunately, in this transformation the generative model usually looses some of the important low-level details of the input image. We therefore propose a dense correspondence method that estimates both per-pixel color and shape changes from the edits applied to the generative model
We then transfer these changes to the original photo using an edge-aware interpolation technique and produce the final manipulated result.

参考

https://arxiv.org/abs/1609.03552v2

Generative Adversarial Text to Image Synthesis

ICML’16

问题

learn a text feature representation that captures the important visual details;
and second, use these feaures to synthesize a compelling image that a human might mistake for real.
- 即合成比较真实的图像

方法

Our approach is to train a deep convolutional generative adversarial network (DC-GAN) conditioned on text features encoded by a hybrid character-level convolutionalrecurrent neural network.
- 即用文本特征编码作为条件的DC-GAN
- a hybrid character-level convolutionalrecurrent neural network
- 图像分类器用的是GoogLeNet，文本分类器用的是LSTM和CNN，参考[Learning Deep Representations of Fine-grained Visual Descriptions]
得到文本特征后，需要把文本特征压缩后与图像特征拼接在一起，放入DC-GAN
文本根据DC-GAN模型，提出以下两点需要改进的地方
1. GAN-CLS
2. GAN-INT
  - 对于根据描述去生成图片的问题，文本描述数量相对较少是限制合成效果的一个重要因素。所以，文本提出通过简单的插值方法来生成大量新的文本描述
$E_{t_1,t_2～p_{data}}[log(1-D(G(z,\beta t_1+(1-\beta)t_2)))]\tag1$
由于这个文本向量是假想的，所以可能并没有真实的图片。如果模型的性能比较好，那么就可能填补很多文本间的空白。
噪声 z 用于捕捉style信息
- $L_{style} = E_{t,z ～N(0,1)}\Vert z-S(G(z,\phi(t))) \Vert\tag2$

参考

Learning Deep Representations of Fine-Grained Visual Descriptions

《Generative Adversarial Text to Image Synthesis》阅读笔记

https://arxiv.org/abs/1605.05396v2

Wasserstein GAN

arXiv:1701

问题

原始的GAN 优化目标等价于最小化真实分布 P(r) 与生成分布 P(g) 的 JS 散度
而由于 P(r) 与 P(g) 几乎不可能有不可忽略的重叠，所以无论它们相距多远 JS 散度都是常数log2，最终导致生成器的梯度（近似）为0，梯度消失。

方法

引入Wasserstein距离又叫Earth-Mover（EM）距离
- $W(P_r, P_g) = \inf_{\gamma \sim \Pi (P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$
Wasserstein距离相比KL散度、JS散度的优越性在于，即便两个分布没有重叠，Wasserstein距离仍然能够反映它们的远近

参考

令人拍案叫绝的Wasserstein GAN

带你漫游 Wasserstein GAN

https://arxiv.org/abs/1701.07875

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

ICCV‘17

问题

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available
即缺少成对的训练数据

方法

这里写图片描述

训练两个G,两个D
两个循环
Adversarial Loss ：
- $L_{GAN}(G,D_Y,X,Y)=\mathbb{E}_{y\sim p_{data}(y)}[\log{D_Y(y)}]+\mathbb{E}_{x\sim p_{data}(x)}[\log{(1-D_Y(G(x)))}]$
- $L_{GAN}(F,D_X,Y,X)=\mathbb{E}_{x\sim p_{data}(x)}[\log{D_X(x)}]+\mathbb{E}_{y\sim p_{data}(y)}[\log{(1-D_X(F(y)))}]$
Cycle Consistency Loss :
- $L_{cyc}(G,F) = \mathbb{E}_{x \sim p_{data}(x)}[||F(G(x))-x||_1]+\mathbb{E}_{y \sim p_{data}(y)}[||G(F(y))-y||_1]$
Full Objective ：
- $L(G,F,D_X,D_Y)=L_{GAN}(G,D_Y,X,Y)+L_{GAN}(F,D_X,Y,X)+λL_{cyc}(G,F)$
论文里面提到判别器如果是对数损失训练不是很稳定，所以改成的均方误差损失如下：
- $L_{LSGAN}(G,D_Y,X,Y)=\mathbb{E}_{y\sim p_{data}(y)}[(D_Y(y)-1)^2]+\mathbb{E}_{x\sim p_{data}(x)}[(1-D_Y(G(x)))^2]$

参考

CycleGAN

CycleGAN-李宏毅

http://openaccess.thecvf.com/content_ICCV_2017/papers/Zhu_Unpaired_Image-To-Image_Translation_ICCV_2017_paper.pdf

Person Transfer GAN to Bridge Domain Gap for Person Re-Identification

arXiv:1711

问题

We also observe that, domain gap commonly exists between datasets, which essentially causes severe performance drop when training and testing on different datasets. This results in that available training data cannot be effectively leveraged for new testing domains. To relieve the expensive costs of annotating new training samples, we propose a Person Transfer Generative Adversarial Network (PTGAN) to bridge the domain gap.

方法

作者提出的一个针对于ReID问题的GAN : PTGAN
这个GAN最大的特点就是在尽可能保证行人前景不变的前提下实现背景domain的迁移
首先PTGAN网路的损失函数包括两部分
- $L_{PTGAN} = L_{Style}+ \lambda_1L_{ID}$
- $L_{Style}$ 代表生成的风格损失，或者说domain损失，就是生成的图像是否像新的数据集风格
- 这个就是标准的CycleGAN的判别loss
- $L_{Style} = L_{GAN}(G,D_B,A,B)+L_{GAN}( \overline{G},D_A,B,A )+\lambda_2L_{Cyc}(G,\overline{G})$
为了保证图片迁移过程中前景不变，论文提出LID损失，用PSPNet提取的前景，这个前景就是一个mask
- $L_{ID}= \mathbb{E}_{a \sim p_{data}(a)}[|| (G(a)-a) \odot M(a) ||_2] + \mathbb{E}_{b \sim p_{data}(b)}[|| (\overline{G}(b)-b) \odot M(b) ||_2]$
- 其中M(a)和M(b)是两个分割出来的前景mask，ID loss将会约束行人前景在迁移过程中尽可能的保持不变

收获

MSMT17数据集的效果还有待提高（目前未公开）
新的数据集迁移的思路

参考

PTGAN

https://arxiv.org/abs/1711.08565

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

ICLR 2016

问题

GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs

DCGAN(Deep Convolutional Generative Adversarial Networks)解决以上问题

并且还能自动学到很多特征表达，生成逼真的图像

方法

将CNN与GAN结合，把G和D换成了两个CNN
- 取消所有 pooling 层。G网络中使用转置卷积（transposed convolutional layer）进行上采样， D网络中用加入步幅卷积（stride convolutions）代替 pooling
- 在 D 和 G 中均使用BN(batch normalization)
- 去掉 FC 层，使网络变为全卷积网络
- G网络中使用 ReLU 作为激活函数，最后一层使用 tanh
- D 网络中使用 LeakyReLU 作为激活函数

收获

在数据量充足的情况下，跟图像有关的问题，CNN是首选
一些调参的trick很重要

参考

https://arxiv.org/abs/1511.06434

CapsuleGAN: Generative Adversarial Capsule Network

arXiv:1802

问题

This motivates the question whether GANs can be designed using CapsNets (insteated of CNNs) to improve their performance

方法

CapsNets are used as discriminators in our framework as opposed to the conventionally used CNNs

收获

之前用DCGAN生成过MNIST，如果 CapsNets 比ConvNet 好的话，可以在类似 MNIST 数据集上做工作

参考

https://arxiv.org/abs/1802.06167