InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Paper :https://arxiv.org/abs/1606.03657 Code :https://github.com/openai/InfoGAN Tips:Nips2016的一篇paper。 (阅读笔记)
1.Main idea
最大化互信息作为目标函数。maximizes the mutual information between a small subset of the latent variables and the observation.
分辨出了各个数据集的某种风格。(即书写的方向)disentangles writing styles from digit shapes on the MNIST dataset…
甚至发现了人脸发型等等视觉概念。It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.
2.Intro
无监督学习很重要,也很有效,甚至一些监督学习下游任务(downstream tasks)有时也需要无监督先找到很好的潜在表示。
分离表示(disentangled representation)能够更好的学到各个变量与特征之间的关系。
在观察与噪声之间最大互信息。
3.Details
GAN的生成器输入只是随机噪声
z
z
z ,并没有给予特定的约束,或许这个
z
z
z 在某个高维对输出有很大作用。所以对将输入
z
z
z 划分为了两个部分:噪声
z
z
z ;潜在编码
c
c
c 对应结构性特征数据,所以生成器有两个参数即
G
(
z
,
c
)
G(z,c)
G ( z , c ) 。一般原始GAN我们都是直接使
P
G
(
x
∣
c
)
=
P
G
(
x
)
P_G(x|c)=P_G(x)
P G ( x ∣ c ) = P G ( x ) ,忽略了
c
c
c 的作用,直接将结果条件概率省去。
互信息
I
(
c
;
G
(
z
,
c
)
)
I(c;G(z,c))
I ( c ; G ( z , c ) ) 应该很高;如下式所示,其中
H
(
⋅
)
H(\cdot)
H ( ⋅ ) 表示信息熵:
I
(
X
;
Y
)
=
H
(
X
)
−
H
(
X
∣
Y
)
=
H
(
Y
)
−
H
(
Y
∣
X
)
=
∑
Y
∑
X
p
(
x
,
y
)
log
p
(
x
,
y
)
p
(
x
)
p
(
y
)
w
h
e
r
e
:
H
(
X
)
=
−
∑
X
P
(
x
)
log
P
(
x
)
=
−
∫
X
p
(
x
)
log
p
(
x
)
d
x
=
−
E
x
∼
X
[
log
p
(
x
)
]
(1)
\begin{aligned} I(X;Y)&=H(X)-H(X|Y)\\ &=H(Y)-H(Y|X) \\ &=\sum_Y \sum_X p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \\ & where:H(X)=- \sum_{X} P(x)\log P(x)=- \int_{X}p(x)\log p(x)\mathrm{d}x=-\Bbb E_{x \sim X}\left[\log p(x)\right]\\ \tag{1} \end{aligned}
I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) = H ( Y ) − H ( Y ∣ X ) = Y ∑ X ∑ p ( x , y ) log p ( x ) p ( y ) p ( x , y ) w h e r e : H ( X ) = − X ∑ P ( x ) log P ( x ) = − ∫ X p ( x ) log p ( x ) d x = − E x ∼ X [ log p ( x ) ] ( 1 ) 很明显地,从互信息可以看出当
p
(
x
)
p(x)
p ( x ) 和
q
(
x
)
q(x)
q ( x ) 独立的时候,
I
(
X
;
Y
)
I(X;Y)
I ( X ; Y ) 为0,表示无关;若
I
(
X
;
Y
)
I(X;Y)
I ( X ; Y ) 很高,则可以表示两组关系很大。从生成器中得到了任意
x
x
x ,目标让生成器中的后验
P
G
(
c
∣
x
)
P_G(c|x)
P G ( c ∣ x ) ,即潜在编码
c
c
c 具有更小的信息熵;这样才能描述从
x
x
x 到
c
c
c 的回溯过程仍然没有丢失潜在编码的信息:
min
G
max
D
E
x
∼
P
d
a
t
a
[
log
D
(
x
)
]
+
E
z
∼
n
o
i
s
e
[
log
(
1
−
D
(
G
(
z
)
)
)
]
−
λ
I
(
c
;
G
(
z
,
c
)
)
(2)
\begin{aligned} \min_G \max_D \mathbb{E}_{x \sim P_{\mathrm{data}}}[\log D(x)] +\mathbb{E}_{z \sim \mathrm{noise}}[\log (1-D(G(z)))] - \lambda I(c;G(z,c)) \\ \tag{2} \end{aligned}
G min D max E x ∼ P d a t a [ log D ( x ) ] + E z ∼ n o i s e [ log ( 1 − D ( G ( z ) ) ) ] − λ I ( c ; G ( z , c ) ) ( 2 ) 上式子中前部分就是GAN的目标函数;后面可以理解是
λ
\lambda
λ 参数影响下的惩罚,最大化生成器输出与潜在编码的关系。
后验
P
G
(
c
∣
x
)
P_G(c|x)
P G ( c ∣ x ) ,很难获得,构造了一个辅助分布
Q
(
c
∣
x
)
Q(c|x)
Q ( c ∣ x ) 去近似,生成器
G
(
z
,
c
)
G(z,c)
G ( z , c ) 得到的是
x
x
x ,所以找下边界:
I
(
c
;
G
(
z
,
c
)
)
=
H
(
c
)
−
H
(
c
∣
G
(
z
,
c
)
)
=
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
E
c
′
∼
P
(
c
∣
x
)
[
log
P
(
c
′
∣
x
)
]
]
=
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
∫
c
′
∼
P
(
c
∣
x
)
p
(
c
′
∣
x
)
log
p
(
c
′
∣
x
)
d
c
′
]
=
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
∫
c
′
∼
P
(
c
∣
x
)
p
(
c
′
∣
x
)
log
p
(
c
′
,
x
)
p
(
x
)
d
c
′
]
=
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
∫
x
∼
X
p
(
⋅
∣
x
)
log
p
(
⋅
∣
x
)
q
(
⋅
∣
x
)
d
x
+
∫
c
′
∼
Q
(
c
∣
x
)
q
(
c
′
∣
x
)
log
q
(
c
′
∣
x
)
d
c
′
]
Let
Q
→
P
=
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
D
K
L
(
P
(
⋅
∣
x
)
∥
Q
(
⋅
∣
x
)
)
+
E
c
′
∼
P
(
c
∣
x
)
[
log
Q
(
c
′
∣
x
)
]
]
≥
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
E
c
′
∼
P
(
c
∣
x
)
[
log
Q
(
c
′
∣
x
)
]
]
Only if
Q
=
P
(3)
\begin{aligned} I(c;G(z,c)) &= H(c)-H(c \vert G(z,c))\\ &=H(c) + \Bbb E_{x\sim G(z,c)}\left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log P(c^{'} \vert x)\right]\right] \\ &=H(c) + \Bbb E_{x\sim G(z,c)}\left[ \int_{c' \sim P(c|x)} p(c' \vert x) \log p(c' \vert x)\mathrm{d}c' \right] \\ &=H(c) +\Bbb E_{x\sim G(z,c)}\left[ \int_{c' \sim P(c|x)} p(c'|x)\log \frac{p(c',x)}{p(x)}\mathrm{d}c' \right]\\ &=H(c) +\Bbb E_{x\sim G(z,c)} \left[ \int_{x \sim X} p(\cdot|x)\log \frac{p(\cdot|x)}{q(\cdot|x)}\mathrm{d}x + \int_{c' \sim Q(c|x)} q(c'|x)\log q(c'|x) \mathrm{d}c' \right] & \text{Let $Q \rightarrow P$} \\ &=H(c) +\Bbb E_{x\sim G(z,c)} \left[ D_{KL}(P(\cdot \vert x) \Vert Q(\cdot \vert x))+ \Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c^{'} \vert x)\right] \right] \\ &\ge H(c) + \Bbb E_{x\sim G(z,c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c^{'} \vert x)\right] \right] & \text{Only if $Q=P$} \tag{3} \end{aligned}
I ( c ; G ( z , c ) ) = H ( c ) − H ( c ∣ G ( z , c ) ) = H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log P ( c ′ ∣ x ) ] ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ c ′ ∼ P ( c ∣ x ) p ( c ′ ∣ x ) log p ( c ′ ∣ x ) d c ′ ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ c ′ ∼ P ( c ∣ x ) p ( c ′ ∣ x ) log p ( x ) p ( c ′ , x ) d c ′ ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ x ∼ X p ( ⋅ ∣ x ) log q ( ⋅ ∣ x ) p ( ⋅ ∣ x ) d x + ∫ c ′ ∼ Q ( c ∣ x ) q ( c ′ ∣ x ) log q ( c ′ ∣ x ) d c ′ ] = H ( c ) + E x ∼ G ( z , c ) [ D K L ( P ( ⋅ ∣ x ) ∥ Q ( ⋅ ∣ x ) ) + E c ′ ∼ P ( c ∣ x ) [ log Q ( c ′ ∣ x ) ] ] ≥ H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log Q ( c ′ ∣ x ) ] ] Let Q → P Only if Q = P ( 3 ) 上述式子即是构造出一个分布
Q
Q
Q ,用
K
L
\mathbf{KL}
K L 散度去衡量两分布差距,并最小化;那么
Q
Q
Q 就接近
P
P
P 。
文中
L
e
m
m
a
5.1
\mathbf{Lemma5.1}
L e m m a 5 . 1 :随机变量关于某一函数的期望有一转换方式,即后验并不会影响到先验以及总体的期望,原文中的证明有些问题,重新证明后如下。
首先总体期望公式,条件概率的期望与其本身的期望是一样的(小部分的期望和总体期望一样):
E
[
E
[
X
∣
Y
]
]
=
E
[
∑
x
x
⋅
P
(
X
=
x
∣
Y
)
]
=
∑
y
[
∑
x
x
⋅
P
(
X
=
x
∣
Y
=
y
)
]
P
(
Y
=
y
)
=
∑
x
x
∑
y
P
(
X
=
x
∣
Y
=
y
)
⋅
P
(
Y
=
y
)
=
∑
x
x
∑
y
P
(
X
=
x
and
Y
=
y
)
=
∑
x
x
⋅
P
(
X
=
x
)
=
E
[
X
]
(4)
\begin{aligned} E[E[X|Y]] &= E \left[ \sum_{x} x \cdot P(X = x | Y) \right] \\ &= \sum_y \left[ \sum_{x} x \cdot P(X = x | Y = y) \right] P(Y = y)\\ &= \sum_x x \sum_y P(X = x | Y = y) \cdot P(Y = y) \\ &= \sum_x x \sum_y P(X = x \, \text{and} \, Y = y) \\ &= \sum_x x \cdot P(X = x)\\ &= E[X] \tag{4} \end{aligned}
E [ E [ X ∣ Y ] ] = E [ x ∑ x ⋅ P ( X = x ∣ Y ) ] = y ∑ [ x ∑ x ⋅ P ( X = x ∣ Y = y ) ] P ( Y = y ) = x ∑ x y ∑ P ( X = x ∣ Y = y ) ⋅ P ( Y = y ) = x ∑ x y ∑ P ( X = x and Y = y ) = x ∑ x ⋅ P ( X = x ) = E [ X ] ( 4 ) 或者:
E
Y
[
E
X
[
X
∣
Y
]
]
=
∫
Y
E
X
[
X
∣
Y
]
f
(
y
)
d
y
=
∫
Y
[
∫
X
x
f
(
x
,
y
)
f
(
y
)
d
x
]
f
(
y
)
d
y
=
∫
Y
∫
X
x
f
(
x
,
y
)
f
(
y
)
d
x
f
(
y
)
d
y
f
(
y
)
is constant w.r.t. Y
=
∫
Y
∫
X
x
f
(
x
,
y
)
d
x
d
y
=
∫
X
x
∫
Y
f
(
x
,
y
)
d
y
d
x
X
is constant w.r.t. Y
=
∫
X
x
[
∫
Y
f
(
x
,
y
)
d
y
]
d
x
=
∫
X
x
f
(
x
)
d
x
=
E
X
[
X
]
(5)
\begin{aligned} E_Y[E_X[X|Y]] &= \int_Y E_X\left[X|Y \right]f(y)\mathrm{d}y \\ &= \int_Y \left[\int_X x \frac{f(x,y)}{f(y)}\mathrm{d}x \right] f(y)\mathrm{d}y\\ &= \int_Y \int_X x \frac{f(x,y)}{f(y)}\mathrm{d}x f(y)\mathrm{d}y & \text{$f(y)$ is constant w.r.t. Y}\\ &= \int_Y \int_X x f(x,y)\mathrm{d}x\mathrm{d}y & \\ &= \int_X x \int_Y f(x,y)\mathrm{d}y\mathrm{d}x & \text{$X$ is constant w.r.t. Y}\\ &= \int_X x \left[ \int_Y f(x,y)\mathrm{d}y \right] \mathrm{d}x\\ &= \int_X xf(x)\mathrm{d}x & \\ &= E_X[X] \tag{5} \end{aligned}
E Y [ E X [ X ∣ Y ] ] = ∫ Y E X [ X ∣ Y ] f ( y ) d y = ∫ Y [ ∫ X x f ( y ) f ( x , y ) d x ] f ( y ) d y = ∫ Y ∫ X x f ( y ) f ( x , y ) d x f ( y ) d y = ∫ Y ∫ X x f ( x , y ) d x d y = ∫ X x ∫ Y f ( x , y ) d y d x = ∫ X x [ ∫ Y f ( x , y ) d y ] d x = ∫ X x f ( x ) d x = E X [ X ] f ( y ) is constant w.r.t. Y X is constant w.r.t. Y ( 5 )
L
e
m
m
a
5.1
\mathbf{Lemma5.1}
L e m m a 5 . 1 即是总体期望公式的应用:
E
x
∼
X
,
y
∼
Y
∣
x
[
f
(
x
,
y
)
]
=
∫
x
P
(
x
)
∫
y
P
(
y
∣
x
)
f
(
x
,
y
)
d
y
d
x
=
∫
x
∫
y
P
(
x
,
y
)
f
(
x
,
y
)
d
y
d
x
=
∫
y
P
(
y
)
∫
x
P
(
x
∣
y
)
f
(
x
,
y
)
d
x
d
y
=
∫
y
P
(
y
)
[
∫
x
′
P
(
x
′
∣
y
)
f
(
x
′
,
y
)
d
x
′
]
d
y
rename
x
to
x
′
=
∫
x
P
(
x
)
[
∫
y
P
(
y
∣
x
)
[
∫
x
′
P
(
x
′
∣
y
)
f
(
x
′
,
y
)
d
x
′
]
d
y
]
d
x
=
E
x
∼
X
,
y
∼
Y
∣
x
,
x
′
∼
X
∣
y
[
f
(
x
′
,
y
)
]
(6)
\begin{aligned} \mathbb{E}_{{x \sim X},y \sim Y|x}[f(x,y)]&=\int_x P(x) \int_y P(y|x)f(x,y)\mathrm{d}y\mathrm{d}x \\ &=\int_x \int_y P(x,y)f(x,y)\mathrm{d}y\mathrm{d}x \\ &=\int_y P(y) \int_x P(x|y)f(x,y)\mathrm{d}x\mathrm{d}y \\ &=\int_y P(y)\left[ \int_{x'} P(x'|y)f(x',y)\mathrm{d}x' \right] \mathrm{d}y & \text{rename $x$ to $x'$} \\ &=\int_x P(x) \left[ \int_y P(y|x)\left[ \int_{x'} P(x'|y)f(x',y)\mathrm{d}x' \right] \mathrm{d}y \right] \mathrm{d}x\\ &=\mathbb{E}_{{x \sim X},y \sim Y|x, x' \sim X|y}[f(x',y)] \\ \tag{6} \end{aligned}
E x ∼ X , y ∼ Y ∣ x [ f ( x , y ) ] = ∫ x P ( x ) ∫ y P ( y ∣ x ) f ( x , y ) d y d x = ∫ x ∫ y P ( x , y ) f ( x , y ) d y d x = ∫ y P ( y ) ∫ x P ( x ∣ y ) f ( x , y ) d x d y = ∫ y P ( y ) [ ∫ x ′ P ( x ′ ∣ y ) f ( x ′ , y ) d x ′ ] d y = ∫ x P ( x ) [ ∫ y P ( y ∣ x ) [ ∫ x ′ P ( x ′ ∣ y ) f ( x ′ , y ) d x ′ ] d y ] d x = E x ∼ X , y ∼ Y ∣ x , x ′ ∼ X ∣ y [ f ( x ′ , y ) ] rename x to x ′ ( 6 ) 于是互信息
I
I
I 有:
I
(
c
;
G
(
z
,
c
)
)
≥
H
(
c
)
+
E
x
∼
G
(
z
,
c
)
[
E
c
′
∼
P
(
c
∣
x
)
[
log
Q
(
c
′
∣
x
)
]
]
=
H
(
c
)
+
E
c
∼
P
(
c
)
,
x
∼
P
G
(
x
∣
c
)
[
E
c
′
∼
P
(
c
∣
x
)
[
log
Q
(
c
′
∣
x
)
]
]
=
H
(
c
)
+
E
c
∼
P
(
c
)
,
x
∼
G
(
z
,
c
)
[
log
Q
(
c
∣
x
)
]
=
L
I
(
G
,
Q
)
(7)
\begin{aligned} I(c;G(z,c)) &\ge H(c) + \Bbb E_{x\sim G(z,c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c' \vert x)\right] \right] \\ &=H(c) + \Bbb E_{c \sim P(c),x\sim P_G(x|c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c' \vert x)\right] \right] \\ &=H(c) +\Bbb E_{c \sim P(c),x\sim G(z,c)} \left[\log Q(c \vert x)\right]\\ &=L_I(G,Q) \\ \tag{7} \end{aligned}
I ( c ; G ( z , c ) ) ≥ H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log Q ( c ′ ∣ x ) ] ] = H ( c ) + E c ∼ P ( c ) , x ∼ P G ( x ∣ c ) [ E c ′ ∼ P ( c ∣ x ) [ log Q ( c ′ ∣ x ) ] ] = H ( c ) + E c ∼ P ( c ) , x ∼ G ( z , c ) [ log Q ( c ∣ x ) ] = L I ( G , Q ) ( 7 )
infoGAN最后的目标函数就应该优化
G
G
G ,
D
D
D 以及
Q
Q
Q 三者:
min
G
,
Q
max
D
E
x
∼
P
d
a
t
a
[
log
D
(
x
)
]
+
E
z
∼
n
o
i
s
e
[
log
(
1
−
D
(
G
(
z
)
)
)
]
−
λ
L
I
(
G
,
Q
)
(8)
\begin{aligned} \min_{G,Q} \max_D \mathbb{E}_{x \sim P_{\mathrm{data}}}[\log D(x)] +\mathbb{E}_{z \sim \mathrm{noise}}[\log (1-D(G(z)))] - \lambda L_I(G,Q)\\ \tag{8} \end{aligned}
G , Q min D max E x ∼ P d a t a [ log D ( x ) ] + E z ∼ n o i s e [ log ( 1 − D ( G ( z ) ) ) ] − λ L I ( G , Q ) ( 8 ) 在训练实现过程中,
Q
(
c
∣
x
)
Q(c|x)
Q ( c ∣ x ) 就直接是一个神经网络,即是生成器得到的
x
x
x 即要欺骗判别器
D
D
D ,又要经过神经网络
Q
Q
Q 后与当时送入生成器的参数
c
c
c 互信息最大。