1. Modeling problem
formulation
a set of distributions
P
=
{
p
y
(
⋅
;
x
)
∈
P
y
:
x
∈
X
}
\mathcal{P}=\left\{p_{\mathrm{y}}(\cdot ; x) \in \mathcal{P}^{y}: x \in \mathcal{X}\right\}
P = { p y ( ⋅ ; x ) ∈ P y : x ∈ X }
approximation
min
q
∈
P
y
max
x
∈
X
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
\min _{q \in \mathcal{P}^{y}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right)
q ∈ P y min x ∈ X max D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) )
solution
Theorem : 对任意
q
∈
P
y
q \in \mathcal{P}^{y}
q ∈ P y 都存在一个混合模型
q
w
(
⋅
)
=
∑
x
∈
X
w
(
x
)
p
y
(
⋅
;
x
)
q_w(\cdot) = \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x)
q w ( ⋅ ) = ∑ x ∈ X w ( x ) p y ( ⋅ ; x ) 满足
D
(
p
y
(
⋅
;
x
)
∥
q
w
(
⋅
)
)
≤
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
for all
x
∈
X
D\left(p_{y}(\cdot ; x) \| q_{w}(\cdot)\right) \leq D\left(p_{y}(\cdot ; x) \| q(\cdot)\right) \quad \text { for all } x \in \mathcal{X}
D ( p y ( ⋅ ; x ) ∥ q w ( ⋅ ) ) ≤ D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) for all x ∈ X Proof : 应用 Pythagoras 定理
然后很容易有
max
x
∈
X
min
q
∈
P
y
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
=
max
x
∈
X
0
=
0
\max _{x \in \mathcal{X}} \min _{q \in \mathcal{P}^{y}} D\left(p_{y}(\cdot ; x) \| q(\cdot)\right)=\max _{x \in \mathcal{X}} 0=0 \\
x ∈ X max q ∈ P y min D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = x ∈ X max 0 = 0
min
q
∈
P
max
x
∈
X
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
=
min
q
∈
P
max
w
∈
P
X
∑
x
w
(
x
)
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
\min _{q \in \mathcal{P}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right)=\min _{q \in \mathcal{P}} \max _{w \in \mathcal{P}^{\mathcal{X}}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right)
q ∈ P min x ∈ X max D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = q ∈ P min w ∈ P X max x ∑ w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) )
Theorem (Redundancy-Capacity Theorem): 以下等式成立,且两侧最优的
w
,
q
w,q
w , q s是相同的
R
+
≜
min
q
∈
P
Y
max
w
∈
P
X
∑
x
w
(
x
)
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
=
max
w
∈
P
min
q
∈
P
∑
x
w
(
x
)
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
≜
R
−
\begin{aligned} R^{+} \triangleq \min _{q \in \mathcal{P}^{\mathcal{Y}}} \max _{w \in \mathcal{P}^{\mathcal{X}}} & \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \\ &=\max _{w \in \mathcal{P}} \min _{q \in \mathcal{P}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \triangleq R^{-} \end{aligned}
R + ≜ q ∈ P Y min w ∈ P X max x ∑ w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = w ∈ P max q ∈ P min x ∑ w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) ≜ R − Proof :
利用后面的 Equidistance property 证明
R
+
≤
R
−
R^+ \le R^-
R + ≤ R −
根据 minimax 和 maxmini 的性质,有
R
+
≥
R
−
R^+ \ge R^-
R + ≥ R −
一定有
R
+
≥
R
−
R^+ \ge R^-
R + ≥ R −
证明两个不等式的取等条件是在同样的
w
,
q
w,q
w , q 处取到
2. Model capacity
首先计算
R
−
R^-
R − 内部的 min
min
q
∈
P
Y
∑
x
w
(
x
)
D
(
p
y
(
⋅
;
x
)
∥
q
(
⋅
)
)
=
min
q
∈
P
Y
∑
x
,
y
w
(
x
)
p
y
(
y
;
x
)
log
p
y
(
y
;
x
)
q
(
y
)
=
constant
−
max
q
∈
P
Y
∑
y
q
w
(
y
)
log
q
(
y
)
=
constant
−
max
q
∈
P
Y
E
q
w
[
log
q
(
y
)
]
\begin{aligned} & \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x} w(x) D\left(p_{\mathbf{y}}(\cdot ; x) \| q(\cdot)\right) \\=& \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x, y} w(x) p_{\mathbf{y}}(y ; x) \log \frac{p_{y}(y ; x)}{q(y)} \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{y} q_{w}(y) \log q(y) \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \mathbb{E}_{q_{w}}[\log q(y)] \end{aligned}
= = = q ∈ P Y min x ∑ w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) q ∈ P Y min x , y ∑ w ( x ) p y ( y ; x ) log q ( y ) p y ( y ; x ) constant − q ∈ P Y max y ∑ q w ( y ) log q ( y ) constant − q ∈ P Y max E q w [ log q ( y ) ] 根据 Gibbs 不等式
q
∗
(
⋅
)
=
q
w
(
⋅
)
≜
∑
x
∈
X
w
(
x
)
p
y
(
⋅
;
x
)
q^*(\cdot) = q_{w}(\cdot) \triangleq \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x)
q ∗ ( ⋅ ) = q w ( ⋅ ) ≜ x ∈ X ∑ w ( x ) p y ( ⋅ ; x ) 再考虑
R
−
R^-
R − 外部的 max,此时可以转化为 Bayesian 角度! $$ \begin{aligned} R^{-} &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum {x} w(x) D\left(p_{y}(\cdot ; x) | q_{w}(\cdot)\right) \ &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum {x, y} w(x) p_{y}(y ; x) \log \frac{p_{y}(y ; x)}{\sum_{x^{\prime}} w\left(x^{\prime}\right) p_{y}\left(y ; x^{\prime}\right)} \
&\overset{\text{Bayesian}}{=}\max {p {\mathbf{x}}} \sum_{x} p_{\mathbf{x}}(x) D\left(p_{y | \mathbf{x}}(\cdot | x) | p_{y}(\cdot)\right) \ &=\max {p {\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}}(x) p_{\mathbf{y} | \mathbf{x}}(y | x) \log \frac{p_{y | x}(y | x)}{p_{\mathbf{y}}(y)} \ &=\max {p {\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}, \mathbf{y}}(x, y) \log \frac{p_{\mathbf{x}, \mathbf{y}}(x, y)}{p_{\mathbf{x}}(x) p_{y}(y)}=\max {p {\mathbf{x}}} I(x ; y)=C \end{aligned} KaTeX parse error: Can't use function '$' in math mode at position 24: …ition**: 对一个模型 $̲p_{\mathsf{y|x}… C \triangleq \max {p {x}} I(x ; y) $$
Model capacity : C
least informative prior :
p
x
∗
p_x^*
p x ∗
Theorem (Equidistance property): C对应的最优的
p
∗
p^*
p ∗ 和
w
∗
w^*
w ∗ 满足
D
(
p
y
(
⋅
;
x
)
∣
∣
q
∗
(
⋅
)
)
≤
C
∀
x
∈
X
D(p_y(\cdot;x)||q^*(\cdot)) \le C \ \ \ \ \ \forall x\in\mathcal{X}
D ( p y ( ⋅ ; x ) ∣ ∣ q ∗ ( ⋅ ) ) ≤ C ∀ x ∈ X 其中等号对于满足
w
∗
(
x
)
>
0
w^*(x)>0
w ∗ ( x ) > 0 的 x 成立
Proof :
I
(
x
,
y
)
I(x,y)
I ( x , y ) 关于
p
x
(
a
)
∀
a
p_x(a)\ \ \forall a
p x ( a ) ∀ a 是 concave 的
构造拉格朗日函数
L
=
I
(
x
,
y
)
−
λ
(
∑
x
p
x
(
x
)
−
1
)
\mathcal{L}=I(x,y) - \lambda(\sum_x p_x(x)-1)
L = I ( x , y ) − λ ( ∑ x p x ( x ) − 1 ) ,也关于
p
x
(
a
)
p_x(a)
p x ( a ) concave
min
p
x
I
(
x
,
y
)
\min_{p_x}I(x,y)
min p x I ( x , y ) 的极值点应满足
∂
I
(
x
;
y
)
∂
p
x
(
a
)
∣
p
x
=
p
x
∗
−
λ
=
0
,
for all
a
∈
X
such that
p
x
∗
(
a
)
>
0
\left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda=0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)>0
∂ p x ( a ) ∂ I ( x ; y ) ∣ ∣ ∣ p x = p x ∗ − λ = 0 , for all a ∈ X such that p x ∗ ( a ) > 0 ,或者
∂
I
(
x
;
y
)
∂
p
x
(
a
)
∣
p
x
=
p
x
∗
−
λ
≤
0
,
for all
a
∈
X
such that
p
x
∗
(
a
)
=
0
\left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda\le0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)=0
∂ p x ( a ) ∂ I ( x ; y ) ∣ ∣ ∣ p x = p x ∗ − λ ≤ 0 , for all a ∈ X such that p x ∗ ( a ) = 0
∂
I
(
x
;
y
)
∂
p
x
(
a
)
=
D
(
p
y
∣
x
(
⋅
;
a
)
∥
p
y
)
−
log
e
\frac{\partial I(x ; y)}{\partial p_{x}(a)} = D\left(p_{y | x}(\cdot ; a) \| p_{y}\right)-\log e
∂ p x ( a ) ∂ I ( x ; y ) = D ( p y ∣ x ( ⋅ ; a ) ∥ p y ) − log e 并根据 3 中取等号的特点恰好可以得到定理中的式子
3. Inference with mixture models
4. Maximum entropy distribution
最大熵等价于均匀分布 向对应的模型集合上的 I-projection
D
(
p
∥
U
)
=
∑
y
p
(
y
)
log
p
(
y
)
+
log
∣
Y
∣
=
log
∣
Y
∣
−
H
(
p
)
p
∗
=
arg
max
p
∈
L
t
H
(
p
)
=
arg
min
p
∈
L
t
D
(
p
∥
U
)
D(p \| U)=\sum_{y} p(y) \log p(y)+\log |\mathcal{Y}|=\log |\mathcal{Y}|-H(p) \\ p^{*}=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \max } H(p)=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \min } D(p \| U)
D ( p ∥ U ) = y ∑ p ( y ) log p ( y ) + log ∣ Y ∣ = log ∣ Y ∣ − H ( p ) p ∗ = p ∈ L t arg max H ( p ) = p ∈ L t arg min D ( p ∥ U )
其他内容请看: 统计推断(一) Hypothesis Test 统计推断(二) Estimation Problem 统计推断(三) Exponential Family 统计推断(四) Information Geometry 统计推断(五) EM algorithm 统计推断(六) Modeling 统计推断(七) Typical Sequence 统计推断(八) Model Selection 统计推断(九) Graphical models 统计推断(十) Elimination algorithm 统计推断(十一) Sum-product algorithm