Bayesian Network的辅助模型

Bayesian Network是有向无环图（directed acyclic graph, DAG），其推断的过程是由根节点（root node）逐次扩散到叶节点（leaf node）的过程，在Bayesian Network的一个节点可以描述为以下方向图：
这里写图片描述
图1 Bayesian Network的一个节点
图1中 $\mathbf Z_j$ 是Bayesian Network中的一个节点，即一个随机变量， $\mathbf {Pa}_j$ 是它的Parents（父节点）。设 $p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j) = N(\mathbf Z_j \ ; \ h_{\theta}(\mathbf {Pa}_j), \sigma_z^2\mathbf I)$ ，表示在给定Parents时， $\mathbf Z_j$ 是以 $h_{\theta}(\mathbf {Pa}_j)$ 为中心的正态分布。在我们用MC(蒙地卡罗)方法进行推断时，为获得 $\mathbf Z_j$ 的样本，需要先得到 $\mathbf {Pa}_j$ 的样本 $\mathbf {pa}_j$ ，然后通过 $p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j = \mathbf {pa}_j) = N(\mathbf Z_j \ ; \ h_{\theta}(\mathbf {pa}_j), \sigma_z^2\mathbf I)$ 抽样才能得到，即构造了一个新的分布后再抽样。
文章《Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form》为我们提出了一种辅助模型，该模型简单地可见图2：
这里写图片描述
图2 辅助模型
辅助模型在原有的模型上添加了一个新的随机变量 $\mathbf E_j$ 原来的 $p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j)$ 成为了新的模型 $\hat p_{\theta}(\mathbf Z_j, \mathbf E_j \vert \mathbf {Pa}_j)$ ,令 $p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j)$ 是 $\hat p_{\theta}(\mathbf Z_j, \mathbf E_j \vert \mathbf {Pa}_j)$ 的边沿分布：

p_{θ} (Z_{j} | {P a}_{j}) = \int_{E_{j}} {\hat{p}}_{θ} (Z_{j}, E_{j} | {P a}_{j}) d E_{j} = \int_{E_{j}} {\hat{p}}_{θ} (Z_{j} | {P a}_{j}, E_{j}) {\hat{p}}_{θ} (E_{j} | {P a}_{j}) d E_{j} E_{j} is independent to {P a}_{j}, s o = \int_{E_{j}} {\hat{p}}_{θ} (Z_{j} | {P a}_{j}, E_{j}) {\hat{p}}_{θ} (E_{j}) d E_{j} (1)

$p_{\theta}(\mathbf Z_j\vert \mathbf {Pa}_j) = \int_{E_j} \hat p_{\theta}(\mathbf Z_j, \mathbf E_j \vert \mathbf {Pa}_j) \ d \mathbf E_j \\ =\int_{E_j} \hat p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j, \mathbf E_j) \hat p_{\theta}(\mathbf E_j \vert \mathbf {Pa}_j) \ d \mathbf E_j \\ \text{$\mathbf E_j$ is independent to $\mathbf {Pa}_j, so$} \\ = \int_{E_j} \hat p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j, \mathbf E_j) \hat p_{\theta}(\mathbf E_j) \ d \mathbf E_j \qquad(1)$
上式中

{\hat{p}}_{θ} (Z_{j} | {P a}_{j}, E_{j})

$\hat p_{\theta}(\mathbf Z_j \vert \mathbf {Pa}_j, \mathbf E_j)$ 被称为conditionally deterministic variables的密度，它是

δ (\cdot)

$\delta(\cdot)$ 函数，即

{P a}_{j}

$\mathbf {Pa}_j$ 和

E_{j}

$\mathbf E_j$ 一旦确定，则

Z_{j}

$\mathbf Z_j$ 便能确定，不再是随机量。我们令：

z_{j} = g_{θ} ({p a}_{j}, e_{j}) so that {\hat{p}}_{θ} (Z_{j} = z_{j} | {P a}_{j} = {p a}_{j}, E_{j} = e_{j}) = δ (z_{j} - g_{θ} ({p a}_{j}, e_{j})) (2)

$\mathbf z_j = g_{\theta}(\mathbf {pa}_j, \mathbf e_j) \qquad \text{so that}\\ \hat p_{\theta}(\mathbf Z_j = \mathbf z_j\vert \mathbf {Pa}_j=\mathbf {pa}_j, \mathbf E_j=\mathbf e_j)=\delta(\mathbf z_j- g_{\theta}(\mathbf {pa}_j, \mathbf e_j)) \qquad (2)$
式中

g_{θ} ({p a}_{j}, e_{j})

$g_{\theta}(\mathbf {pa}_j, \mathbf e_j)$ 是我们可以选择的生成函数，将（2）代入（1）有：

p_{θ} (Z_{j} = z_{j} | {P a}_{j} = {p a}_{j}) = \int_{e} δ (z_{j} - g_{θ} ({p a}_{j}, e_{j})) \cdot {\hat{p}}_{θ} (e) d e (3)

$p_{\theta}(\mathbf Z_j = \mathbf z_j \vert \mathbf {Pa}_j = \mathbf {pa}_j) = \int_{e} \delta(\mathbf z_j-g_{\theta}(\mathbf {pa}_j, \mathbf e_j))\cdot \hat p_{\theta}(\mathbf e)\ d \mathbf e \qquad (3)$
根据

δ (\cdot)

$\delta(\cdot)$ 性质，则有：

z_{j} = g_{θ} ({p a}_{j}, e_{j})

$\mathbf z_j = g_{\theta}(\mathbf {pa}_j, \mathbf e_j)$ 。
假设图1所反映的条件分布是正态分布：

p_{θ} (Z_{j} = z_{j} | {P a}_{j} = {p a}_{j}) = N (z_{j}; h_{θ} ({p a}_{j}), σ_{z}^{2} I) (4)

$p_{\theta}(\mathbf Z_j = \mathbf z_j \vert \mathbf {Pa}_j = \mathbf {pa}_j) = N(\mathbf z_j; h_{\theta}(\mathbf {pa}_j), \sigma^2_z \mathbf I) \qquad(4) \\$
式中

h_{θ} ({p a}_{j})

$h_{\theta}(\mathbf {pa}_j)$ 是变量

Z_{j}

$\mathbf Z_j$ 的所有父节点参与的一个映射，

p_{θ} (Z_{j} = z_{j} | {P a}_{j} = {p a}_{j})

$p_{\theta}(\mathbf Z_j = \mathbf z_j \vert \mathbf {Pa}_j = \mathbf {pa}_j)$ 是以这个映射的像为均值的正态分布，映射可以由一层（或多层）神经网络实现。
设计生成函数

g_{θ} (\cdot)

$g_{\theta}(\cdot)$ 如下：

z_{j} = g_{θ} ({p a}_{j}, e_{j}) = h_{θ} ({p a}_{j}) + e_{j} \cdot σ_{z} (5) 其中， e_{j} \in N (0, I) \Rightarrow e_{j} = \frac{z_{j} - h_{θ} ({p a}_{j})}{σ_{z}} \sim N (0, I) \Rightarrow z_{j} \sim N (h_{θ} ({p a}_{j}), σ_{z}^{2} I) (6)

$\mathbf z_j = g_{\theta}(\mathbf {pa}_j,\mathbf e_j) = h_{\theta}(\mathbf {pa}_j) + \mathbf e_j \cdot \sigma_z \qquad(5) \\ \text{其中，} \mathbf e_j \in N(\mathbf 0, \mathbf I)\\ \Rightarrow \mathbf e_j = \frac {\mathbf z_j - h_{\theta}(\mathbf {pa}_j)}{\sigma_z} \sim N(\mathbf 0, \mathbf I) \\ \Rightarrow \mathbf z_j \sim N(h_{\theta}(\mathbf {pa}_j),\mathbf \sigma_z^2 \mathbf I) \qquad(6)$
代入（3）即有：

\int_{e} δ (z_{j} - g_{θ} ({p a}_{j}, e_{j})) \cdot {\hat{p}}_{θ} (e) d e = N (z_{j}; h_{θ} ({p a}_{j}), σ_{z}^{2} I) (7)

$\int_{e} \delta(\mathbf z_j-g_{\theta}(\mathbf {pa}_j, \mathbf e_j))\cdot \hat p_{\theta}(\mathbf e)\ d \mathbf e = N(\mathbf z_j;h_{\theta}(\mathbf {pa}_j),\mathbf \sigma_z^2 \mathbf I) \qquad(7)$
由此例可见，加入了辅助变量，并不影响原来的变量关系和概率分布。但我们在产生

Z_{j}

$\mathbf Z_j$ 的样本时的方法不同：
1、原来模型，是从（4）式所定义的分布：

N (z_{j}; h_{θ} ({p a}_{j}), σ_{z}^{2} I)

$N(\mathbf z_j; h_{\theta}(\mathbf {pa}_j), \sigma^2_z \mathbf I)$ 进行抽样；
2、辅助模型，是从

e_{j} \in N (0, I)

$\mathbf e_j \in N(\mathbf 0, \mathbf I)$ 中进行抽样，然后代入生成函数产生：

z_{j} = g_{θ} ({p a}_{j}, e_{j}) = h_{θ} ({p a}_{j}) + e_{j} \cdot σ_{z}

$\mathbf z_j = g_{\theta}(\mathbf {pa}_j,\mathbf e_j) = h_{\theta}(\mathbf {pa}_j) + \mathbf e_j \cdot \sigma_z$
由上比较可见，采用辅助模型在实现时是要简单一点的。其实在VAE的实现时，对CODE进行抽样就是这样完成的：
1、Encoder对输入样本

x_{j}

$\mathbf x_j$ （例如：MNIST图），进行编码，这相当于

h_{θ} ({p a}_{j})

$h_{\theta}(\mathbf {pa}_j)$ ，此时的

{p a}_{j}

$\mathbf {pa}_j$ 就是Encoder输入的

x_{j}

$\mathbf x_j$ ，编码Code相当于

z_{j}

$\mathbf z_j$ ；
2、Decoder对

z_{j}

$\mathbf z_j$ 的抽样进行映射得到

\hat{x_{j}}

$\hat {\mathbf x_j}$ 。其概率图如下：
这里写图片描述

图3 VAE概率图模型
具体实现时，在VAE原来模型上增加了辅助变量，产生了新的辅助模型，如图：
这里写图片描述

图4 VAE实现模型
其中

Z_{j}

$\mathbf Z_j$ 抽样是按公式（5）实现的。
参考：
1、《Fast Gradient-Based Inference with Continuous Latent Variable Models in Auxiliary Form》Diederik P. Kingma 2013.6

Diederik P. Kingma在2014年，发表了著名的《Auto-Encoding Variational Bayes》，提出了VAE模型，可以说[1]就是VAE的先导。

Bayesian Network的辅助模型

猜你喜欢