Sum-Product Networks: A New Deep Architecture

H. Poon, P. Domingos, Sum-Product Networks: A New Deep Architecture, ICCV (2011), Best Paper

摘要

图模型（graphical model）推理（inference）和学习（learning）的主要制约因素（key limiting factor）为配分函数（partition function）的复杂度。

本文提出一种和积网络（SPN）：以变量为叶节点，中间节点为和、积运算，且对边加权的有向无环图（SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges）。

若SPN完备（complete）且一致（consistent），则该SPN表示图模型的配分函数及所有边缘、SPN的节点表示语义（the partition function and all marginals of some graphical model, and give semantics to its nodes）。

本文提出一种基于反向传播（backpropagation）和EM的SPN学习算法（learning algorithms）

SPN的学习和推理速度、准确性均优于传统深度网络。

1 引言

图模型（graphical models）将分布表示为因子的归一化乘积（graphical models represent distributions compactly normalized products of factors）： $P(X = x) = \frac{1}{Z} \prod_{k} \phi_{k} (x_{\{k\}})$ ，其中，

$x \in \mathcal{X}$ 为 $d$ 维向量
势（potential） $\phi_{k}$ 为变量子集（作用域） $x_{\{k\}}$ 的函数（each potential $\phi_{k}$ is a function of a subset $x_{\{k\}}$ of the variables (its scope)）
$Z = \sum_{x \in \mathcal{X}} \prod_{k} \phi_{k} (x_{\{k\}})$ 表示配分函数（partition function）。

图模型的缺点：

一些分布无法表示成上述形式；
最坏情况下（in the worst case），推理（inference）的时间复杂度呈指数（exponential）增长；
最坏情况下，学习所需样本数量（sample size required for accurate learning）随变量数量（scope size）呈指数增长；
由于学习过程涉及推理，即使固定变量，其时间复杂度依然为指数（because learning requires inference as a subroutine, it can take exponential time even with fixed scopes）。

通过假设隐含变量（hidden variables） $y$ ，可显著提高图模型的紧凑性（compactness）： $P(X = x) = \frac{1}{Z} \sum_{y} \prod_{k} \phi_{k} ( (x, y)_{k} )$

多层隐藏变量的模型能够在类别数量众多的分布上高效推理（models with multiple layers of hidden variables allow for efficient inference in a much larger class of distributions）。

若能通过分配律将 $\sum_{x \in \mathcal{X}} \prod_{k} \phi_{k} (x_{\{k\}})$ 改写为多项式数量的和、积项（if $\sum_{x \in \mathcal{X}} \prod_{k} \phi_{k} (x_{\{k\}})$ can be reorganized using the distributive law into a computation involving only a polynomial number of sums and products），则配分函数 $Z$ 可高效计算。

本文提出和积网络（sum-product networks，SPNs）。SPN可视为混合模型的广义有向无环图（generalized directed acyclic graphs of mixture models），其和节点对应变量子集的混合（sum nodes corresponding to mixtures over subsets of variables）、积节点对应混合的特征（product nodes corresponding to features or mixture components）。SPN可采用反向传播或EM学习（efficient learning by backpropagation or EM）。

2 和积网络（Sum-Product Networks）

考虑布尔变量（Boolean variables） $X_{i}$ ，其反（negation）记为 $\bar{X}_{i}$ 。

指示函数（indicator function） $[\cdot]$ ：当输入（argument）为真时，其值为1；反之为0。本文中，变量指示器 $[X_{i}]$ 、 $[\bar{X}_{i}]$ 分别简记为 $x_{i}$ 、 $\bar{x}_{i}$ 。

网络多项式（network polynomial）：令 $\Phi(x) \geq 0$ 表示非归一化概率分布（unnormalized probability distribution），则 $\Phi(x)$ 的网络多项式为 $\sum_{x} \Phi(x) \Pi(x)$ ，其中 $\Pi(x)$ 表示在状态 $x$ 上值为1的指示器之积（the product of the indicators that have value 1 in state $x$ ）。

网络多项式为指示器变量的多重线性函数（a multilinear function）。

证据（evidence） $e$ ： $X$ 的部分实例化；证据 $e$ 的非归一化概率：与 $e$ 兼容的所有指示器设为1、其余设为0时，网络多项式的值，（the unnormalized probability of evidence (partial instantiation of $X$ ) $e$ is the value of the network polynomial when all indicators compatible with $e$ are set to 1 and the remainder are set to 0）。

定义1：和积网络（SPN）为变量 $x_{1}, \dots, x_{d}$ 的有向无环有根图，其叶节点为 $x_{1}, \dots, x_{d}$ 和 $\bar{x}_{1}, \dots, \bar{x}_{d}$ 的指示器，中间节点为和、积运算（a sum-product network (SPN) over variables $x_{1}, \dots, x_{d}$ is a rooted directed acyclic graph whose leaves are the indicators $x_{1}, \dots, x_{d}$ and $\bar{x}_{1}, \dots, \bar{x}_{d}$ and whose internal nodes are sums and products）：

和节点各边 $(i, j)$ 的权值 $w_{ij}$ 非负（each edge $(i, j)$ emanating from a sum node $i$ has a non-negative weight $w_{ij}$ ）。
积节点的值为其所有子节点值之积（the value of a product node is the product of the values of its children）
和节点的值为 $\sum_{j \in \text{Ch}(i)} h(i) w_{ij} v_{j}$ ，其中 $\text{Ch}(i)$ 表示节点 $i$ 的子节点、 $v_{j}$ 为节点 $j$ 的值（the value of a sum node is $\sum_{j \in \text{Ch}(i)} w_{ij} v_{j}$ , where $\text{Ch}(i)$ are the children of $i$ and $v_{j}$ is the value of node $j$ ）。
SPN的值为其根节点的值（the value of an SPN is the value of its root）。

在这里插入图片描述
假设：和、积节点层交替排列（sums and products are arranged in alternating layers, i.e., all children of a sum are products or leaves, and vice-versa）。

将和积网络 $S$ 记为指示变量（indicator variables） $x_{1}, \dots, x_{d}$ 和 $\bar{x}_{1}, \dots, \bar{x}_{d}$ 的函数， $S(x_{1}, \dots, x_{d}, \bar{x}_{1}, \dots, \bar{x}_{d})$ ：

若指示器指定一个完全状态（the indicators specify a complete state $x$ ），即每个状态 $X_{i}$ 的指示器都分配一个值（ $x_{i} = 0$ 、 $\bar{x}_{i} = 1$ 或 $x_{i} = 1$ 、 $\bar{x}_{i} = 0$ ），和积网络的输出记为 $S(x)$ ；
若指示器指定一个证据 $e$ ，和积网络的输出记为 $S(x)$ ；
若所有指示器的值均设为1，和积网络的输出记为 $S(\ast)$ ；
和积网络中，以任意节点 $n$ 为根的子网络（the subnetwork rooted at an arbitrary node）仍为和积网络，记为 $S_{n}(\cdot)$ ；
$S(x)$ 定义了 $\mathcal{X}$ 上的非归一化概率分布（the values of $S(x)$ for all $x \in \mathcal{X}$ define an unnormalized probability distribution over $\mathcal{X}$ ）；
在 $S(x)$ 定义的分布下，证据 $e$ 的非归一化概率为 $\Phi_{S}(e) = \sum_{x \in e} S(x)$ ，其中 $\sum$ 表示对所有与 $e$ 一致的状态求和（the unnormalized probability of evidence $e$ under this distribution is $\Phi_{S}(e) = \sum_{x \in e} S(x)$ , where the sum is over states consistent with $e$ ）；
由 $S(x)$ 定义的分布，其配分函数为 $Z_{S} = \sum_{x \in \mathcal{X}} S(x)$
$S$ 的作用域（scope）为 $S$ 中的变量集合（the scope of an SPN $S$ is the set of variables that appear in $S$ ）
若 $\bar{x}_{i}$ 为 $S$ 的叶节点，则 $S$ 中变量 $X_{i}$ 取反；反之亦然（a variable $X_{i}$ appears negated in $S$ if $\bar{x}_{i}$ is a leaf of $S$ and non-negated if $x_{i}$ is a leaf of $S$ ）。

例：图1中，SPN为 $S(x_{1}, x_{2}, \bar{x}_{1}, \bar{x}_{2}) = 0.6 (0.6 x_{1} + 0.4 \bar{x}_{1}) (0.3 x_{2} + 0.7 \bar{x}_{2}) + 0.2 (0.6 x_{1} + 0.4 \bar{x}_{1}) (0.2 x_{2} + 0.8 \bar{x}_{2}) + 0.6 (0.9 x_{1} + 0.1 \bar{x}_{1}) (0.2 x_{2} + 0.8 \bar{x}_{2})$ ，网络多项式为 $( 0.5 \times 0.6 \times 0.3 + 0.2 \times 0.6 \times 0.2 + 0.3 \times 0.9 \times 0.2 ) x_{1} x_{2}$ 。给定完全状态 $x$ ： $X_{1} = 1$ 、 $X_{2} = 0$ ， $S(x) = S(1, 0, 0, 1)$ ；给定证据 $e$ ： $X_{1} = 1$ ， $S(x) = S(1, 1, 0, 1)$ ； $S(\ast) = S(1, 1, 1, 1)$

定义2：称和积网络 $S$ 是有效的（valid），当且仅当对 $\forall e$ ，满足 $S(e) = \Phi_{S}(e)$ （a sum-product network $S$ is valid iff $S(e) = \Phi_{S}(e)$ for all evidence $e$ ）。

定义3：称和积网络 $S$ 是完备的（complete），当且仅当 $S$ 中任意和节点的所有子节点作用域均相同（a sum-product network is complete iff all children of the same sum node have the same scope）。

定义4：称和积网络 $S$ 是一致的（consistent），当且仅当 $S$ 中任意积节点的所有子节点不相悖（a sum-product network is consistent iff no variable appears negated in one child of a product node and non-negated in another）。■即积节点中不存在 $x_{i} \bar{x}_{i}$ ■

定理1：当该网络完备且一致时，和积网络有效（a sum-product network is valid if it is complete and consistent）。

完备性（completeness）和一致性（consistency）不是网络有效（validity）的必要条件。

若和积网络 $S$ 完备但不一致（complete but inconsistent），其展开式（expansion）中包含网络多项式中不存在的单项式（monomial），故 $S(e) \geq \Phi_{S}(e)$ ；若和积网络 $S$ 一致但不完备（consistent but incomplete），其展开式中缺少网络多项式中的部分单项式，故 $S(e) \leq \Phi_{S}(e)$ 。因此，无效SPN可用于近似推理（approximate inference）。

定义5：称非归一化概率分布 $\Phi(x)$ 是可由和积网络 $S$ 表示的，当且仅当对 $\forall x$ ，满足 $\Phi(x) = S(x)$ 且 $S$ 有效（an unnormalized probability distribution $\Phi(x)$ is representable by a sum-product network $S$ iff $\Phi(x) = S(x)$ for all states $x$ and $S$ is valid）。

则通过 $S$ ，可实现 $\Phi(x)$ 所有边缘及其配分函数的高效计算。

定理2：若马尔科夫网络的配分函数 $\Phi(x)$ 可通过包含 $d$ 的多项式条边（edges）的和积网络表示，其中 $x$ 表示 $d$ 维向量，则 $\Phi(x)$ 的计算时间复杂度为 $d$ 的多项式（the partition function of a Markov network $\Phi(x)$ , where $x$ is a $d$ -dimensional vector, can be computed in time polynomial in $d$ if $\Phi(x)$ is representable by a sum-product network with a number of edges polynomial in $d$ ）。