Dynamic Routing Between Capsules

最近看看了CapsNet，感觉很有意思，也更加敬佩Hinton，非常有才华，想象力和洞察力都让人感叹，下面就是我对原文的理解。

Abstract

capsule用activity vector表示某个object。vector的模长表示object出现的可能性，向量的方向表示object的方向。低级别的capsule通过transformation matrices对高级别capsule的参数进行预测。当多个预测一致时，更高级别的胶囊变得活跃。迭代路由协议机制：一个较低级别的胶囊倾向于将其输出发送到较高级别的胶囊，如果其activity vector具有较大的标量积。

1. Introduction

人类视觉会忽略不相干的细节，更关注整体的相对位置。

2. How the vector inputs and outputs of a capsule are computed

有很多可能的方法来可以实现胶囊的思想。本文是为了表明一个相当简单的实现就能够很好地工作，并且动态路由对结果有所帮助。
我们希望胶囊的输出向量的模长表示胶囊代表的object出现的概率。因此，我们使用非线性的“squashing”函数来确保vector长度短的几乎缩小为零，而vector矢量长度长的缩小到略小于1。squashing function:

v_{j} = \frac{| | s_{j} | |^{2}}{1 + | | s_{j} | |^{2}} \frac{s_{j}}{| | s_{j} | |}

$v_j=\frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}$

v_{j}

$v_j$ 是输出向量，

s_{j}

$s_j$ 为输入向量。
For all but the first layer of capsules, the total input to a capsule

s_{j}

$s_j$ is a weighted sum over all “prediction vectors”

{\hat{u}}_{j | i}

$\hat{u}_{j|i}$ from the capsules in the layer below and is produced by multiplying the output

u_{j}

$u_j$ of a capsule in the layer below by a weight matrix

W_{i j}

$W_{ij}$

s_{j} = \sum_{i} c_{i j} {\hat{u}}_{j | i}, {\hat{u}}_{j | i} = W_{i j} u_{i}

$s_j=\sum_i c_{ij}\hat{u}_{j|i}, \hat{u}_{j|i}=W_{ij}u_i$
where the

c_{i j}

$c_{ij}$ are coupling coefficients that are determined by the iterative dynamic routing process.

c_{i j} = \frac{e x p (b_{i j})}{\sum_{k} e x p (b i k)}

$c_{ij}=\frac{exp(b_{ij})}{\sum_k exp(b{ik})}$

b_{i j} \leftarrow b_{i j} + {\hat{u}}_{j | i} \cdot v_{j}

$b_{ij}\leftarrow b_{ij}+\hat{u}_{j|i}\cdot v_j$

3. Margin loss for digit existence

To allow for multiple digits, we use a separate margin loss, $L_k$ for each digit capsule $k$ :

L_{k} = T_{k} m a x (0, m^{+} - | | v_{k} | |)^{2} + λ (1 - T_{k}) m a x (0, | | v_{k} | | - m^{-})^{2}

$L_k = T_k max(0, m^+-||v_k||)^2+\lambda (1-T_k)max(0, ||v_k||-m^-)^2$ where

T_{k} = 1

$T_k = 1$ iff a digit of class k is present and

m^{+} = 0.9

$m^+ = 0.9$ and

m^{-} = 0.1

$m^- = 0.1$ . The

l a m b d a

$lambda$ down-weighting 缩减没出现时的矢量长度。

4. CapsNet architecture

The architecture is shallow with only two convolutional layers and one fully connected layer.
Conv1 has $256, 9\times 9$ convolution kernels with a stride of 1 and ReLU activation. (提取的特征作为primary capsules的输入。)
The second layer (Primary Capsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a $9\times 9$ kernel and a stride of 2). Each primary capsule output sees the outputs of all $256\times 81$ Conv1 units whose receptive fields overlap with the location of the center of the capsule.
In total PrimaryCapsules has $32\times 6\times 6$ capsule outputs (each output is an 8D vector) and each capsule in the $6\times 6$ grid is sharing their weights with each other.
One can see PrimaryCapsules as a Convolution layer with squashing fuctiong as its block non-linearity.
The final Layer (DigitCaps) has one 16D capsule per digit class and each of these capsules receives input from all the capsules in the layer below.
We have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps).
Since Conv1 output is 1D, there is no orientation in its space to agree on.(以为特征没法算方向)
Therefore, no routing is used between Conv1 and PrimaryCapsules. All the routing logits ( $b_{ij}$ ) are initialized to zero. Therefore, initially a capsule output ( $u_i$ ) is sent to all parent capsules with equal probability ( $c_{ij}$ ).Our implementation is in TensorFlow and we use the Adam optimizer with its TensorFlow default parameters, including the exponentially decaying learning rate, to minimize the sum of the margin losses.

4.1. Reconstruction as a regularization method

在训练过程中，屏蔽除正确数字胶囊的活动向量以外的所有数据。然后使用这个向量来重建输入图像。胶囊的输出向量被送到由3个全链接层组成的解码器中，希望它可以解码为原图像。
最小化输出与像素强度之间的平方差的总和。将这个重建损失缩小（乘0.0005的系数），让它不会在训练期间占主导地位。CapsNet 16D输出的重建功能非常强大，同时仅保留重要细节。

5. Capsules on MNIST

发现在MNIST上的性能好于art of state

5.1. What the individual dimensions of a capsule represent

由于我们传递的只是一个数字的编码，清零了其他数字，因此数字胶囊的尺寸应该学习到了该类的数字实例化方式的变化空间。这些变化包括行程厚度，歪斜和宽度。通过使用解码器网络，我们可以看到各个维度代表什么。将这个胶囊的活动向量的扰动送到解码器网络，观察扰动如何影响重建。我们发现胶囊的16个维度中的一个维度几乎总是代表数字的宽度。某些维度代表全局变化的组合，有些维度表示数字局部的变化。例如，6的上面的竖线的长度和下面环的大小。

5.2. Robustness to Affine Transformations

Affine Transformations：就是空间上的平移旋转放缩
实验表明，与传统的卷积网络相比，每个DigitCaps胶囊都会为每个类学习更强大的表示。由于在手写数字中存在自然的偏斜，旋转，样式等变化，因此经过训练的CapsNet对训练数据的小仿射变换具有适度的稳健性。
为了测试CapsNet仿射变换的鲁棒性，我们在填充和转换的MNIST训练集上训练了一个CapsNet和一个传统的卷积网络（带有MaxPooling和DropOut）。新的NIST训练集中的数字随机放置在40X40的黑色背景上。
然后，我们在affNIST数据集上测试了这个网络，其中每个例子都是一个带有随机小的仿射变换的MNIST数字。我们的模型从未受过仿射变换的训练。
An under-trained CapsNet with early stopping which achieved 99.23% accuracy on the expanded MNIST test set achieved 79% accuracy on the affnist test set.
A traditional convolutional model with a similar number of parameters which achieved similar accuracy (99.22%) on the expanded mnist test set only achieved 66% on the affnist test set.

6. Segmenting highly overlapping digits

动态路由可以被看作是一种并行的注意力机制，使得每个胶囊关注下一层级的一些active的胶囊，并忽略其的胶囊。这应该使得模型能够识别图像中的多个对象，即使对象重叠。

6.1. MultiMNIST dataset

将MNIST数据集放在成 $32\times 32$ 上，每个数字可以看成被 $20\times 20$ 的框子框住然后让每两个数字平均有80%的重叠。The training set size is 60M and the test set size is 10M.

6.2. MultiMNIST results

Our 3 layer CapsNet model trained from scratch on MultiMNIST training data achieves higher test classification accuracy than our baseline convolutional model.

7. Other datasets

也测试了在CIFAR10、smallNORB中的性能。
与其他generative models 一样，Capsules的一个缺点是它喜欢考虑图像中的所有内容，因此它在拟合杂乱一些的图像时，使用软标签比onehot标签时更好。

8. Discussion and previous work

近三十年来，语音识别领域的最新技术是使用隐马尔可夫模型和高斯混合作为输出分布。这些模型很容易在小型计算机上学习，但与使用分布作为输出的RNN相比，它们具有致命的缺点representational limitation，它是用类似onehot表示信息，效率低下。为了使HMM可以记住的信息数量增加一倍，需要对隐藏节点的数量进行平方。对于recurrent net，只需要将隐藏的神经元数量加倍。
Capsule希望解决的就是representational limitation的问题。Capsule希望解决CNN中类似GMM-HMM与RNN的问题，并且我们的实验也显示了它强大的表征能力。

读后感

ConvNet的缺点在于他的特征提取过程相当于寻找某个特征是否出现在原图中，忽略了各个特征object的相对位置。感觉capsule的创新点是在与用向量表示某个object，不但体现object出现的概率，而且体现了object的位置方向信息，这样提取出来的特征对原来的图像有更强的表征能力。更外一个创新点在于动态路由算法学习权重，感觉和Hebb规则有点像，但是其他部分还是需要反向传播学习。我以前也读过几篇STDP学习的文章，有空串起来研究研究。