Resnet in Resnet：Resnet in Resnet: Generalizing Residual Architectures

摘要：

残差网络（ResNets）在计算机视觉任务中达到了state of art。我们提出了Resnet in Resnet（RiR）：一种深度dual-stream架构，它对ResNets和标准的CNN进行了推广，并且很容易实现（没有额外的计算开销）。RiR在ResNets的基础上进一步提高了性能（同样是在CIFAR-10数据集上，采用和ResNets一样的数据增强技术），并且在CIFAR-100上达到了新的state of art。

总结：提出了ResNet Init及ResNet in ResNet架构。文章不够深刻

1. 简介

ResNets在ILSVRC 2015分类任务上达到了 state of art，并且允许我们训练深达1000层的网络。与highway网络相似，residual网络使用了identity shortcut connections，这些连接使得信息流可以无衰减地穿过各层，从而提高优化效果（resulting in improved optimization）[1]。在残差网络中，shortcut连接直接连接了两层（没有任何变换）。虽然ResNets的实验中的性能提升较大，但当前的残差网络有很多潜在的缺陷：当前的ResNet使用identity连接会导致不同级别的特征在每一层积聚，即使在一个深度网络，前面的一些层学习到的一些特征可能在后面的层不再提供有用的信息。

ResNet架构的一个假设是：学习identity权重是困难的，同样的，it is difficult to learn the additive inverse of identity weights needed to remove information from the representation at any given layer。residual block模块固定尺寸的结构也迫使残差单元必须通过浅层的子网络来学习得到，尽管有证据表明，越深的网络计算量越大。我们引入了一个广义残差架构，这个架构以残差，非残差并行的方式结合了残差网络和标准的卷积网络（in parallel residual and non-residual streams）。我们表明使用广义残差块保留了identity shortcut连接的优化特性，同时提高了表达能力、降低了去除不需要的信息的难度。我们然后得到一个架构：ResNet in ResNet（RiR），它包含了这些广义残差块，并且在CIFAR-100上达到了state of art。

2. 广义残差网络架构（Generalizing Residual Network Architectures）

广义残差网络架构的模块化单元是一个并行结构的广义残差块，并行包含了一个残差通道 $\text{r}$ 和一个瞬变通道 $\text{t}$ 。残差通道采用和ResNet类似的identity shortcut连接，瞬变通道采用标准的卷积层。另外，有两组fliter对两个通道进行交叉卷积（ $W_{l,\text{r}\rightarrow \text{t}}$ 和 $W_{l,\text{t} \rightarrow \text{r}}$ ）：

r_{l + 1} = σ (conv (r_{l}, W_{l, r \to r})) + conv (t_{l}, W_{l, t \to r} + shortcut (r_{l}))

$\text{r}_{l+1}=\sigma(\text{conv}(\text{r}_{l},W_{l,\text{r} \rightarrow \text{r}})) +\text{conv}(\text{t}_{l},W_{l,\text{t} \rightarrow \text{r}} + \text{shortcut}(\text{r}_{l}))$

t_{l + 1} = σ (conv (r_{l}, W_{l, r \to t})) + conv (t_{l}, W_{l, t \to r})

$\text{t}_{l+1}=\sigma(\text{conv}(\text{r}_{l},W_{l,\text{r} \rightarrow \text{t}})) +\text{conv}(\text{t}_{l},W_{l,\text{t} \rightarrow \text{r}})$

r

$\text{r}$ 通道的使用可以保留残差单元的优化特性，

t

$\text{t}$ 通道的使用将允许前层提取的特征被去除。下面是广义残差块的框架图

如果

r

$\text{r}$ 通道的权重为0，广义残差块就相当于一个标准的卷积层；如果

t

$\text{t}$ 通道的权重为0，广义残差块就相当于标准的残差块。通过广义残差块的堆叠，网络可以学习图1b中的各种可能的结构（例如图1c）。新的广义残差块增强了信息处理能力。广义残差块不仅可以用于CNN，也可以用于其它类型的网络。用广义残差块（图1b）替换原始的残差块中的conv，就产生了一个新的架构（ResNet in ResNet（RiR）图1d），在图2中，我们总结了CNN、ResNet Init、ResNet和RiR架构之间的关系。

3. 实验

实验数据集选择的是CIFAR-10和CIFAR-100。通过超参数搜索（grid search），本文的方法在该数据集上达到了当前最好的结果。搜索后的超参数：SGD的动量为0.9，minibatch-size为500，L2惩罚为0.0001，训练82epochs。学习速率在第42和62 epoch除以10，参数初始化采用MSR初始化。增加维度时使用3x3卷积投射。两个steam的filter的数量相同，对filter数量在两个stream上的分配进行超参数搜索，可能会产生进一步的性能提升。

在我们的实验中，ResNet Init架构的性能比标准的CNN高；RiR架构比原始的ResNet的性能高（表3）。我们发现RiR架构对每一个块内部的层的数量不敏感，都可以取得较好的结果（We find the RiR architecture performs well across a range of numbers of blocks and layers in each block），并且当前架构使用的ResNet Init相较于标准的initialization有性能提高（and that ResNet Init applied to existing architectures, such as ALL-CNN-C (Springenberg et al., 2014), yields improvement over standard initialization (表4, 5)。因为每一个通道只使用总filter的一半，我们研究了我们的架构在一个更宽的18层网络上的效果（表1，2）。我们发现这个RiR架构是相当高效的，在CIFAR-10数据集上获得了可观的效果（只使用随机裁剪和水平翻转来增强数据），在CIFAR-100数据集上获得了state of art。我们通过ablation实验研究了一个训练好的广义残差网络里的每一个块对结果的影响（We visualize the effect of zeroing learned connections of each stream in a trained ResNet Init model a single layer at a time），实验结果说明两个通道都对准确率有帮助，并且两个通道在网络不同位置是变化的（图3）（which shows both streams contribute to accuracy and relative use of residual and transient streams changes at different stages of processing）。在图4中，我们说明RiR架构对于残差块的深度的增加是鲁棒的，并且RiR架构允许我们训练比原始ResNet更深的残差网络。

4. 相关工作

交互式变换steams中，只有一个stream包含shortcut连接；这种连接在LSTM和Grid-LSTM网络的blocks中也有使用。但是，highway网络的控制流通过输入相关的传送和转换门来穿过shortcut连接（However, in contrast to highway networks which control flow through shortcut connections via input-dependent carry and transform gates），与LSTM和Grid-LSTM块不同，广义残差块的残差和瞬态模块之间的信息流不通过门（gate），因此，不需要额外的参数，便可以在一个标准的前溃网络的基础上实现（and to memory and hidden states of LSTM and Grid-LSTM blocks, flow of information between the residual and transient states of the generalized residual block does not use gates and can thus be implemented with no additional parameters over a standard feedforward network）。前人的架构和本文的广义残差架构的另一个区别是：当一个LSTM或者Grid-LSTM块中的memory( $\text{m}$ )和隐藏的states( $\text{h}$ )是顺序地计算的（公式为 $\text{h}_{l}=\text{o}_{l} \odot tanh(\text{m}_{l})$ ）。
and depends only on the learned convolutional filters at each layer without further constraints on their relation. The SCRN architecture of Mikolov et al. (2014) also uses hidden and context units together within a single layer to learn longer term information, which behave similarly to the transient and residual streams, but SCRN only allows unidirectional flow from context to hidden units and connections between context units are fixed, in contrast to bidirectional flow between streams and learned connections for both transient and residual streams in our generalized residual architecture.

5. 结论

我们提出了一个广义残差架构（generalized residual architecture），通过对原始方案简单的修改便可以实现这个网络（ResNet Init）。将ResNet Init应用到原始的ResNet中，从而得到RiR架构，RiR架构取得了state of art的结果。未来可以做的工作包括：RiR架构及相关残差模型的进一步研究，去进一步探索这些模型能够带来性能提升的原因（the cause of their beneficial effects）。

参考文献：

[1]: Rupesh K Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Training very deep networks. In
Advances in Neural Information Processing Systems, pp. 2368–2376, 2015.

6. 附录

6.1 广义残差块的实现

我们对一个标准的卷积或全连接层进行修改即可实现广义残差块。广义残差块（ResNet Init）包含了identity shortcut和期望的线性变换（卷积、矩阵乘法），将残差单元 $\text{r}$ 和瞬变 $\text{t}$ 连接成一个一个单一的张量 $\text{x}$ 。因为identity shortcut、same-stream的结果cross-stream transformations的结果被summed去获得每一个stream的输出， $\text{r}$ 和 $\text{t}$ 里面的op可以合并成关于 $\text{x}$ 的单个线性op（下面是在FC上的例子）。

x_{l + 1} = σ (W_{l}^{^{'}} x_{l}) \Leftrightarrow \begin{matrix} [\begin{matrix} r_{l + 1} \\ t_{l + 1} \end{matrix}] \end{matrix} = σ (([\begin{matrix} W_{l, r \to r} & W_{l, t \to r} \\ W_{l, r \to t} & W_{l, t \to t} \end{matrix}] + [\begin{matrix} I & 0 \\ 0 & 0 \end{matrix}]) \times [\begin{matrix} r_{l} \\ t_{l} \end{matrix}])

$\text{x}_{l+1}=\sigma(W^{'}_{l}\text{x}_{l}) \Leftrightarrow \begin{gather*}\begin{bmatrix} \text{r}_{l+1} \\ \text{t}_{l+1} \end{bmatrix}\quad\end{gather*} = \sigma ((\begin{bmatrix} W_{l,r \rightarrow r} & W_{l,t \rightarrow r} \\ W_{l,r \rightarrow t} & W_{l,t \rightarrow t} \end{bmatrix}\quad + \begin{bmatrix} \text{I} & 0 \\ 0 & 0 \end{bmatrix}\quad) \times \begin{bmatrix} \text{r}_{l} \\ \text{t}_{l} \end{bmatrix}\quad)$

ResNet in ResNet文章翻译