Xception 论文笔记

Xception: Deep Learning with Depthwise Separable Convolutions

摘要

Inception 介于正常卷积、depthwise separable 卷积之间。在这种认知下，一个 depthwise separable 卷积可以被理解为包含最大数量 tower （tower 指的是 Inception 模块内的各个 path）的 Inception 模块。由此，作者提出了 Xception（灵感来自于 Inception）。在ImageNet 上，Xception 比 Inception v3 性能高一点，但在更大的数据集上，Xception 性能明显优于 Inception v3。因为 Xception 架构与 Inception v3 的参数量相同，因此性能的提高不是由于模型容量的增加，而是由于模型参数更高效的利用。

depthwise separable convolution：等价于一个 depthwise 卷积 + 一个 pointwise 卷积

1. 简介

Inception 模块有很多不同的版本，图 1 展示了 Inception v3 中的标准 Inception 模块。Inception 模型就是用这些 Inception 模块堆叠而成的。这与早前 VGG 堆叠简单的卷积层不同。

在概念上，Inception 模块和卷积层是相似的（都是卷积特征提取器），经验上来说，他们都是用少量参数学习丰富的表达。它们是怎样工作的，和正常的卷积层有什么区别呢？Inception 背后的设计理念是什么呢？

1.1. Inception 的假设

卷积层尝试在一个 3 维空间（宽度和高度这两个维度、通道这个维度）学习一些 filter。因此一个卷积核同时负责通道关系映射和空间关系映射。

Inception 模块的背后想法是：通过将通道关系映射和空间关系映射分开，从而使得这个过程更简单、高效。具体来说，标准的 Inception 模块首先通过 1x1 卷积来处理通道关系映射，将输入映射到 3 或 4 个小于输入空间的独立空间上；然后通过正常的 3x3 或 5x5 卷积处理这些独立空间之间的关系。实际上，Inception 背后的基础假设是：将通道间关系和空间关系完全解耦。

考虑图 2 这样一个简单的 Inception 模块。该模块可以被看作图 3 这样的一个结构：一个大的 1x1 卷积跟一些空间卷积。这个认知带来一些问题：分段的数量的影响？做出比 Inception 更强的假设是合理的吗？通道关系映射和空间关系映射是否真的可以完全分开？

在 Inception v3 中，作者还用 7x1 和 1x7 的卷积来替代 7x7 的卷积，从而将高度方向和宽度方向解耦。这样的空间分离卷积在图像处理中有很悠久的历史，已经在2012年后（或许更早）的一些卷积网络中有使用

1.2. 正常卷积与 separable convolution 间的连续性

Inception 模块的 tower 的数量十分多时，我们就可以首先使用一个 1x1 卷积去处理通道关系映射，然后处理各个输出通道的空间关系（如图 4）。所以作者认为 tower 非常非常多时， Inception 模块和 depthwise separable 卷积趋向于一致。

depthwise separable convolution 早在 2014 年就提出了 [15]。自从 TensorFlow 2016 引入了这个op，这种卷积就流行起来。

在 TensorFlow 和 Keras 中，包含深度可分离卷积 depthwise convolution（首先进行空间关系映射，然后进行通道关系映射）。这与图像处理中的常用的空间可分离卷积 separable convolution 不同，不要混淆。

在 “extreme” Inception 和 depthwise separable 卷积之间的两个小差别是：

卷积的顺序：TensorFlow 中的 depthwise separable 卷积首先进行 channel-wise 空间卷积，然后进行 1x1 卷积。Inception 顺序相反。
第一个卷积后是否使用激活函数：在 Inception 中，两个卷积后都使用 ReLU 激活，但 depthwise separable 卷积中不使用激活函数。

第一个区别不重要，因为整体是堆叠式结构；第二个区别可能很重要，作者对此进行了实验（图 10）。

We also note that other intermediate formulations of Inception modules that lie in between regular Inception modules and depthwise separable convolutions are also possible: in effect, there is a discrete spectrum between regular convolutions and depthwise separable convolutions, parametrized by the number of independent channel-space segments used for performing spatial convolutions. A regular convolution (preceded by a 1x1 convolution), at one extreme of this spectrum, corresponds to the single-segment case; a depthwise separable convolution corresponds to the other extreme where there is one segment per channel; Inception modules lie in between, dividing a few hundreds of channels into 3 or 4 segments. The properties of such intermediate modules appear not to have been explored yet.

基于上面的认知，作者用 depthwise separable 卷积替换了 Inception 中的 Inception 块，并对结构在两个大型数据集上进行了实验。

2. 同期的研究

VGG
Inception
Depthwise separable convolutions
ResNet

3. Xception 架构

作者提出了 Xception 架构。事实上，作者做了以下假设：

通道关系映射和空间关系映射能够被完全解耦。因为这个假设比 Inception 的假设还强。

Xception 名字由来：Extreme Inception

Xception 架构的详细情况见图 5。Xception 架构共使用了 36 个 depthwise separable 卷积来提取来构成基本的特征提取器。36 个卷积层被拆分成 14 个模块，所有的模块都使用残差连接（除了第一个和最后一个模块）。

简言之，Xception 架构就是 depthwise separable 卷积层以残差方式堆叠起来的。这使得模型架构很容易去定义（三四十行代码就可以）、修改；Keras 里有 Xception 的开源实现：https://keras.io/applications/#xception

4. 实验评估

作者选择将 Xception 和 Inception 进行对比研究（规模相似、参数相似），因此性能提高与模型容量没关系。作者在 ImageNet 数据集和 JFT 数据集上进行了实验。

4.1. The JFT dataset

JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Hinton et al. in [5], which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes. To evaluate the performance of a model trained on JFT, we use an auxiliary dataset, FastEval14k.

FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average). On this dataset we evaluate performance using Mean Average Precision for top 100 predictions (MAP@100), and we weight the contribution of each class to MAP@100 with a score estimating how common (and therefore important) the class is among social media images. This evaluation procedure is meant to capture performance on frequently occurring labels from social media, which is crucial for production models at Google.

4.2 优化配置

On ImageNet：

Optimizer：SGD
Momentum：0.9
Initial learning rate：0.045
Learning rate decay：decay of rate 0.94 every 2 epochs

On JFT：

Optimizer：RMSprop
Momentum：0.9
Initial learning rate：0.001
Learning rate decay：decay of rate 0.9 every 3,000,000 samples

Xception 和 Inception v3 对比的过程中使用相同的优化配置。这个配置是在 Inception v3 上精调出来的，没有针对 Xception 进行调整。因为网络有不同的训练曲线（图6），这可能不是最优的，尤其是优化配置。

另外，所有的模型在评估时的 inference 中都使用了 Polyak 平均[13]。

4.3 正则配置

Weight decay： Inception v3 with a weight decay (L2 regularization) rate of $4 \times 10^{-5}$ ，Xception 的weight decay rate 为 $1 \times 10^{-5}$ 。作者没有对 weight decay rate 进行超参数搜索。两个数据集上使用相同的 weight decay rate。
Dropout：对于 ImageNet 上的所有实验在 logistic regression layer 前添加一个 dropout 层，dropout 率为 50%。对于 JFT 上的实验，没有使用 dropout，因为 JFT 数据集太大，不太可能过拟合。
Auxiliary loss tower： 本文的实验中没有使用 Inception v3 架构中的辅助损失函数。