Introduction

思路
1. 对于mobile vision & big-data, 计算效率和param count很重要
2. Scale up network 同时减少额外计算
  1. 四条scale-up规则
  2. 具体方式：
    1. 卷积分解
    2. 正则化
原理
1. 让Inception更加flexible的原理：
  1. 大量的dimensional reduction
  2. Inception的平行结构让单独branch的结构变化可以被缓和
2. 卷积分解的原理：
  1. Inception networks是fully convolutional，每个weight对应于一个activation中的一次乘法。
  2. 因此，任何导致计算成本减少的方法将会导致param数量下降！所以我们通过卷积分解来减少参数数量->faster training。（只是相对于同样规模的网络更快乐了）
主要贡献及优点
1. 设计了4条scale upCNN的方法，并在Inception结构中实例化(v3)，提高了网络性能的同时只增加了适中的computational cost
2. single crop和multi-model, multi-crop的结果在 ILSVR 2012 classification 数据集上超越了state of art，且必最好的模型节省了约6倍的computational cost
3. Inception-v3的computational cost只是googlenet的2.5倍，但是深度从22层变成了42层，比VGG快很多
4. 模型对变化来讲相对稳定（只要满足了四条原则）
ILSVRC 2012 Classification Result
1. 25 million prarams + 5 billion multiply-adds/inference
2. 21.2% top-1 5.6% top-5 error for single frame evaluation (single model, single crop）
3. 17.3% top-1 3.5% top-5 error for 4 models and mlti-crop evaluation.
缺点
1. Principle有待证明，直接使用未必能直观改进模型性能

General Design Principles

以下规则是推测性的，需要更多实验证明对模型性能有帮助&正确性

Principle 1

内容：避免representation瓶颈(extreme compression), especially early in network，representation size应该逐渐缓慢下降
原理：representation的维数急剧下降会丢弃重要因素（e.g. correlation structure)，从维数中只能粗浅的估计出information content。
应用：Fig 10 降低维度的方法，Fig 9 left is a case of violation

Principle 2

内容：高维度的表示可以在网络中局部处理
e.g.:Inception的multi-branch拆分原因
原理：增加每层的activations可以得到更多的分离特征，会使网络训练更快。
应用：Fig 7：在最低分辨率(high dimension)上分离出高维度的稀疏的representation
1. Input: 8*8*1280
2. Output: 8*8*2048

Principle 3

内容：空间聚合可以在更低维度的embedding中进行而不会降低表达能力
e.g.: 3*3之前使用1*1 conv 来降维
原理：因为相邻的unit具有很高的相关性（是filter bank堆叠的结果），所以降维不会损失太多信息
应用：Fig 5, 一个 5*5 变成了两个 3*3 （why this principle?）

Principle 4

内容：协调width & depth
原理：增大规模可以增强模型质量，协调他们之间的关系可以让computation amount增加的最小
应用：Inception-v3的结构从：filter bank size的选取

相对于v-2的改进

分解大的卷积核

分解成小卷积
1. 方式：e.g. 5*5变成两个3*3依然可以保持平移不变性
2. 参数计算：
  1. 减少了param：因为参数共享
  2. n = alpha*m的计算，可以在拆分的每一步上让#filters变为原来的sqrt(alpha)倍实现
  3. alpha = 这一层的C_out*H*W/C_in*H*W
  4. Googlenet里alpha is around 1.5
  5. 因为5*5 conv会被聚合，alpha一般都会比1大一点点，因为Cout更大
3. 会导致loss of expressiveness吗？（不会）
4. 是否要保留1st layer的linear activation?（否）
  1. 根据Fig 2:
    1. 在分解的每个阶段都用ReLU更好：因为非线性activation能够走呢更强网络的表达能力。
    2. 因为网络更深了所以表达性能会更好
空间拆分成非对称卷积
1. 方式：3*3=1*3+3*1，可以节省33%参数，变成2*2节省11%，n越大，节省的param越多（这里n-3）
2. 在midum grid-size使用的效果要好于在lower layer使用（why？）（12-20）

Auxiliary Classifier

与v1的不同：
1. 新的发现：在training后期才会有帮助
2. lower auxiliary branch没有用，对low-level feature并没有帮助
3. 目的性不同：
  1. v1: 帮助模型收敛，学习lower-layer的下降方向，对抗深度网络的梯度消失问题。
  2. v3: 用作regularizer: if use BN or dropout on auxiliary classifier, performance improve
性能提升：在使用BN之后对top-1有0.4%的绝对提升（在17*17层使用， batch_size = 32）

Efficient Grid Size Reduction

Fig 9:
1. poolng -> inception = 2(d/2)^2(k)^2 违反了Principle 1, pooling 有表达瓶颈
2. inception -> pooling = 2(d^2)*(k^2) cost 4 倍
3. both 表达 = (d/2)^2*(2k)
Fig 10
1. cause：Principle 1
2. conv(3,3 - 2) & pooling(3,3 - 2) = (d/2)^2*(k^2) 计算变少，表达能力 = (d/2)^2*(2k) 不变

Model Regularization via Label Smoothing

原始的soft-max用了cross-entropy,最小化loss == 最大化正确label的log-likelihood，存在两个问题：
1. Over-fitting让模型学习把所有的probability都放在true label上，但不常见
2. 鼓励让最大的logit和所有其他logit的差距变大，所以其他false label的probability会下降(接近0），所以在z_k上的GD=a-y=0-0，使模型的自适应能力变差
3. 模型对自己的预测过分自信
Label-smoothing regularization or LSR, preventing the largest logit from becoming much larger than all others
数据：（同v1）ILSVRC 2012， K = 1000 # classes, u(k) = 1/1000, LSR penalty 系数=0.1

Inception-v3

Architecture

用Fig 10来降低Inception之间的grid reduction
0-Padding应用在了Inception内部
不同的filter bank size的选取遵循了Principle 4
具体实现细节和paper描述有细微差异，区别在于
1. 在Fig 10的HW reduction的过程多加了一个branch来增加模型厚度
2. 每个Stage的Inception的第一个模型会有细微差异
  1. e.g. Stage2里第一个Inception是Fig10的拆分降维结构
  2. e.g. Stage1里面没有使用Fig 5中的1*1 -> 3*3branch而是变为了1*1 -> 5*5
3. 参考：https://github.com/fchollet/deep-learning-models/blob/master/inception_v3.py
42 layers，但是computational cost只有googlenet的2.5倍，比VGG高效很多
对变化比较稳定，只要考虑到四条原则
v3 vs v1

模型框架：

Inception-v3 Architecture

Training

参数：
1. momentum = 0.9
2. best model: RMSProp (0.9, 1.0)
3. base_lr = 0.045, lr *= 0.94 / 2 epochs
4. Gradient clipping = 2.0 （保证平方和小于2.0）
没有说明Data augmentation的方法，应该是沿用了v1里的random crop和distortion方法。
训练：同步SGD with multiple GPUs，把batch/#GPU数量的batch分配到不同GPU中。

Lower Resolution Input

问题背景，作为OD中的post classification阶段，object size偏小
问题定义：是否更大的resolution会有更好的性能
控制变量：computation cost不变
结论：在input缩小时，考虑专门的dedicated high-cost low resolution networks. 不能直接把更小的模型用在更小的resolution上，因为模型更小性能更差，同时识别小物体的任务更艰难。

实验结果和对比

Table3：对Inception-v2的逐步改进：
1. RMSProp
2. Label Smoothing
3. 分解7*7变成一系列3*3
4. BN-auxiliary：不只在conv上做BN，还在auxiliary head上加BN，会降低0.2%error rate。
5. 最终模型为Inception-v3
Table 3: single model, single crop - state of art
Table 4: single model, multi crop(12, 144) - 无法对比（state of art crops = 10)
Table 5: multi model(4), multi crop(144) - state of art

问题

让Inception更加flexible的原因之2：
1. generous use of dimensional reduction
2. parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components.
- Inception内部的parallel结构让一个branch的变动会被mitigated by别的分支
Principle 1：避免representation bottleneck的原因 -> dimensionality of representation会discard important factors like correlation structure.
1. 什么是correlation structure
2. 为什么【dimensionality 只能代表rough estimation of information content】能够推导出 principle 1?
Principle 2：是否在讲inception拆分的原因
不是，在讲对8*8 higher dimension的局部处理来抽取high dimensional sparse feature
Conv factorization: use computational and memory savings to increase the filter-bank sizes of NN. >> how??? & what is filter bank size? (kernel的数量，模型的width)
- filter bank size: 每个bank里面kernel的数量 #filters
activation & unit 的概念
1. n = alpha * m
2. n = activations = H*W*Cout
3. m = units = H*W*Cin
4. 这里假设hw不变
5. 理解2：#unit = C_in, #activations = C_out
Sec 3:
1. 为什么computational cost result in reduced number of params???
2. 为什么each weight corresponds to one multiplication per activation???
3. 为什么“This means that with suitable factorization, we can end up with more disentangled parameters and therefore with faster training.” 能理解更少的param，不能理解faster training因为multi-adds更多了！所以其实
4. 为什么空间多变性越强用ReLU效果越好？- non linear activation enhance model's power of performance
Sec 3.2：
1. n*n拆分为1*n+n*1为什么在lower layers不好, 在mid-grid-sizes效果好(12-20)
2. Fig 6 的具体实现, 看代码
Sec 6:
1. 图片和描述似乎不对应（那几个fig）
2. 42 layer怎么来的
3. 每一层的细节，即使是3*Inception每一个Inception似乎也不一样
V1 vs V3
1. Training & Testing crop & Data Augmentation
  1. Testing: 见对照表格
  2. Training的Augmentation: v1用了random crop和distortion，v3没有说

TODO

对模型更深的理解
代码对照

参考

原文Link：https://arxiv.org/pdf/1512.00567.pdf
Inception Github: https://github.com/tensorflow/models/tree/master/research/inception
Inception-v3结构Github: https://github.com/fchollet/deep-learning-models/blob/master/inception_v3.py

AlexNet	60m	8	11,5,3	16.4%
VGG	180m	19	3	7.3%
GoogLeNet	7m (Multiply-Adds 1.5 billion)	22	5,3,1	6.67% (7 model, 144 crops)	10.07%
Inception v3	2.5 * GoogLeNet 25 million params + 5 billion multiply-adds/inference	42	1,3,5,7	3.5% (4 model, 144 crops)	5.6%

Paper Reading: Inception系列之Inception v3

目录

Introduction

相关工作

General Design Principles

Principle 1

Principle 2

Principle 3

Principle 4

相对于v-2的改进

分解大的卷积核

分解成小卷积

空间拆分成非对称卷积

Auxiliary Classifier

Efficient Grid Size Reduction

Model Regularization via Label Smoothing

Inception-v3

Architecture

Training

Lower Resolution Input

实验结果和对比

问题

TODO

参考

猜你喜欢