目录
Introduction(Sec 1)
- 形式:SE block, a novel architectural unit
- 目的:提升模型的表达能力
- 方式:实现动态的channel层面的特征recalibration(重新校准)
- 动机:
- 传统的方法:卷积操作把channel-wise和spacial的information一起提取到local receptive fields里
- 一些增强空间编码的方式可以增强模型的表达能力,如某些嵌入式的学习机制可以捕获空间相关性(spatial correlation)like inception,类似的,SE通过对通道之间的相互依存关系显示建模来重新校准channel-wise feature
- 引起关注的原因:
- 1st place of ILSVRC 2017 classification contest (Top-5 err = 2.251%)
- 优点:
- Generalization: SENet是SE Bloaks的堆叠的结构,在不同数据集上、不同task上(分类/OD)表现很好
- Drop-In Replacement: SE Block可以直接嵌入到不同的state of art框架中来提升性能(residual or non-residual)
- Improvement in Performance: SENet对state of art的深度学习框架的性能有显著提升
- Light-Weighted: 同时SENet的额外计算开销很小
- Simple-Design: new CNN architecture的设计都十分复杂:需要更多的超参数 & layer的config(e.g. Inception)。然而SE的设计简单。
- Greater Representation Power: 模型的表达能力更强
- Easy learning process: 使用gloable information来推断channel之间的非线性关系can ease the learning process.
- 意义:
- (hopefully)对其他有提取strong discriminative features需求的task有帮助
- (hopefully)SE blocks对相关领域(如network pruning for compressions)有帮助
相关工作(Sec 2)
- 开发 Deep Architecture
- 系列1:增加深度
- VGGNet: benefit from depth
- BN: 通过正则化layer的input来优化gradient propagation. 让学习过程更加稳定
- ResNet: 通过identity-based skip connections来充分学习更深的网络结构
- Highway networks: use gating mechanism to regulate shortcut connections.
- 系列2:着重于modular components
- Grouped convolutions 可以增加基数(# filters).
- Eg. Muti-branch conv, 让每一层操作的组成成分更加灵活
- Cross-channel correlations
- 对于空间结构独立
- 通过1*1 conv 合并
- 系列2的目的:reduction of model, save computational cost
- 推论:channel之间的关系可以变为与实例无关的公式,只通过 local receptive fields 相关
- 系列1:增加深度
- Attention and Gating Mechanisms
- Attention are usually followed by a gating function(e.g. sigmoid or soft-max) and sequential techniques(e.g. LSTM).
- Wang et al. introduce a powerful trunk-and-mask attention mechanism
- 这样的Unit应用在中间阶段
- SE更加lightweight
- SE可以贯穿整个网络
SE block(Sec 3)
- 原理
- 着重于模型设计中channel之间的关系
- SE通过对通道之间的相互依存关系显示建模来重新校准channel-wise feature,以此来选择性强调information features、抑制less useful ones.
- 提升了模型的表达能力
- 出发点 & 目的:
- 每个卷积操作只能学习到局部的receptive field,而不能学习到该区域以外的上下文信息
- SE block 让network对informative feature更加敏感,并且抑制没用的feature
- 基本结构
- Input: Feature map U (H*W*C)
- squeeze:
- Input: Feature map U (H*W*C)
- Method: global average pooling
- Output: produce a channel descriptor. (1*C)
- excitation: Adaptive Recalibration
- 作用:完全捕获channel间的依赖关系
- 要求:
- flexible: 学习到 non-linear 的依赖关系
- 学习到非互斥的依赖关系(0,1,2)
- Method:simple gating mechanism with sigmoid func: (W1 demention-reduction [C/r, C]; W2 demention-increasing [C, C/r])
- 结构1:sample-specific activations(ReLU)
- 结构2:self gating mechanism (sigmoid)
- sub-output: (1*C) weight of each channel
- Output: re-weighted feature map U
- 特征:
- lower layer使用SE对于不同类别无大差别 - bolstering the quality of the shared lower level representation。
- higher layer使用SE对于不同类别有很大差别
- highest layer使用SE会产生相似的channel权重分布 - 可以去掉来换取更少的computational cost (See Sec 6.4)
- 模型举例:SE-Inception & SE-ResNet
- SE-Inception(Fig 2)
- Ftr = Inception layer
- 在Inception之间增加SE
- SE-ResNet((Fig 3)
- Ftr应用在 non-identity branch上
- 在合并到 identity branch (skip操作) 之前进行 squeeze and excitation
- SE-Inception(Fig 2)
时间复杂度(Sec 4)
- 模型对比:
- GPU:
- Training time: ResNet-50: 150 ms; SE-ResNet-50: 209 ms (for a single pass forwards and backwards) mini-batch = 256
- 时间更慢的原因:global pooling & small inner-product operations 没有在GPU中被充分优化
- CPU:
- Inference time: Res-50: 164 ms, SE-Res-50: 167 ms (224*224 input)
- 虽然有小的计算能耗的增长,但是模型的performance更好了
- (why training & testing in GPU vs CPU?)
- GPU:
- 计算过程 (Eq. 5)
Implementation(Sec 5)
- Training阶段的参数:
- 随机裁剪224*224(or 299*299 for Inception-ResNet-v2)
- mini-batch sampling strategy: data balancing
- 同步SGD (v1用了易步SGD)
- momentum = 0.9
- mini-batch size =1024
- base-lr = 0.6
- ls /= 10 for each 30 epochs
- r = 16 (见 Sec 6.4 Table 7)
- 在Inf阶段:
- single crop in testing - centre crop evaluation
- 首先resize到短边对应256(352)再裁剪中间的224*224 (299*299 for Inception-ResNet-v2) pixels
实验(Sec 6)
- ImageNet Classification
About Dataset:: ImageNet 2012
1.28m | 50k | 1k | Top-5 Error |
- 在不同的深度的ResNet中嵌入SE blocks可以持续提升模型性能(both Top-1 & Top-5)
- GFLOPs只有小幅度的提升
- Single model的SE-ResNet-50比ResNet-50在Top-5 Error上提升了0.8c6,接近于ResNet-101(GFLOPs只提升了一倍)
- SE-ResNet-101比ResNet-152性能还要好(6.07% vs 6.34% Top-5 err)
- 在更深的模型上添加SE block: 收益递减,性能递增
ResNet vs SE-ResNet
Original SENet Model Top-5 Err GFLOPs Top-5 Err GFLOPs ResNet-50 7.48 3.86 6.62 3.87 ResNet-101 6.52 7.58 6.07 7.60 ResNet-152 6.34 11.30 5.73 11.32 - Single model的SE-ResNet-50比ResNet-50在Top-5 Error上提升了0.86,接近于ResNet-101(GFLOPs只提升了一倍)
- SE-ResNet-101比ResNet-152性能还要好(6.07% vs 6.34% Top-5 err) 而且GFLOPs更小
ResNeXt vs SE-ResNeXt
ResNetXt-50 | 5.49% | 4.24 |
ResNetXt-101 | 5.90% | 7.99 |
SE-ResNetXt-50 | 5.57% | 4.25 |
比101准确率更高且computational overhead节省了将近50%
Inception-ResNet-v2 vs SE-Inception-v2
-
-
Inception-ResNet-v2 5.21 11.75 SE-Inception-ResNet-v2 4.79 11.76 存在问题:Input image size might be larger (299 resized to 352), Inception-v2没有指明自己的input size. 所以性能的提升的可信度较低
Non-Residual Models
Original SENet Model Top-5 Err GFLOPs Top-5 Err GFLOPs VGG-16 8.81 15.47 7.70 15.48 BN-Inception 7.89 2.03 7.14 2.04 备注:VGG-16和SE implementation都用了batch-norm after each conv layer.
Representative efficient architecture: MobileNet & ShuffleNet
Original SENet Model Top-5 Err MFLOPs Top-5 Err MFLOPs MobileNet 10.1 569 7.9 572 ShuffleNet 13.6 140 11.7 142 只有轻微的computational overhead提升,但是在Top-5 Err上有大幅度下降
- 模型对比结论:
- 结论1:SE blocks可以和广泛的architecture结合并达到性能的提升
- 结论2:可以适用both residual & non-residual foundation
-
- 1st price in ILSRVRC 2017 Classification Competition
Result:
- 2.25 Top-5 (1st price)
- 更大分辨率结果更好
- 在Inf阶段:
- single crop in testing - centre crop evaluation
- 首先resize到短边对应256(352)再裁剪中间的224*224(299*299)pixels
Dataset:
-
8 m 36500 365 - 考验模型generalization和处理抽象的能力:
- 需要捕获更加复杂的数据关联性
- 需要对更高的appearance variation更加robust
- Result
- Use ResNet-152 as baseline. SE-ResNet-152 Surpass state of art
Dataset
-
-
80k 40k 80 - Faster R-CNN + SE-ResNet-50/101
- [email protected] = 49.2% 时对101更加benificial(0.5% improvement), 对比 SSD512 [email protected] = 46.5%
-
- Analysis and Interpretation
- Reduction ration (r)
- 实验网络:SE-ResNet-50
- 性能分析:
- r越大,capacity越小,param数量越少,但是性能没有单调递减(err没有单调递增)
- 可能是更大的capacity过拟合了channel的相互依存关系
- 选择r = 16作为computational cost和top-5 err的trade off (for all exps)
- Excitation在不同layer的作用(Fig5: 6张图, 4个stage)
- lower layer features are typically more general (不同class相似)
- higher layer features have greater specificity (class分化)
- Last Stage:
- SE_5_2(last stage block 2)趋向于饱和状态:大多数channel的activation -> 1, 极小部分 -> 0
- 如果所有的activation都==1,则跟标准resnet一样
- 最后一个blck SE_5_3 不同class的pattern相似!只是scale不同
- 说明SE_5_2, SE_5_3不那么重要,可以去掉(牺牲很小的performance)来节省很大的computation cost.
- Reduction ration (r)
问题
- GFLOPs不是越大越好,而是require的越大速度越慢。为什么这样衡量?
- VGG-16和SE implementation都用了batch-norm after each conv layer. 可以这样对比嘛?
- 对比Crop方法
- 同步和异步SGD对GFLOPs有影响吗?
- 这些classification/OD模型的性能对比
- highest layer使用SE会产生相似的channel权重分布
- Easy learning process: 使用gloable information来推断channel之间的非线性关系can ease the learning process. >> how?
- why training & testing in GPU vs CPU?
参考
Paper link: https://arxiv.org/pdf/1709.01507.pdf