deep Learning 合集

Deep Learning

Very Deep Convolutional Networks for Large-Scale Image Recognition

ICLR 2015

问题

网络模型不够深

方法**

用3个 3x3的核替换 1个 7x7的核
- 更多的非线性映射 3 vs. 1
- 更少的参数 $3*(3^2C^2)$ vs. $7^2C^2$

收获

多个小核代替大核更有优势

参考

https://arxiv.org/abs/1409.1556

Network In Network

ICLR 2014

问题

之前的CNN如AlexNet参数过多
卷积层是线性的，抽象特征的能力有限
本文想解决以上两个问题

方法

MLP卷积层，即用1x1的卷积，然后Relu激活
- 因为CNN高层特征其实是低层特征通过某种运算的组合
- 作者就根据这个想法，提出在每个局部感受野中进行更加复杂的运算
把FC层用global average pooling代替
- 减少过拟合
- 减少参数

收获

1x1的卷积很有用，类似MLP中的用激活函数把线性变成非线性的过程，还能起到通道降维的作用
既然类似，能不能尝试把dropout加入到CNN中呢（在用1x1的卷积之前，随机置feature maps中一些值为0）？

参考

https://arxiv.org/abs/1312.4400

Going Deeper with Convolutions

CVPR 2015

问题

增加网络的深度和宽度会带来过拟合的问题
训练过程中会使得很多参数趋向于0 -> 稀疏
- 计算机的基础结构在遇到稀疏数据计算时会很不高效，使用稀疏矩阵会使得效率大大降低
- 但是稀疏性对深度神经网络是有用的，这与生物学中Hebbian法则“有些神经元响应基本一致，即同时兴奋或抑制”一致
本文想设计一种既能利用稀疏性，又可以利用稠密计算的网络结构

方法

在卷积层处理前，先用1x1的卷积核将它们聚合（信息压缩）后再卷积
- 降低运算量
- 增加非线性
用多个尺度的卷积核 1x1，3x3，5x5 还有 3x3 max pooling提取特征，然后组合成一层的feature maps（same padding）
- 增加了网络的width，另一方面增加了网络对尺度的适应性

收获

1x1的卷积核是一个将稀疏变稠密的方法之一
用多个尺度的卷积核提取特征比一个尺度要好

参考

http://openaccess.thecvf.com/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf

Deep Residual Learning for Image Recognition

CVPR 2016

问题

深模型的问题：梯度消失与爆炸，导致难以找到最优值
- 现有方案：
- normalized initialization
- intermediate normalization layers
模型更深
- 预测精度无法进一步提升
- 训练和验证损失值，反而比浅层的损失值更大

方法

Residual Learning
即 F(x) := H(x) - x
skip connections
即一层的输出，直接跳跃多层，链接给另一层

收获

这种残差结构有效解决了梯度消失和爆炸，以后设计深的模型需借鉴
skip connections这种思路在其他很多领域都能借鉴

参考

https://arxiv.org/abs/1512.03385

Squeeze-and-Excitation Networks

arXiv:1709

问题

为了提高网络的表示能力，许多现有的工作已经显示出增强空间编码的好处
所以作者想到能不能从其他层面来考虑去提升性能，比如考虑特征通道之间的关系

方法

提出 Squeeze-and-Excitation Networks
采用了一种全新的“特征重标定”策略
具体来说，就是通过学习的方式来自动获取到每个特征通道的重要程度，然后依照这个重要程度去提升有用的特征并抑制对当前任务用处不大的特征

收获

SE模块可以嵌入到自己的网络里，具有很强的泛化性，计算量增加可忽略不计

参考

CVPR | ImageNet冠军模型SE-Net详解

https://arxiv.org/abs/1709.01507

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

JMLR 15

问题

With limited training data, however, many of these complicated relationships will be the result of sampling noise, so they will exist in the training set but not in real test data even if it is drawn from the same distribution.
即深度神经网络训练出来的结果会受到噪声的影响，会导致过拟合
论文Introduction第三段全是说问题

方法

The term “dropout” refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections
代码: d =random.rand(a.shape) < keep_prob

收获

Droput能在全链接网络中有效解决过拟合问题，在玻尔兹曼机等图形模型也可以被广泛应用

参考

http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shif

ICML 2015

问题

Internal Covariate Shif ：the change in the distribution of network activations due to the change in network parameters during training
传统的方法是对输入进行白化处理
- 即通过线性变换使其均值为0，方差为1，并且降低输入的冗余性
- 白化的时候，某些节点中数值的更新则被白化消除了，于是参数一直增长，但网络的输出和损失几乎没有变化
- 计算整个训练样本的协方差矩阵，计算量过大
作者希望找到一种算法不仅能够进行可微分的归一化，还能不用在整个训练集上进行操作

方法

对特征的每个维度单独做归一化，而非以往的所有输入单元联合白化
用每个mini-batch的期望和方差来估计全局的期望和方差
引入两个可学习的参数γ(k)，β(k)，对x做线性处理（是为了在加速收敛和表征破坏之间，留一个trade off的空间）

收获

BN能加速收敛
深模型时要采用BN

参考

Batch Normalization 学习笔记

https://arxiv.org/abs/1502.03167

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

CVPR’14

问题

Learning CNNs, however, amounts to estimating millions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data

方法

![])

删除了softmax层，加上了两层自适应层

收获

可以估根据自己的数据量的大小，进行 fine-tuning

参考

http://openaccess.thecvf.com/content_cvpr_2014/papers/Oquab_Learning_and_Transferring_2014_CVPR_paper.pdf

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet ClassificationI

ICCV 2015

问题

Relu激活函数不是0均值输出，我们希望像tanh那样输出的均值是0
对于非常深的模型，随机初始化权重很难converge。“Xavier”初始化对Relu和Prelu无效。

本文提出了一种新的激活函数解决了问题1，一种新的初始化方法用于Relu

方法

提出Parametric Relu(PRelu)代替Relu，使错误率降低
- PRelu: $f(y_i) = max(0, y_i) + a_i min(0, y_i)$
- 当 $a_i$ 很小时就是Leaky Relu( $a_i=0.01$ )
- $ai$ 可以求导，所以PRelu可以通过反向传播来训练
提出新的初始化权重的方法
- 适用于深度网络+Relu类激活函数
- 是基于方差的计算，详细推到见论文2.2
- 结果权重满足均值为0，方差为 $\frac{2}{n_l}$ 的高斯分布
- python代码：W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2) # layer initialization

收获

以后追求极低错误率可以尝试用PRelu
可以尝试找出通用公式，把某些现有的东西（Relu）归成一种特例（PRelu的一种特例）
从0开始设计深度网络时可以考虑本文初始化权重的方法

参考

https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

ICIR‘16

问题

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources

方法

Prunes the network：只保留一些重要的连接；
Quantize the weights：通过权值量化来共享一些weights；
Huffman coding：通过霍夫曼编码进一步压缩；

参考

【深度神经网络压缩】Deep Compression （ICLR2016 Best Paper）

https://arxiv.org/abs/1510.00149

Deep Networks with Stochastic Depth

ECCV’16

问题

training very deep networks comes with its own set of challenges
- The gradients can vanish
the forward flow often diminishes
the training time can be painfully slow

方法

during training :
- for each mini-batch, randomly drop a subset of layers and bypass them with the identity function
- 效果：reduces training time substantially and improves the test error significantly on almost all data sets

收获

随机深度的方法很好理解，也算一种集成方法，集成了不同深度的网络

参考

https://arxiv.org/pdf/1603.09382.pdf

Snapshot Ensembles: Train 1, get M for free

ICLR’17

问题

Ensembles of neural networks are known to be much more robust and accurate than individual networks. However, training multiple deep networks for model averaging is computationally expensive
即集成网络比单个网络更具有鲁棒性和准确性，但是训练多个深度网络的模型代价太大

方法

Our approach leverages the non-convex nature of neural networks and the ability of SGD to converge to and escape from local minima on demand. Instead of training M neural networks independently from scratch, we let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum. More specifically, we adopt the cycling procedure suggested by Loshchilov & Hutter (2016), in which the learning rate is abruptly raised and then quickly lowered to follow a cosine function
即先用很大的学习率，然后用很小的学习率到达局部最优后
保存模型，添加到集成中
重复

收获

Snapshot Ensembling 可以作为一个提升准确率的技巧来用

参考

https://arxiv.org/pdf/1704.00109.pdf

Deep Mutual Learning

arXiv:1706

问题

本文旨在研究如何利用多个模型来提高图像识别的准确度，常用的方法是ensemble多个模型的结果来提高准确度，但是ensemble多个模型意味着更多的计算量。本文则提出用多个模型一起训练，互相学习，使得每个单模型都能提高泛化能力。

方法

这里写图片描述

DML的核心思想是希望两个分类器的概率预测分布能够一样，而评价两个概率分布相似度的就是KL散度
假设两个分类器分别是和，输出的概率分布分别和，则从到的KL距离定义为
- $D_{KL}(p_2||p_1)=∑_{i=1}^N∑_{m=1}^Mp^m_2(x_i)log\frac{p^m_2(x_i)}{p^m_1(x_i)}$
- 即把 $p_2$ 看作grand truth来计算两个分布的相对熵，最终网络 $θ_1$ 的损失函数 $L_{θ_1}$ 定义为：
- $L_{θ_1}=L_{C_1}+D_{KL}(p_2||p_1)$
- 其中 $L_{C_1}$ 是网络 $θ_1$ 自己的分类损失，比如交叉熵损失。同理对于网络 $θ_2$ 最终的损失也是自己的分类损失加把 $p_1$ 看作grand truth的互学习损失:
- $L_{θ_2}=L_{C_2}+D_{KL}(p_1||p_2)$
DML两个子网络是异步更新的

收获

本文提出的DML是一种通用的迁移学习、蒸馏学习的互学习方法，可以在不需要预训练网络的情况下提高单网络的泛化能力，适用于各种分类的任务
DML在数学理论上十分简单，工程实现也难度不大，但是效果很好
AlignedReID就采用了这个方法

参考

【论文笔记】Deep mutual learing

https://arxiv.org/abs/1706.00384