ICLR-2014

全称为「International Conference on Learning Representations」（国际学习表征会议），由位列深度学习三大巨头之二的 Yoshua Bengio 和 Yann LeCun 牵头创办。详细介绍可以参考才办了五年的 ICLR，为何被誉为“深度学习的顶级会议”？| ICLR 2017。

对于用卷积神经网络进行图像目标的分类问题，我们要明白一些相关东西。

一个训练好的卷积神经网络，它的相关的参数是固定的对吧，参数都存在这些地方：第一，卷积核，所以呢，卷积层中的 feature map的个数是固定的，但是它的大小不是固定的哦。第二就是网络中的全连接层，全连接层的神经元个数不能改变了。。。总之，在卷积层中，网络的输入大小可以改变，输入大小的不同的结果仅仅是 feature map 大小的不同。而在全连接层中，我们网络的输入的大小不能改变，必须固定。在卷积神经网络中，网络的前面基本都是卷积层，而网络的后面几层基本都会跟着全连接层的，所以，这就决定的网络在训练过程中网络的输入的大小应该统一固定，在测试中，网络的输入应该与训练时网络的输入大小保持一致。

以上说的为常用的卷积神经网络。不过有一种 fully-convnet 可以不受这个限制，它的做法就是把网络的后面的全连接层看层为卷积层来对待。这样，我们在测试时，就可以 multi-scale 的输入了。

one CNN model（Overfeat），three tasks：
- classification
- localization
- detection
multi-scales and sliding window（feature map）
localization & detection by
- predict object bounding boxes
- accumulation not suppression to increase detection confidence（好处：无需做背景样本的训练、避免耗时复杂的抽样训练，让模型聚焦positive class 提高预测精度）

2 Motivation

面临的问题：

While images from the ImageNet classification dataset are largely chosen to contain a roughly centered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image.

idea有三个：

CNN with a sliding window fashion, and over multiple scales，但是如果window抓住 a perfectly identifiable portion of the object（比如狗的头部），This leads to decent classification but poor localization and detection.
train the system to not only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window.
accumulate the evidence for each category at each location and size.

本文首次
provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.

3 Advantages

ILSVRC 2013 datasets, ranks 4th in classification, 1st in localization and 1st in detection
explaining how ConvNets can be effectively used for detection and localization tasks

4 Methods

4.1 Vision Tasks

classification：for a image, assign a category
localization: you know ahead of time that there is one object, we’re gong to make classification decision and give produce exactly one bounding box
detection: there exists some fixed set of categories, given a image, we want to find all objects in that set of categories, and give bounding box and category label for each object,(we don’t konw ahead of time how many objects to find)

4.2 Classification

4.2.1 Model Design

基于AlexNet 改进
Feature Extractor（layers 1-5） + Multi-Scale Classification（layers 6-output）

fast model

这里写图片描述

1）不使用LRN；

2）不使用over-pooling使用普通pooling；

3）第3,4,5卷基层特征数变大，从AlexNet 的 384→384→256；变为512→1024→1024.

4）fc-6层神经元个数减少，从4096 变为 3072

accuracy model

这里写图片描述

1）不使用LRN；

2）不使用over-pooling，使用普通pooling，更大的pooling间隔 S=2 或 3

3）第一个卷基层的间隔从4变为2（accurate 模型），卷积大小从11*11变为7*7；第二个卷基层filter从5*5升为7*7

4）增加了一个第三层，是的卷积层变为6层；从Alex-net的 384→384→256；变为512→512→1024→1024.

关于stride，需要注意，stride大虽然有利于速度，但会损害准确性。

4.2.2 Training

回顾 Alexnet的训练、测试：

1) 训练阶段：每张训练图片256*256，然后我们随机裁剪出224*224大小的图片，作为CNN的输入进行训练。

2) 测试阶段：输入256*256大小的图片，我们从图片的5个指定的方位(上下左右+中间)进行裁剪出5张224*224大小的图片，然后水平镜像一下再裁剪5张，这样总共有10张；然后我们把这10张裁剪图片分别送入已经训练好的CNN中，分别预测结果，最后用这10个结果的平均作为最后的输出。

overfeat这篇文献的图片分类算法，在训练阶段采用与Alexnet相同的训练方式，然而在测试阶段可是差别很大

OverFeat
1）训练：对于每张原图片为256*256，然后进行随机裁剪为221*221的大小作为CNN输入，进行训练。
2）测试：多尺度+offset-max pooling
这里写图片描述

layer5 pre-pooling → layer5 post-layer，过程如下，以一维为例：
这里写图片描述

输入20像素（layers 1-5 提取出来的 feature map 的长或者宽）

从第一个像素开始（△=0）经过 size为3，stride 为3 的 max pooling 后，结果为 6（图中（c）红色），余下的两个像素（19、20）扔掉
从第二个像素开始（△=1）经过 size为3，stride 为3 的 max pooling 后，结果为 6 （图中（c）绿色），余下的两个像素（1、20）扔掉
从第三个像素开始（△=2）经过 size为3，stride 为3 的 max pooling 后，结果为 6（图中（c）蓝色），余下的两个像素（1、2）扔掉

一个维度 offset max-pooling 后有三个结果，两个维度（picture）就有 3X3 个结果，这就对应 table5 中的（3*3），之后经过 layer 6-8 层输出结果，layer 6-8相当于一个 5*5 的卷积核（图中（d），一维就是5），所以，6个像素经过一个 5 的卷积后，会输出 2，把输出的3个2结合一下就是图中的（e）了。

明白了offset之后，再回过头来看测试的表
这里写图片描述

scale1：（17-2）/ 3 = 5，5 卷积（核大小5，步长1）后为 1
scale2：（20-2）/ 3 = 6，6 卷积（核大小5，步长1）后为 2，（23-2）/ 3 = 7，7 卷积（核大小5，步长1）后为 3

后面类似，就不一一列举了，最终把所有的预测结果平均一下即可

上图中（c）到（d）的过程，是 layer6-layer8的过程，具体如下，我们称之为（FCN）
这里写图片描述
核心就是把全连接层换成了conv层，这样就是很好的适应 multi-scale
原来CNN

fully-convnet

4.2.3 Resuls

这里写图片描述

6-scalse fine steps 表现最好

这里写图片描述

4.3 Localization

4.3.1 Generating Predictions

To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network.

4.3.2 Regressor Training

　　定位问题的模型也是一个CNN，1-5层作为特征提取层和分类问题完全一样，后面接两个全连接层（4096*1024→输出4），组成regressor network。训练时，前面5层的参数由classification network给定，只需要训练后面的两个全连接层。这个regressor network的输出就是一个bounding box，也就是说，如果将一幅图像或者一个图像块送到这个regressor network中，那么，这个regressor network 输出一个相对于这个图像或者图像块的区域，这个区域中包含感兴趣的物体。
　　
　　对于定位问题，测试时，在每一个尺度上同时运行classification network和regressor network。这样，对于每一个尺度来说，classification network给出了图像块的类别的概率分布，regressornetwork进一步为每一类给出了一个bounding box，这样，对于每一个bounding box，就有一个置信度与之对应。最后，综合这些信息，给出定位结果。
　　
　　我们把用图片分类学习的特征提取层的参数固定下来，然后继续训练后面的回归层的参数，网络包含了4个输出，对应于bounding box的上左上角点和右下角点，然后损失函数采用欧式距离L2损失函数。

如上图所示，每个回归网络，以最后一个卷积层作为输入，回归层也有两个全连接层，隐层单元为4096,1024（为什么作者没有说，估计也是交叉实验验证的），最后的输出层有4个单元，分别是预测bounding box的四个边的坐标。和分类使用offset-pooling一样，回归预测也是用这种方式，来产生不同的预测结果。

上图展示了在单个比例上预测的在各个offset和sliding window下 pooling后，预测的多个bounding box；从图中可以看出本文通过回归预测bounding box的方法可以很好的定位出物体的位置，而且bounding box都趋向于收敛到一个固定的位置，而且还可以定位多个物体和同一个物体的不同姿势。但是感觉offset和sliding window方式，通过融合虽然增加了了准确度，但是感觉好复杂；而且很多的边框都很相似，感觉不需要这么多的预测值。就可以满足超过覆盖50%的测试要求。

4.3.3 Combining Predictions

这里写图片描述

match score： using the sum of the distance between centers of the two bounding boxes and the intersection area of the boxes
box merge：compute the average of the bounding boxes’ coordinates