读论文：Object Detection 目标检测合集

Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

CVPR’14

问题

之前最好的办法很复杂
滑动窗口来定位是个挑战

方法

测试：
- 从输入图片中 region proposals 2000个候选区域
- 在每个区域跑一次CNN，提取出固定长度的特征向量
- 对每个向量用SVM
训练：
- Supervised pre-training 用ILSVRC 2012的数据集进行预训练
- Domain-specific fine-tuning 对不同场合的识别需要进行fine-tuning
- Object category classifiers 对每个类都训练一个线性的SVM分类器，standard hard negative mining method

参考

https://arxiv.org/pdf/1311.2524.pdf

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

ECCV‘14

问题

之前的方法需要输入的图片是固定尺寸的，或者进行cropping、warping
这样会导致不必要的失真

方法

提出 spatial pyramid pooling (SPP)
在卷积层的最后一层加上SPP层
从卷积层获得的feature map，将其分成1x1，2x2，4x4…的小块，再在每个小块上做max pooling

收获

Bag-of-Words (BoW) approach 的思想
a global average pooling is used to reduce the model size and also reduce overfitting;
a global average pooling is used on the testing stage after all fc layers to improve accuracy
a global max pooling is used for weakly supervised object recognition
The global pooling operation corresponds to the traditional Bag-of-Words method

参考

扫描二维码关注公众号，回复： 3763582 查看本文章

https://arxiv.org/pdf/1406.4729.pdf

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

NIPS’15

问题

之前的方法：SSPnet, Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck
Region proposal methods typically rely on inexpensive features and economical inference schemes 比如（Selective Search）

方法

提出RPN网络（anchor boxes的机制）
On top of these conv features, we construct RPNs by adding two additional conv layers:
- one that encodes each conv map position into a short (e.g., 256-d) feature vector
- and a second that, at each conv map position, outputs an objectness score and regressed bounds for k region proposals relative to various scales and aspect ratios at that location (k = 9 is a typical value).
训练：we propose a simple training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed

参考

http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

SSD: Single Shot MultiBox Detector

ECCV’16

问题

these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.

方法

…that does not resample pixels or features for bounding box hypotheses
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio

参考

https://arxiv.org/pdf/1512.02325.pdf

Feature Pyramid Networks for Object Detection

CVPR2017

问题

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive
即特征金字塔的方法在传统的计算机视觉算法中经常被用到，而在深度学习中，都尽量避免使用多尺度相关的算法，因为一旦涉及多尺度，计算量将成倍增加
金字塔结构的优势是其产生的特征每一层都是语义信息加强的，包括高分辨率的低层
In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost

方法

作者认为在卷积网络中的每一层，就对应一个尺度的特征
FPN(Feature Pyramid Networks)包含一个自底向上的pathway，一个自顶向下的pathway，以及lateral connections
- Bottom-up pathway
- 作者定义，不同层但是尺度相同的feature map处于同一个stage，作者只利用了每个stage 最后一层的feature map去做后续的操作。这里也很好理解，每个stage的feature map 尺寸相同，最深层的feature map肯定有更强的特征表达
- 具体来说，对于ResNets，作者使用了每个阶段的最后一个残差结构的特征激活输出
- Top-down pathway and lateral connections
- 把高层特征图进行上取样（最邻近上采样法），然后把该特征横向连接（lateral connections ）至前一层特征，具体看原论文3

收获

作者利用卷积网络自身的结构设计出特征金字塔
算法同时利用低层特征高分辨率和高层特征的高语义信息，通过融合这些不同层的特征达到预测的效果
并且预测是在每个融合后的特征层上单独进行的，这和常规的特征融合方式不同

参考

FPN（feature pyramid networks）算法讲解

https://arxiv.org/abs/1612.03144

R-FCN: Object Detection via Region-based Fully Convolutional Networks

CVPR’16

问题

一方面，图像级别的分类任务侧重于平移不变性（在一幅图片中平移一个物体而不改变它的判别结果）
另一方面，物体检测任务需要定义物体的具体位置，因此需要平移变换特性

方法

为了把平移变换特性融合进FCN中，我们创建了一个位敏得分地图（position-sensitive score maps）来编码位置信息，从而表征相关的空间位置。
在FCN的顶层，我们附加了一个position-sensitive ROI pooling layer 来统领这些得分地图（score maps）的信息，这些得分地图不带任何权重层

收获

可以通过 score maps 投票来确定空间位置

参考

[译] 基于R-FCN的物体检测

https://arxiv.org/pdf/1605.06409.pdf

PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

NIPS’16

问题

准确率很高的检测算法有 heavy computational cost
本文提出用于目标检测的一个轻量级的特征提取的网络结构——PVANET

方法

The key design principle is “less channels with more layers”
our networks adopted some recent building blocks
Concatenated rectified linear unit (C.ReLU) is applied to the early stage of our CNNs (i.e., first several layers from the network input) to reduce the number of computations by half without losing accuracy.
Inception [3] is applied to the remaining of our feature generation sub-network
We adopted the idea of multi-scale representation like HyperNet [4] that combines several
intermediate outputs

收获

C.ReLU 可以在不损失精度的情况下提升2倍速度
Inception 用于捕获输入图像中小目标和大目标的最具有 cost-effective
采用multi-scale representation的思想可以同时考虑多个层次的细节和非线性

参考

PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

https://arxiv.org/pdf/1608.08021v3.pdf

DSSD : Deconvolutional Single Shot Detector

CVPR’17

问题

进一步提高精度

方法

把SSD的基网络由VGG换成ResNet-101
然后添加 Deconvolution Module ，prediction module
跳跃链接

收获

更好的特征提取网络和增加上下文信息有助于提高精度

参考

深度学习论文笔记：DSSD

https://arxiv.org/abs/1701.06659

DSOD: Learning Deeply Supervised Object Detectors from Scratch

ICCV’17

问题

为什么要从0开始训练一个检测模型，而不是fine-tune一个预训练的模型呢？

预训练的模型一般是在分类图像数据集比如Imagenet上训练的，不一定可以迁移到检测模型的数据上（比如医疗图像）
预训练的模型，其结构都是固定的，所以如果你要再修改的话比较麻烦
预训练的分类网络的训练目标一般和检测目标不一致，因此预训练的模型对于检测算法而言不一定是最优的选择

方法

左边的plain connection表示SSD算法中的特征融合操作，这里对于300*300的输入图像而言，一共融合了6种不同scale的特征
在每个虚线矩形框内都有一个1*1的卷积和一个3*3的卷积操作，这其实就是一个bottleneck，也就是1*1的卷积主要起到降低channel个数从而降低3*3卷积计算量的作用
右边的dense connection表示本文引入densenet思想的特征融合操作
dense connection部分左边的虚线矩形框部分和plain connection的右边虚线矩形框部分很像，差别在于channel个数（dense connection中3*3的channel个数是对应plain connection中3*3的channel个数的一半），主要是因为在plain connection中，每个bottleneck的输入直接是前一个bottleneck的输出，但是在dense connection中，每个bottleneck的输入是前面所有bottleneck的输出的concate
dense connection部分右边的矩形框是down sampling block，包含2x2的max pooling（降采样作用）和一个1x1的卷积（降低channel个数的作用），作者也提到先进行降采样再进行1x1卷积主要可以减少计算量

参考

不需要预训练模型的检测算法—DSOD

http://openaccess.thecvf.com/content_ICCV_2017/papers/Shen_DSOD_Learning_Deeply_ICCV_2017_paper.pdf

Training Region-based Object Detectors with Online Hard Example Mining

CVPR’16

问题

The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune.

方法

文章提出了一种通过online hard example mining（OHEM）算法训练基于区域的卷积检测算子的高效目标检测算法，能够对简单样本和一些小数量样本进行抑制，使得训练过程更加高效
该方法利用显著的bootstrapping技术（SVM中被普遍利用），对SGD算法进行一定的修改，使得原有的region-based ConvNets的启发式学习和多参数可以被移除，并得到较准确稳定的检测结果
- 文章提出的OHEM算法里，对于给定图像，经过selective search RoIs，同样计算出卷积特征图。但是在绿色部分的（a）中，一个只读的RoI网络对特征图和所有RoI进行前向传播，然后Hard RoI module利用这些RoI的loss选择B个样本。在红色部分（b）中，这些选择出的样本（hard examples）进入RoI网络，进一步进行前向和后向传播

参考

论文笔记 OHEM

Improving_ICCV_2017_paper.pdf”>http://openaccess.thecvf.com/content_ICCV_2017/papers/Bodla_Soft-NMS–_Improving_ICCV_2017_paper.pdf

Focal Loss for Dense Object Detection

ICCV‘17

问题

one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far
即 one-stage detectors 很好，但是没 two-stage detectors 精准
作者提出 focal loss 的出发点也是希望 one-stage detector 可以达到 two-stage detector 的准确率，同时不影响原有的速度

方法

提出了 focal loss
- $FL(p_t)=-a_t(1-p_t)^γlog(p_t)$
- 这个损失函数是在标准交叉熵损失基础上修改得到的
- 这个函数可以通过减少易分类样本的权重，使得模型在训练时更专注于难分类的样本
- 当一个样例被误分类，那么 $P_t$ 很小，那么调制因子 $（1-P_t）$ 接近1，损失不被影响；当 $P_t→1$ ，因子 $（1-P_t）$ 接近0，那么分的比较好的（well-classified）样本的权值就被调低了
- 专注参数 $γ$ 平滑地调节了易分样本调低权值的比例。 $γ$ 增大能增强调制因子的影响，实验发现 $γ$ 取2最好

参考

读Focal Loss

https://arxiv.org/abs/1708.02002

Soft-NMS – Improving Object Detection With One Line of Code

ICCV’17

问题

红色框和绿色框是当前的检测结果，二者的得分分别是0.95和0.80。如果按照传统的NMS进行处理，首先选中得分最高的红色框，然后绿色框就会因为与之重叠面积过大而被删掉
另一方面，NMS的阈值也不太容易确定，设小了会出现下图的情况（绿色框因为和红色框重叠面积较大而被删掉），设置过高又容易增大误检

方法

思路：不要粗鲁地删除所有IOU大于阈值的框，而是降低其置信度

指定一个置信度阈值，然后最后得分大于该阈值的检测框得以保留

参考

一行代码改进NMS

https://arxiv.org/abs/1704.04503

Light-Head R-CNN: In Defense of Two-Stage Object Detector

arXiv:1711

问题

two-stage detector 检测精度高，但是慢
主要在基于proposal的recognition过程，作者成为 ’head‘

方法

对head进行瘦身
- Large separable convolution + Thin feature map 提升算法速度。
- 用FC来代替global average pooliing来减少空间信息的丢失，提高精度。
加入其它trick，例如: PSRoI with RoIAlign、multi-scale training、OHEM 等来进一步提升精度

收获

解决问题前要分析问题，知道从哪下手才有效
此篇论文更偏工程，实用

参考

【计算机视觉】《Light-Head R-CNN: In Defense of Two-Stage Object Detector》

https://arxiv.org/abs/1711.07264

读论文：Object Detection 目标检测合集

Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

SSD: Single Shot MultiBox Detector

Feature Pyramid Networks for Object Detection

R-FCN: Object Detection via Region-based Fully Convolutional Networks

PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

DSSD : Deconvolutional Single Shot Detector

DSOD: Learning Deeply Supervised Object Detectors from Scratch

Training Region-based Object Detectors with Online Hard Example Mining

Focal Loss for Dense Object Detection

Soft-NMS – Improving Object Detection With One Line of Code

Light-Head R-CNN: In Defense of Two-Stage Object Detector

猜你喜欢