Object detection 之 MR-CNN

Object detection via a multi-region & semantic segmentation-aware CNN model

利用多区域和语义分割辅助的CNN模型的目标检测

摘要

原文	翻译
We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features.	文章提出一种基于多区域深度卷积网络的目标检测系统，此网络可以对和语义分割相关的特征进行编码。
The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization.	通过CNN提取的特征不仅能够获得一系列和外形相关的因素，而且能够展示定位信息，这些对于准确定位目标很重要。
We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model.	文章分析了提出的识别模型的特征，通过联合一个迭代定位装置和一个深度回归CNN网络，这个迭代器可以在box得分和位置间进行选择。
Thanks to the efficient use of our modules, we detect objects with very high localization accuracy. On the detection challenges of PASCAL VOC2007 and PASCAL VOC2012 we achieve mAP of 78.2% and 73.9% correspondingly, surpassing any other published work by a significant margin.	利用本文提出的模型后，目标检测任务的表现得到了提升。在PASCAL VOC2007和2012数据集上，mAP到了78.2%和73.9%

main contribution

提出一种基于多区域的CNN识别模型，通过提取多个区域的特征来丰富目标的特征，达到提高目标检测的效果；
将语义分割的概念加入到目标检测中，在不需要额外数据的前提下，通过弱监督方式学习和语义相关的特征；
将用于分类的CNN和回归网络耦合，预测bounding box;

multi-region CNN model

多区域网络模型如下图所示，主要包括两个部分：1.特征图activation maps module 2.区域调整region adaptation components of the model

Activation maps module. This part of the network gets as input the entire image and outputs activation maps (feature maps) by forwarding it through a sequence of convolutional layers.通过将输入图片经过一系列的卷积层，得到整张图的特征图（激活图）

Region adaptation module. Given a region R on the image and the activation maps of the image, this module projects R on the activation maps, crops the activations that lay inside it, pools them with a spatially adaptive (max-)pooling layer, and then forwards them through a multi-layer network.给定一个区域R和第1步得到的特征图，这个模块将R映射到特征图上，然后crop，并经过一个空间自适应池化层，输入进行一个全连接网络。

Multi-Region CNN Model
multi-region就是选取一张图片的多个区域，这么做的目的是：
(i). 从多个角度来丰富提取的特征
(ii). 使得提取的特征对于那些错误/不准确的定位box敏感

region components and their role in detection

那么问题来了，究竟选取哪些区域呢？怎样才能保证提取的特征能够完全表征图片。作者在2.1节给出了选区的区域和理由，一共选取了10个典型的区域，如下图所示。
regions used in the paper
作者设计了10个regions，这些region的形状有种：矩形、矩形环，并且解释了选取这些regions的理由。
role in detection：给出了使用多区域的2个理由Discriminative feature diversification 和 localization-aware representation

semantic-segmentation ware CNN model

根据经验能够知道分割可以帮助检测的，作者在multi-region的基础上，进行了一写改进，加入了语义分割的信息。
1 改进activation maps
改用全卷积神经网络预测目标出现的可能性；加入弱监督学习，不需要标准语义信息，直接利用原来的ground truth box，尽管使用的是弱监督学习，但是特征激活图还是可以反应语义信息；将FCN最后一程分类层去掉，用来提取特征。
2 改进 region adaption

location

主要包括3点
CNN region adaption module for bounding box regression
和R-CNN中仅仅用1个回归层不同，本文用来2个隐含全连接层和一个预测层，预测层输出bounding box，为了避免遗漏，把预测的box扩大1.3倍
Iterative localization迭代优化
Bounding box voting投票机制，类似于nms

Object detection 之 DeepBox

DeepBox: Learning Objectness with Convolutional Networks

DeepBox:利用卷积网络学习目标

摘要

原文	译文
Existing object proposal approaches use primarily bottom-up cues to rank proposals, while we believe that “objectness” is in fact a high level construct.	现有的目标检测中proposal算法大部分依赖于自下而上的特征来对proposal进行排序，但是作者认为应该用高层的视觉特征对proposals进行排序。
We argue for a data-driven, semantic approach for ranking object proposals. Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method.	作者研究了一种数据驱动的语义方法对proposals进行排序，我们将方法命名为DeepBox，利用卷积神经网络对proposals进行重新排序
We use a novel four-layer CNN architecture that is as good as much larger networks on the task of evaluating objectness while being much faster. We show that DeepBox significantly improves over the bottom-up ranking, achieving the same recall with 500 proposals as achieved by bottom-up methods with 2000.	本文用了一个新的4层的卷积网络，在不损失精度的前提下加快了速度。利用本文的方法，仅仅需要对500个proposals进行重新排序就可以得到和未改进算法2000个proposals的效果。
This improvement generalizes to categories the CNN has never seen before and leads to a 4.5-point gain in detection mAP. Our implementation achieves this performance while running at 260 ms per image.	本文提出的方法的泛化能力很强，mAP可以提高4.5%，并且速度可以达到每张图260ms。

main contributions

提出一个4层的卷积网络，更快更准
对proposals进行改进，重新排序。

本文方法

本文的主要框架如下图所示，包括三个步骤：

提出Bottom-up proposal pool
CNN提取特征
Re-rank proposals

Network Architecture

作者对网络结构进行不停的尝试，在AlexNet的基础上，改变fc6的节点数、改变input的尺寸、remove不同的层等等，发现最后的一个网络结构为：
conv(11,96,4)–pool(3,2)–conv(5,256,1)–fc(1024)–fc(2)

Sharing computation for faster reranking

每张输入图片，只进行一次卷积特征提取，

目标检测系列(4):MR-CNN和DeepBox