Understanding Neural Network (ten) Faster R-CNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks In this object detection frame Region Proposal + CNN's classification, Region Proposal quality directly affects the accuracy of detection of the target task. If you find a way to extract only a few hundred or fewer high-quality fake selection window, and high recall rate, which will not only accelerate the speed of target detection, but also improve the performance of target detection (false positive cases less). RPN (Region Proposal Networks) network came into being.

Here Insert Picture Description

1) RPN core idea is to use a convolutional neural network generated directly Region Proposal, is the sliding window method is essentially used. RPN design is more clever, RPN simply slide again on the last convolution layer, because Anchor mechanism and frame return can be multi-scale and multi-aspect ratio Region Proposal.
2) Faster R-CNN architecture for the extraction of the candidate box SelectiveSearch most common method of extracting an image takes about 2s time, the improved algorithm will improve the efficiency EdgeBoxes to 0.2s, but that was not enough. Candidate frame extraction does not have to do on the original, can also be a feature map, low-resolution map feature means less calculation, based on this assumption, MSRA of Ren Shaoqing et al RPN (RegionProposal Network), the perfect solution to this question, we first look at the network topology.
Here Insert Picture Description
By adding additional RPN branch network, the candidate frame extraction merged into the depth of the network, which is Faster-RCNN landmark contribution. RPN network characteristics extraction candidate blocks that is achieved by way of the sliding window, each sliding window position generates nine candidate window (different scales, different width and height), extracts the corresponding nine candidate windows (Anchor) feature, for a target classification and regression border with FastRCNN similar. Target classification only need to distinguish between candidates box features as foreground or background.
Frame return more accurate determination target position, basic network structure as shown below:
Here Insert Picture Description
training process, involving the selection of candidate block select by:

  • Dropped across anchor boundary;
  • anchor overlapping area greater than 0.7 the sample is labeled as foreground, overlapping area is less than a calibrated background 0.3;

对于每一个位置,通过两个全连接层(目标分类+边框回归)对每个候选框(anchor)进行判断,并且结合概率值进行舍弃(仅保留约 300 个 anchor),没有显式地提取任何候选窗口,完全使用网络自身完成判断和修正。
从模型训练的角度来看,通过使用共享特征交替训练的方式,达到接近实时的性能,交替训练方式描述为:

  • 根据现有网络初始化权值 w,训练 RPN;
  • 用 RPN 提取训练集上的候选区域,用候选区域训练 FastRCNN,更新权值 w;
  • 重复 1、 2,直到收敛。

因为 Faster-RCNN, 这种基于 CNN 的 real-time 的目标检测方法看到了希望, 在这个方向上有了进一步的研究思路。至此,我们来看一下 RCNN 网络的演进,如下图所示:
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
3) RPN 架构
RPN 采用任意大小的的图像作为输入,并输出一组候选的矩形,每个矩形都有一个对象分数。RPN 被用于训练直接产生候选区域,不需要外部的候选区域。
Here Insert Picture Description
Here Insert Picture Description
Anchor 是滑动窗口的中心,它与尺度和长宽比相关,默认采 3 种尺度(128,256,512), 3种长宽比(1:1,1:2,2:1),则在每一个滑动位置 k=9 anchors。我们直接看上边的 RPN 网络结构图(使用了 ZF 模型),给定输入图像(假设分辨率为600*1000),经过卷积操作得到最后一层的卷积特征图(大小约为 40*60)。在这个特征图上使用 3*3 的卷积核(滑动窗口)与特征图进行卷积,最后一层卷积层共有 256 个feature map,那么这个 3*3 的区域卷积后可以获得一个 256 维的特征向量,后边接 clslayer(box-classification layer)和 reg layer(box-regression layer)分别用于分类和边框回归(跟 Fast R-CNN 类似,只不过这里的类别只有目标和背景两个类别)。 3*3 滑窗对应的每个特征区域同时预测输入图像 3 种尺度(128,256,512), 3 种长宽比(1:1,1:2,2:1)的 region proposal,这种映射的机制称为 anchor。所以对于这个 40*60的 feature map,总共有约 20000(40*60*9)个 anchor,也就是预测 20000 个 regionproposal。
Faster R-CNN would have been isolated region proposal and CNN classification fused together, end to end network using a target detection, both in the speed or accuracy have been a good increase. However Faster R-CNN or reach real-time target detection, pre-acquisition Region Proposal, then it is quite a large amount of calculation for the classification of each Proposal. Fortunately for the emergence of such target is YOLO allow real-time detection method becomes possible. What are the benefits of this design is it? While sliding window policy is now used, except that: the sliding window operation is performed on a layer wherein FIG convolution, reduced dimensions compared with the original image 16*16times (four times in the middle after 2*2pooling operation); multi-scale use of nine anchor , corresponding to three degrees and three kinds of aspect ratio, plus behind the border back then, so even outside the window of the nine anchor target can be obtained with a relatively close regionproposal.

4) summary
Here Insert Picture Description

Published 163 original articles · won praise 117 · views 210 000 +

Guess you like

Origin blog.csdn.net/u010095372/article/details/91344687