Faster R-CNN笔记

一、论文理论笔记部分：

一个比较好的图：https://blog.csdn.net/WoPawn/article/details/52223282

Two modules:

a deep fully convolution network that proposes regions as an attention mechanism.
the fast RCNN detector that uses the proposed regions.

faster RCNN

1、RPN（Region Proposal Networks）

sppnet和fast RCNN减少了检测网络的时间，但是region proposal还是耗费很多时间。FASTER-RCNN解决了这个问题，提出了Region Proposal Network（RPN）代替selective search部分。

输入：image with any size；

输出：rectangular obect proposals with objectness score。

ultimate goal: share computation with a Fast R-CNN，implement end-to-end network.

Fast RCNN结构图

为了使RPN和fast rcnn分享卷积特征，所以这两个网络要使用同样的卷积层。在论文中，使用了ZF和VGG19两个网络的卷积层，作为共享卷积层。

如上图所示，为了生成region proposals，在最后一个卷积层上，用一个n*n（n=3）的小窗口（卷积层）滑动每个位置，把特征降为256维。把这256为特征分别输入到连个全连接层cls和reg。

2、Translation-Invariant Anchors（平移不变性）：

如果移动了一张图像中的一个物体，这proposal应该也移动了，而且相同的函数可以预测出热议未知的proposal。MultiBox不具备如此功能。平移不变性可以介绍模型大小。

在每个滑动窗口的位置预测k个region proposal（实验默认k=9）叫做anchor，默认使用3种尺度（scale:实验中使用128^2，256^2，512^2）和3种长宽比（ratio：实验中使用1：1，1：2，2：1），以滑动窗口的中心点为中心（An anchor is centered at the sliding window in question.）。对于一个convolutional feature map of size $W*H$ ，一共有 $W*H*k$ 个anchor（这里因为每个窗口产生一个feature map 单元，每个单元里有k个anchors）。

【our anchor-based method is built on a pyramind of anchors, which is more cost-efficient.Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios】

3、Multi-Scale Anchor as Regression Reference

Two popular ways for multi-scale predictions

based on image/feature pyramids,如DPM and CNN-based methods。图像被resized成不同尺寸，然后为每一种尺寸计算feature maps(HOG或者deep convolutional features)。这种方法比较费时。
use sliding windows of multiple scales(and/or aspect ratio) on the feature maps——filters金字塔。第二种方法经常和第一种方法一起使用。

在本论文中：anchor金字塔——more cost-efficient，只依靠但尺寸的图像和feature map。

the design of multiscale anchors is a key component for sharing features without extra cost for addressing scales.

4、Loss Function for learning region proposal

为了训练PRNs，赋予anchors二值的类标对应是否包含object（只是是否包含有对象，不分类）。来对anchors赋label：

positive label：
- the anchor/anchors with the highest IOU overlap with a ground-truth box,
- or，an anchor that has an IOU overlap higher than 0.7 with any group-truth box.
negative label:

IOU ratio < 0.3 for all groud-truth boxes.

其余的非P非N的anchors have no contribution.

损失函数：

$L({P_i}, {t_i}) =\frac{1}{N_{cls}}\sum_iL_{cls}(P_i, p_i^*)+\lambda \frac{1}{N_{reg}}\sum_iP_i^*L_{reg}(t_i, t_i^*)$

tips：

$i$ is the index of an anchor

$P_i$ is the predicted probability of anchor i being an object.

$P_i^*$ 为真实值 1 or 0

$t_i$ 是预测边界框四个坐标组成的向量

normalized by $N_{cls}, N_{rcg}$ ，weighted by a balanced parameter $\lambda$ .【在论文实验代码中： $N_{cls}=256$ ， $N_{reg}$ ~ $2,400,\lambda =10$ 】

for bounding box：

$t_x=(x-x_a)/(w_a)$ ， $t_y=(y-y_a)/h_a$

$t_w=log(w/w_a)$ ， $t_h=log(h/h_a)$

$t_x^*=(x^*/x_a)/w_a$ ， $t_y^*=(y^*-y)/h_a$

$t_w^*=log(w^*/w_a)$ ， $t_h^*=log(h^*/h_a)$

tips： $x$ ——>predicted box, $x_a$ ——>anchor box, $t_w^*$ ——>groud-truth box。x与y是box的中心坐标，w，h为宽和高。

可以认为是从anchor box回归到附近的gound truth box。

5、Training RPNs

image-centric sampling strategy
mini-batch arises from a single image that contains many positive and negative example anchors.
随机在一张图片中采样256个anchors来计算一个mini-batch的loss function。正负anchors=1:1
all new layers的权值初始化：高斯分布( $\mu =0$ , $\sigma =0.01$ ), all other layers（比如共享卷积层）用imageNet来权值初始化。用ZF net来进行微调。
学习率：0.001(60k)——>0.0001(20k)
动量(momentum)：0.9
weight decay：0.0005

6、Sharing Feature for RPN and Fast R-CNN

sharing convolutional layers between the two networks, rather than learning two separate networks
三种特征共享的方法：
- （1）Alternative training:迭代，先训练RPN，然后用proposal去训练Fast R-CNN。被Fast R-CNN微调的网络然后用来初始化RPN，以此迭代。本论文所有的实现都是用该方法。
- （2）Approximate joint training:RPN和fast R-CNN融合到一个网络中进行训练。不考虑Bounding Boxes。
- （3）Non-Approximate joint training：考虑Bounding Boxes。
4-step Alternating Traing
- step1:train RPN, initialized with an ImgNet-pre-trained model and fine-tuned end-to-end for the region tack.
- step2:train a separate detection network by Fast R-CNN using the proposals generated by the step1 RPN. This network is also initialized by the ImgNet-pre-trained model.At this point, the two network do not share conv layers.
- step3:use the detector network to initialize RPN training, but we fix the shared conv layers and only fine-tuned the layers unique to RPN. Now the two networks share conv layers.
- step4:keep the shared conv layers fixed, fine-tune the unique layers of Fast R-CNN.

7、implementation Details

Multi-scale and speed-accuracy之间的trade-off
To reduce redundancy, we adopt non-maximun-suppression(NMS) on the proposal regions based on their cls scores.

8、网络结构（1）

（1）VGG介绍：

VGG-16:VGG名字来自于在ImageNet ILSVRC 2014竞赛中使用此网络的小组组名，首次发布于论文[Very Deep Convolution Networks for large-Scale Image Recognition]。

当使用VGG作为分类任务时，其输入时224x224x3的张量，在分裂任务中输入图片尺寸固定，因为网络最后一部分的全连接层需要固定长度的输入。在接入全连接层时，通常需要将最后一层卷积的输出展开成一维张量。
因为要使用卷积网络中间层的输出所以输入图片的尺寸不再有限制。因为只有卷积层参与计算(可以通过加padding来保证输出尺寸的一致么？？)。
每一层卷积网络都在前一层的基础上提取了更加抽象的特征。第一层学习到了简单的边缘，第二层寻找目标边缘的模式，以激活后续卷积网络中更加复杂的形状。最终，我们得到了一个在空间维度上比原始图片小很多，但表征更加深的卷积特征图。特征图的长和宽会随着卷积层间的池化二缩小，深度会随着卷积层过滤器的数量而增加。

左侧：锚点，中心：特征图空间单一锚点咋原图中的表达，右侧：所有锚点在原图中的表达

（2）RPN

RPN采用卷积特征图并在图像上生成建议。

RPN接受所有的参考框（锚点）并为目标输出一套好的建议。RPN会：(i)输出锚点作为目标的概率，但是它不关心分类（2）：输出边框回归，用来调整锚点以更好的拟合其预测的目标。
RPN是用完全卷积的方式实现的，用基础网络返回的卷积特征图作为输入。首先，我们使用一个有256个通道和3x3卷积核大小的卷积层，然后我们有两个使用1x1卷积核并行卷积网络，其通道数量取决于每个点的锚点数量。

基于区域的卷积神经网络（R-CNN）是Faster R-CNN工作流的最后一步。从图像上获得卷积特征图之后，用它通过RPN来获得目标建议并最终为每个建议提取特征(通过RoI Pooling），最后我们需要使用这些特征进行分类。R-CNN试图模仿分类CNNs的最后阶段，在这个阶段用一个全连接层为每个目标类输出一个分数。

9、网络结构（2）：参考自http://huchaowei.com/2018/01/16/faster-rcnn%E7%BD%91%E7%BB%9C%E5%89%96%E6%9E%90/

faster R-CNN=特征提取+RPN+fast R-CNN组成，这里选择ZF为作为特征提取的网络，再介入RPN，生成proposals。

四个部分：

Conv layers：使用你一组基础的conv+relu+pooling层提取image的feature maps
Region Proposal Networks（RPN）：该层生成一系列anchors并映射到原图，然后通过softmax判断anchors属于foreground或者background，再利用bounding box regression修正anchors获得精确的proposals.
Roi Pooling：该层收集输入的feature maps和proposals，综合这些信息后提取proposal feature，送入后续全连接层判定目标类别。
Classification：利用proposal feature maps计算proposal的类别，同时再次bounding box regression获取检测框最终的精确位置。

如上图：

Conv layers：conv layers部分共分为13个conv层，13个relu层，4个pooling层。为了保证Con layers生成的feature map都可以和原图对应起来，卷积过程中使用pad保证卷积后宽高不变，经过一次pooling操作，宽高变为原来的1/2，一个MxN大小的矩阵经过conv layers固定变为(M/16)x(N/16)。一共有四次pooling，故一共是1/16。

RPN

RPN：网络分为两条线，上面的一条通过softmax分类anchors获得foreground和background（检测目标是foreground），下面一条用于计算对于anchors的bounding box regression偏移量，以获得精确的proposal。（一条分类一条回归，分类是有无目标的分类）最后的Proposal层则负责综合foreground和bounding box regression偏移量获取proposals，同时剔除大小和超出边界的proposals。
ROI pooling层负责收集proposal，统一proposals的尺度，送入后续网络。它有两个输入：原始的proposal boxes（大小各有不同）以及原始的feature maps
classification：classification部分利用已经获取的Proposal feature maps，通过full connect层与softmax计算每个proposal具体属于哪个类别，输出cls_prob概率向量，同时再次利用bounding box regression获取每个proposal的位置偏移量bbox_pred，用于回归更加精确的目标检测框。
- 通过全连接层和softmax对proposal进行分类，这实际上已经是识别的范畴了。
- 再次对Proposal进行bounding box regression，以获取更高精度的rect box。

Classification部分网络结构

Faster R-CNN训练：

ZF网络结构图

faster R-CNN是在已经训练好的model（如VGG_CNN_M_1024, VGG, ZF）的基础上进行训练。实际训练分为6个步骤：

在已经训练好的model上，训练RPN网络，对应stage1_rpn_train.pt
利用第一步训练好的RPN，收集proposals，对应rpn_test_pt
第一次训练Faster RCNN网络，对应stage1_fast_rcnn_train.pt
第二次训练RPN网络，对应stage2_rpn_train.pt
再次利用第四步训练好的RPN，手机proposals，对应rpn_test.pt
第二次训练Fast R-CNN，对应stage2_fast_cnn_train.pt

可以看到训练的过程是一个”迭代“的过程，不过只是两次，两次的原因是：A similar alternative training can be run for more iterations. but we have observed negligible improvements。即更多了没什么效果提升。

训练RPN：

使用训练好的RPN网络收集proposals

整个网络的loss就是本文最开始部分的Loss function。

fast R-CNN：

猜你喜欢