Faster-RCNN Algorithm Intensive Reading

论文:《Faster R-CNN: Towards Real-Time ObjectDetection with Region Proposal Networks》

Abstract: The algorithm mainly solves two problems:

1. Propose a regional proposal network RPN to quickly generate candidate regions;

2. By alternating training, the RPN and Fast-RCNN networks share parameters.

 

1. RPN network structure

The role of the RPN network is to input an image and output a batch of rectangular candidate regions, which is similar to the Selective Search step in previous target detection. The network structure is based on a convolutional neural network, but the output contains a multi-task model of two-class softmax and bbox regression. The network results are as follows (using the ZF network as the reference model):


Among them, above the dotted line is the structure before the last convolutional layer of the ZF network, and below the dotted line is the unique structure of the RPN network. The first is the 3*3 convolution, and then the output is divided into two channels through the 1*1 convolution output, one of which outputs the probability of the target and the non-target, and the other outputs the four parameters related to the box, including the center coordinates of the box x and y, box width w and length h.

(As for why the 3*3 convolution kernel was used before, I think it corresponds to the size of the receptive field. In the original ZF model, the corresponding map ratio of the 3*3 convolution kernel is 3/13, which is equivalent to the size of the receptive field. For example, a receptive field of about 180 is used in a 1000*600 image. For most targets in a 1000*600 image, this size of receptive field is more appropriate.)

From the convolution operation itself, convolution is equivalent to a sliding window. If the input image is 1000*600, after several strides, the map size is reduced by 16 times, and the output of the last convolutional layer is about 60*40 size, which is equivalent to using a 3*3 window sliding window (note that With padding), for the left branch, 18 channels are output, and the map size of each channel is still 60*40, which represents the probability of whether there is a target in the receptive field corresponding to the center of each sliding window. The right branch is the same.

2. Anchor mechanism

anchor是rpn网络的核心。刚刚说到,需要确定每个滑窗中心对应感受野内存在目标与否。由于目标大小和长宽比例不一,需要多个尺度的窗。Anchor即给出一个基准窗大小,按照倍数和长宽比例得到不同大小的窗。例如论文中基准窗大小为16,给了(8、16、32)三种倍数和(0.5、1、2)三种比例,这样能够得到一共9种尺度的anchor,如图(摘自http://blog.csdn.net/shenxiaolu1984/article/details/51152614)。


因此,在对60*40的map进行滑窗时,以中心像素为基点构造9种anchor映射到原来的1000*600图像中,映射比例为16倍。那么总共可以得到60*40*9大约2万个anchor。

三、 训练

RPN网络训练,那么就涉及ground truth和loss function的问题。对于左支路,ground truth为anchor是否为目标,用0/1表示。那么怎么判定一个anchor内是否有目标呢?论文中采用了这样的规则:1)假如某anchor与任一目标区域的IoU最大,则该anchor判定为有目标;2)假如某anchor与任一目标区域的IoU>0.7,则判定为有目标;3)假如某anchor与任一目标区域的IoU<0.3,则判定为背景。所谓IoU,就是预测box和真实box的覆盖率,其值等于两个box的交集除以两个box的并集。其它的anchor不参与训练。

于是,代价函数定义为:


代价函数分为两部分,对应着RPN两条支路,即目标与否的分类误差和bbox的回归误差,其中Leg(ti,ti*) = R(ti-ti*)采用在Fast-RCNN中提出的平滑L1函数,作者认为其比L2形式的误差更容易调节学习率。注意到回归误差中Leg与pi相乘,因此bbox回归只对包含目标的anchor计算误差。也就是说,如果anchor不包含目标,box输出位置无所谓。所以对于bbox的groundtruth,只考虑判定为有目标的anchor,并将其标注的坐标作为ground truth。此外,计算bbox误差时,不是比较四个角的坐标,而是tx,ty,tw,th,具体计算如下:


四、 联合训练

作者采用四步训练法:

1) 单独训练RPN网络,网络参数由预训练模型载入;

2) 单独训练Fast-RCNN网络,将第一步RPN的输出候选区域作为检测网络的输入。具体而言,RPN输出一个候选框,通过候选框截取原图像,并将截取后的图像通过几次conv-pool,然后再通过roi-pooling和fc再输出两条支路,一条是目标分类softmax,另一条是bbox回归。截止到现在,两个网络并没有共享参数,只是分开训练了;

3) 再次训练RPN,此时固定网络公共部分的参数,只更新RPN独有部分的参数;

4) 那RPN的结果再次微调Fast-RCNN网络,固定网络公共部分的参数,只更新Fast-RCNN独有部分的参数。

至此,网络训练结束,网络集检测-识别于一体,测试阶段流程图如下:

有一些实现细节,比如RPN网络得到的大约2万个anchor不是都直接给Fast-RCNN,因为有很多重叠的框。文章通过非极大值抑制的方法,设定IoU为0.7的阈值,即仅保留覆盖率不超过0.7的局部最大分数的box(粗筛)。最后留下大约2000个anchor,然后再取前N个box(比如300个)给Fast-RCNN。Fast-RCNN将输出300个判定类别及其box,对类别分数采用阈值为0.3的非极大值抑制(精筛),并仅取分数大于某个分数的目标结果(比如,只取分数60分以上的结果)。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325806803&siteId=291194637