Faster R-CNN paper notes - FR

Before introducing Faster R-CNN, let's introduce some prior knowledge to pave the way for Faster R-CNN.

1. Deep learning target detection algorithm based on Region Proposal (candidate region)

Region Proposal (candidate area) is to find out in advance where the target in the image may appear. By using the texture, edge, color and other information in the image, it is guaranteed to select fewer windows (thousands or even hundreds) in the case of Maintain high recall (IoU, Intersection-over-Union).

Figure 1 Definition of IoU

The Region Proposal method obtains higher quality than the traditional sliding window method. The more commonly used Region Proposal methods are: SelectiveSearch (SS, selective search), Edge Boxes (EB).

The steps of the Region Proposal target detection algorithm are as follows:

in:

See http://blog.csdn.net/qq_17448289/article/details/52850223 for the CNN method .

Bouding Box Regression: It is a linear regression algorithm that corrects Region Proposal, in order to make the window extracted by Region Proposal more consistent with the target window (Ground Truth).

2. The relationship between R-CNN, Fast R-CNN, Faster R-CNN


Figure 2 The relationship between the three

Table 1 Comparison of the three

 

Instructions

shortcoming

Improve

R-CNN

(Region-based Convolutional

Neural Networks)

1. SS extracts RP;

2. CNN extracts features;

3. SVM classification;

4. BB box returns.

1. The training steps are cumbersome (fine-tuning network + training SVM + training bbox);

2. Both training and testing are slow;

3. Training takes up space

1. Directly increased from 34.3% of DPM HSC to 66% (mAP);

2. Introduce RP+CNN

Fast R-CNN

(Fast Region-based Convolutional

Neural Networks)

1. SS extracts RP;

2. CNN extracts features;

3. Softmax classification;

4. Multi-task loss function bounding box regression.

1. Still use SS to extract RP (it takes 2-3s, and feature extraction takes 0.32s);

2. It cannot meet real-time applications, and does not really realize end-to-end training and testing;

3. The GPU is utilized, but the region proposal method is implemented on the CPU.

1. Increased from 66.9% to 70%;

2. Each image takes about 3s.

Faster R-CNN

(Fast Region-based Convolutional

Neural Networks)

1. RPN extracts RP;

2. CNN extracts features;

3. Softmax classification;

4. Multi-task loss function bounding box regression.

1. Still unable to achieve the real-time detection target;

2. Obtaining a region proposal and then classifying each proposal requires a relatively large amount of computation.

1. Improve the detection accuracy and speed;

2. Realize the end-to-end target detection framework;

3. It only takes about 10ms to generate the suggestion box.

2.1 R-CNN target detection process introduction

  

For details, please refer to http://blog.csdn.net/shenxiaolu1984/article/details/51066975

2.2 Fast R-CNN target detection process introduction

Note: The RegionProposal of Fast R-CNN is done after the feature map, so that there is no need to perform a separate CNN Forward step for all regions.

The Fast R-CNN framework is as follows:


                                                                                                                                            Figure 3 Fast R-CNN framework

The Fast R-CNN framework differs from R-CNN in two ways:

① A ROI pooling layer is added after the last convolutional layer;

② The loss function uses the multi-task loss (multi-task loss) function, and the border regression is directly added to the CNN network for training. Classification Fast R-CNN directly uses softmax to replace the SVM used by R-CNN for classification.

Fast R-CNN is end-to-end.

For details, please refer to http://blog.csdn.net/shenxiaolu1984/article/details/51036677

3. Faster R-CNN target detection

3.1 The idea of ​​Faster R-CNN

Faster R-CNN can be simply regarded as a system of "regional generation network RPNs + Fast R-CNN", which replaces the Selective Search method in FastR-CNN with a region-generated network. The Faster R-CNN paper focuses on solving three problems in this system:
1. How to design the region generation network;
2. How to train the region generation network;
3. How to let the region generation network and Fast RCNN network share the feature extraction network .

In the entire Faster R-CNN algorithm, there are three scales:
1. Original image scale: the size of the original input. Not subject to any restrictions and does not affect performance.

2. Normalized scale: the size of the input feature extraction network, set during testing, opts.test_scale=600 in the source code. Anchor is set at this scale. The relative size of this parameter and the anchor determines the range of objects you want to detect.
3. Network input scale: The size of the input feature detection network, which is set during training, and is 224*224 in the source code.

3.2 Introduction to Faster R-CNN Framework

Figure 4 Faster R-CNN model

The Faster-R-CNN algorithm consists of two major modules:

1. PRN candidate frame extraction module;

2. Fast R-CNN detection module.

Among them, RPN is a fully convolutional neural network, which is used to extract candidate boxes; Fast R-CNN detects and identifies targets in proposals based on proposals extracted by RPN.

3.3 Introduction to RPN

3.3.1 Background

The current state-of-the-art target detection networks need to first use the region proposal algorithm to infer the target location. Although networks such as SPPnet and Fast R-CNN have reduced the running time of the detection network, computing region proposals is still time-consuming. Therefore, under such a bottleneck, RBG and Kaiming He and a group of people also handed over the Region Proposal to CNN, which is why the RPN (Region Proposal Network) region proposal network was proposed to extract the detection region, which can be combined with the entire detection network. Convolutional features are shared across the entire graph, making region proposals almost time-consuming.

What RCNN solves is, "Why not use CNN for classification?"

What Fast R-CNN solves is, "Why not output bounding box and label together?"

Faster R-CNN solves, "Why use selective search?"

3.3.2 RPN core idea

The core idea of ​​RPN is to use the CNN convolutional neural network to directly generate the Region Proposal. The method used is essentially a sliding window (just slide once on the last convolutional layer), because the anchor mechanism and the frame regression can get multi-scale and long. Aspect ratio Region Proposal.

The RPN network is also a fully-convolutional network (FCN, fully-convolutional network), which can be trained end-to-end for the task of generating the detection proposal frame, and can predict the boundary and score of the object at the same time. Just add 2 additional convolutional layers (full convolutional layers cls and reg) to the CNN.

① Encode the position of each feature map into a feature vector ( 256d for ZF and 512d for VGG ).

② Output an objectness score and regressed bounds for k region proposals for each position, that is, output k (3*) multiple scales (3 types) and aspect ratios (3 types) at each convolution mapping position at this position 3=9) Object scores and regression boundaries for region proposals.

The input of the RPN network can be an image of any size (but there is still a minimum resolution requirement, for example, VGG is 228*228). If VGG16 is used for feature extraction, the composition of the RPN network can be expressed as VGG16+RPN.

VGG16: Reference

https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/pascal_voc/VGG16/faster_rcnn_end2end/train.prototxt, it can be seen that the part used for feature extraction in VGG16 is 13 convolutional layers ( conv1_1---->conv5.3), excluding pool5 and the network hierarchy after pool5.

Since our ultimate goal is to share computation with the Fast R-CNN object detection network, it is assumed that the two networks share a series of convolutional layers. In the experiments in the paper, ZF has 5 shareable convolutional layers and VGG has 13 shareable convolutional layers.

The specific process of RPN is as follows: use a small network to perform sliding scanning on the feature map obtained by the final convolution, this sliding network is fully connected to the window of n*n ( n=3 in the paper) on the feature map each time (the effective image The receptive field is very large, ZF is 171 pixels, VGG is 228 pixels), and then mapped to a low-dimensional vector (256d for ZF / 512d for VGG), and finally this low-dimensional vector is sent to two fully connected layers, namely bbox Regression layer (reg) and box classification layer (cls). The handling of the sliding window ensures that the reg-layer and cls-layer are associated with the entire feature space of conv5-3.

reg layer: predict the proposal's anchor corresponding to the proposal's (x, y, w, h)

cls layer: Determine whether the proposal is a foreground (object) or a background (non-object).



Figure 5 RPN framework

In Figure 5, it should be noted that the center point of the 3*3 convolution kernel corresponds to the position (point) on the original image (re-scale, the source code sets the re-scale to 600*1000), and this point is used as the center of the anchor Point, frame anchors with multiple scales and multiple aspect ratios in the original image. Therefore, the anchor is not on the conv feature map, but on the original image. For a feature layer of size H*W, each pixel on it corresponds to 9 anchors, and there is an important parameter feat_stride = 16, which means that a point is moved on the feature layer, corresponding to the original image. Move 16 pixels ( Take a look at the stride in the network to understand the origin of 16). Translate the coordinates of these 9 anchors to obtain the coordinates on the original image. After that, rpn_lables are generated according to the relationship between the ground truth label and these anchors. The specific method mentioned in the paper, calculated according to the overlap, will not be described in detail here. In the generated rpn_labels, the positive position is set to 1, and the negative position is set to 1. The position of is set to 0, the others are -1. The box_target is generated by the _compute_targets() function, which actually finds the most matching ground truth box for each anchor, and then performs the transformation of the box coordinates mentioned in the paper. http://blog.csdn.net/zhangwenjie89/article/details/52012880

Figure 6 9 kinds of anchors (note: different positions)


Figure 7 Faster R-CNN convolution flow chart

After the original image 600*1000 is convolved by CNN, the feature map of 40*60 size is obtained in the last layer of CNN (conv5), which corresponds to the typical value of 2400 mentioned in the text. If the feature map size is W*H, W*H*K anchors are needed, and 40*60*9≈2k are needed in this paper.

In the RPN network, we need to focus on understanding the concept of anchors, the calculation method of Loss fucntions and the specific details of RPN layer training data generation.

3.4 Translation invariance of RPN

One of the challenges in computer vision is translation invariance: for example, in face recognition tasks, how do small faces (24*24 resolution) and large faces (1080*720) have the same trained weights can be correctly identified in the network. If the object in the image is translated, the suggestion box should also be translated, and the same function should be able to predict the suggestion box.

Traditionally, there are two mainstream solutions:
first, the image or feature map layer is sampled by scale\width and height;
second, the filter is sampled by scale\width and height (or can be considered as a sliding window).

However, the specific implementation of Faster R-CNN to solve this problem is to sample the scale and aspect ratio through the center of the convolution kernel (anchor used to generate the recommended window), and use 3 scales and 3 ratios to generate 9 kinds of anchors.

3.5 Window classification and position refinement

The classification layer (cls_score) outputs the probability that at each position, the 9 anchors belong to the foreground and background.

The window regression layer (bbox_pred) outputs the parameters (x, y, w, h) that the 9 anchors correspond to the window should be translated and scaled at each position.

For each location, the classification layer outputs the probability of belonging to the foreground and background from the 256-dimensional features; the window regression layer outputs 4 translation scaling parameters from the 256-dimensional features.

It should be noted that no candidate window is explicitly extracted, and the network itself is used to complete the judgment and correction .

3.6 Learning Region Proposal Loss Function

3.6.1 Label classification regulations

In order to train the RPN, the class labels {target, non-target} assigned to each anchor are required. For positive label (positive label), the following regulations are given in the paper (one of the following conditions can be judged as a positive label):


Note that a GT bounding box can correspond to multiple anchors, so a GT bounding box can have multiple positive labels.

In fact, using the second rule can basically find enough positive samples, but for some extreme cases, such as the IoU of all anchor boxes and ground truth corresponding to all anchors is not greater than 0.7, the first rule can be used to generate.

negative label: An anchor with an IoU less than 0.3 with all GT bounding boxes.

For anchors that are neither positive nor negative labels, and anchors that cross the image boundary, we discard them because they have no effect on the training target.

3.6.2 Multi-task loss (from Fast R-CNN)

Figure 8 multi-task data structure

The Fast R-CNN network has two sibling output layers (cls score and bbox_prdict layers), both of which are fully connected layers called multi-task.

① clsscore layer: used for classification, output k+1-dimensional array p, indicating the probability of belonging to k class and background. Output a discrete probability distribution for each RoI (Region of Interesting)

Usually, p is calculated by softmax of fully connected layers of class k+1.

② bbox_prdict layer: used to adjust the position of the candidate area, output the displacement of the bounding box regression, and output a 4*K-dimensional array t, indicating that when they belong to the k category, the parameters that should be translated and scaled.

k represents the index of the category, which refers to the constant translation relative to the objectproposal scale, and refers to the height and width relative to the objectproposal in logarithmic space.

The loss_cls layer evaluates the classification loss function. Determined by the probability corresponding to the true classification u:


loss_bbox evaluates the loss function for box localization. Compare the predicted pan-scaling parameters corresponding to the ground-truth classification with

The difference between the real pan and zoom parameters is :


Among them, the smooth L1 loss function is:

The smooth L1 loss function curve is shown in Figure 9 below. The purpose of the author's setting is to make the loss more robust to outliers. Compared with the L2 loss function, it is insensitive to outliers and outliers. Controlling the magnitude of the gradient makes it difficult to run away during training.

Figure 9 smoothL1 loss function curve

最后总损失为(两者加权和,如果分类为背景则不考虑定位损失):


规定u=0为背景类(也就是负标签),那么艾弗森括号指数函数[u≥1]表示背景候选区域即负样本不参与回归损失,不需要对候选区域进行回归操作。λ控制分类损失和回归损失的平衡。Fast R-CNN论文中,所有实验λ=1。

艾弗森括号指数函数为:


源码中bbox_loss_weights用于标记每一个bbox是否属于某一个类。

3.6.3 Faster R-CNN损失函数

遵循multi-task loss定义,最小化目标函数,FasterR-CNN中对一个图像的函数定义为:

其中:


3.6.4 R-CNN中的boundingbox回归

下面先介绍R-CNN和Fast R-CNN中所用到的边框回归方法。

1.      为什么要做Bounding-box regression?

图10  示例

如图10所示,绿色的框为飞机的Ground Truth,红色的框是提取的Region Proposal。那么即便红色的框被分类器识别为飞机,但是由于红色的框定位不准(IoU<0.5),那么这张图相当于没有正确的检测出飞机。如果我们能对红色的框进行微调,使得经过微调后的窗口跟Ground Truth更接近,这样岂不是定位会更准确。确实,Bounding-box regression 就是用来微调这个窗口的。

2.      回归/微调的对象是什么?


3.      Bounding-box regression(边框回归)

那么经过何种变换才能从图11中的窗口P变为窗口呢?比较简单的思路就是:




注意:只有当ProposalGround Truth比较接近时(线性问题),我们才能将其作为训练样本训练我们的线性回归模型,否则会导致训练的回归模型不work(当ProposalGT离得较远,就是复杂的非线性问题了,此时用线性回归建模显然不合理)。这个也是G-CNN: an Iterative Grid Based Object Detector多次迭代实现目标准确定位的关键。

线性回归就是给定输入的特征向量X,学习一组参数W,使得经过线性回归后的值跟真实值Y(Ground Truth)非常接近。即。那么Bounding-box中我们的输入以及输出分别是什么呢?

输入:这个是什么?输入就是这四个数值吗?其实真正的输入是这个窗口对应的CNN特征,也就是R-CNN中的Pool5feature(特征向量)。(注:训练阶段输入还包括 Ground Truth,也就是下边提到的)

输出:需要进行的平移变换和尺度缩放,或者说是。我们的最终输出不应该是Ground Truth吗?是的,但是有了这四个变换我们就可以直接得到Ground Truth,这里还有个问题,根据上面4个公式我们可以知道,P经过,得到的并不是真实值G,而是预测值

的确,这四个值应该是经过 Ground Truth 和Proposal计算得到的真正需要的平移量和尺度缩放

这也就是R-CNN中的:



那么目标函数可以表示为是输入Proposal的特征向量,是要学习的参数(*表示,也就是每一个变换对应一个目标函数),是得到的预测值。我们要让预测值跟真实值差距最小,得到损失函数为:

函数优化目标为:

利用梯度下降法或者最小二乘法就可以得到

4.      测试阶段
   根据3我们学习到回归参数,对于测试图像,我们首先经过 CNN 提取特征,预测的变化就是,最后根据以下4个公式对窗口进行回归:

3.6.5 Faster R-CNN中的bounding box回归

其中:


※注意:计算regression loss需要三组信息:

1)     预测框,即RPN网络测出的proposa;

2)     锚点anchor box:之前的9个anchor对应9个不同尺度和长宽比的anchorbox;

3)     GroundTruth:标定的框。

3.7 训练RPNs

RPN通过反向传播(BP,back-propagation)和随机梯度下降(SGD,stochastic gradient descent)进行端到端(end-to-end)训练。依照FastR-CNN中的“image-centric”采样策略训练这个网络。每个mini-batch由包含了许多正负样本的单个图像组成。我们可以优化所有anchor的损失函数,但是这会偏向于负样本,因为它们是主要的。

采样

每一个mini-batch包含从一张图像中随机提取的256anchor(注意,不是所有的anchor都用来训练),前景样本和背景样本均取128个,达到正负比例为1:1。如果一个图像中的正样本数小于128,则多用一些负样本以满足有256个Proposal可以用于训练。

初始化

新增的2层参数用均值为0,标准差为0.01的高斯分布来进行初始化,其余层(都是共享的卷积层,与VGG共有的层)参数用ImageNet分类预训练模型来初始化。

参数化设置(使用caffe实现)

在PASCAL数据集上:

前60k个mini-batch进行迭代,学习率设为0.001;

后20k个mini-batch进行迭代,学习率设为0.0001;

设置动量momentum=0.9,权重衰减weightdecay=0.0005。


3.8 非极大值抑制法

训练时(eg:输入600*1000的图像),如果anchor box的边界超过了图像边界,那这样的anchors对训练loss也不会产生影响,我们将超过边界的anchor舍弃不用。一幅600*1000的图像经过VGG16后大约为40*60,则此时的anchor数为40*60*9,约为20k个anchor boxes,再去除与边界相交的anchor boxes后,剩下约为6k个anchor boxes,这么多数量的anchorboxes之间肯定是有很多重叠区域,因此需要使用非极大值抑制法(NMS,non-maximum suppression)将IoU>0.7的区域全部合并,最后就剩下约2k个anchor boxes(同理,在最终检测端,可以设置将概率大约某阈值P且IoU大约某阈值T的预测框采用NMS方法进行合并,注意:这里的预测框指的不是anchor boxes)。NMS不会影响最终的检测准确率,但是大幅地减少了建议框的数量。NMS之后,我们用建议区域中的top-N个来检测(即排过序后取N个)。

3.9 RPN与Fast R-CNN特征共享

Faster-R-CNN算法由两大模块组成:

1.PRN候选框提取模块;

2.Fast R-CNN检测模块。

我们已经描述了如何为生成区域建议训练网络,而没有考虑基于区域的目标检测CNN如何利用这些建议框。对于检测网络,我们采用Fast R-CNN,现在描述一种算法,学习由RPN和Fast R-CNN之间共享的卷积层。

RPN和Fast R-CNN都是独立训练的,要用不同方式修改它们的卷积层。因此需要开发一种允许两个网络间共享卷积层的技术,而不是分别学习两个网络。注意到这不是仅仅定义一个包含了RPN和Fast R-CNN的单独网络,然后用反向传播联合优化它那么简单。原因是Fast R-CNN训练依赖于固定的目标建议框,而且并不清楚当同时改变建议机制时,学习Fast R-CNN会不会收敛。

After RPN extracts proposals, the author chooses to use Fast-R-CNN to detect and identify the final target. RPN and Fast-R-CNN share 13 VGG convolutional layers. Obviously, it is not a wise choice to train these two networks completely in isolation. The author adopts the alternate training (Alternating training) stage convolutional layer feature sharing:

In the first step, we train an RPN as described above, which is initialized with an ImageNet pre-trained model and fine-tuned end-to-end for the region proposal task;

In the second step, we use the proposal frame generated by the RPN in the first step to train a separate detection network by Fast R-CNN. This detection network is also initialized by the ImageNet pre-trained model. At this time, the two networks have not shared convolutional layer;

In the third step, we initialize the RPN training with the detection network, but we fix the shared convolutional layers and only fine-tune the layers unique to the RPN. Now the two networks share the convolutional layers;

The fourth step, keeping the shared convolutional layers fixed, fine-tunes the fc layers of Fast R-CNN. In this way, the two networks share the same convolutional layers, forming a unified network.

Note: In the first iteration, use the model obtained from ImageNet to initialize the parameters of the convolutional layers in RPN and Fast-R-CNN; from the second iteration, when training RPN, use the shared convolutional layer of Fast-R-CNN The parameters initialize the shared convolutional layer parameters in the RPN, and then only fine-tune the corresponding parameters of the unshared convolutional layers and other layers. When training Fast-RCNN, keep the convolutional layer parameters shared with RPN unchanged, and only fine-tune the parameters corresponding to the layers that are not shared. In this way, the feature sharing training of two network convolutional layers can be realized. For the corresponding network model, please refer to https://github.com/rbgirshick/py-faster-rcnn/tree/master/models/pascal_voc/VGG16/faster_rcnn_alt_opt

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325775775&siteId=291194637