Target detection Comparative series R-CNN

This introduces the general idea R-CNN about the series of articles. . Including training step, step prediction, and novelty of each paper.

R-CNN

The following is a flowchart of r-cnn:
A flowchart of r-cnn

Training process

1, an input picture with a label, the use of select-search method, to obtain candidate region (region propossals, about 2000)
2, each region propossal, affine image warping (affine image wraping), these candidate regions convert the image to the same size (in fact a do not care input layer convolution size, but full-back connection layer is fixed, where the input to a fixed size in the CNN), and then extracts a feature vector is thrown into the CNN.
3, the candidate region to obtain feature vectors, thrown into his SVM classifier to obtain prediction category; then thrown into the vessel regression, the regression information prediction frame. .
4, the obtained prediction result, tag comparison, calculation of the loss function, back propagation, update the model parameters.

detail

1, the initial model of CNN

Use Classification pictures pre-supervised training model, this model is mainly used for feature extraction of input images for subsequent classification and regression

2, fine-tuning the model

Because the pre-training model is to hold the training data set classification task, but do now is to detect the task, in addition to classification, but also to return to the border coordinates, so the need for fine-tuning the model took the task of detecting data sets, fine-tuning consists of two parts, as follows:
(1) CNN fine-tuning: a new training picture, get a candidate region of about 2000, to take advantage of these candidate regions to fine-tune the model, the predicted category and border coordinates of each candidate region can be obtained by the forward propagation, but how to determine the target value of it? Paper approach is to calculate the predicted frame and the target frame (ground truth pre-labeled good) of IOU, if IOU> = 0.5, it is as positive class (the label is consistent with the ground truth), or else as a negative class (not included in category loss of function, is not used to train the model), with the set of training data to fine-tune the CNN
(2) training SVM: used here SVM made binary, or the method described above, but the threshold is changed to 0.3 (the hyper-parameters are experimentally obtained to) of greater than 0.3 to 1, 0 to less than 0.3. Why not classified softmax it? Papers say that accuracy is not high svm.

Forecasting process

1, select-search method, to obtain candidate regions
2, wherein each candidate region extraction convolutional
3, holding characteristics obtained using the trained classification and regression prediction unit
4, using the final prediction result obtained NMS.

to sum up

R-CNN is not in each specific step of the algorithm in any innovation, there are reference methods papers, but although the method is not new, but the more basic methods grouped together, but got good results. .
Disadvantages:
. 1, SELECT Search to select candidate regions, inefficient, and there is then a convolution calculation is repeated a large number
2, scale scaling issues candidate region, resulting in deformation of the target, affect the accuracy of recognition

fast R-CNN

Mainly for shortcomings R_CNN to do to improve

Training process

Here Insert Picture Description
1, an input picture with a label, the use of select-search method, to obtain candidate region;
2, the ROI Pooling:

  • Forward convolution feature map, then ROI projection (mapping operation), the specific region candidate region is mapped into feature map in the first step, thus avoiding double counting the candidate region when convolution);
  • Polling ROI is then converted to the same size as the ROI size, which is the SPP (pooled space pyramid) in a special case, the SPP cell layer using a plurality of scales pooling operation, while using only a single scale RoI pooling were pooled, after fully connected, to obtain a feature vector of each ROI;

3, the calculation of the loss, but started using multitasking loss, not a single train svm classifier return and border control, and this classification began softmax, once fine-tuning, while the classification and regression is optimized (this simplifies the operation, accelerate speed)
4, back propagation model parameter update

detail

1、ROI pooling

Here Insert Picture DescriptionPooling ROI mainly do two things:
mapping the candidate region in the original image feature map feature map; the ROI of different sizes into the same size with the feature vector max pooling.
FIG mapping as the last step, the matrix is generated from the matrix of 7 × 7 20 × 20 and 2 will be rounded to 2.86, it means that the need to produce a 1 × 1 matrices from the matrix of 2 × 2 maximum pool, it is in fact It is a 7 × 7 matrix produced by the matrix of 14 × 14. .
ROI pooling of the problem:
we can see from the chart, front to back several times to do a mapping for the first time from the original image to the feature map, the second from the feature map to the ROI feature, on top of every map will be taken the whole operation, multiple rounding operation, to map the original frame, a plurality of pixels will be biased, leading to increasingly inaccurate border.

2, multi-tasking loss

fast R-CNN unified category output tasks and return the candidate box task. The definition of multi-tasking loss function is:
Here Insert Picture Description

to sum up

fast R-CNN major improvement point:

  • Candidate region shared convolution, convolution avoid single candidate region caused by repeated operation speeds up;
  • 利用roi pooling,避免了剪裁而导致的变形问题;
  • 多任务损失,统一了类别输出任务和候选框回归任务。

faster R-CNN

最主要的改进就是在候选区域的改进,提出RPN和anchor box,大幅增加速度。。
其中RPN网络将候选区域的选择从图像中移到了feature map,拿anchor box在feature map中滑动窗口,相比select search快了很多。

训练过程

1、cov layers:直接对有原图进行正向卷积,得到feature map;
2、RPN:

  • 对feature map进行滑动窗口,同时利用anchor box,得到ROI,如果当前feature map大小为W×H,k = 9个anchor box,那么将会得到k×W×H个anchor;
  • 利用ROI pooling将ROI转换为同样大小的特征向量;
  • 将特征向量同时扔给分类器和回归器,cls分类器是一个two-class softmax layer,预测输出一个二维数据,用来预测当前这个ROI有无目标的概率(共2k个scores),回归器用来回归边框的坐标信息,是一个四维的输出(4k个坐标值)
    Here Insert Picture Description
    4、IOU>0.7视为正类,<0.3为反例,其余的不用于训练,损失函数如下,从损失函数可以看出,坐标的回归只对存在目标的边框计算损失(Lcls 和 Lreg 和fast R-CNN一样)
    Here Insert Picture Description
    Here Insert Picture DescriptionHere Insert Picture Description
    5、反向传播更新模型参数

测试过程

1、 把任意大小的图片输入 CNN 通过卷积层进行特征提取;
2、利用 RPN 网络产生高质量的建议框, 每张图片约产生 300 个建议框;
3、将建议框映射到 CNN 的最后一层卷积特征图上;
4、用 RoI pooling 层固定每个建议框的大小;
5、利用分类层和边框回归层对建议区域进行具体的类别判断和精确的边框回归;
6、NMS得到最终预测框;

总结

R-CNN系列论文主要就是在不断的加速,主要是针对候选框的加速,而采用的方法就是不断地将任务向feature map上迁移,尽量用网络解决问题,这样训练的更快。

Mask R-CNN

mask R-cnn是一种对象实例分割方法,基于faster R-cnn做了改进。
实例分割(instance segmentation):不仅区分类别,而且区分单个对象
语义分割(semantic segmentation):仅区分类别,不区分单个对象

相对于faster R-cnn的改变

(1)多分支输出
(2)binary mask
(3)ROI Align

多分支输出

在faster R-CNN中,到class box输出就结束了,但是,这里继续对ROI特征进行全卷积(FCN),从而得到ROI中对象的mask。
Here Insert Picture Description

binary mask

在FCN中,是对每个像素用softmax进行多分类,而mask R-CNN对ROI进行二分类,只需要判断每个像素是背景还是识别对象即可,二分类相对于多分类,难度下降,分类精度也提高了
Here Insert Picture Description

ROI Align

Roi Align actually accurate version of Roi pooling;
a small change in the feature map, may be mapped to several changes in the pixels of the original image, because the accuracy of the segmentation task border inspection task more demanding than required, it may Align increase segmentation accuracy.
Here Insert Picture DescriptionROI Align calculation mapping process, no rounding operation;
As shown above, in the last step of mapping, when mapped matrix from the matrix 20.78 × 20.78 to 7 × 7 requires maximum pooled from the matrix 2.97 × 2.97 in a representative value of 1 × 1, how to do it?
Bilinear interpolation
Here Insert Picture Description
as in FIG, 4 is assumed number of samples, means that, for each small area of 2.97 * 2.97, four equally, whichever Each center point, the center point of the pixel, two wire calculating interpolation method, so, will obtain the pixel values of four points, four red cross on '×' pixel value is calculated by bilinear interpolation algorithm is taken as the maximum value of the four pixel values in the small region (i.e.: 2.97 × 2.97 size region) of the pixel value (maximum pooling)

Guess you like

Origin blog.csdn.net/qq_41332469/article/details/89286595