Target detection algorithm of the R-CNN principles and SPPNet

A, R-CNN principle

  R-CNN stands for Region-CNN , which can be said to be the first to apply to deep learning algorithms on the target detection. Later to learn Fast R-CNN, Faster R- CNN are all built on the basis of R-CNN.

  Most of the traditional target detection method based on image recognition. After all areas of the general frame can be selected using the object may appear on the image exhaustion method, extracting a feature region frame on the image recognition method and using the classification of all successful classification region, by the non-maximum value suppression (Non-maximum suppression, NMS) output.

  R-CNN followed ideas traditional target detection, the same extracting frame, extracting a feature of each block, image classification, non-maximal suppression four steps for target detection. In this step only extracting features, traditional features (such as SIFT, HOG features, etc.) into the extracted feature depth convolutional network. R-CNN algorithm overall frame shown in FIG.

   

  For the original image, the first to use Selective Search area search for objects that may be present. Selective Search can heuristic search area may contain an object from an image. Compared exhaustive terms, Selective Search part of the calculation amount can be reduced. Next, the region containing the object may be extracted into the extraction features CNN network. CNN typically accept a fixed size image, the size of the extracted region but different. In this regard, R-CNN approach is to zoom area to a uniform size, and then extracting features using CNN. After using the SVM classification features extracted, and finally through the non-suppressed output maxima.

  R-CNN training can be divided into the following four steps:

  (1) training CNN on the data set. R-CNN CNN network used in the paper is AlexNet, data set ImageNet.
  (2) on the target detection data sets, for CNN trained to do fine-tuning.
  (3) Selective Search search candidate region, CNN hinted unifying feature extraction features of these regions, and used to extract stored.
  (4) using the stored characteristics, to train the SVM classifier.

  Although R-CNN conventional method of identifying frame difference is not great, but due to an excellent ability to extract feature CNN, R-CNN effect is still much better than conventional methods. As the VOC 2007 dataset, the highest mean accuracy of the conventional method mAP (mean Average Precision) is about 40%, while the R-CNN mAP 58.5%!

  R-CNN disadvantage is too computationally . In the picture, the effective area obtained by Selective Search more often in 1000, which means double-counting than 1,000 neural networks, very time-consuming . In addition, during the training phase, also need to save up all the features, and then trained by SVM, which is very time-consuming and cumbersome. To learn later Fast R-CNN Faster R-CNN and improved R-CNN computationally intensive features a certain degree, not only a lot faster speed, recognition accuracy is also improved.

Two, SPPNet principle

  Before learning R-CNN improved version of Fast R-CNN, as a pre-knowledge, it is necessary to learn the principles of SPPNet. SPPNet English name is Spatial Pyramid Pooling Convolutional Networks, translated into Chinese is "Space convolution pyramid pooling network." It sounds very profound, in fact, the principle is not difficult, simply speaking, SPPNet mainly to do one thing: the CNN input from the fixed-size improvements of any size . For example, CNN in the conventional structure, the size of the input image is often fixed (e.g., 224x224 pixels), the output can be seen as a vector of fixed dimension. SPPNet added in the conventional CNN structure of the ROI pooling layer (ROI Pooling, ROI is the Region of Interest shorthand, refers to the "on the feature map block" ), so that the input image network may be of any size, output It will remain unchanged, the same vector is a fixed dimension.

  ROI cell layer is generally followed convolution layer, its input is the convolution of any size, the output is a vector of the fixed dimension, as shown in FIG.

   

  In order to clarify why the ROI cell layer can be of any size convolutional feature vector is converted into a fixed length, may wish to set a width of the convolution output layer w, the height is h, the channel is c. Regardless of the input image size is the number, the number of the channel convolutional layer will not change, that is to say c is a constant. And w, h will vary with changes in the input image size , it can be seen as two variables. ROI figure Pool layer above example, it first convolutional layer is divided into a 4x4 grid, each grid width is w / 4, a high-h / 4, the channel number is c. When not divisible, it requires rounding.

  接着,对每个网格中的每个通道,都取出其最大值,换句话说,就是对每个网格内的特征做最大值池化(Max Pooling,关于池化可以联系到卷积神经网络中的池化操作)。这个4x4的网格最终就形成了16c维的特征。接着,再把网格划分成2x2的网格,用同样的方法提取特征,提取的特征的长度为4c。再把网络划分为1x1的网格,提取的特征的长度就是c,最后的1x1的划分实际是取出卷积中每个通道的最大值。最后,将得到的特征拼接起来,得到的特征是16c+4c+1c=21c维的特征。很显然,这个输出特征的长度与w、h两个值是无关的,因此ROI池化层可以把任意宽度、高度的卷积特征转换为固定长度的向量

  应该怎么把ROI池化层用到目标检测中来呢?其实,可以这样考虑该问题:网络的输入是一张图像,中间经过若干卷积形成了卷积特征,这个卷积特征实际上和原始图像在位置上是有一定对应关系的。如下图所示。

   

  在上图中,原始图像中有一辆汽车,它使得卷积特征在同样位置产生了激活。因此,原始图像中的候选框,实际上也可以对应到卷积特征中相同位置的框。由于候选框的大小千变万化,对应到卷积特征的区域形状也各有不用,但是不用担心,利用ROI池化层可以把卷积特征中的不同形状的区域对应到同样长度的向量特征。综合上述步骤,就可以将原始图像中的不同长宽的区域都对应到一个固定长度的向量特征,这就完成了各个区域的特征提取工作。

  在R-CNN中,对于原始图像的各种候选区域框,必须把框中的图像缩放到统一大小,再对每一张缩放后的图片提取特征。使用ROI池化层后,就可以先对图像进行一遍卷积计算,得到整个图像的卷积特征;接着,对于原始图像中的各种候选框,只需要在卷积特征中找到对应的位置框,再使用ROI池化层对位置框中的卷积提取特征,就可以完成特征提取工作。

  R-CNN和SPPNet的不同点在于,R-CNN要对每个区域计算卷积,而SPPNet只需要计算一次,因此SPPNet的效率比R-CNN高得多。

  R-CNN和SPPNet的相同点在于,它们都遵循着提取候选框、提取特征、分类这几个步骤。在提取特征后,它们都使用了SVM进行分类。

 

 

Guess you like

Origin www.cnblogs.com/xiaoyh/p/11787239.html