[Deep Learning] Two-stage target detection for convolutional neural network applications|R-CNN, SPP-Net, Fast-RCNN, Faster-RCNN

basic concept

Selective Search: The main idea is to first divide the image into small areas according to the pixels, then look at the existing small areas, merge the two adjacent areas with the highest probability according to the merging rules, and repeat until the entire image is merged into one area.
IoU (Intersection of Uint): Defines the positioning accuracy of the two bounding boxes - the overlapping area of ​​the two rectangular boxes accounts for the area ratio of the union of the two rectangular boxes.

Non-Maximum Suppression (NMS): Suppress elements that are not maximum values ​​and search for local maximum values.
Algorithm steps:

  1. Find the bounding box with the highest score among the candidate bounding boxes for the current class;
  2. Calculate the IOU value of other bounding boxes and this bounding box;
  3. Delete all bounding boxes with IOU values ​​greater than a given threshold;

mAP (mean Average Precision): Calculate AP for each category, and then average.

1. R-CNN

1. Network structure

Independent at all levels:

Regin proposal-Selective Search
Feature extraction-CNN
Classification-SVM
Bounding box regression

The algorithm is mainly divided into 4 steps:

  1. Obtain about 2000 candidate regions through the Selective Search method;
    the steps of the Selective Search algorithm are as described above.
  2. Use CNN to extract features for each candidate area;
    expand the Bounding box of the area picture by 16 pixels outward through area preprocessing, and transform it into 227 × 227 227\times 227227×227 pictures; then enter the pre-trained convolutional neural network to get the feature map.
  3. Send fc7the features to each type of SVM classifier to determine whether the area belongs to this class;
    2000 × 4096 2000\times 40962000×4096 × 20 4096\times 20composed of 4096-dimensional features and 20 SVM classifiers4096×Multiply the 20- dimensional weight matrix to get2000 × 20 2000\times 202000×A 20- dimensional matrix, which represents the score indicating that each proposal box is a certain type of object.
  4. The features are subjected conv5to Bounding box regression to fine-tune the prediction frame structure and
    further screen the remaining bounding boxes after non-maximum suppression (NMS), and then use 20 regressors to perform regression operations on the regression boxes of 20 categories, and the candidate The box is corrected to get the final bounding box.

2. Training process

  • Pre-training: Use ImageNetthe dataset to pre-train the CNN model to initialize network parameters.

    • Due to the lack of training data for target detection, if the method of randomly initializing CNN parameters is to be used directly, the current training data volume is far from enough, so supervised pre-training is used - directly using the network parameters of Alexnet and VGG.
  • fine-tuning: Fine-tune the pre-trained network using all regions generated by the SS algorithm.

    • Log loss
    • Fine-tuning is to replace the last layer of the convolutional layer with a softmax layer of N+1 neurons (N class + 1 class background), and then this layer adopts the method of random initialization of parameters, and the parameters of other network layers remain unchanged. , followed bySGD training: At the beginning, the SGD learning rate is selected as 0.001. In each training session, we select a batch size of 128, including 32 positive samples and 96 negative samples.
    • The overlapping area of ​​the candidate frame selected by the SS algorithm and the manually marked rectangular frame IoU>0.5 is classified as a positive sample, otherwise it is classified as a negative sample (background category)
    • If you do not perform fine-tuning for specific tasks, but only use CNN as a feature extractor, the convolutional layer learns the basic shared feature extraction layer, which can be used to extract the features of various pictures.The features learned by f6 and f7 are features for specific tasks卷积层所学习到的为共性特征,全连接层所学习到的是特定任务的特征。
      insert image description here
  • fc7SVM Classification: Use the trained SVM linear classifier from the fine-tuned network .

    • Hinge Loss
    • Each category (N categories) corresponds to a classifier
    • The IoU threshold is defined as 0.3, and when the overlap is greater than 0.3, it is defined as a positive sample, and vice versa.
    • Once the CNN fc7 layer features are extracted, train an svm classifier for each object class, and judge whether it is the required object or the backgound through classification.
    • insert image description here
  • conv5Bounding Box Regression: Use the trained Bounding Box regression model in the fine-tuned network .

    • Square Loss

    • Train a regression model per category training (N categories)

    • The IoU threshold is defined as 0.3, and when the overlap is greater than 0.3, it is defined as a positive sample, and vice versa.

    • Since the measurement standard of the target detection problem is the overlapping area: many seemingly accurate detection results often have a small overlapping area because the candidate frame is not accurate enough. Therefore, a location refinement operation is required: use a linear regression model for each type of target for refinement. Regular term λ = 1000 \lambda=1000l=1000 . The input is the 4096-dimensional feature conv5of

3. Testing phase

  1. Use the SS algorithm to extract 2000 regional target maps;
  2. Normalize each region target map to 227 × 227 227\times 227 through preprocessing227×227 ;
    Use the fine-tuned CNN to calculate 2 sets of features
  3. fc7-> SVM -> Category Scores
    • NMS (IoU>=0.5) obtains a subset of regions without redundancy
  4. con5-> Bounding box -> Box deviation
    • Using Bbox Bias to Correct Region Subsets

4. Problems with RNN

  1. The test speed is slow . It takes 53S to test a picture on the CPU, and it takes 2S to extract the candidate frame using the Selective Search algorithm. There is a lot of overlap between the candidate frames in an image, and there is a lot of redundancy in the feature extraction operation;
  2. The training speed is slow and the process is extremely cumbersome. It not only needs to train the image classification network, but also needs to train the SVM classifier and the Bounding Box regressor. The training process is independent of each other;
  3. The space required for training is large . For SVM and Bounding Box regression training, features need to be extracted from each target candidate box in each image and written to disk. For very deep networks, the features extracted from 5K images on the training set need Hundreds of gigabytes of storage.

Q: Why use SVM classification instead of using softmax multi-classifier directly?
A: The definition of positive and negative samples in the process of svm training and cnn training is different, resulting in the final output of CNN softmax being lower than the accuracy of svm. During the training process, the labeling of the training data is very broad (the bounding box only contains a part) and is marked as a positive sample, which is easy to overfit; svm has strict requirements on the iou of the training sample data (the bounding box contains the entire object).


2. SPP-Net

Two innovations are proposed on the basis of R-CNN:Shared convolution computation and pyramid pooling(spatial pyramid pooling)。

  1. Shared convolution calculation: the output of the conv5 layer extracts the features of all regions.
  2. Pyramid pooling: For regions of different sizes, extract features from the Conv5 output; map to a fully connected layer with a fixed size.

1. Network structure

Regin proposal-Selective Search
Feature extraction--CNN+SPP
Classification-SVM
Bounding box regression

The algorithm is mainly divided into 5 steps (mostly similar to RCNN):

  1. Obtain about 2000 candidate regions through the Selective Search method;
  2. Use CNN to extract features for each candidate area;
  3. Feature map extraction after CNN extractionSPP characteristics
  4. Send fc7the feature to Nthe SVM-like classifier to determine whether the region belongs to the class;
  5. Perform Bounding box-like regression on conv5features Nto fine-tune the prediction frame structure

2. Basic knowledge

Shared convolution computation

Directly input the entire image, perform a shared convolution calculation, and output conv5the features of all regions in the layer.

Pyramid Pooling Spatial Pyramid Pooling

In R-CNN, each candidate box needs to be uniformly sized as the input of CNN, which is inefficient and time-consuming. SPP proposes to perform only one convolution calculation on the original image to obtain the convolution features of the entire image, and then find the mapping of each candidate frame on the feature map, and input the mapping as the convolution feature of the candidate frame into the SPP layer, transform into the same scale.
Specific operation:
Replace the pooling layer in conv5 with spp, the idea of ​​spp is to divide the feature map of any size into three different levels of cutting maps, and the cutting sizes are 1 × 1 1\times 11×1 2 × 2 2\times 2 2×2 4 × 4 4\times 4 4×4. Each cut map gets 1, 4, and 16 blocks, and then the maximum pooling is performed on each block, and the pooled features are spliced ​​to obtain a fixed-dimensional output. to meet the needs of the fully connected layer.

3. Training process

  • Pre-training: Use the ImageNet dataset to pre-train the CNN model to initialize network parameters.
  • SPP features: Calculate the SPP features of all SS regions.
  • fine-tuning: Fine-tuning the fully connected layer using SPP features.
  • SVM classification: Use the fine-tuned fc7features to perform svm classification for each class.
  • Bounding box regression: Use spp features for bounding box regression.
    • R-CNN uses conv5 for bounding box regression
    • Only fine-tuning the fully connected layer

4. Test process

The basic structure is similar to that of R-CNN. The preprocessing process is removed, and the shared convolution calculation is performed on the image. The obtained conv5 layer feature map and the area map extracted by the SS algorithm are mapped to obtain the original image area as the input of the SPP layer. The SPP layer Pyramid pooling is performed on areas of different sizes mapped to the original image to turn them into feature maps of the same size, and then enter the fully connected layer. The subsequent steps are similar to R-CNN.

  • Different from R-CNN, SPP does not perform fine-tunning in image-level calculations, but only performs fine-tunning in area-level calculations.

5. There is a problem

  1. Inherited the remaining problems of RCNN: need to store a large number of features, complex multi-stage training, training time is still long
  2. New problem: SPP replaces the previous Max pooling layer and converts the feature map to 224*224. Due to the particularity of SPP (divided into 3-sized bins), all convolutional layer parameters before the SPP layer cannot be finetune, lacking migration possibility.

3. Fast R-CNN

Three improvements are proposed on the basis of SPP:

  1. Realize end-to-end single-stage training , and realize end-to-end through multi-task loss function.
  2. All layer parameters can be finetune
  3. No need to store signature files offline

Fast R-CNN proposes two optimization points based on SPP Net:ROI pooling layer(ROI pooling) andMulti-task loss function(Multi- task loss)。

1. Network structure

第二阶段
Classification-SVM
Feature extraction-CNN
Bounding box regression
Regin proposal-Selective Search

The algorithm steps are divided into 5 steps:

  1. Obtain about 2000 candidate regions through the Selective Search method;
  2. Use CNN to extract features for each candidate area;
  3. 对CNN提取后的特征图提取ROI特征
  4. fc7特征送入N+1类softmax分类,判断该区域是否属于哪类;
  5. conv5特征进行N类Bounding box回归精调预测框结构

与SPP网络结构异同点:

  • 提取特征的backbone由AlexNet改换为VGG,提取特征能力更强
  • SPP Pooling替换为ROI Pooling
  • SVM分类和回归任务使用多任务损失函数替代,目标检测任务就不需要分阶段训练
  • 提取到ROI特征向量后并联连接这两个分支。使用softmax替代SVM分类器(C+1类,包含background)。FC全连接边界框回归器替代了LR回归模型,新的边界框回归器输出对应(C + 1)个类别的候选框边界回归参数(dx, dy, dw, dh),共输出(C + 1) * 4个节点,如下图每4个一组,这里回归参数的含义与RCNN保持一致。

2. 基础知识

感兴趣区域池化层 (ROI pooling)

ROI pooling是SPP pooling的单层特例。ROI pooling是将ROI区域的卷积特征拆分为 H × W H \times W H×W网格,然后对每个Bin内的所有特征进行Max pooling。

多任务损失(Multi-task loss)

损失函数为: L ( p , u , t u , v ) = L c l s ( p , u ) + λ [ u ≥ 1 ] L l o c ( t u , v ) L(p,u,t^u,v)=L_{cls}(p,u)+\lambda [u\ge 1]L_{loc}(t^u,v) L(p,u,tu,v)=Lcls(p,u)+λ[u1]Lloc(tu,v)

该损失函数分为两部分,第一部分分类器的损失: L c l s ( p , u ) = − l o g p u L_{cls}(p,u)=-logp_u Lcls(p,u)=logpu,其中p为每个ROI的概率分布,u为Ground truth类别。

第二部分为回归器损失L1 loss:
L l o c ( t u , v ) = ∑ i = { x , y , w , h } s m o o t h L 1 ( t i u − v i ) L_{loc}(t^u,v)=\sum_{i=\{x,y,w,h\}}smooth_{L1}(t_i^u-v_i) Lloc(tu,v)=i={ x,y,w,h}smoothL1(tiuvi)
s m o o t h L 1 ( x ) = { 0.5 x 2 , ∣ x ∣ < 1 ∣ x ∣ − 0.5 , o t h e r w i s e smooth_{L1}(x)=\left\{\begin{matrix} 0.5x^2 & ,\left | x \right |< 1 \\ \left | x \right |-0.5 &,otherwise \end{matrix}\right. smoothL1(x)={ 0.5x2x0.5,x<1,otherwise
其中 v = { v x , v y , y w , v h } v=\{v_x,v_y,y_w,v_h\} v={ vx,vy,yw,vh}为偏差目标, t u = { t x u , t y u , t w u , t h u } t^u=\{t_x^u,t_y^u,t_w^u,t_h^u\} tu={ txu,tyu,twu,thu}为预测偏差, [ u ≥ 1 ] [u\ge1] [u1]为指示函数,当该值为1的时候分类为物体类别,有回归loss;当值为0时,分类为背景类别,没有回归loss.
预测偏差的计算公式:
t x = ( G x − P x ) / P w t y = ( G y − P y ) / P h t w = l o g ( G w / P w ) t h = l o g ( G h / P h ) \begin{matrix} t_x=(G_x-P_x)/P_w\\ t_y=(G_y-P_y)/P_h\\ t_w=log(G_w/P_w)\\ t_h=log(G_h/P_h) \end{matrix} tx=(GxPx)/Pwty=(GyPy)/Phtw=log(Gw/Pw)th=log(Gh/Ph)

3. Training & Testing Process

Training process: Input the whole picture into the CNN network, and at the same time perform the ss algorithm to extract the candidate frame, map it to the feature matrix of the candidate frame in the Conv5 feature map, do ROI Pooling, regularize to a fixed size, and then go through the fully connected layer, respectively The features after the fully connected layer are input into the SoftMax classifier and the Bounding box regressor (according to the output dimension, another layer of FC needs to be connected), and the multi-task combination loss function is used for calculation and gradient return to realize end-to-end network training.

Do finetune on the pre-trained model. During Fast R-CNN training, the mini-batches of stochastic gradient descent (SGD) adopt stratified sampling, first sample N images, and then sample R/N RoI regions for each image.

  • batch_size=128
  • Batch size (128) = number of pictures per batch (2) * number of ROIs per picture (64)
  • A batch comes from two pictures, and each picture takes 64 candidate regions, the ratio of positive and negative samples is 1:3, the judgment condition of positive samples is that the IOU value is greater than 0.5, and the judgment condition of negative samples is that the IOU must be between 0.1-0.5 , is a hard example mining strategy.

Test process: Same as the training process, just add post-processing NMS algorithm for each category.


4. Faster R-CNN

In order to solve the time-consuming problem of SS selection in the Fast RCNN algorithm, Faster RCNN proposes aRPN(Region Proposal Network) network, that is, other parts are the same as Fast RCNN, that is, Faster RCNN = RPN + Fast RCNN, and RPN replaces the offline Selective Search module, which solves the performance bottleneck. At the same time, Faster RCNN further shares the calculation of the convolutional layer, and based on the Attention mechanism,To be changed


1. Network structure

Faster RCNN = RPN + Fast RCNN, mainly divided into the following steps:
  • Input the image to the CNN network to get the feature map
  • Use the RPN network structure to generate candidate boxes, and then project the candidate boxes generated by these RPNs to the first step to obtain the corresponding feature matrix
  • 然后将每个特征矩阵通过ROI Pooling层缩放到固定的 7 × 7 7\times7 7×7大小的特征图,最后将特征图flatten后经过一系列全连接层得到分类和回归的结果。

2. RNP

1.网络结构
RNP网络具体结构如下:

  • 向RNP网络输入图片Conv5 13 × 13 × 256 13\times13\times256 13×13×256的特征图,依次经过 3 × 3 × 256 3\times3\times256 3×3×256的卷积核和 1 × 1 × 256 1\times 1\times256 1×1×256的卷积核和ReLu激活函数,得到 3 × 3 × 256 3\times3\times256 3×3×256的特征矩阵。
  • 将得到的特征矩阵并联输入到两个分支中,第一个分支 c l s   l a y e r cls \ layer cls layer用2k个 1 × 1 × 256 1\times 1\times256 1×1×256卷积核进行卷积,输出2k个数,表示某个区域有没有物体的分数。
  • 第二个分支 r e g   l a y e r reg \ layer reg layer用4k个 1 × 1 × 256 1\times 1\times256 1×1×256卷积核进行卷积,最后输出4k个数,表示x,y,w,h的偏移量。

2.Anchor
Anchor box为图像中的参考框,对应网络结构中的k,一般来说k=9,分别包括了3个尺度和3个长宽的ratio的组合。

  • 3个尺度:[128,256,512]
  • 3个ratio:1:1,1:2,2:1
  • RPN网络在输入特征图后进行 3 × 3 3 \times 3 3×3的卷积,特征图位置和原图像之间有对应关系,这里Anchor box参考框的中心就是卷积核 的中心,在conv5层上每卷积一次就会自动对应9个Anchor box,这样拟合的边界框偏移量就是Anchor box的偏移量。

3.Loss Function
L ( { p i } , { t i } ) = 1 N c l s ∑ i L c l s ( p i , p i ∗ ) + λ 1 N r e g ∑ i p i ∗ L r e g ( t i , t i ∗ ) L(\{p_i\},\{t_i\})=\frac{1}{N_{cls}\sum_{i}L_{cls}(p_i,p_i^*) }+\lambda \frac{1}{N_{reg}\sum_{i}p_i^*L_{reg}(t_i,t_i^*) } L({ pi},{ ti})=NclsiLcls(pi,pi)1+λNregipiLreg(ti,ti)1

  • p i p_i pi为第i个anchor预测为真实标签的概率
  • p i ∗ p_i^* pi为正样本时为1,负样本为零(作用类似于Faster RCNN中艾佛森括号)
  • t i t_i ti表示预测第i个Anchor box的边界框回归参数
  • t i ∗ t_i^* ti表示第i个Anchor box对应的GT Box
  • N c l s N_{cls} Ncls表示一个mini-batch中所有样本的数量
  • N r e g N_{reg} Nreg表示Anchor box位置个数

第一部分为分类损失,若使用多类别的Softmax交叉熵损失,由于分类类别只有背景和前景,因此对于k个Anchor box就有2k个值。若使用的是二分类的交叉熵损失,对于每个Anchor box只计算一个概率,对于k个Anchor box就有k个值。

第二部分为边界框回归损失,形式和Faster RCNN类似。

4.RPN Loss和Fast RCNN Loss联合训练
具体步骤如下:

  1. 训练RPN网络
    使用Image Net预训练分类模型初始化卷积层参数;
  2. 训练Fast RCNN网络
    使用Image Net预训练分类模型初始化卷积层参数;Region proposals由步骤1的RPN生成
  3. Tuning RPN
    uses Fast RCNN convolutional layer parameters to initialize it;
    fixed convolutional layer, finetune remaining layer
  4. Tuning the Fast RCNN
    fixed convolution layer, finetune the remaining layer; Region proposals are generated by the RPN in step 3.

reference article

RCNN- The pioneering work of introducing CNN into target detection
Why use IoU non-maximum suppression in Rcnn?

Target detection (3) - SPPNet

Fast rcnn paper interpretation (with code link)
Fast R-CNN

RCNN series (R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN) The
algorithm steps of RCNN, Fast-RCNN, Faster-RCNN and the difficulties in one
article to understand target detection: R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD
fast regional convolutional network method (Fast R-CNN)

Guess you like

Origin blog.csdn.net/m0_52427832/article/details/127390742