06- Algorithm Interpretation Fast R-CNN (Target Detection)

Main points:

Fast R-CNN belongs to Two-stage detector

Regression loss reference: https://www.cnblogs.com/wangguchangqing/p/12021638.html

Two Fast R-CNN algorithm

Fast R-CNN is another masterpiece of the author Ross Girshick after R-CNN . Also using VGG16 as the backbone of the network , compared with R-CNN, the training time is 9 times faster, the test reasoning time is 213 times faster, and the accuracy rate is increased from 62% to 66% ( on the Pascal VOC dataset ) .

The Fast R-CNN algorithm process can be divided into 3 steps :

One image generates 1K~2K candidate regions ( using the Selective Search method)
Input the image into the network to obtain the corresponding feature map , and project the candidate frame generated by the SS algorithm onto the feature map to obtain the corresponding feature matrix
Scale each feature matrix to a 7x7 feature map through the ROI pooling layer , and then flatten the feature map through a series of fully connected layers to get the prediction result

2.1 Calculate the entire image feature

R-CNN sequentially inputs the candidate frame area into the convolutional neural network to obtain features.

Fast-RCNN sends the entire image into the network, and then extracts the corresponding candidate regions from the feature image. The features of these candidate regions do not need to be recalculated.

2.2 RoI Pooling Layer

RoI Pooling Layer (region of interest pooling layer) is a mechanism for extracting regions of interest from convolutional feature maps . RoI refers to the Region of Interest (region of interest), which refers to the bounding box obtained by the target detection algorithm in the input image.

The role of the RoI Pooling Layer is to map RoI regions of different sizes to outputs of the same size. Specifically, it first divides each RoI region into fixed-size sub-regions, and then performs a max-pooling operation on each sub-region to obtain a fixed-size output. The advantage of this is that it can ensure that RoI regions of different sizes can be processed, and map them to output feature maps of the same size, which is convenient for subsequent classification and regression tasks. does not limit the size of the input image

2.3 Classifier

Output the probability of N+1 categories (N is the type of detection target, 1 is the background) a total of N+1 nodes

2.4 Bounding box regressor

Output the candidate bounding box regression parameters (dx, dy, dw, dh) corresponding to N+1 categories , a total of (N+1)x4 nodes

Bounding box regressor

Output the candidate bounding box regression parameters ( ) corresponding to N+1 categories $dx, d_y, d_w, d_h$ , a total of (N+1)x4 nodes

$G^x = P_wd_x (P) + P_x$

$G^y = P_h d_y (P) + P_y$

$G^w = P_w exp(d_w (P))$

$G^h = P_h exp(d_h (P))$

$P_x , P_y , P_w , P_h$ They are the center x, y coordinates, and width and height of the candidate box respectively

$G^x , G^y , G^w , G^h$ Respectively for the final predicted bounding box center x, y coordinates, and width and height

2.5 Multi-task loss

2.6 Cross Entropy Loss cross entropy loss

1. For multi-classification problems (softmax output, the sum of all output probabilities is 1)

2. For binary classification problems ( sigmoid output, each output node is irrelevant to each other)