Summary of Faster RCNN network data flow

foreword

When learning Faster RCNN, I read many blogs written by others. After reading it, I have a general understanding of Faster RCNN, but the data flow inside the network during training is not very clear, so in combination with this version of the faster rcnn code, the network data flow is summarized. In order to better master Faster rcnn.

Data flow during training

In this version of the code, the batch_size is 1 during training. The network architecture in the original paper is as follows:
insert image description here

1 ◯ \textcircled{\scriptsize 1}1 Network input

The first part is the input to the network. The input of the network is an image of any size, but before being sent to the network, it will undergo a scaling operation and then normalize . While scaling the image, the gt_bbox (ground truth bounding box, real bounding box) should also be scaled in the same way.
How is it scaled? Refer to the code here.

def preprocess(img, min_size=600, max_size=1000):
	# img: 输入图像
	# min_size: 图像放缩的最小大小
	# max_size: 图像放缩的最大大小
    C, H, W = img.shape
    scale1 = min_size / min(H, W)
    scale2 = max_size / max(H, W)
    scale = min(scale1, scale2)
    img = img / 255.
    # resize缩放大小  长和宽等比例缩放
    img = sktsf.resize(img, (C, H * scale, W * scale), mode='reflect',anti_aliasing=False)

In this proportional scaling method, the result is that either the longer side of the original image is enlarged to 1000, or the shorter side of the original image is enlarged to 600. On the whole, an enlarged maximum and minimum range is set. Because the batch_size is 1, the scaled size of each image can be different. If the batch_size is not 1, then the scaled size of all images in this batch must be the same. In the following discussion we ignore the batch dimension (because batch is 1)

2 ◯ \textcircled{\color{green}\scriptsize 2} 2 Feature extraction network

The second part is the feature extraction module. The feature extraction network here is VGG16, but the last few layers of full connections have been removed. The only thing to pay attention to here is that the input image is reduced by 16 times after VGG16 (because there are 4 pooling layers), and the dimension is increased to 512 dimensions.
If the input image I input I^{input}IThe size of in p u t is[ 3 , x , y ] \left[3,x,y\right][3,x,y ] , then the extracted feature mapI feature I^{feature}IThe size of f e a t u re is[ 512 , x 16 , y 16 ] \left[512, \frac{x}{16},\frac{y}{16}\right][512,16x,16y]

3 ◯ \textcircled{\color{purple}\scriptsize 3} 3 RPN network

The input of the RPN network is a feature map, which is first subjected to a 3x3 convolution with a channel number of 512, and the output is still [ 512 , x 16 , y 16 ] \left[512, \frac{x}{16},\frac{y} {16}\right][512,16x,16y] .
The branch on the right is a 1x1 convolution with a channel number of 36 (36 is because each point has 9 anchors and each anchor has 4 coordinates), and the output is[ 36 , x 16 , y 16 ] \left[36, \ frac{x}{16},\frac{y}{16}\right][36,16x,16y] , and then reshape it to[the total number of anchors, 4 ] \left[the total number of anchors,4\right][ an c h or total number of ,4 ] Size, recorded as rpn_loc.
The branch on the left is a 1x1 convolution with a channel number of 18 (18 is because each point has 9 anchors, each anchor is either background or foreground, two possibilities), and the output is [ 18 , x 16 , y 16] \left[18, \frac{x}{16},\frac{y}{16}\right][18,16x,16y] . Then it is processed by softmax, and the final output size is[the total number of anchors, 2 ] \left[the total number of anchors, 2\right][ an c h or total number of ,2 ] , recorded as rpn_score.

insert image description here
After the above points are clear, we will focus on how the RPN network calculates the loss , which is called L oss RPN Loss^{RPN}LossRPN . We all know that calculating loss requires network output values ​​and label values. Now that the network output values ​​already exist, where do the label values ​​come from?
From the above figure, we can see that there is anAnchorTargeCreatormodule. The input of this module is that we generate the anchor and gt_bbox, and calculate the real deviation gt_rpn_loc between the anchor and gt_bbox and whether the anchor is responsible for the background or the foreground gt_rpn_label. We use gt_rpn_loc and gt_rpn_label as label values ​​and rpn_score and rpn_score to calculate the loss respectively, and the sum of the two losses isL oss RPN Loss^{RPN}LossRPN . We will not talk about the specific calculation formula of the loss here.

In the interpretation of the bbuf boss, "AnchorTargetCreator selects 256 Anchors from more than 20,000 candidate Anchors for classification and regression." The code also samples 256 samples, but the final returned true label value is all anchors size, not 256 size.

The meaning of the ProposalCreator module is as follows:
insert image description here
In summary, in addition to its own backpropagation training, the rpn network also outputs 2000 anchors through the ProposalCreator module.

4 ◯ \textcircled{\color{blue}\scriptsize 4} 4 ProposalTargetCreator module

Not all of the 2,000 ROIS output by the ProposalCreator module are used. After screening by the ProposalTargetCreator module (filtered by the IOU with gt_bbox), a total of 128 positive and negative ROIS are generated. Output the gt_label and gt_loc of these 128 rois at the same time.

5 ◯ \textcircled{\color{blue}\scriptsize 5} 5 ROI pooling

The ROI pooling here is the same as in fast rcnn, and its input is a feature map and 128 rois . ROI Pooling pools all these regions of different sizes to the same scale (7x7). The output of ROP pooling is input to the classifier.

6 ◯ \textcircled{\color{blue}\scriptsize 6} 6 classifier

The classifier here is shown in the purple box in the figure below.
insert image description here
This fully connected network can borrow the fully connected network of VGG16, which is also done in the code.
21 represents a total of 21 categories, each anchor belongs to the probability of each category, the output is [128, 21] [128,21][128,21 ] ; 84 = 21 *4, there will be a coordinate information for each category, the output is[ 128 , 84 ] [128,84][128,84 ] , and then calculate the loss with gt_label and gt_loc respectively and add it up to be the loss of the classifier.
suppress is the non-maximum value suppression during inference, which is not used during training.

backpropagation

To sum up, we add the loss of the rpn network and the loss of the classifier, and then perform backpropagation to update the parameters.
Finally, put the network flow diagram of the faster rcnn summarized by the BBuf boss.
insert image description here
My talent is shallow, if there is something wrong in the blog post, please criticize and correct me, thank you.
Reference link: giantpandacv
simple-faster-rcnn-pytorch

Guess you like

Origin blog.csdn.net/qq_41596730/article/details/132405764