FasterRCNN study notes

FasterRCNN network structure:

Faster RCNN can be divided into 4 main contents

1. Conv layers.
Feature extraction network Backbone. Faster RCNN first uses a set of basic conv+relu+pooling layers to extract the feature maps of the image. The feature maps are shared for the subsequent RPN layer and fully connected layer.
2. Region Proposal Networks.
The RPN network is used to generate proposals (recommendation boxes). This layer uses softmax to determine that the anchors (a priori box) belong to the foreground or background, and use bounding box regression to correct the anchors to obtain accurate proposals.
3. RoI Pooling.
This layer collects the input feature maps and proposals, integrates this information, extracts the proposal feature maps, and sends them to the subsequent RCNN fully connected layer to determine the target category.
4. Classification.
Use the proposal feature maps to calculate the category of the proposal, and at the same time again bounding box regression to obtain the final precise position of the detection frame.

1、Conv layers

Insert picture description here
For a PxQ image of any size, first scale to a fixed size MxN, and then send the MxN image to the network; Conv layers include 13 conv layers + 13 relu layers + 4 pooling layers; and:

  1. All conv layers are: kernel_size=3, pad=1
  2. All pooling layers are: kernel_size=2, stride=2

In the entire Conv layers, the conv and relu layers do not change the input and output size, only the pooling layer makes the output length and width become 1/2 of the input. A matrix of MxN size is fixed to (M/16)x(N/16) through Conv layers. In this way, the featuure map generated by Conv layers can correspond to the original image. Taking VGG16 as an example, suppose the dimension of the input image is 3X600X800. Since the downsampling rate of VGG16 is 16, the dimension of the output feature map is 512X38X50.

2. RPN module

Insert picture description hereInsert picture description hereRPN uses convolutional feature maps and generates suggestions on the image.

RPN network: Generate a better proposal box proposal, where a strong a priori Anchor is used.
Input: feature map, object label, that is, the category and border position of all objects in the training set.
Output: Proposal, classification Loss, regression Loss, where Proposal is used as the generated area for subsequent module classification and regression. Two parts of the loss are used to optimize the network.

1. The
RPN generated by the Anchor corresponds to 9 Anchors for each point on the fee map. These 9 Anchors are different in size, width and height, and correspond to the original image to basically cover all possible objects. The detection frame obtained in this way is very inaccurate, don't worry, there are two bounding box regressions to correct the detection frame position. With a large number of Anchos, RPN's next job is to filter from them, and adjust a better position to get Proposal.
Insert picture description here
There are k anchors on each point in the feature map (default k=9), and each anhcor is divided into foreground and background , so each point is converted from 256d feature to cls=2k scores; and each anchor has [x , y, w, h] corresponds to 4 offsets, so reg=4k coordinates

All anchors are used for training too much, 256 suitable anchors will be selected for training during training (128 postive anchors + 128 negative anchors)

Anchors are defined based on the convolution feature map, but the final anchors are relative to the original picture. The anchor points of the final picture will be separated by r pixels. In VGG, r=16.
Insert picture description here
The above figure is the central coordinate point of the Anchor on the feature map mapped to the original image.
Insert picture description here
Left: Anchors of a single point on the feature map; Middle: Anchors of a single anchor point in the feature map space mapped to the original image; Right: Anchors mapped to all the original images
The original image has a total of 38 x 50 x 9 = 17,100 Anchors.

2. RPN convolutional network

Use 1X1 convolution to predict the category of each Anchor on the feature map as the predicted frame category: foreground/background prediction score, predict the offset of the real frame relative to the Anchor .

2.1 Softmax predicts the foreground/background score of
Insert picture description hereeach Anchor. The predicted value of each anchor category
after 1×1 convolution reduces the feature map to W×H×18. This just corresponds to the feature maps with 9 anchors at each point, and each anchor may be foreground and background. This information is stored in a matrix of size W×Hx (9x2). The following softmax classification function obtains foreground anchors. Specifically, each Anchor is a foreground and background probability score. It is a two-class classification problem, that is, the target may appear in the foreground anchors.

The role of the reshape layer is to facilitate softmax classification, corresponding to the above matrix [1, 2x9, H, W] that saves bg/fg anchors. In softmax classification, fg/bg classification is required, so the reshape layer will change it to the size of [1, 2, 9xH, W], that is, a single dimension is "vacated" for softmax classification, and then reshape back to the original state. .

The true value
RPN network of each anchor category judges whether the anchor belongs to the foreground or the background by calculating the IoU of the anchors and the label , and uses anchors and softmax to initially extract foreground anchors as candidate regions.

2.2 Predict the offset of the 4 position coordinates of the
Insert picture description here
anchor The true value of the offset of each anchor The
center coordinates of the anchor are xa and ya, the width and height are wa and ha, and the center coordinates of the label M are x and y, and the width and height are respectively w and h. The position offsets tx and ty are normalized by width and height, while the width and height offsets tw and th are processed logarithmically. This has the advantage of further limiting the range of the offset and making it easier to predict.
Insert picture description here
The offset prediction value of each anchor is
reduced by 1x1 convolution and the output feature map is WxHx36, 9 anchors, and each anchor has 4 regressions [dx(A), dy(A), dw(A), dh(A)] transformation amount, the 4 predicted offsets of the real object label relative to the Anchor. After the predicted offset is obtained, the above formula can be used in reverse to apply the predicted offset to the corresponding Anchor to obtain the actual positions x', y', w', and h'of the prediction frame.

RPN network structure summary:
generate anchors -> softmax classifier to extract fg anchors -> bbox reg regression fg anchors -> generate proposals

In fact, Anchor is a priori reference value of the attribute we want to predict, not limited to a rectangular box. If necessary, we can also add other types of priors, such as polygon box, angle and speed.

2.3 Calculate RPN loss:

This step is only in training, match all anchors with the label GT, and determine whether the anchor is a positive sample or a negative sample by calculating the IoU of the anchors and the label, and obtain the true value of the classification and offset , and the prediction in the second step The score and the predicted offset value are calculated for loss.

Since the total number of Anchors is close to 20,000, and most of the Anchor labels are background, if the loss is calculated, the positive and negative samples will lose the balance, which is not conducive to the convergence of the network. Here, RPN selects 256 Anchors for loss calculation by default, of which no more than 128 positive samples are at most. If the number exceeds the limit, it will be randomly selected. Of course, 256 and 128 here can be adjusted according to the actual situation, rather than fixed.

Loss function design

With the network predicted value and true value, the loss can then be calculated. The loss function of RPN includes classification and regression, as shown in the specific formula.
Insert picture description hereThe first part represents the classification loss of 256 selected Anchors, Pi is the true value of each Anchor's category, and Pi is the predicted category of each Anchor. Since the role of RPN is to select Proposal, it does not require subdivision of which type of prospect it is, so at this stage, it is two-class classification and uses cross-entropy loss.
The second part represents the regression loss, where Pi is
used for screening, only the positive sample needs to be regressed, and the coefficient is used to balance the effects of the two parts of the loss. The regression loss uses the smooth L1 function.
Insert picture description here
2.4. NMS and screening Proposal to get Rol:

Since anchor points often overlap, Proposal will eventually overlap on the same target. In order to solve the problem of repeated proposals, we use a simple algorithm called non-maximum suppression (NMS). After applying NMS, we keep the N Proposals with the highest scores. N=2000 in the paper.

During training, the number of proposals generated in the previous step is 2000, of which there are still many background frames, and the ones that actually contain objects are still a minority. Therefore, it is completely possible to carry out another screening for Proposal. The process is similar to the process of screening Anchor in RPN. Labels and Proposal construct a loU matrix, select 256 Rols based on the degree of overlap with the labels, and each RoI is assigned a positive or negative sample label. In the testing phase, this module is not needed, and Proposal can be directly used as RoI (about 300 or so).

3、RoI pooling

Insert picture description here
This part is a link between the previous and the next, accepting the feature map extracted by the convolutional network and the RoI of the RPN, and outputting it to the RCNN network. Since the RCNN module uses a fully connected network, the dimension of the feature is required to be fixed, and the feature size corresponding to each RoI is different and cannot be sent to the fully connected network. Therefore, RoI Pooling pools the features of Rol to a fixed dimension, which is convenient Send to the fully connected network.
Insert picture description hereRoIPooling

4、RCNN

Insert picture description hereR-CNN architecture

R-CNN uses a feature map for each suggestion, flattens it and uses two fully connected layers of 4096 size with ReLU activation function.
Then, it uses two different fully connected layers for each different target:

  • FC 21 is used to classify and predict which category RoIs belong to (20 categories + background)
  • FC 84 is used to return to position (21 classes, each class has 4 position parameters)

Send the features obtained by RoI pooling to the fully connected network, predict the classification of each RoI, and predict the offset to refine the frame position and calculate the loss. It mainly contains 3 parts:

1. RCNN fully connected network: The obtained fixed-dimensional RoI features are connected to the fully connected network, and the output is the predicted score and the predicted regression offset of the RCNN part.
2. Calculate the true value of RCNN: For the selected RoI, it is necessary to determine whether it is a positive sample or a negative sample, and calculate the offset from the corresponding real object.
3. RCNN loss: This step is only in training, through the prediction value of RCNN and the true value of the RoI part. For the classification problem, the cross entropy loss is directly used, and for the regression loss of the position, the Smooth_L1 Loss is also used, but only the loss is calculated for the positive sample, and the loss is only calculated for the 4 parameters of this category in the positive sample

It can be seen from the whole process that Faster RCNN is a two-stage algorithm, namely RPN and RCNN. Both steps need to calculate the loss, but the former needs to provide a better area of ​​interest for the latter.

Training losses are:

1. RPN classification loss: whether anchor is foreground (two classification)
2. RPN position regression loss: anchor position fine-tuning
3. RoI classification loss: RoI category (21 classification, one more class as background)
4. RoI position regression loss : Continue to fine-tune the RoI position

The four losses are added as the final loss, backpropagated, and the parameters are updated.

Reference (thanks)

An article to understand Faster RCNN
object detection series of faster-rcnn principle introduction,
etc. . .

Guess you like

Origin blog.csdn.net/W1995S/article/details/112062411