FasterRcnn principle analysis

FasterRcnn

 

FastRCNN main steps

1: First scale the image to a fixed size M*N, and then send the image to the network.

2: Extract feature maps through other networks such as VGG or Resnet. In addition, initialize anchors and find valid anchors (Step1).

3: The Feature Map obtains the effective Anchor confidence and the coordinate coefficient of the prediction frame through the RPN network (Region Proposal Networks). 2K (positive samples and negative samples x K anchor coefficients) scores and 4K (anchor's coordinate offset coefficient x K anchor coefficients) coordinates.

 4: Use the Proposal calculated from the feature map and prediction frame to obtain a fixed-size prediction target feature map through ROI Pooling, that is, use the prediction frame to extract the target from the feature map.

 

 Refer to the above parameters

  •  The proposal corresponds to the M*N scale, so first use Spatial_Scale to map it back to a Feature Map of (M/16)*(N/16) size.
  • Then divide its feature map into a grid size of Pooled_w*Pooled_H size.
  • Then perform max pooling operation on each network to output.
  • This ensures that the output size is uniform.

5: The fixed-size Feature Map is sent to the fully connected layer and softmax to predict BOX_Pred and Cls_Prob respectively.

Detailed derivation examples

Step1: Input the image Reshape to 800*800 size.

Now divide the picture into a grid. Assume that the pixel area of ​​each grid is 16*16, then the grid is divided into 800/16=50, then the grid size is 50*50. At the same time, each grid generates K=9 anchors. Then a picture generates 22,500 anchors at the same time.

At the same time, because anchors have different sizes, there must be some anchors that exceed the bounding box or have negative values. Then the bounding box of this part needs to be eliminated.

After elimination, there are 8940 anchors left. Get the coordinate value of the bounding box and complete the initialization

 Then calculate the IOU between all anchors and the real GT. And screen a certain proportion of positive samples and negative samples.

Step2: Perform VGG to obtain the feature map, and then send the feature map to the RPN network. Repeat step 1

Now divide the picture into a grid. Assume that the pixel area of ​​each grid is 16*16, then the grid is divided into 800/16=50, then the grid size is 50*50. At the same time, each grid generates K=9 anchors. Then a picture generates 22,500 anchors at the same time.

For these 22,500 anchors, we get 2 (positive sample 1 negative sample 0) confidence of each anchor and four coordinate offset values ​​of each anchor relative to the original picture.

Step3: Calculate RPN loss (loss between effective anchors and predicted anchors - coordinates & confidence)

Through the target anchors information in Step1 and the pred_anchors information in Step2. It mainly calculates the loss values ​​of the anchors of the positive samples in step 1 and the corresponding anchors in step 2.

Step4: IOUNMS

Based on anchors and predicted anchors coefficients, calculate the prediction frame ROI, coordinate coefficients and categories. The obtained ROI prediction box is further filtered and streamlined through NMS (non-maximum suppression).

Step5:ROI Pooling

1: Deduct the corresponding prediction target from the feature map according to the ROI.

Two: Pass the deduction processing prediction target through the adaptive_max_pool method and output it as a fixed size to facilitate subsequent batch processing.

STep6&Step7: Calculate loss

Calculate the category confidence of the prediction frame and the translation scaling coefficient converted into the target frame and the classification loss respectively.

Guess you like

Origin blog.csdn.net/weixin_43852823/article/details/127629059