faster-RCNN principle understanding

Preface

Originally wanted to learn how to make full use of sample information in the target detection field (target detection not only requires the category of the picture, but also a specific physical location. In other words, not only need to know what the object is? You also need to know the object Where?), a faster-rcnn stumped me. Because I mainly don't understand these questions:

  1. How does the model use location information?
  2. What does the model want to learn? What are the advantages of the model after learning compared with the model before learning? (After all, the anchor is generated by traversal, whether it is in the training or inference phase, the anchor is the same)
  3. Does this algorithm perform a kind of intelligent template matching? Is there a step
  4. The rpn algorithm is constant during the training phase. Can it detect targets at any position?

2 network structure

2.1 rpn network

The target detection network should be divided into two parts intuitively: find the target and identify the target. Hence the rpn network. The purpose of the rpn network is to find the target (find the area of ​​interest/find the target area). So how does the rpn network work? 2.1 Why can't it find the target area before training? 2.2 How does it know how many areas to find (how many targets)?

  1. Generate anchor points (human words: generate many boxes, and then look at those boxes, which are close together)
    First, the rpn network line generates many boxes. Don’t worry too much, it just covers all the pictures evenly anyway
  2. The important parameter rpn_locs for generating rpn (human words: fine-tuning the frame)
    Because these frames may not be able to frame the object exactly (may be only half of the frame), we generate a fine-tuning parameter rpn_locs for each frame. The dimension of this parameter is : The number of anchor points * 4 (or "the number of frames" * 4). We will use this parameter twice in the training phase: the first time is to use it to fine-tune our frame (anchor point) so that the frame is close to the target, and the second time is in the loss function. After all, we need to update this parameter. (How to update, still need to analyze. Remember to write below)

But there seem to be some problems here. In the testing phase, if we generate 2w candidate boxes (the boxes are unchanged for all inputs), and then use the rpn_locs parameter to fine-tune the boxes one-to-one, we get the fine-tuned 'Box 2', but these 2w boxes 2 are also fixed, but the position of the object is differentiable, and there are infinite. So you may not be able to frame it

So how does the rpn network operate specifically?

a. Output parameter rpn_locs (Human words: get the fine-tuning parameters of each box, we will use this parameter twice in total)

  • Perform a certain convolution operation on the input feature map of the rpn network, regardless of what convolution is, anyway, it is good to ensure that the output rpn_locs is a feature map of (w, h, 9, 4) dimensions. Here (w, h, 9, 4) is the aforementioned
    rpn_locs (number of anchor points * 4).

b. The output parameter rpn_scores (that is, the output of each box), we will also use this parameter twice: output rpn_fg_scores; calculate the loss function

  • The meaning of this rpn_scores is: the positive and negative sample probability of each box (or: the output value of all anchor points on each feature map). Since my feature map at this time is a 62*19 map, the dimension of rpn_scores is (1,18,62,19), 9 is the number of anchor points, and multiplying by two means that we need positive and negative probabilities (so the code also needs a The softmax of the second dimension).
  • Then get the probability that each box is a positive sample according to it: rpn_fg_scores, its dimension is the number of anchor points (that is, the box).
  • So how is this rpn_scores calculated? In fact, in the model, the output of the neural network is set to a tensor of this dimension (1,18,62,19), and then said that this tensor is the positive and negative sample probability of each box, and it does not say why this can be done. .

c. ProposalCreator module (according to the fine-tuning parameter rpn_locs, whether it contains the object probability: rpn_fg_scores, the position of the frame (fixed value), lock the target frame: roi)

  • The first step: get the region of interest. According to the position of the frame and the fine-tuning parameter rpn_locs, it is easy to determine the area of ​​interest roi_1 (human words: lock target).
  • Step 2: Prevent roi_1 from being out of bounds or too small. (Some roi will be deleted at this time)
  • The third step: According to whether the object probability is included: rpn_fg_scores parameter, select the top roi (about the top 200)

So far, the rpn model has come to an end, output: rpn_locs (for loss), rpn_score (for loss), roi (for the next comparison model),

2.2 proposal_target_creator network (human words: using data samples)

We haven't used any data tags after doing so for so long! Start using it here. Why not use label data from the beginning? Because the label data (known target frame, ground truth) is "small", the data (img) data is "large". It is inevitable to compare small data on big data. This is the traditional template matching method. When we adjust the original image -> anchor point frame -> fine-tuned, filtered, and corrected, there is a high probability of the frame of the object, making the picture smaller.

proposal_target_creator network, one sentence description: Assign ground truth bounding boxes to given RoIs
input:

  • roi (region of interest) from rpn
  • label
  • Object label box

Output:

  • Result 1: sample_roi (from the input roi, extract several areas that coincide with the label)
  • Result 2: gt_roi_loc (calculate the distance between sample_roi and the target frame/fine tuning parameters to calculate loss. Here gt_roi_loc will calculate the loss function with roi_loc above)
    a. Calculate each roi and compare it with the box label , Get the dimension of the iou indicator of each roi: (number of roi * number of tags). This means how similar each box is to the target box.

b. Assign a Groudtruth to each box.

c. According to the size of the iou index of roi, select about 200 positive samples from 2000 roi (meaning the overlap between these boxes and the label box is large), and 200 negative samples (the boxes are all background) As a result 1.

Guess you like

Origin blog.csdn.net/qq_43110298/article/details/110237792