Target detection algorithm: Faster-RCNN paper interpretation

Target detection algorithm: Faster-RCNN paper interpretation

foreword

​ In fact, there are already many good articles on the Internet that interpret various papers, but I decided to write one myself. Of course, my main purpose is to help myself sort out and understand the papers deeply, because when writing an article, you must put what you have learned into consideration. What I write is clear and correct. I think this is a good exercise. Of course, if it can help netizens, it is also a very happy thing.

illustrate

​If there is a clerical error or writing error, please point it out (do not spray), if you have a better opinion, you can also put it forward, and I will study hard.

Original paper address

​Click here , or copy the link

https://arxiv.org/abs/1506.01497

Directory Structure

1. An overview of the content of the article:

​The current detection algorithms all rely on the "region proposal" algorithm. Although the advancement of technologies such as SSP-net and Fast-RCNN has shortened the running time of the detection network, it also exposed the time-consuming of the region proposal.

Therefore, the author introduces RPN on the basis of Fast-RCNN. By adding several convolutional layers to the original CNN structure, the cost-free region proposal method is almost realized while realizing shared weights.

2. Faster-RCNN process introduction:

​ The process in the original picture of the paper is as follows:

insert image description here

​ Of course, if you just want to know how the process is and are not interested in its internals, you can directly look at this picture, the process is as follows:

  • First, enter an image
  • Then, this image is sent to the CNN architecture, which outputs a feature map
  • Then, use the RPN network to generate some regional proposal boxes
  • Second, map the region proposal box onto the feature map, and use the ROI Pooling method to produce a fixed-length output
  • Then, use the value output by ROI Pooling for regression and classification

​ **However, the above process is actually very short and full of doubts in some places. **Therefore, some seniors on the Internet have summarized a detailed flow chart based on pytorch official code and papers (from reference 1):

insert image description here

​ A simple description of the above picture:

  • First, the image is scaled to the specified size, and then sent to VGG16 with the fully connected layer removed, which outputs a feature map

Question: Why do you want to scale to the specified size?

​ Obviously with ROI Pooling (regardless of the input, the output size is fixed), why do you need to zoom to the specified size. This was limited by the implementation tools at that time. At that time, there were no simple frameworks such as tensorflow and pytorch. Therefore, considering the convenience, the images were scaled to the same size for easy processing.

  • Then, the feature map is divided into two steps, one is sent to RPN, and the other is sent to ROI Pooling

    • RPN part: First, the so-called anchors are generated according to the feature map traversal; then the feature map is subjected to a 3*3 convolutional layer, and then regression (calibration anchor) and classification (discrimination anchors) are performed respectively. In the RPN structure in the above figure, the upper path is classification, and the lower path is regression
  • Then, use the proposal box generated by RPN (that is, the anchors calibrated above) to map on the feature map, and extract the content of the feature map (different in size) corresponding to each proposal box for ROI Pooling operation, and output the corresponding ROI Pooling feature vector (of the same length).

  • Finally, perform two fully connected operations on the ROI Pooling feature vector, and then perform classification and regression (softmax)

Next , interpret the key knowledge points above:

3. Anchor/Anchor boxes:

​RPN is the most important structure in Faster-RCNN. We first need to understand its purpose: to generate a good region proposal box.

​ Then, look at the original picture of the paper:

insert image description here

​First , give a brief description of the above picture, and then explain the details in detail . The conv feature map in the above figure is the final output feature map of the CNN architecture. The red box is called the sliding window. That is, slide a window on the feature map, this window takes the n*n of the feature map (3*3 in the figure) as input, and sends the final output (256-d in the figure) to classification and regression.

​In fact , it is not difficult to see that this so-called sliding window is a 3*3 convolutional layer, which is nothing magical. This also corresponds to the 3*3 convolutional layer of the original Faster-RCNN.

Explanation: The meaning of the numbers above

​ 256-d, this is the meaning of dimension. Because in the Faster-RCNN paper, the CNN architecture adopts the ZF model, and the number of feature map channels output by the last convolutional layer of the model is 256 dimensions. Similarly, if it is changed to VGG16, then it should be 512-d here.

​k anchor boxes, what needs to be known is that anchor boxes are our region suggestion boxes, but the region suggestion box at this time is still very rough, because it has not been corrected by classification and regression . Among them, k means quantity.

​ 2k scores and 4k coordinate, one is the classification score (probability value), and the other is the coordinate value. 2k means a box with 2 values ​​(positive case probability + negative case probability), then k boxes are 2k values. Similarly, 4k means 4 coordinate values ​​(x+y+w+h) of a box, then k boxes are 4k values.

Explanation: How anchor boxes are generated

​ First of all, Anchor generally refers to the center of the box, while Anchor Boxex refers to the box.

​ From the above figure, it is not difficult to see that the centers of the k anchor boxes correspond to the center of the sliding window with blue dotted lines, **This means that each center of the sliding window (movable) will generate k anchor boxes. We also know that the sliding window is actually a 3*3 convolutional layer, and the feature map is its input, which means that each point of the feature map will generate k anchor boxes. **In the original text, k takes 9.

​ The first problem is solved above, that is, how anchor boxes are generated. Then, there is another question. Each point will generate k anchor boxes. Are there any requirements for this anchor boxes ? In fact, this is artificial, and in the paper, the author believes that three ratios and three aspect ratios will be used, which will produce 3*3=9 boxes, as shown in the following figure (drawn by myself): (specifically Code, generally a basic anchor is given, and the latter is generated according to the ratio, etc.)

insert image description here

Explanation: The number of Anchor boxes on a feature map

We assume that the size of the original image is 400*400, and the CNN architecture adopts VGG16, then we can know that after the original image passes through VGG16, the obtained feature map is actually reduced by 16 times (4 pooling layers, each reduced by 2 times).

​ Then, the total number of Anchor boxes is:

(400/16)*(400/16)*9 = 5625

​ Here comes the problem, there are too many suggestion boxes of more than 5,000. The author then selected 256 suitable suggestion boxes, and the ratio of positive and negative examples was 1:1.

​ How to judge whether the suggestion box is appropriate? Only if the following conditions are met:

  • Proposal boxes with IOU > 0.7 are positive examples
  • The proposal box with the largest IOU is a positive example
  • Proposal boxes with IOU < 0.3 are negative examples
  • Suggestion boxes between 0.3 and 0.7 are ignored

Explanation: NMS and out-of-bounds removal

​ After obtaining the suggestion box, NMS (non-maximum suppression, if you are unclear, please refer to my RCNN paper for interpretation) and out-of-bounds elimination (that is, the suggestion box exceeds the original image and needs to be removed).

Explanation: How are 18 and 36 in the figure below obtained?

insert image description here

​ First of all, both 18 and 36 represent the number of channels or dimensions. 36=4*9, that is, 9 boxes, each box has 4 coordinate values, and the corresponding data format should be:

[batch_size,4*9,W,H]

​ 18=2*9, that is, 9 boxes, each box corresponds to 2 category values ​​(positive or negative examples), and the corresponding data format is:

[batch_size,2*9,W,H]

Question: Are the anchor boxes generated in the original image or the feature image?

​ In fact, from my personal point of view, there is not much difference between the anchor boxes generated on the original image and the feature map . For example, if it is generated in the original image, then you can map it to the feature map (VGG16) by dividing by 16, and the same is true for the feature map.

​ However, from a practical point of view , anchor boxes are generated in the original image. Why do you say that? Because we need to calculate the IOU value of the anchor boxes and the real box, this tells us to a certain extent that it is generated in the original image (unless you want to calculate one more step and waste more resources).

How to realize the regression in RPN?

​ How to say this is a common frame regression problem, which is almost the same as the frame regression in RCNN. There is nothing special about it. If you are interested, you can open my blog homepage and find the part in the interpretation of RCNN papers. Or just copy the link below:

https://blog.csdn.net/weixin_46676835/article/details/129929232

4. Loss function:

​ The loss function of Faster-RCNN is quite satisfactory, and the transformation of Fast-RCNN is not large.

​ The specific formula is as follows:

insert image description here

​ of which:

  • i represents the i-th anchor boxes in a batch
  • Pi indicates the probability that the objects surrounded by the anchor frame are different types (for example, there are 20 types in total, each of which is the probability value of each type)
  • Pi* takes {0, 1}, and its value is determined by whether the anchor i is a positive example. If it is a positive example, it takes 1, otherwise it takes 0. This means that only when the value is 1, the regression is included in the loss, that is, only the box containing the object is worth including the loss, otherwise it is meaningless.
  • ti represents the 4 coordinate values ​​corresponding to the i-th anchor, and ti* is the coordinate value of the real frame
  • In the paper , Ncls takes the number of batch_size (assumed to be 256), Nreg takes the size of the feature map (assuming the input picture is 600*1000, VGG16 architecture, the size of the feature map is about 2400), and λ is used to balance the two The size relationship , at this time, take 2400/256, which is about 10.
  • In the code , directly let Ncls and Nreg both take the batch_size size, so that λ can be directly taken as 1, which is convenient and simple.

Lcls is a commonly used classification loss function, namely the cross entropy loss function:

insert image description here

Lreg is a smooth L1 loss function commonly used in the field of target detection, namely:

insert image description here

5. Faster-RCNN training:

Now think about it, the training process of Faster-RCNN is actually as important as its Anchor.

​ In fact, on how to train Faster-RCNN, the author proposed three methods, namely 4-step alternate training, approximate joint training, and non-approximate joint training.

​ Briefly talk about these three methods:

  • Alternate training: the method adopted by the author, which will be described in detail later
  • Approximate joint training: combine RPN and Fast -RCNN into one network during training. During the forward pass, RPN generates region suggestion boxes, and then these region suggestion boxes are sent to Fast-RCNN. When passing back, it is backpropagating normally, but for the shared convolutional layer (ie, the CNN architecture part), the RPN loss and the Fast-RCNN loss are combined.
    • This method is easy to implement, but ignores the derivative value of the anchor boxes coordinates, that is, the derivative of the anchor boxes is not considered during backpropagation (this is also the origin of the approximation).
  • Non-approximate joint training: very complicated, do not consider

​The author finally chose the alternate training method to train Faster-RCNN , the specific process is as follows:

  • First of all, it must be trained on ImageNet with the CNN architecture (here set to VGG)
  • Then, VGG adds the RPN network part ( weight/convolution sharing idea ), and starts training RPN, which generates a series of regional proposal boxes
  • Training Faster-RCNN for the first time (not including the RPN part, more precisely training Fast-RCNN)
  • Train the RPN again to get a new region proposal box
  • The second training Faster-RCNN (not including the RPN part, more accurately training Fast-RCNN)
  • ...(Iterate and repeat until convergence, the author found that it converges after about two times)

6. Summary:

​ Faster-RCNN is a masterpiece of two-stage detection, absorbing all the essence of the previous ones. Its main contributions are:

  • The RPN structure is proposed, and the method of convolution sharing is cleverly used to reduce the amount of calculation
  • Proposed the idea of ​​anchor
  • alternate training method

Guess you like

Origin blog.csdn.net/weixin_46676835/article/details/131049773