Target Detection Algorithm: Interpretation of Fast-RCNN Papers

Target Detection Algorithm: Interpretation of Fast-RCNN Papers

foreword

​ In fact, there are already many good articles on the Internet that interpret various papers, but I decided to write one myself. Of course, my main purpose is to help myself sort out and understand the papers deeply, because when writing an article, you must put what you have learned into consideration. What I write is clear and correct. I think this is a good exercise. Of course, if it can help netizens, it is also a very happy thing.

illustrate

​If there is a clerical error or writing error, please point it out (do not spray), if you have a better opinion, you can also put it forward, and I will study hard.

Original paper address

​Click here , or copy the link

https://arxiv.org/abs/1504.08083

Directory Structure

1. An overview of the content of the article:

Inspired by SSP-Net, the author proposed Fast-RCNN in 15 years, which greatly improved the operating efficiency while ensuring mAP.

2. SSP-Net:

​ I didn't write a separate blog about SSP-Net, mainly I think the content may be a bit small, because the most important knowledge point in SSP-Net is Spatial Pyramid Pooling (Spatial Pyramid Pooling) .

2.1 Source of ideas:

We know that an important shortcoming of RCNN is that for a picture, it is necessary to perform CNN processing on 2000 region proposal boxes , which is undoubtedly time-consuming. In addition, the image surrounded by the region suggestion box needs to be scaled to the specified size, which will undoubtedly affect the accuracy of the image (as shown in the figure below, the original image of the SSP paper).

insert image description here

Therefore, researchers think about whether there is a way to avoid the above two situations.

2.2 Measure 1: ROI Mapping

​ In order to solve the first problem, that is, the 2000 suggestion boxes need to be processed by CNN, the author adopts the method of ROI mapping.

  • First, 2000 proposal boxes are generated for the entire image.
  • Second, pass the entire image as input to the CNN architecture.
  • The last convolutional layer of the CNN architecture outputs a feature map.
  • Then, map the original 2000 suggestion boxes to the feature map, so that there is no need to re-do the CNN operation for each suggestion box

Question: How does the original proposal box map to the correct position on the feature map?

For example, assume that all convolutional layers do not change in size, only the number of channels is changed, and all pooling layers are halved. Then, assuming that the original image size is 256*256 and has undergone 4 pooling layers, the output feature map size is 16*16.

Assuming that the coordinates of the upper left corner of a suggestion box on the original image are (Xmin, Ymin), and the coordinates of the lower right corner are (Xmax, Ymax), then the coordinates mapped to the features should be:

左上角: 
	x = Xmin / 16
	y = Ymin / 16
右上角:
	x = Xmax / 16
	y = Ymax / 16

2.3 Measure 2: Spatial Pyramid Pooling

​ In order to solve the second problem, that is, the problem of needing to fix the size of the input image. The author adopts the operation of spatial pyramid pooling. The function of this method is to generate a fixed-size output for any input.

​ The picture in the SSP-Net paper is shown below:

insert image description here

​ In the picture above, you can say that it is a three-level spatial pyramid pooling , where 256-d is the dimension of the output of the final convolutional layer of CNN, which is 256 dimensions. The reason why it is said to be three levels is because it has three pools, namely 16*256-d, 4*256-d, 256-d from left to right in the above picture. The final output is to splice the three together, and the length is fixed .

How to understand its pooling operation? It's actually very simple. First, assume that our pyramid pooling is three levels, and the pooling sizes are 4*4, 2*2, and 1*1 in the figure (the output length is fixed in this way), and the feature map size is assumed It is 16*16.

​ Then take 4*4 as an example, which has a total of 16 cells, then each cell takes the maximum value of a 4*4 area of ​​the original feature map. Taking 2*2 as an example, there are 4 cells in total, then each cell takes the maximum value of an 8*8 area of ​​the original feature map.

In this way, it can be guaranteed that no matter what size feature map is used as input, the output generated by SSP is the same, that is, a 4*4 size, a 2*2 size, and a 1*1 size.

3. Fast-RCNN process:

​ The original picture of the paper is as follows:

insert image description here

​ You can also look at the pictures on the Internet, from Reference 2:

insert image description here

​ Briefly talk about the process:

  • First, generate 2000 candidate boxes for the input image
  • Then, the entire picture is passed to the CNN architecture as input, and the last convolutional layer of the CNN architecture outputs a feature map
  • Next, the candidate frame is mapped to the feature map, and the value surrounded by the candidate frame is sent to ROI Pooling, and ROI Pooling outputs a fixed-length output
  • Then, pass the output of ROI Pooling to two fully connected layers, and then perform classification and regression operations respectively

4. ROI Pooling:

​ The ROI Pooling operation is actually a special spatial pyramid pooling operation, but there is no multi-level pyramid pooling here, only one level.

​ The principle is as follows (if the above SSP is not well understood, you can refer to it again here): divide the ROI of h*w into W*H networks, and the size of one grid is (h/H, w/W), and the The value of the grid is the maximum value in the ROI corresponding to the grid size ( pooling operation ). Among them, W*H is the fixed output size of the pyramid pooling in SSP, and h*w is the size of the feature map.

​ For example, see the figure below:

insert image description here

5. Joint optimization (loss function):

​ One change point of Fast-RCNN is to jointly optimize classification and regression, unlike RCNN, which also needs to train SVM, LR and other content separately.

​ Through the Fast-RCNN flow chart, we can see that there are two sibling outputs at the end, namely a classification output and a regression output. The former outputs a discrete probability distribution, and the latter outputs a bounding insert image description here
box offset insert image description here
, where K represents a total category. number (not including the background), and k is a category in K.

And for each ROI, there is its corresponding real category u and real bounding box v. Then the joint loss can be expressed as:

insert image description here

​ Among them, Lcls represents the loss value of classification, Lloc represents the loss value of regression positioning, and λ is a weight parameter, which is generally 1.

However, for the u value, we generally default to 0 for the background , and [u>=1] means that only when u>=1 is satisfied, the value is equal to 1, otherwise it is equal to 0. This means that negative samples (i.e. ROIs that do not contain objects at all) do not participate in the regression loss calculation .

However, the classification loss is still a commonly used logarithmic loss function, and pu represents the probability of predicting the true category:

insert image description here

​ The loss of regression adopts the corrected and smooth L1 loss (L1 loss is MAE, yes, it is the difference of the absolute value you think) :

insert image description here

​ Among them, the modified L1, that is, smooth L1 expression is:

insert image description here

Why not take L2 loss or modified L2 loss?

​ Because the L2 loss is sensitive to outliers, such as the same difference of 5 (the general difference is 0.8 or 1, where 5 represents the difference of outliers), the result of L1 loss is 5, and L2 The loss is 25.

Therefore, no L2 loss is taken.

6. Truncated singular value decomposition:

​ The author found that since Fast-RCNN needs to perform full connection calculations for each ROI, it takes more time to fully connect (in addition, both regression and classification are handed over to full connection is also a factor), as shown on the left in the figure below, so Consider using SVD to accelerate the operation of the full connection, as shown on the right of the figure below:

insert image description here

​ Specifically how to implement SVD, let me briefly talk about it (the requirements for the principle of SVD are very low).

​ We assume that the input of the fully connected layer is x , the output is y , and the matrix composed of all weights of the fully connected layer is W(u, v).

​ Then, we calculate normally, y = W x (represented as a vector in bold), then the complexity is u*v (only the number of multiplications is counted). If you don't understand how the above complexity is calculated, let me draw a picture and give an example:

insert image description here

​ Using SVD to decompose the weight matrix can be expressed as:

insert image description here

​ After SVD decomposition, the shape of U is u*u, the shape of Σ is u*v, and the shape of V is v*v. Among them, Σ is a diagonal matrix. In order to satisfy the size u*v, its diagonal value is composed of singular values ​​(which can be calculated from eigenvalues) and 0.

However, we directly use its t singular values ​​(parts that are not 0) instead (this is the origin of the truncation formula), then the entire expression can be rewritten as:

insert image description here

​ Then, at this time to calculate the fully connected output, the expression is:
insert image description here

​ The complexity is u*t+v*t+t, how to understand this complexity. First, the diagonal matrix Σ is multiplied by V (complexity is t), and then multiplied by x (complexity is t*v), and the result (shape is t*1) is multiplied by U (complexity is u*t), so the overall complexity is u*t+v*t+t. ( I see that the complexity of other people's blogs does not have the final t, I don't know how to get it, friends who know can leave a message to let me know )

​ When the values ​​of u and v are large (the value of t is generally small), such a decomposition can greatly reduce the amount of calculation, as the above picture has proved.

7. Several questions explored by the author:

​ The author also explored several minor issues during the experiment:

  • Comparison of the advantages and disadvantages of single-scale pictures and multi-scale pictures
    • It is found that the multi-scale mAP is a little higher, but it takes more time
    • Single scale is fast, and mAP is not low
  • Is more training data better?
    • The author doubled the training set, and the mAP increased slightly, from 66.9% to about 70%.
  • Is softmax better than SVM for classification?
    • The author believes that Softmax is indeed good at this time, because Softmax has caused class competition (that is, different classes hope to get good results in the evaluation, so they start to compete)
  • More area suggestion boxes please?
    • No, the author found that too many suggestion boxes reduced the accuracy value

8. Summary:

​ Fast-RCNN is already very good compared to the previous algorithm. It proposes ideas such as joint optimization and SVD accelerated training, which provide good ideas for subsequent development (for example, most of the algorithm optimizations in the future are joint optimization).

​ In addition, from Fast-RCNN, it is not difficult for us to guess the further optimization route in the future. For example, we can start with the method of obtaining the region suggestion box, such as how to make the model a real end-to-end (that is, give a picture, it directly gives I a result). The following algorithms such as Faster-RCNN and YOLO are also modified in this way.

9. References:

1. https://blog.csdn.net/qq_36926037/article/details/105984501
2. https://blog.csdn.net/abc13526222160/article/details/90344816
3. https://zhuanlan.zhihu.com/p/139050020
4. 论文原文
5. SVD分解:https://zhuanlan.zhihu.com/p/52890135

Guess you like

Origin blog.csdn.net/weixin_46676835/article/details/129971092