CV target detection interview essential RCNN series 1

Table of contents

Why learn Faster-RCNN?

foreword

R-CNN

RCNNSummary 

Fast-RCNN

What is ROI (also in Faster-RCNN, it is a test point)


Why learn Faster-RCNN?

I recently prepared for a CV internship and reviewed the RCNN series. Most of my friends think that the Faster-RCNN era has become history. Why should I watch Faster-RCNN? Isn’t this a waste of time? I thought so at first, but the lab brothers are interviewing for CV posts I often encounter problems such as the RPN network structure in Faster-RCNN, how to train, how to calculate the loss, etc. Although with the development of technology, the results of Faster-RCNN on most public data sets are not as good as the current YOLO and transformer scores. High, but it contains many initial knowledge points, which can allow us to learn better. As the saying goes, to understand a technology, you must be familiar with its history. At the same time, although Faster-RCNN is not as fast as YOLO, etc., it is more stable, so it is still used in most industries.


foreword

Faster-RCNN is evolved step by step from fast-RCNN and RCNN. They are both two-stage algorithms (in simple terms, they need to generate candidate frames in advance and then modify them at the end. Unlike YOLO, they directly 9 anchor boxes predict the final result), then we must see what RCNN is and what each has improved.

R-CNN

Looking at the picture, in simple terms, R-CNN is divided into four steps

1. Use Seletive Search to generate 1k~2k candidate areas that may contain objects . What does Seletive Search mean? It is to input a picture and circle it into an area through the similarity of each color area and texture features. So it is very slow and time consuming. (Personally think that the interview may ask)

2. After obtaining these candidate areas, map them to the picture, which is equivalent to 1k~2k small pictures. Since each frame is of variable size, the size of the picture is different, so first zoom to a uniform size of 227x227, and then Input to CNN (AlexNet, as a feature extraction function), and finally output as a feature through a full connection.

3. Input the obtained 1k~2k features into the SVM for binary classification, for example, a total of 20 categories, and the output dimension (2000x20)

4. Use a full connection to correct the position of the 1k~2k box

So do you want so many boxes? The answer is definitely no, how to deal with it, here is another required interview, NMS non-maximum value suppression, I will take out the code later and explain it in detail.

RCNNSummary 

1. Disadvantages: The above is the overall process of RCNN. It can also be seen from these that the speed is slow, and the frames have to be manually selected using the SS algorithm, and then feature extraction is performed using Alexnet for each frame, resulting in a large waste of resources due to repeated convolution, and slow SVM training. These are its shortcomings, but Fast-RCNN came out

2. Advantages: Using the SS algorithm improves the accuracy, and the effect was very good at that time.


Fast-RCNN

I entered Fast non-stop, the name can be seen, it has become faster

First look at the network structure of Fast RCNN. It can be seen that there is still a big gap compared with RCNN. The biggest difference is that there is a RoI (Region of Interest) pooling layer. Secondly, the SVM in RCNN has been changed to a fully connected layer.

The overall process is to continue to use the Seletive Search in RCNN to filter out 2k candidate boxes, and then input the picture into the convolutional network to obtain the feature map, and then map the candidate box to the feature map, while RCNN maps the candidate box to the original After the picture, 2k small pictures are obtained for convolution, which greatly wastes computing resources. Afterwards, the final result is obtained by performing full connection through these 2k feature matrices.

What is ROI (also in Faster-RCNN, it is a test point)

It is to correspond each candidate frame to the 2k feature regions obtained on the feature map, divide each feature region evenly into M×N blocks (7x7 in the large code), and perform max pooling on each block. Thus, the candidate regions of different sizes on the feature map are converted into feature vectors of uniform size and sent to the next layer. Why should it be unified, because the input is not uniform, and the network structure cannot be adaptively processed. Specific operation:

 

1. According to the ratio of the size of the input image and the size of the feature map, the 2k candidate frames are scaled, and then mapped to the feature map to obtain the RoI.

2. The pooling operation is performed on each RoI. Since the size of each candidate frame is different, the RoI is different and needs to be processed with the same size. For example, if it is divided into MxN blocks, and then MaxPooling is performed on each block, it is to take the maximum value. See the following dynamic diagram, here it is divided into (2x2)

 

Finally, it is relatively simple to see the overall flow chart given in the paper. Replace the SVM in RCNN with a fully connected layer. The calculation of loss is also different.

The Smooth loss is used for the existence of the derivative when x=0, and the gradient is smooth. The classification loss uses SoftMax. The predicted real box is the offset, and the real xywh of each box is obtained by the following calculation method.

Summarize:

Fast-RCNN still has a great improvement, and the speed has been improved a lot, but because the SS algorithm manually takes the candidate frame, it is still very slow, Faster-Rcnn further improves it


Since Faster-RCNN has many improvements, it looks a bit bloated when put together, so I put it in the next article.

Reference blog:

(10 messages) [Deep Learning] Two-Stage Target Detection Algorithm_MangoloD's Blog-CSDN Blog_Two-Stage Target Detection Algorithm

Guess you like

Origin blog.csdn.net/weixin_44711102/article/details/129249727