Detailed explanation of DynaSLAM code (2) — Mask RCNN object detection framework

Table of contents

2.1 Preface

2.2 Advantages of Mask R-CNN

2.3 Analysis of Mask R-CNN framework

(1) Mask R-CNN algorithm steps

(2) Faster-R-CNN

(3) FCN

(4) Analysis and comparison of ROIPooling and ROIAlign

(5) Mask R-CNN loss


Reference link:

(1) Detailed Explanation of Mask R-CNN Network

(2) Detailed explanation of Mask R-CNN_mask rcnn_Technical Miner's Blog-CSDN Blog

Video explanation: Detailed explanation of Mask R-CNN network_哔哩哔哩_bilibili

Paper address: https://arxiv.org/abs/1703.06870 

2.1 Preface

        Mask R-CNN is an article published by He Yuming in 2017, which also won the Marr Prize of ICCV 2017. Mask R-CNN is a very flexible framework that can add different branches to complete different tasks, and can complete various tasks such as target classification, target detection, semantic segmentation, instance segmentation, and human pose recognition  .

 Mask R-CNN adds a branch for predicting the target segmentation Mask on the basis of Faster R-CNN (that is, predicting the Bounding Boxes information, category information of the target, and predicting the segmentation Mask information).

Mask R-CNN can not only perform target detection and segmentation at the same time, but can also be easily extended to other tasks, such as predicting key points of the human body at the same time.

 

2.2 Advantages of Mask R-CNN

High speed and high accuracy: In order to achieve this goal, the author chose the classic target detection algorithm Faster-rcnn and the classic semantic segmentation algorithm FCN. Faster-rcnn can quickly and accurately complete the function of target detection; FCN can accurately complete the function of semantic segmentation. These two algorithms are classics in the corresponding field. Mask R-CNN is more complex than Faster-rcnn, but it can still reach 5fps in the end, which is comparable to the speed of the original Faster-rcnn. Due to the discovery of the pixel deviation problem in ROI Pooling, the corresponding ROIAlign strategy is proposed, coupled with the precise pixel mask of FCN, so that it can obtain high accuracy.

Simple and intuitive: The idea of ​​the entire Mask R-CNN algorithm is very simple, that is, FCN is added to the original Faster-rcnn algorithm to generate the corresponding MASK branch. That is Faster-rcnn + FCN, more detailed is RPN + ROIAlign + Fast-rcnn + FCN.

Ease of use: The entire Mask R-CNN algorithm is very flexible and can be used to complete a variety of tasks, including target classification, target detection, semantic segmentation, instance segmentation, human pose recognition and other tasks, which demonstrates its ease of use vividly. I have rarely seen any algorithm with such good scalability and ease of use, it is worth learning and learning from. In addition, we can replace different backbone architecture and Head Architecture to obtain different performance results.

2.3  Analysis of Mask R-CNN framework

Mask R-CNN algorithm framework

(1) Mask R-CNN algorithm steps

  • First, input a picture you want to process, and then perform the corresponding preprocessing operation, or the preprocessed picture;
  • Then, input it into a pre-trained neural network (ResNeXt, etc.) to obtain the corresponding feature map;
  • Then, set a predetermined ROI for each point in the feature map, so as to obtain multiple candidate ROIs;
  • Then, send these candidate ROIs to the RPN network for binary classification (foreground or background) and BB regression, and filter out some candidate ROIs;
  • Then, perform the ROIAlign operation on the remaining ROIs (that is, first match the original image with the pixel of the feature map, and then match the feature map with the fixed feature);
  • Finally, classify these ROIs (N category classification), BB regression and MASK generation (perform FCN operation in each ROI).

The structure of Mask R-CNN is also very simple, which is to add a Mask branch (small FCN) in parallel on the basis of the RoI obtained by RoIAlign (RoIPool in the original Faster R-CNN). See the figure below. Previously, Faster R-CNN was connected to a Fast R-CNN detection head on the basis of RoI, that is, the class and box branches in the figure, and now there is a Mask branch in parallel.

Note that the Mask R-CNN with and without the FPN structure is slightly different in the Mask branch. For the Mask R-CNN with the FPN structure, its class, box branch and Mask branch do not share a RoIAlign. During the training process, for the class, the box branch RoIAlign pools the Proposals obtained by the RPN (Region Proposal Network) to a size of 7x7, and for the Mask branch RoIAlign pools the Proposals to a size of 14x14. .

Here, I decompose Mask R-CNN into the following 3 modules, Faster-rcnn, ROIAlign and FCN. Then explain these three modules separately, which is also the core of the algorithm .

(2) Faster-R-CNN

 

Faster-rcnn mainly includes 4 key modules, feature extraction network, generating ROI, ROI classification, and ROI regression.

  • Feature extraction network : It is used to extract some important features of different targets from a large number of pictures, usually composed of conv+relu+pool layer, commonly used some pre-trained networks (VGG, Inception, Resnet, etc.), the obtained results is called a feature map;
  • Generate ROI : Make multiple candidate ROIs (here 9) on each point of the obtained feature map, and then use the classifier to distinguish these ROIs into background and foreground, and use the regressor to make preliminary adjustments to the positions of these ROIs ;
  • ROI classification : In the RPN stage, it is used to distinguish the foreground (overlapping the real target and its overlapping area is greater than 0.5) and the background (does not overlap with any target or its overlapping area is less than 0.1); in the Fast-rcnn stage, it is used to distinguish between different types target (cat, dog, person, etc.);
  • ROI regression : In the RPN stage, make preliminary adjustments; in the Fast-rcnn stage, make precise adjustments;

Its overall process is as follows:

  • First, crop the input image, and send the cropped image to the pre-trained classification network to obtain the feature map corresponding to the image;
  • Then take 9 candidate ROIs (3 different scales, 3 different aspect ratios) on each anchor point on the feature map, and map them to the original image according to the corresponding ratio (because the feature extraction network generally has conv and pool, but only the pool will change the size of the feature map, so the final feature map size is related to the number of pools);
  • Then these candidate ROIs are input into the RPN network, and the RPN network classifies these ROIs (that is, determines whether these ROIs are foreground or background) and performs preliminary regression on them (that is, calculates the deviation of BB between these foreground ROIs and the real target. value, including Δx, Δy, Δw, Δh), and then do NMS (non-maximum suppression, that is, sort these ROIs according to the classification score, and then select the top N ROIs);
  • Then perform ROI Pooling operations on these ROIs of different sizes (that is, map them to a feature_map of a specific size, which is 7x7 in the text), and output a fixed-size feature_map;
  • Finally, input it into a simple detection network, and then use 1x1 convolution to classify (distinguish different categories, N+1 categories, the extra category is the background, used to delete inaccurate ROI), and perform BB regression at the same time ( Accurately adjust the deviation between the predicted ROI and the ROI of GT), thereby outputting a BB set. 

(3) FCN

FCN network architecture

The FCN algorithm is a classic semantic segmentation algorithm that can accurately segment objects in pictures. Its overall architecture is shown in the figure above. It is an end-to-end network. The main modules include convolution and deconvolution. That is, the image is first convolved and pooled to reduce the size of the feature map; then Perform a deconvolution operation, that is, perform an interpolation operation, continuously increase its feature map, and finally classify each pixel value. So as to realize the accurate segmentation of the input image.
 

(4) Analysis and comparison of ROIPooling and ROIAlign

 Comparison of ROI Pooling and ROIAlign

The biggest difference between ROI Pooling and ROIAlign is that the former uses two quantization operations, while the latter does not use quantization operations, but uses a linear interpolation algorithm, as shown in the figure above.

 

ROI Pooling technology

As shown in the figure above, in order to get a fixed-size (7X7) feature map, we need to do two quantization operations:

  • image coordinates — feature map coordinates,
  • feature map coordinates — ROI feature coordinates.

Let's talk about the specific details. As shown in the figure, we input an 800x800 image, and there are two targets (cat and dog) in the image.

The first quantization error : the size of the dog is 665x665. After passing through the VGG16 network, we can obtain the corresponding feature map. If we perform Padding operation on the convolutional layer, our image will maintain its original size after passing through the convolutional layer, but Due to the existence of the pooling layer, the feature map we finally obtain will be reduced by a certain ratio compared with the original image, which is related to the number and size of the Pooling layer. In this VGG16, we use 5 pooling operations, each pooling operation is 2Pooling, so we finally get the size of the feature map as 800/32 x 800/32 = 25x25 (it is an integer), but the dog corresponds to On the feature map, the result we get is 665/32 x 665/32 = 20.78 x 20.78. The result is a floating point number with decimals, but our pixel value does not have decimals, so the author quantized it (ie rounding operation), i.e. its result becomes 20 x 20. ;

The second quantization error : However, there are ROIs of different sizes in our feature map, but the network behind us requires us to have a fixed input. Therefore, we need to convert ROIs of different sizes into fixed ROI features. Use here The ROI feature is 7x7, then we need to map the 20 x 20 ROI to the 7 x 7 ROI feature, the result is 20 /7 x 20/7 = 2.86 x 2.86, which is also a floating point number with a decimal point, we take The same operation rounds it up.

In fact, the error introduced here will lead to a deviation between the pixels in the image and the pixels in the feature, that is, there will be a large deviation between the ROI in the feature space and the original image. The reason is as follows: For example, using the error we introduced for the second time to analyze, it was originally 2,86, and we quantized it to 2. During this period, an error of 0.86 was introduced, which seems to be a small error, but you have to remember This is in the feature space. Our feature space is proportional to the image space. Here it is 1:32, so the gap corresponding to the original image is 0.86 x 32 = 27.52. This gap is not small, this is only considering the second quantization error. This can greatly affect the performance of the entire detection algorithm and is therefore a serious problem.

 ROIAlign Technology

As shown in the figure above, in order to obtain a fixed-size (7X7) feature map, ROIAlign technology does not use quantization operations, that is, we do not want to introduce quantization errors, such as 665 / 32 = 20.78, we use 20.78 instead of 20 It, such as 20.78 / 7 = 2.97, we use 2.97 instead of 2 to replace it. This is the original intention of ROIAlign. So how do we deal with these floating point numbers? Our solution is to use the "bilinear interpolation" algorithm .

Bilinear interpolation is a better image scaling algorithm, which makes full use of the four real points around the virtual point in the original image (such as the floating point number 20.56, the pixel position is an integer value, no floating point value) The pixel value is used to jointly determine a pixel value in the target image, that is, the pixel value corresponding to the virtual position point of 20.56 can be estimated.

 Schematic diagram of bilinear interpolation

As shown in the figure above, the blue dotted line box represents the feature map obtained after convolution, the black solid line box represents the ROI feature, and the final output size is 2x2, then we use bilinear interpolation to estimate these blue points (virtual Coordinate point, also known as the grid point of bilinear interpolation) corresponding to the pixel value, and finally get the corresponding output. These blue points are ordinary points randomly sampled in 2x2Cell. The author pointed out that the number and position of these sampling points will not have a great impact on performance, and you can also use other methods to obtain them. Then perform max pooling or average pooling operations in each orange area to obtain the final 2x2 output. We do not use quantization operations in the whole process, and no errors are introduced, that is, the pixels in the original image and the pixels in the feature map are completely aligned without deviation, which will not only improve the accuracy of detection, but also facilitate instance segmentation .

(5) Mask R-CNN loss

The loss of Mask R-CNN is to add the loss on the Mask branch on the basis of Faster R-CNN, namely:

For the source code analysis of Mask R-CNN, please refer to other blog posts

Guess you like

Origin blog.csdn.net/qq_41921826/article/details/131445787
Recommended