Reprinted from: https://blog.csdn.net/linolzhang/article/details/71774168
1. Introduction to Mask-RCNN
The last article introduced FCN. This article introduces a new concept, Mask-RCNN, which is easy to understand. It is to add Mask on the basis of RCNN.
Mask-RCNN comes from the young and promising Kaiming god. By adding a branch network on the basis of Faster-RCNN, it can segment the target pixels while achieving target detection.
Paper download: Partial translation of Mask R-CNN
Code download: [ Github ]
The network structure of Mask-RCNN (modified on the basis of the author's original image):
Assuming that everyone is already familiar with Faster, unfamiliar students are advised to read the previous blog post: [ Target Detection-RCNN Series ]
The black part is the original Faster-RCNN, and the red part is the modification on the Faster network:
1) Replace the Roi Pooling layer with RoiAlign;
2) Add a parallel FCN layer (mask layer);
Let's first outline several features of Mask-RCNN (from Paper's Abstract):
1) Add a branch network on the basis of frame recognition for semantic Mask recognition;
2) The training is simple, only a small Overhead is added compared to Faster, which can run to 5FPS;
3) It can be easily extended to other tasks, such as human pose estimation, etc.;
4) Without Trick, on each task, the effect is better than all current single-model entries;
Including Winners of COCO 2016.
PS: I wrote a reminder here, I suggest that you read the original Paper first, so that you will have a second understanding if you come back to read it.
2. RCNN pedestrian detection framework
Based on the earliest Faster RCNN framework, there have been many improvements. There are three main ones to read:
1) This article recommended by the author
Speed/accuracy trade-offs for modern convolutional object detectors
Paper Download【arxiv】
2) ResNet
MSRA is also the author's own work, you can refer to blog【ResNet Residual Network】
Paper Download【arxiv】
3)FPN
Feature Pyramid Networks for Object Detection, which integrates multi-layer features through feature pyramids to implement CNN.
Paper Download【arxiv】
Let's take a look at the schematic diagram of the combination of the latter two RCNN methods and Mask (the original image is directly attached):
The gray part in the figure is the original RCNN combined with ResNet or FPN network, and the black part below is the newly added parallel Mask layer. This figure itself is no different from the above figure, which aims to illustrate the generalization of the Mask RCNN method proposed by the author. Adaptability - Can be combined with a variety of RCNN frameworks and perform well .
3. Technical points of Mask-RCNN
● Technical point 1 - Strengthened basic network
Using ResNeXt-101+FPN as a feature extraction network to achieve the effect of state-of-the-art.
● Technical Point 2 - ROIAlign
Use ROIAlign instead of RoiPooling (improving the pooling operation). An interpolation process is introduced, first through bilinear interpolation to 14*14, and then pooling to 7*7, which largely solves the Misalignment alignment problem caused by direct sampling only by Pooling .
PS: Although Misalignment has little effect on the classification problem, there will be a large error on the Pixel-level Mask.
Later, we will post the results for comparison (Table2 c & d). We can see that ROIAlign has brought great improvement. It can be seen that the larger the Stride, the more obvious the improvement.
● Technical point 3 - Loss Function
For B, each ROIAlign corresponds to the output of dimension K * m^2. K corresponds to the number of categories, that is, K masks are output , and m corresponds to the pooling resolution (7*7). Loss function definition:
Lmask(Cls_k) = Sigmoid (Cls_k), the average binary cross-entropy (average binary cross-entropy) Loss, calculated by pixel-by-pixel Sigmoid.
Why K masks? Inter-class competition can be effectively avoided by assigning a Mask to each Class (other Classes do not contribute Loss).
From the comparison of the results (Table 2 b), the Decouple decoupling that the author refers to is much better than the multi-class Softmax.
4. Comparative experiment effect
In addition, the author gives a lot of experimental segmentation effects, not all of them are listed, only a comparison chart with FCIS (FCIS has the problem of Overlap):
5. Mask-RCNN extension
The extension of Mask-RCNN in pose estimation has a good effect. Interested children's shoes can see Paper.