Mask-RCNN technical analysis

Reprinted from: https://blog.csdn.net/linolzhang/article/details/71774168

1. Introduction to Mask-RCNN

       The last article introduced FCN. This article introduces a new concept, Mask-RCNN, which is easy to understand. It is to add Mask on the basis of RCNN.

       Mask-RCNN comes from the young and promising Kaiming god. By adding a branch network on the basis of Faster-RCNN, it can segment the target pixels while achieving target detection.

       Paper download: Partial translation of Mask R-CNN           

       Code download: [ Github ]

       The network structure of Mask-RCNN (modified on the basis of the author's original image):

        

       Assuming that everyone is already familiar with Faster, unfamiliar students are advised to read the previous blog post: [ Target Detection-RCNN Series ]

       The black part is the original Faster-RCNN, and the red part is the modification on the Faster network:

1) Replace the Roi Pooling layer with RoiAlign;




2) Add a parallel FCN layer (mask layer);


       Let's first outline several features of Mask-RCNN (from Paper's Abstract):

1) Add a branch network on the basis of frame recognition for semantic Mask recognition;

2) The training is simple, only a small Overhead is added compared to Faster, which can run to 5FPS;

3) It can be easily extended to other tasks, such as human pose estimation, etc.;

4) Without Trick, on each task, the effect is better than all current single-model entries;

     Including Winners of COCO 2016.

        PS: I wrote a reminder here, I suggest that you read the original Paper first, so that you will have a second understanding if you come back to read it.


2. RCNN pedestrian detection framework

       Based on the earliest Faster RCNN framework, there have been many improvements. There are three main ones to read:

1) This article recommended by the author

     Speed/accuracy trade-offs for modern convolutional object detectors

     Paper Downloadarxiv

2) ResNet

     MSRA is also the author's own work, you can refer to blog【ResNet Residual Network】 

     Paper Downloadarxiv

3)FPN

     Feature Pyramid Networks for Object Detection, which integrates multi-layer features through feature pyramids to implement CNN.

     Paper Downloadarxiv

       Let's take a look at the schematic diagram of the combination of the latter two RCNN methods and Mask (the original image is directly attached):

        

       The gray part in the figure is the original RCNN combined with ResNet or FPN network, and the black part below is the newly added parallel Mask layer. This figure itself is no different from the above figure, which aims to illustrate the generalization of the Mask RCNN method proposed by the author. Adaptability - Can be combined with a variety of RCNN frameworks and perform well .


3. Technical points of Mask-RCNN

● Technical point 1 - Strengthened basic network

     Using ResNeXt-101+FPN as a feature extraction network to achieve the effect of state-of-the-art.

● Technical Point 2 - ROIAlign

     Use ROIAlign instead of RoiPooling (improving the pooling operation). An interpolation process is introduced, first through bilinear interpolation to 14*14, and then pooling to 7*7, which largely solves the Misalignment alignment problem caused by direct sampling only by Pooling .

     PS: Although Misalignment has little effect on the classification problem, there will be a large error on the Pixel-level Mask.

     Later, we will post the results for comparison (Table2 c & d). We can see that ROIAlign has brought great improvement. It can be seen that the larger the Stride, the more obvious the improvement. 

● Technical point 3 - Loss Function


    For B, each ROIAlign corresponds to the output of dimension K * m^2. K corresponds to the number of categories, that is, K masks are output , and m corresponds to the pooling resolution (7*7). Loss function definition:

            Lmask(Cls_k) = Sigmoid (Cls_k), the average binary cross-entropy (average binary cross-entropy) Loss, calculated by pixel-by-pixel Sigmoid.

     Why K masks? Inter-class competition can be effectively avoided by assigning a Mask to each Class (other Classes do not contribute Loss).

        

     From the comparison of the results (Table 2 b), the Decouple decoupling that the author refers to is much better than the multi-class Softmax.


4. Comparative experiment effect


       In addition, the author gives a lot of experimental segmentation effects, not all of them are listed, only a comparison chart with FCIS (FCIS has the problem of Overlap):

       

5. Mask-RCNN extension

       The extension of Mask-RCNN in pose estimation has a good effect. Interested children's shoes can see Paper.

        



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324525275&siteId=291194637