Introduction to MASK R-CNN network

foreword

  Before the introduction MASK R-CNN, it is recommended to look at FPNthe Internet, Faster R-CNNand FCNthe introduction: the link is attached below:

  When we introduced the data set before, we said that image segmentation is divided into semantic segmentation and instance segmentation. Look at the following two animations: Semantic segmentation: instance segmentation
:
insert image description here
today
insert image description here
  ’s introduction MASK R-CNNis for instance segmentation. We mainly start from the following Several parts are explained:

  • MASK R-CNN network
  • RoiAlign
  • Mask branch (FCN)
  • loss function
  • Mask branch prediction

1. MASK R-CNN network

  Let's look at the network structure first MASK R-CNN:
insert image description here
  From the above network structure, we can see that the previous one RolAlign+CNNis the previous Faster R-CNNone (actually, Faster R-CNNthe messenger 's is also RoIAlign, but not RoIPool). The next convolutional layer is a network structure that can be connected in parallel for segmentation, and key point detection is fine.
insert image description here

  Let's take a look at MASKthe structure of the branch, which is FCNvery similar to the branch. There are two main structures: the pyramid structure without the FPN feature and the one with it FPN. The one we often use is also the one on the right FPN.
insert image description here

1.1. RoIPool and RoIAlign

  As mentioned above, the F in MASK R-CNNthe middle bar was replaced with a layer, why? Because two rounding operations are involved in the middle, which will lead to deviations in positioning. Here we look at the next operation:   As can be seen from the above figure, two rounding operations may be involved. Let’s take the label box of target detection as an example to explain. The first time is to project the size of the label box to the final network output feature. Layers are rounded once; the second time is at , because the projected frame cannot be guaranteed to be evenly divided, and a rounding is involved.   In contrast , as can be seen from the above figure, the rounding operation is not involved in the first projection , and the final calculated value is as much as it is; the second pooling is directly divided into the first projection The obtained feature matrix, find the coordinates of the center point and the nearest points around it (you can also use several sampling points to calculate the mean value, here is an example), directly calculate the bilinear difference, and will not involve rounding operations .   It can be seen from the above comparison that no rounding operation is involved, so his positioning is more accurate.aster R-CNNRoIPoolRoIAlignRoIPool
RoIPool
insert image description here
RoIPoolmaxpooling
insert image description here
RoIPoolRoIAlign
RoIAlign

1.2. MASK branch

  We mentioned above that MASKthere are two types of branches, with FPNand without FPN. The most commonly used one is the following FPNstructure:

insert image description here
Note 1:
  There are two in the above figureRoI, the above one corresponds towhich is not the same asFaster R-CNNthe one used bythe branchRoIAlign,, and the size of one output is 7 × 7 7\times 7MASKRoIAlignRoIAlign7×7 , one is13 × 13 13\times1313×13 . Because segmentation requires more information to be retained, more information will be lost if the pooling is larger. MASKThe final output ofthe followingis 28 × 28 × 80 28\times28\times8028×28×80 means to predict a 28 × 28 28\times28for each category (COCOusually used when 80 categories)28×28 size masks.

What does it mean to decouple MASK R-CNNthe predicted Masksum   in ? For each pixel, each category will predict a category probability score. Finally, each pixel will be processed along the direction . After processing, the probability score of each pixel belonging to each category can be obtained, so the difference between different categories is There is competition. After passing, the probability of each pixel in the direction is only equal to . If the probability score of a certain category is large, the probability score of other categories will be small. So there is a competitive relationship between them, that is, the state of coupling with is. So how do you decouple and decouple in? I just said that a mask will be predicted for each prediction category in the branch, but it will not process each data along its direction , but according to the branch prediction For the category information of the target, the mask information for the category in the branch is extracted and used. This passage sounds a bit convoluted and obscure, please understand more. The core is that the branch no longer needs its own classification information, and takes the classification information as its own.classFCNchannelsoftmaxsofmaxchannel1AMSKclassMASK R-CNNmaskclassmaskchannelsoftmaxFaster R-CNNmaskmaskFaster R-CNN

Note 2:
  When training the network,MASKthe target of the input branch isRPNprovided by , that isproposals, it should be noted that all the input tomaskthe branchproposalsare positive samples, and the positive samples areFaster R-CNNobtained when the branch performs positive and negative sample matching, and will beproposalsinput toFaster R-CNNthe branch. Infasterr-cnnthe branch, the matching of positive and negative samples will be performed to obtainproposalwhether each is a positive sample or a negative sample andproposalit correspondsGTto, and pass all the obtained positive samples toMaskthe branch. The target of the input branch
  when predictingprovided by , which is the last predicted target bounding box. The provided target bounding box may not be accurate. For one target,multiple target bounding boxes may be provided. We just said that all the samples provided tothe branchare positive samples, so there must be intersections. Thesecan be provided tothe branch for training. But in the final prediction, it is the output of the directly usedbranch, because only the most accurate target bounding box is needed for prediction, it may be a target, and this target can be provided to thebranch, and in, afterprocessing It can filter out many overlapping targets, and finally send fewer targets to the mask branch, and the amount of calculation will decrease if there are fewer targets.maskFaster R-CNNRPNRPNmaskproposalsproposalsmaskFaster R-CNNMASKFaster R-CNNNMS

2. Loss function

The loss function has a total of three items, that is, the loss corresponding to the branch Faster R-cnnis added on the basis of . L oss = L rpn + L fastrcnn + L mask Loss =L_{rpn}+L_{fast_rcnn }+L_{mask }mask
Loss=Lrpn+Lfastrcnn+Lmask

insert image description here
  How to calculate maskthe loss of the branch, here we borrow a picture drawn by a blogger, as shown in the picture above, input a picture, pass through backboneand fpnget the feature layer of different sampling rates, and then pass through RPNto generate a series proposals, assuming passed through to RPNget One Proposal(the black rectangle box in the figure), proposalinput to RoIAlign, can proposalbe cut out on the corresponding feature layer according to the size of the corresponding feature (shape is 14 × 14 × C 14\times14\times C14×14×C ), and thenlogitssigmoid activation functionin the figureMask Branchpredicting the information of each categoryAs mentioned above, the inputbranchesareprovided by , and theseare positive samples. These positive samples are known through. Whenpassing throughthe corresponding graphThe correspondingcat is obtained, sothe prediction of the corresponding category cat(sis28 × 28 28\times28Mask通过maskproposalRPNproposalFast R-CNNproposalFaster R-CNNGTlogitsmaskhape28×28 ) Extract it. logitsIt should be noted that although there is no processing on the channelheresoftmax, it will besigmoidactivated, that is, each predicted value will be mapped to0-1between. Thencrop and scale to28 × 28 28\times28Proposalthe corresponding on the original imageGT28×28 size, get in the pictureGT mask(corresponding target area is 1, background area is 0). In the final calculation,logitsthe predicted category is catmaskandGT maskthatBCELoss(BinaryCrossEntropyLoss)is enough. The above is just anproposalexample, there will be many in reality.

3. Mask branch prediction

insert image description here
  At the time of true predictive inference, Maskthe target of the input branch is Fast R-CNNprovided by the branch. As shown in the figure above, the previous ones backbon+fpn,RPNare the same as those introduced above, and will not be introduced here. RPNThe output proposals passes through the Fast R-CNN branch (note that this RoIAlignis different from the mask above), and we can get the final predicted target bounding box information and category information. Then provide the target bounding box information to Maskthe branch by RoIAlignobtaining the corresponding features, predict one for each category mask, and then predict logitsthe information of the target, and then extract the information Fast R-CNNcorresponding to the category in the logits according to the category information provided by the branch . MaskThat is, the information predicted for the target Mask( 28 shape× 28 28\times2828×28 , due tosigmoidthe activation function, the values ​​​​are between 0在这里插入代码片and1). Then use bilinear interpolation to scale the Mask to the size of the predicted target bounding box and place it in the corresponding area of ​​the original image. Then, it will be converted into a binary imagethrough the set threshold (default value0.5), the area with a predicted value greater than that is set as the foreground and the remaining area is the background. Now for each predicted target we can draw the bounding box information, category information and targetinformation in the original image.Mask0.5Mask

Guess you like

Origin blog.csdn.net/qq_38683460/article/details/129436676