Table of contents
foreword
Before the introduction MASK R-CNN
, it is recommended to look at FPN
the Internet, Faster R-CNN
and FCN
the introduction: the link is attached below:
- Introduction to R-CNN, Fast R-CNN and Faster R-CNN networks
- FCN network introduction
- FPN network introduction
When we introduced the data set before, we said that image segmentation is divided into semantic segmentation and instance segmentation. Look at the following two animations: Semantic segmentation: instance segmentation
:
today
’s introduction MASK R-CNN
is for instance segmentation. We mainly start from the following Several parts are explained:
- MASK R-CNN network
- RoiAlign
- Mask branch (FCN)
- loss function
- Mask branch prediction
1. MASK R-CNN network
Let's look at the network structure first MASK R-CNN
:
From the above network structure, we can see that the previous one RolAlign+CNN
is the previous Faster R-CNN
one (actually, Faster R-CNN
the messenger 's is also RoIAlign
, but not RoIPool
). The next convolutional layer is a network structure that can be connected in parallel for segmentation, and key point detection is fine.
Let's take a look at MASK
the structure of the branch, which is FCN
very similar to the branch. There are two main structures: the pyramid structure without the FPN feature and the one with it FPN
. The one we often use is also the one on the right FPN
.
1.1. RoIPool and RoIAlign
As mentioned above, the F in MASK R-CNN
the middle bar was replaced with a layer, why? Because two rounding operations are involved in the middle, which will lead to deviations in positioning. Here we look at the next operation: As can be seen from the above figure, two rounding operations may be involved. Let’s take the label box of target detection as an example to explain. The first time is to project the size of the label box to the final network output feature. Layers are rounded once; the second time is at , because the projected frame cannot be guaranteed to be evenly divided, and a rounding is involved. In contrast , as can be seen from the above figure, the rounding operation is not involved in the first projection , and the final calculated value is as much as it is; the second pooling is directly divided into the first projection The obtained feature matrix, find the coordinates of the center point and the nearest points around it (you can also use several sampling points to calculate the mean value, here is an example), directly calculate the bilinear difference, and will not involve rounding operations . It can be seen from the above comparison that no rounding operation is involved, so his positioning is more accurate.aster R-CNN
RoIPool
RoIAlign
RoIPool
RoIPool
RoIPool
maxpooling
RoIPool
RoIAlign
RoIAlign
1.2. MASK branch
We mentioned above that MASK
there are two types of branches, with FPN
and without FPN
. The most commonly used one is the following FPN
structure:
Note 1:
There are two in the above figureRoI
, the above one corresponds towhich is not the same asFaster R-CNN
the one used bythe branchRoIAlign
,, and the size of one output is 7 × 7 7\times 7MASK
RoIAlign
RoIAlign
7×7 , one is13 × 13 13\times1313×13 . Because segmentation requires more information to be retained, more information will be lost if the pooling is larger. MASK
The final output ofthe followingis 28 × 28 × 80 28\times28\times8028×28×80 means to predict a 28 × 28 28\times28for each category (COCO
usually used when 80 categories)28×28 size masks.
What does it mean to decouple MASK R-CNN
the predicted Mask
sum in ? For each pixel, each category will predict a category probability score. Finally, each pixel will be processed along the direction . After processing, the probability score of each pixel belonging to each category can be obtained, so the difference between different categories is There is competition. After passing, the probability of each pixel in the direction is only equal to . If the probability score of a certain category is large, the probability score of other categories will be small. So there is a competitive relationship between them, that is, the state of coupling with is. So how do you decouple and decouple in? I just said that a mask will be predicted for each prediction category in the branch, but it will not process each data along its direction , but according to the branch prediction For the category information of the target, the mask information for the category in the branch is extracted and used. This passage sounds a bit convoluted and obscure, please understand more. The core is that the branch no longer needs its own classification information, and takes the classification information as its own.class
FCN
channel
softmax
sofmax
channel
1
AMSK
class
MASK R-CNN
mask
class
mask
channel
softmax
Faster R-CNN
mask
mask
Faster R-CNN
Note 2:
When training the network,MASK
the target of the input branch isRPN
provided by , that isproposals
, it should be noted that all the input tomask
the branchproposals
are positive samples, and the positive samples areFaster R-CNN
obtained when the branch performs positive and negative sample matching, and will beproposals
input toFaster R-CNN
the branch. Infasterr-cnn
the branch, the matching of positive and negative samples will be performed to obtainproposal
whether each is a positive sample or a negative sample andproposal
it correspondsGT
to, and pass all the obtained positive samples toMask
the branch. The target of the input branch
when predictingprovided by , which is the last predicted target bounding box. The provided target bounding box may not be accurate. For one target,multiple target bounding boxes may be provided. We just said that all the samples provided tothe branchare positive samples, so there must be intersections. Thesecan be provided tothe branch for training. But in the final prediction, it is the output of the directly usedbranch, because only the most accurate target bounding box is needed for prediction, it may be a target, and this target can be provided to thebranch, and in, afterprocessing It can filter out many overlapping targets, and finally send fewer targets to the mask branch, and the amount of calculation will decrease if there are fewer targets.mask
Faster R-CNN
RPN
RPN
mask
proposals
proposals
mask
Faster R-CNN
MASK
Faster R-CNN
NMS
2. Loss function
The loss function has a total of three items, that is, the loss corresponding to the branch Faster R-cnn
is added on the basis of . L oss = L rpn + L fastrcnn + L mask Loss =L_{rpn}+L_{fast_rcnn }+L_{mask }mask
Loss=Lrpn+Lfastrcnn+Lmask
How to calculate mask
the loss of the branch, here we borrow a picture drawn by a blogger, as shown in the picture above, input a picture, pass through backbone
and fpn
get the feature layer of different sampling rates, and then pass through RPN
to generate a series proposals
, assuming passed through to RPN
get One Proposal
(the black rectangle box in the figure), proposal
input to RoIAlign
, can proposal
be cut out on the corresponding feature layer according to the size of the corresponding feature (shape is 14 × 14 × C 14\times14\times C14×14×C ), and thenlogitssigmoid activation functionin the figureMask Branch
predicting the information of each categoryAs mentioned above, the inputbranchesareprovided by , and theseare positive samples. These positive samples are known through. Whenpassing throughthe corresponding graphThe correspondingcat is obtained, sothe prediction of the corresponding category cat(sis28 × 28 28\times28Mask
(
通过
mask
proposal
RPN
proposal
Fast R-CNN
proposal
Faster R-CNN
GT
logits
mask
hape
28×28 ) Extract it. logits
It should be noted that although there is no processing on the channelheresoftmax
, it will besigmoid
activated, that is, each predicted value will be mapped to0-1
between. Thencrop and scale to28 × 28 28\times28Proposal
the corresponding on the original imageGT
28×28 size, get in the pictureGT mask
(corresponding target area is 1, background area is 0). In the final calculation,logits
the predicted category is catmask
andGT mask
thatBCELoss(BinaryCrossEntropyLoss)
is enough. The above is just anproposal
example, there will be many in reality.
3. Mask branch prediction
At the time of true predictive inference, Mask
the target of the input branch is Fast R-CNN
provided by the branch. As shown in the figure above, the previous ones backbon+fpn,RPN
are the same as those introduced above, and will not be introduced here. RPN
The output proposal
s passes through the Fast R-CNN branch (note that this RoIAlign
is different from the mask above), and we can get the final predicted target bounding box information and category information. Then provide the target bounding box information to Mask
the branch by RoIAlign
obtaining the corresponding features, predict one for each category mask
, and then predict logits
the information of the target, and then extract the information Fast R-CNN
corresponding to the category in the logits according to the category information provided by the branch . Mask
That is, the information predicted for the target Mask
( 28 shape
× 28 28\times2828×28 , due tosigmoid
the activation function, the values are between 0在这里插入代码片
and1
). Then use bilinear interpolation to scale the Mask to the size of the predicted target bounding box and place it in the corresponding area of the original image. Then, it will be converted into a binary imagethrough the set threshold (default value0.5
), the area with a predicted value greater than that is set as the foreground and the remaining area is the background. Now for each predicted target we can draw the bounding box information, category information and targetinformation in the original image.Mask
0.5
Mask