The classic model of big talk target detection: Mark R-CNN

In the previous article, the classic models of target detection (R-CNN, Fast R-CNN, Faster R-CNN) were introduced. The target detection is generally to achieve the following effects:
 
in R-CNN, Fast R-CNN, Faster R-CNN In order to identify and locate the target
 
more accurately and identify different targets in the pixel-level scene, the "image segmentation" technology is used to locate the exact pixels of each target, as shown in the following figure Display (accurate segmentation of people, cars, traffic lights, etc.):
 
Mask R-CNN is an important model for this "image segmentation".

The idea of ​​Mask R-CNN is very simple. Since the target detection effect of Faster R-CNN is very good, and each candidate region can output category labels and positioning information, then a branch is added on the basis of Faster R-CNN to increase a The output, the object mask, is changed from the original two tasks (classification + regression) to three tasks (classification + regression + segmentation). As shown in the figure below, Mask R-CNN consists of two branches:
 
These two branches of Mask R-CNN are parallel, so the training is simple and only a little more computationally expensive than Faster R-CNN.
Classification and positioning have been introduced in Faster R-CNN (see the article for details: Dahua target detection classic model RCNN, Fast RCNN, Faster RCNN ), and will not repeat the introduction here, the following will focus on the second branch, namely How to achieve pixel-level image segmentation.

As shown in the figure below, Mask R-CNN adds a branch of a fully convolutional network to Faster R-CNN (white part in the figure), which outputs a binary mask to indicate whether a given pixel is part of the target. The so-called binary mask means that when the pixel belongs to all positions of the target, it is marked as 1, and other positions are marked as 0.
 
As can be seen from the above figure, the binary mask is output based on the feature map, and the original image undergoes a series of convolution and pooling. After the transformation, the size has changed many times. If the binary mask output by the feature map is used to segment the image, it is definitely inaccurate. At this time, it needs to be corrected, that is, using RoIAlign to replace RoIPooling
 
as shown in the figure above, the original image size is 128x128, and the feature map after the convolutional network becomes 25x25 in size. At this time, if you want to circle the area corresponding to the upper left 15x15 pixels in the original image, how do you select the corresponding pixels in the feature map?
As you can see from the above two images, each pixel in the original image corresponds to 25/128 pixels of the feature map, so to select 15x15 pixels from the original image, you just need to select 2.93x2.93 pixels in the feature map (15x25/128=2.93), in RoIAlign, the bilinear interpolation method will be used to accurately obtain the content of 2.93 pixels, which can largely avoid the dislocation problem.
The modified network structure is shown in the following figure (the black part is the original Faster R-CNN, and the red part is the modified part of Mask R-CNN).
 
From the above figure, it can be seen that the loss function becomes
 
the loss function as classification error + detection error + Segmentation error, classification error and detection (regression) error are in Faster R-CNN, and segmentation error is newly added in Mask R-CNN.
For each MxM sized ROI, the mask branch has KxMxM dimensional outputs (K refers to the number of categories). For each pixel, the sigmod function is used to obtain the binary cross-entropy, that is, logistic regression is performed on each pixel to obtain the average binary cross-entropy error Lmask. By introducing a mechanism for predicting K outputs, each class is allowed to generate an independent mask to avoid inter-class competition, which enables decoupling of mask and class prediction.
For each ROI area, if it is detected which category it belongs to, only the cross-entropy error of this category is used for calculation, that is, for the output of KxMxM in a ROI area, the only useful thing is the output of MxM of a certain category. As shown in the figure below:
 
For example, there are currently 3 categories: cats, dogs, and people. It is detected that the current ROI belongs to the "people" category, then the Lmask used is the mask of the "people" branch.

Mask R-CNN combines these binary masks with classification and bounding boxes from Faster R-CNN, resulting in a stunningly accurate segmentation of the image, as shown below:

Mask R-CNN is a small and flexible general-purpose object instance segmentation framework, which can not only detect objects in images, but also output a high-quality segmentation result for each object. In addition, Mask R-CNN is also easy to generalize to other tasks, such as character key point detection, as shown in the following figure:

From R-CNN, Fast R-CNN, Faster R-CNN to Mask R-CNN, each progress is not necessarily It's leaps and bounds, and these advances are actually intuitive and incremental paths of improvement, but they add up to a very dramatic effect.
Finally, summarize the development process of the target detection algorithm model, as shown in the following figure:

Wall Crack Advice

In 2017, Kaiming He et al. published the classic paper "Mask R-CNN" on Mask R-CNN. In the paper, the idea, principle and test effect of Mask R-CNN were introduced in detail. It is recommended to read this paper to learn more about this paper. Model.

 

Recommended related reading

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325020374&siteId=291194637