Neural Network Study Notes 7 - RCNN, Fast RCNN, Faster RCNN, FCN and Mask RCNN in Target Detection, Semantic Segmentation and Instance Segmentation

Series Article Directory

RCNN series reference video
FCN reference video
Mask R-CNN reference video



Object Detection

There are four broad categories of tasks in computer vision for image recognition:

(1) Classification-Classification: Solve the problem of "what is it?", that is, given a picture or a video to judge what category of targets are contained in it.

(2) Positioning-Location: Solve the problem of "where?", that is, locate the location of the target.

(3) Detection-Detection: Solve the problem of "where? What is it?", that is, locate the position of the target and know what the target is.

(4) Segmentation-Segmentation: It is divided into instance-level segmentation (Instance-level) and scene segmentation (Scene-level), to solve the problem of "which object or scene each pixel belongs to".
insert image description here

  1. What is target detection?
    The task of target detection is to find out all the targets (objects) of interest in the image, determine their categories and positions, and it is one of the core issues in the field of computer vision. Object detection has always been the most challenging problem in the field of computer vision due to the different appearance, shape and posture of various objects, coupled with the interference of factors such as illumination and occlusion during imaging. Target detection is divided into two series - RCNN series and YOLO series. RCNN series is a representative algorithm based on area detection, and YOLO is a representative algorithm based on area extraction.

  2. The core problem of target detection
    (1) classification problem: that is, which category the image in the picture (or a certain area) belongs to.
    (2) Localization problem: The target may appear anywhere in the image.
    (3) Size issue: Targets come in various sizes.
    (4) Shape problem: The target may have various shapes.

  3. Classification of target detection algorithms
    Target detection algorithms based on deep learning are mainly divided into two categories: Two stage and One stage.

    1) The Tow Stage
    first generates a region, which is called a region proposal (referred to as RP, a preselection box that may contain objects to be inspected), and then classifies samples through a convolutional neural network.
    Task flow: feature extraction --> generate RP --> classification/positioning regression.
    The representative of the common tow stage target detection algorithm is the RCNN series algorithm.

    2) One Stage
    does not use RP, and directly extracts features from the network to predict object classification and location.
    Task flow: feature extraction –> classification/positioning regression.
    The representative of the common one stage target detection algorithm is the YOLO series algorithm.

Semantic segmentation

Popular and specific to the actual image, semantic segmentation is actually the classification of the refined version, that is, for an image, the traditional image classification is to detect and identify the objects that appear in the image and identify what category they belong to. It is to classify a whole picture. So now someone wants to classify every pixel in the picture. Unlike classification, where the final result of a deep network is the only thing that matters, semantic segmentation not only requires discrimination at the pixel level, but also requires a mechanism to project the discriminative features learned at different stages of the encoder onto the pixel space.

When we classify a certain pixel on a picture, each pixel will be assigned a category. After each pixel is marked with a different category, each pixel corresponding to a different category is assigned a new color and then reassembled into a picture. At this time, for this picture, everything is distinguished from the pixel level, and after the color is reconnected, the image shows that an object in the picture is separated from the whole picture and It has all the semantic information of this object. It can be said that image segmentation is the fine reasoning of image classification, the process from rough reasoning to fine reasoning.

Semantic segmentation can only divide categories, but the same category cannot be divided. Therefore, after semantic segmentation, instance segmentation is needed to divide different instances of the same category.

Instance division

Instance segmentation uses the results of target detection and semantic segmentation at the same time, and extracts the Mask corresponding to the target in semantic segmentation through the index of the highest confidence category of the target provided by target detection. Instance segmentation, as the name suggests, is to separate specific objects (specific examples) in a category. For example, if there are multiple people in a photo, for semantic segmentation, it is only necessary to classify the pixels of all the people into one category, but for instance segmentation, the pixels of different people must be classified into different categories. That is to say, instance segmentation is a step further than semantic segmentation.

insert image description here


Preface to RCNN Series Algorithms

In engineering applications, the detection algorithm is mainly the one-stage algorithm yolo series, etc., because one-stage is usually fast and can complete good real-time detection, while the two-stage algorithm represents that the RCNN series is slightly inferior, but some deep learning Frameworks such as Baidu PaddlePaddle open source the RCNN model for target detection, so that applications that meet various scenarios can be quickly built, including but not limited to security monitoring, medical image recognition, traffic vehicle detection, signal light recognition, food detection, etc.
insert image description here

As a classic target detection framework Faster R-CNN, although it is a paper in 2015, it is still the basis of many target detection algorithms, which is very rare in the rapidly developing field of deep learning. The improved Mask R-CNN based on Faster R-CNN was proposed in 2018 and won the best paper of ICCV2017. Mask R-CNN can be applied to human pose recognition, and has achieved good results in the three tasks of instance segmentation, target detection, and human key point detection.

Mask R-CNN is inherited from Faster R-CNN. Mask R-CNN just adds a Mask Prediction Branch (Mask prediction branch) to Faster R-CNN, and proposes ROI Align on the basis of ROI Pooling. So if you want to understand Mask R-CNN, you must first be familiar with Faster R-CNN. Similarly, Faster R-CNN is inherited from Fast R-CNN, and Fast R-CNN is inherited from R-CNN. Therefore, in order to allow everyone to better understand the CNN-based target detection method, we start with R-CNN Cut in, and have been briefly introduced to Mask R-CNN.

1. Pioneering work: RCNN

The RCNN algorithm was published by Ross Girshick et al. in CVPR 2014. It applied the convolutional neural network to feature extraction, and with the help of CNN's good feature extraction performance, it increased the detection rate of the PASCAL VOC dataset from 35.1% to 53.7%.
RCNN still continues the traditional object detection idea, treats object detection as a classification task, that is, first extracts a series of candidate regions, and then classifies the candidate regions. The specific process mainly includes four steps:

1. Candidate region generation

Use Region Proposal to extract candidate areas, such as SS (Selective Search) algorithm, first divide the image into small rectangular areas, and then merge and output the areas with the highest probability of containing the same object. In this step, about 2000 candidate areas are extracted. After extraction, each region needs to be normalized, and the candidate region is scaled to 227×227 to obtain a fixed-size image.

The SS algorithm first divides the picture into many small areas through a simple area division algorithm, and then through similarity and area size (small areas are aggregated first, which prevents large areas from continuously aggregating small areas, resulting in incomplete hierarchical relationships) Continuously aggregate adjacent small areas, similar to the idea of ​​clustering. This will solve the object hierarchy problem.

insert image description here

2. CNN feature extraction

The above-mentioned fixed-size image is used to obtain a fixed-dimensional feature output by using a CNN deep network. For example, inputting the Alexnet CNN network, in Alexnet, it does not directly perform the direct classification of the fully connected layer, but stops at the flattening process to obtain a 2000×4096 feature matrix

insert image description here

3. SVM classifier

Use the linear binary classifier SVM to classify the output features to get the results of whether they belong to this category, and use difficult sample mining to balance the imbalance of positive and negative samples. Taking the PASCAL VOC dataset as an example, there are 20 categories in the dataset, so 20 SVM classifiers are set. Multiply the 2000×4096 feature with the weight matrix 4096×20 composed of 20 SVMs to obtain a 2000×20 dimensional matrix, which represents the probability that 2000 candidate regions belong to 20 categories, so the sum of each row of the matrix is ​​1 .
It means that there are a total of 2000 candidate boxes, and each candidate box has 4096-dimensional features, and each dimension of each candidate box needs to perform 20 SVM binary classification judgments, where the 20 SVMs represent 20 categories. Then each row in the 2000×4096-dimensional matrix represents one of the candidate boxes, and each column in the 4096×20-dimensional matrix represents one of the category classifiers. The first row is multiplied by each column to get the first candidate box in 20 categories. The probability of belonging to each category is the first row in the 2000×20-dimensional matrix, and so on to obtain the probability of 2000 candidate boxes belonging to each of the 20 categories.

insert image description hereinsert image description here

For each class in the 2000×20-dimensional matrix, non-maximum value suppression is performed to remove overlapping suggestion boxes, and the column is obtained, which is some candidate boxes with the highest probability in this class. The specific implementation method of non-maximum value suppression and elimination of overlapping suggestion boxes is:
Step 1: Define the IoU index (Intersection over Union), that is, (A∩B) / (AUB) intersection and union ratio, that is, the area of ​​the overlapping area of ​​AB and the area of ​​AB ratio of the total area. Intuitively speaking, IoU is the ratio of the overlap of the candidate frame AB. The larger the IoU, the larger the proportion of the overlapping part of AB, that is, the more similar A and B are.

Step 2: Find the candidate box with the highest category probability among the 2000 candidate boxes in each category, calculate the IoU value of the similarity between other candidate boxes and the candidate box, and delete all candidate boxes with IoU values ​​greater than the threshold. In this way, only a few candidate boxes with a low coincidence rate can be retained, and repeated areas can be removed, because there may be many candidate boxes of different sizes for the same target, so this method removes other high-quality candidate boxes that retain the highest classification probability.

For example, in the following example, A is the area with the highest probability among all candidate boxes corresponding to the sunflower class, and B is another area. If the IoU of AB is calculated and the result is greater than the threshold, then AB is considered to belong to the same sunflower, so A should be retained. , delete B, because only one anchor box is needed for the same target, which is non-maximum suppression.
insert image description hereinsert image description here

4. Location refinement

Through the regressor, the feature is regressed on the boundary to obtain a more accurate target area. The position of the candidate area obtained by the SS algorithm is not necessarily accurate, so 20 regressors are used to perform regression operations on the remaining candidate frames in the 20 categories, and finally the corrected target area for each category is obtained. The specific implementation is as follows:

As shown in the figure, the yellow box represents the candidate frame Region Proposal, the green window represents the actual region Ground Truth (manually marked), and the red window represents the predicted region after the Region Proposal performs regression, and the least square method can be used to solve the linear regression problem.

Four parameters of the candidate area can be obtained through the regressor, namely: the x and y offset of the candidate area, and the scaling factor of height and width. These four parameters can be used to fine-tune the position of the candidate area to obtain the red prediction area.

insert image description here

5. Overall

insert image description here

Advantages:
The main improvement of R-CNN to the previous object recognition algorithm is to use a pre-trained convolutional neural network to extract features, which effectively improves the recognition accuracy.

Although RCNN has significantly improved the object detection effect, there are still three angle problems:

  1. The speed of training and testing is slow. RCNN requires multi-step training. The steps are cumbersome and the training speed is slow. For an image, we may select thousands of regions of interest. to the calculation.
  2. Due to the fully connected network in the actual classification, the input size is fixed, resulting in a decrease in accuracy.
  3. Candidate regions need to be extracted and saved in advance, which takes up a lot of space.

2. End-to-end: Fast RCNN

After RCNN, the SPPNet algorithm solves the two problems of repeated convolution calculation and fixed output scale, but there are still other disadvantages of RCNN. In 2015, Ross Girshick independently proposed a faster and stronger Fast RCNN algorithm. Not only the training steps can be implemented end-to-end, but also the algorithm is based on the VGG16 network, which is nearly 9 times faster than RCNN in terms of training speed. In the test It is 213 times faster and achieves a detection rate of 68.4% on the VOC 2012 dataset.insert image description here

1. Candidate area and feature extraction

Also using the SS (Selective Search) algorithm, unlike RCNN that directly divides the image into small rectangular areas, Fast-RCNN first determines the position of the candidate frame, and then sends the entire image to the convolutional network to calculate the entire image features at one time, and finally According to the coordinate mapping RoI of the previously determined RoI candidate frame, the desired RoI candidate feature map is obtained, and the feature maps of about 2000 candidate frames are extracted without repeated calculations.

insert image description here

The entire image is sent to the convolutional network for region generation, instead of candidate regions one by one like RCNN. Although the Selective Search method is still used, the advantages of shared convolution greatly reduce the amount of calculation.

2. RoI full connection, classifier and bounding box regression

For each sample candidate frame matrix, the method of feature pooling (RoI Pooling) is used to perform feature scale transformation. This method can have input of pictures of any size, making the training process more flexible and accurate without limiting the size of the image.

Divide the candidate feature map into a fixed number (7×7), then perform max pooling and downsampling maxpool to obtain a 7×7 feature matrix, and then flatten it into a one-dimensional vector (vector). After two fully connected layers (FC), the ROI feature vector (ROI feature vector) is obtained.

By flattening to a one-dimensional vector (vector). After two fully connected layers (fully connected layers, FCs), the ROI feature vector (ROI feature vector) is obtained. Then the ROI feature vector connects two FCs in parallel, one for the target probability prediction softmax classifier, and the other for the regression bbox regressor regressor of the bounding box parameters (bbox means bounding box).
insert image description here
The SVM used here is different from RCNN, and the classification and regression networks are trained together. In order to avoid the shortcomings of separate training and slow speed brought by the SVM classifier, the softmax function is used for classification. The output of the softmax classifier is N+1, N refers to your category, and 1 means that everything except the specified category belongs to the background category. Taking the PASCAL VOC dataset as an example, there are 20 categories in the dataset, so 21 softmax classifiers are set.

Because there are N+1 candidate boxes for classification, each candidate box has ( dx , dy , dw , dh ) (d_x, d_y, d_w, d_h)(dxdydwdh) four parameters, so the fully connected layer FCs of the bbox regressor outputs 4×(N+1) nodes.

insert image description here

Fast RCNN provides a calculation regression formula, provided by the candidate box ( P x , P y , P w , P h ) (P_x, P_y, P_w, P_h)(PxPyPwPh) data and the output of the fully connected layer( dx , dy , dw , dh ) (d_x, d_y, d_w, d_h)(dxdydwdh) parameters are calculated.

insert image description here

Because the probability of N+1 categories and the regression parameters of the bounding box need to be predicted in Fast RCNN, two loss functions are defined: classification loss and bounding box regression loss, and the sum of the two losses is the total loss.
( vx , vy , vw , vh ) (v_x, v_y, v_w, v_h)(vxvyvwvh) calculation requires inverse calculation of the above regression formula.
insert image description hereinsert image description hereinsert image description here

insert image description here

3. Overall

insert image description here
RCNN consists of four parts, so multi-step training is required, which is very cumbersome. Fast RCNN combines the three parts of CNN feature extraction, bounding box classifier, and bbox regression bounding box regression, all of which are integrated into the same CNN. Then Fast RCNN has only two parts: first obtain the candidate frame through the SS algorithm, and then complete feature extraction, classification and bounding box regression through CNN.

Although the Fast RCNN algorithm has achieved remarkable results, in this algorithm, Selective Search takes 2 to 3 seconds, while feature extraction only takes 0.2 seconds, so this method of region generation limits the space for the Fast RCNN algorithm to play. It also provides an improvement direction for the later Faster RCNN algorithm.

Advantages:
Feature extraction is performed on the entire image, and then the proposed area is selected, thereby reducing repeated calculations;

Disadvantages:
1. Although the GPU is used, the Region proposal is still implemented on the CPU. In the CPU, it takes about 2s to extract the candidate frame area of ​​a picture with the SS algorithm, but it only takes 0.32s to complete the entire CNN. Therefore, the bottleneck of Fast RCNN's calculation speed is the Region proposal.

2. Real-time applications cannot be satisfied, and end-to-end training and testing have not been truly realized;

3. Going to real time: Faster RCNN

Faster RCNN is another masterpiece of the author Ross Girshick after RCNN and Fast RCNN. Also using VGG16 as the backbone of the network, the inference speed reaches 5fps on the GPU (including the generation of candidate regions), and the accuracy rate is further improved. In 2015 ILSVRC and cOco competition won the first place in several projects. The biggest innovation of this algorithm is that the RPN (Region Proposal Network) network is proposed to replace the SS algorithm, and the Anchor mechanism is used to link the region generation with the convolutional network, which improves the detection speed to 14 FPS (Frames Per Second) in one fell swoop, and Achieving 70.4% detection results on the VOC 2012 test set.

insert image description here

1. RPN and anchor

Faster RCNN is actually a combination of RPN+Fast RCNN. RPN replaces the position of SS and is used for candidate frame extraction after feature map generation.

Input the image into the convolutional network to obtain the corresponding feature map feature maps, use the RPN network to generate candidate frames on the feature map, and project the candidate frames generated by RPN onto the feature map to obtain the feature matrix of the ROI region.
insert image description here
The conv feature map in the figure is the feature map obtained by inputting the original image into the convolutional network. Through the sliding window processing of 3×3, a 256-d one-dimensional vector will be generated every time it slides to a position. The 256 here is the vector of the ZF network and then passes through two fully connected layers, and finally outputs the classification probability 2k scores and the bounding box regression parameters 4k coordinates, where k refers to k anchor boxes, and 2k scores are each anchor box respectively The probability of foreground and background (note that only foreground and background are distinguished here, all categories are classified as foreground), 4k coordinates are because each anchor box has four parameters.

So what is an anchor? First of all, it must be clear that the anchor is not a candidate frame (Proposal). In the feature map, a sliding window is determined, and then the center point of the sliding window is used to find the point for the original image. The height and width of the original image are divided by the height and width of the feature map. The ratio of the two is obtained, and the corresponding pixel can be found in the original image through the ratio. With the pixel as the center, n boxes of different sizes and aspect ratios are drawn, called anchors. These anchors may contain targets. , or possibly no target. Because the size and aspect ratio of the target we are looking for in a picture is not fixed, so here the same point can set n anchors of different sizes and aspect ratios for prediction.
insert image description here

How to determine the number of n? Design the area and aspect ratio of each anchor in the paper:
area ( 12 8 2 , 25 6 2 , 51 2 2 ) (128^2,256^2,512^2)1282,2562,5122 )
Aspect ratio( 1 : 1 , 1 : 2 , 2 : 1 ) (1:1,1:2,2:1)1:1,1:2,2:1 )
Calculate n to be 3×3=9, as shown in the figure below, the blue anchors have an area of ​​128×128, the red anchors have an area of ​​256×256, and the green anchors have an area of ​​512×512.
insert image description hereinsert image description here

The 2k scores are the probability that each anchor box is the foreground and the background respectively (note that only the foreground and the background are distinguished here, and all categories are counted as the foreground).

reg is the same as Fast RCNN, each anchor box output by the fully connected layer has ( dx , dy , dw , dh ) (d_x, d_y, d_w, d_h)(dxdydwdh) four parameters.
insert image description here

What is the relationship between the anchor and the candidate box? Suppose there is a 1000x600x3 image, and a 3x3 convolution kernel is used for feature extraction to obtain a 60x40 feature map, then there are 60x40x9 (about 2w) anchors in total. After discarding some anchors that exceed the boundary of the picture, there are about 6000 anchors left. For these 6000 anchors, each anchor is adjusted to a candidate frame through the bounding box regression parameters generated by RPN (as mentioned earlier, each anchor outputs 2 probabilities and 4 bounding box regression parameters after RPN), as you can see here The proposal is extracted from the anchor, and this process is also the process of RPN generating candidate boxes. There is a large amount of overlap between the candidate boxes generated by RPN. Based on the cls score of the candidate box, non-maximum suppression is used, and the IoU is set to 0.7, so that only 2000 candidate boxes are left for each picture.

2. The loss of RPN and Fast_RCNN

insert image description hereinsert image description here

RPN performs prediction and judgment on cls and reg, and performs loss calculation on RPN. Similar to the loss calculation of Fast RCNN, it needs to calculate the loss sum of classification and bounding box, but here the cls classification is foreground background classification, and the reg bounding box is anchor boundary regression.

There are two opinions on the calculation of cls, one is to use multi-category 2k anchors, and the other is to classify k anchors with binary cross entropy.

Multi-classification loss:
insert image description here
A pair of anchors, corresponding to the background and the foreground respectively (the required detection targets are collectively referred to as the foreground), pi ∗ p^*_ipiThe value of is the real sample value of the top 1, 0, etc., pi p_ipiIt is the predicted value of 0.1, 0.9 and so on in the rectangular box.
According to pi ∗ p^*_ipito enter pi p_ipivalue, calculate the value of -log.

Binary classification loss:
insert image description here
a rectangle represents an anchor, pi ∗ p^*_ipiThe value of is the real sample value of the top 1, 0, etc., pi p_ipiIt is the predicted value of 0.1, 0.9 and so on in the rectangular box. pi ∗ p^*_ipiMake which part before and after the formula is 0.

Bounding box regression loss:
insert image description hereinsert image description here

Fast RCNN loss:
insert image description here
As mentioned earlier, Faster RCNN is actually a combination of RPN+Fast RCNN. RPN replaces the position of SS and is used for candidate frame extraction after feature map generation. Therefore, the Fast RCNN loss is also used in the specific classification.

insert image description here

Combine the candidate frames predicted by RPN to map to the original feature map to obtain feature matrices, and then zoom to the 7x7 feature map through the ROI pooling layer, then flatten the feature map into a vector, and then get the prediction result through a series of fully connected layers .

3. Overall

insert image description here

Anchor can be regarded as many boxes with fixed size and width and height on the image. Since the objects to be detected are also boxes with different sizes, width and height, Faster RCNN regards Anchor as strong prior knowledge. Next, only It is necessary to match the Anchor with the real object, and fine-tune the classification and position. Compared with the object detection algorithm without Anchor, such a priori undoubtedly reduces the convergence speed of the network. Coupled with a series of engineering optimizations, Faster RCNN has reached a peak in the object detection side.


FCN Preface

The full convolutional network FCN stands for Fully Convolutional Networks for Semantic Segmentation, the first end-to-end full convolutional network for pixel-level prediction for semantic segmentation, which classifies images at the pixel level, thus solving semantic-level image segmentation question.

Unlike the classic CNN that uses a fully connected layer to obtain a fixed-length feature vector for classification after the convolutional layer (fully connected layer + softmax output), FCN can accept input images of any size, and uses a deconvolution layer to convert the last convolutional layer The feature map is upsampled to restore it to the same size as the input image, so that a prediction can be generated for each pixel, while retaining the spatial information in the original input image, and finally step by step on the upsampled feature map Pixel classification. Although FCN seems to have some problems now, and the accuracy is not enough, FCN has created a new era of semantic segmentation. The subsequent development of semantic segmentation is basically based on the concept proposed on FCN.

The convolutional network part of FCN can use VGG, GoogleNet, AlexNet, etc. as the pre-basic network, perform migration learning and finetuning on the basis of these pre-training, and superimpose and output the result of deconvolution with the corresponding forward feature map ( The purpose of this is to obtain more accurate pixel-level segmentation), which are divided into FCN-8S, FCN-16S, and FCN-32S according to the multiples of upsampling.

main idea:

  1. A fully convolutional network without a fully connected layer can adapt to input of any size. The meaning of full convolution is to replace all the fully connected layers of the classification network with convolutional layers;
  2. The deconvolution layer increases the image size and outputs fine results;
  3. A skip-level structure that combines results from different depth layers ensures robustness and accuracy.

insufficient:

  1. The results obtained are not fine enough and not sensitive enough to details;
  2. Pixel-to-pixel relationships are not considered, lack of spatial consistency, etc.

1. The overall structure of FCN

In the traditional CNN network, several fully connected layers and softmax are connected after the last convolutional layer, and the feature map generated by the convolutional layer is mapped into a fixed-length feature vector. The general CNN structure is suitable for image-level classification and regression tasks, because they all expect the probability of the classification of the input image. For example, the VGG16 network finally outputs a 1000-dimensional vector indicating the probability that the input image belongs to each category.

However, the FCN network is different from the classic CNN that uses a fully connected layer to classify after the convolutional layer. FCN converts all fully connected layers into convolutional layers that can accept input images of any size. The feature map (feature map) is upsampled to restore it to the same size as the input image, so that a prediction can be generated for each pixel, while retaining the spatial information in the original input image, and finally parity in the upsampled features The image classifies pixels.insert image description here

The figure below is the model architecture of VGG16. After 7 × 7 × 512, 3 fully connected layers are connected, which are 1 × 1 × 4096, 1 ​​× 1 × 4096, 1 ​​× 1 × 1000 and softmax.
insert image description hereinsert image description here

In the VGG classification network, the 7 × 7 × 512 matrix is ​​first flattened to obtain a vector with a length of 25088 nodes, and a vector with a length of 4096 dimensions is obtained after outputting through the fully connected layer. The output vector of each dimension must be fully connected with the input, so there are a total of 25088 × 4096 weight parameters.

As mentioned above, FCN does not use a fully connected layer, but directly uses a convolution kernel with a size of 7 × 7, a step size of 1, and a convolution operation with a number of convolution kernels of 4096 instead of a fully connected layer. One of the convolution kernels corresponds to 7 × 7 × 512 parameters, just the same as the number of input nodes of the fully connected layer, so in fact, the number of parameters corresponding to a node in the output of the fully connected layer is the same as the parameter of a convolution kernel, which is 102760448. Therefore, directly reshaping the weight parameters corresponding to each node of the fully connected layer can be directly assigned to the convolutional layer for use, and the height and width information are retained.

The remaining fully connected layers are also transformed.
insert image description here

2. FCN subdivision structure

Compared with the conv and pool structures of VGG16, conv6 represents the convolutional layer corresponding to the first fully connected layer, conv7 represents the convolutional layer corresponding to the second fully connected layer, and so on.
FCN-32s means that the prediction result of pool5 is upsampled by 32 times and restored to the original image size. Similarly, pool4 is downsampled by 16 times, pool3 is downsampled by 8 times, pool2 is downsampled by 4 times, and pool1 is downsampled by 2 times.
insert image description here

insert image description hereinsert image description here

1、FCN-32s

insert image description here

  1. Starting from pool5 in VGG16 Backbone, save its output w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 and continue to perform the following operations.
  2. FC6 and FC7 replace the last two FC-4096 of VGG16 Backbone (1×1×4096 and 1×1×4096), FC6 represents the convolutional layer corresponding to the first fully connected layer, and FC7 represents the second fully connected layer The corresponding convolutional layer.
  3. The width and height of the input image is w×h, after multiple downsampling by VGG16 Backbone, the width and height output after Conv5 is w 32 × h 32 \frac{w}{32}×\frac{h}{32}32w×32h
  4. Pool5 downsampling output w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 , and passed to FC6 as input.
  5. The step size of the convolutional layer of FC6 is 1, the padding is 3, and the output w 32 × h 32 × 4096 \frac{w}{32}×\frac{h}{32}×4096 is obtained without changing the height and width32w×32h×4 0 9 6 , the output size does not change.
  6. Since the size of the convolution kernel of FC7 is 1 × 1, the output size will not change.
  7. After FC7, a Conv2d operation is performed. Through the 1 × 1 convolutional layer, the height and width remain unchanged, and the number of output channels becomes the number of categories (including the background category).
  8. Upsample 32 times through a ConvTranspose2d transpose convolution with a convolution kernel of 64 and a step size of 32 to obtain h × w × N um C lsh × w × NumClsh×w×NumCls output , restore the size of the original image, and then perform softmax processing to get the predicted category of each pixel.

The direct upsampling ratio is too large

2、FCN-16s

insert image description here

  1. Starting from pool4 in VGG16 Backbone, save pool4 and pool5 output w 16 × h 16 × 512 \frac{w}{16}×\frac{h}{16}×51216w×16h×5 1 2w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 and continue to perform the following operations.
  2. FC6 and FC7 replace the last two FC-4096 of VGG16 Backbone (1×1×4096 and 1×1×4096), FC6 represents the convolutional layer corresponding to the first fully connected layer, and FC7 represents the second fully connected layer The corresponding convolutional layer.
  3. The width and height of the input image is w×h, after multiple downsampling by VGG16 Backbone, the width and height output after Conv5 is w 32 × h 32 \frac{w}{32}×\frac{h}{32}32w×32h
  4. Pool5 downsampling output w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 , and passed to FC6 as input.
  5. The step size of the convolutional layer of FC6 is 1, the padding is 3, and the output w 32 × h 32 × 4096 \frac{w}{32}×\frac{h}{32}×4096 is obtained without changing the height and width32w×32h×4 0 9 6 , the output size does not change.
  6. Since the size of the convolution kernel of FC7 is 1 × 1, the output size will not change.
  7. After FC7, a Conv2d operation is performed. Through the 1 × 1 convolutional layer, the height and width remain unchanged, and the number of output channels becomes the number of categories (including the background category).
  8. Through a ConvTranspose2d with a convolution kernel of 4×4 and a step size of 2, the transposed convolution is upsampled by 2 times to obtain w 16 × h 16 × N um C ls \frac{w}{16}×\frac{h} {16}×NumCls16w×16h×N u m C ls output .
  9. The output saved to pool4 w 16 × h 16 × 512 \frac{w}{16}×\frac{h}{16}×51216w×16h×5 1 2 Perform a convolution operation with a convolution kernel size of 1 × 1 and a step size of 1 to obtain the outputw 16 × h 16 × N um C ls \frac{w}{16}×\frac{h}{ 16}×NumCls16w×16h×NumCls
  10. Add the output of the eighth step and the ninth step, and then perform a ConvTranspose2d transpose convolution with a convolution kernel of 32×32 and a step size of 16 to upsample by 16 times to restore the original image size. After softmax processing, the predicted category of each pixel can be obtained.

The first difference from FCN-32s is the first transposed convolution. FCN-32s is directly upsampled by 32 times, while FCN-16s is first upsampled by 2 times, then added to the feature map output from Max-pooling4, and finally upsampled by 16 times to obtain the original image size.

3、FCN-8s

insert image description here

  1. Starting from pool3 in VGG16 Backbone, save the output of pool3, pool4 and pool5 w 8 × h 8 × 256 \frac{w}{8}×\frac{h}{8}×2568w×8h×256 w 16 × h 16 × 512 \frac{w}{16}×\frac{h}{16}×512 16w×16h×5 1 2w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 and continue to perform the following operations.
  2. FC6 and FC7 replace the last two FC-4096 of VGG16 Backbone (1×1×4096 and 1×1×4096), FC6 represents the convolutional layer corresponding to the first fully connected layer, and FC7 represents the second fully connected layer The corresponding convolutional layer.
  3. The width and height of the input image is w×h, after multiple downsampling by VGG16 Backbone, the width and height output after Conv5 is w 32 × h 32 \frac{w}{32}×\frac{h}{32}32w×32h
  4. Pool5 downsampling output w 32 × h 32 × 512 \frac{w}{32}×\frac{h}{32}×51232w×32h×5 1 2 , and passed to FC6 as input.
  5. The step size of the convolutional layer of FC6 is 1, the padding is 3, and the output w 32 × h 32 × 4096 \frac{w}{32}×\frac{h}{32}×4096 is obtained without changing the height and width32w×32h×4 0 9 6 , the output size does not change.
  6. Since the size of the convolution kernel of FC7 is 1 × 1, the output size will not change.
  7. After FC7, a Conv2d operation is performed. Through the 1 × 1 convolutional layer, the height and width remain unchanged, and the number of output channels becomes the number of categories (including the background category).
  8. Through a ConvTranspose2d with a convolution kernel of 4×4 and a step size of 2, the transposed convolution is upsampled by 2 times to obtain w 16 × h 16 × N um C ls \frac{w}{16}×\frac{h} {16}×NumCls16w×16h×N u m C ls output .
  9. The output saved to pool4 w 16 × h 16 × 512 \frac{w}{16}×\frac{h}{16}×51216w×16h×5 1 2 Perform a convolution operation with a convolution kernel size of 1 × 1 and a step size of 1 to obtain the outputw 16 × h 16 × N um C ls \frac{w}{16}×\frac{h}{ 16}×NumCls16w×16h×NumCls
  10. Add the output of the eighth step and the ninth step, and then perform a ConvTranspose2d transpose convolution upsampling with a convolution kernel of 4×4 and a step size of 2 to obtain w 8 × h 8 × N um C ls \frac{w}{8}×\frac{h}{8}×NumCls8w×8h×N u m C ls output .
  11. The output saved to pool3 w 8 × h 8 × 512 \frac{w}{8}×\frac{h}{8}×5128w×8h×5 1 2 Perform a convolution operation with a convolution kernel size of 1 × 1 and a step size of 1 to obtain the outputw 8 × h 8 × N um C ls \frac{w}{8}×\frac{h}{ 8}×NumCls8w×8h×NumCls
  12. Add the output of the tenth step and the eleventh step, and then perform a ConvTranspose2d transpose convolution with a convolution kernel of 16×16 and a step size of 8 to upsample by 8 times to restore the original image size. After softmax processing, the predicted category of each pixel can be obtained.

FCN-8s not only uses the output from Maxpooling4, but also uses the output from Maxpooling3, and finally upsamples 8 times to get the final original image size.

Mask R-CNN Preface

Target segmentation in the usual sense refers to semantic segmentation. Semantic segmentation has a long history of development and has made good progress. Currently, many scholars are doing research in this area; however, instance segmentation is a A small field independent of the field, which has only been developed in recent years. Compared with the former, the latter is more complicated, and there are fewer scholars currently studying it. It is a hot field with research space.

Mask R-CNN is an instance segmentation algorithm, which mainly performs segmentation on the basis of target detection. It can be used for "target detection", "target instance segmentation", and "target key point detection". The Mask R-CNN algorithm is mainly Faster R-CNN+FCN, and more specifically ResNet/FPN+RPN+RoI Align+Fast R-CNN+FCN.

insert image description here

1. ROI Pooling and ROI Align

In the Fast RCNN, the candidate frame is obtained through the SS algorithm, and the ROI is used to map to the feature map. Input the original image, obtain the coordinates of a fixed number of candidate frames by extracting the candidate frame method, and then feed the entire input image into the basic network to obtain the feature map. Align to the grid unit, calculate according to the ratio of the input image to the feature map, and map the ROI to the corresponding position of the feature map, so the RoI is only for further processing of the candidate frame. RoI can be simply understood as a box on the "feature map".

ROI Pooling is used when the candidate frame is compressed when the original image is converted to a feature map and when the RoI feature on the feature map is divided into n×n rectangular areas, that is, the ROI Pooling operation has two quantization processes.

  1. Round the boundary of the candidate box and quantize it as an integer point coordinate value.
  2. The quantized boundary area is divided into n×n units (bins) on average, and the boundary of each unit is rounded and quantized.

After the above two quantizations, the candidate frame at this time has a certain deviation from the initial regressed position, and this deviation will affect the accuracy of detection or segmentation. In the paper, the author summarizes it as "misalignment".

insert image description here
The specific operation of ROI Pooling is:

  1. As shown in the cat and dog picture below, the size of the original input image is 800×800, and the candidate frame extracted by the SS algorithm is 665×665.
  2. The backbone downsampling of VGG16 by 32 times makes 800 exactly divisible by 32 to 25. But after dividing 665 by 32, we get 20.78 with decimals, so ROI Pooling directly quantizes it to 20. Finally, a 25×25 feature map and a 20×20 RoI candidate box are obtained.
  3. To pool the features in the 20×20 RoI candidate frame to a size of 7×7 is to divide the surrounding RoI candidate frame into 7×7 rectangular candidate frame units on average, and the side length of each rectangular candidate frame unit is 2.97, and Contains decimals, so ROI Pooling quantizes it to 2 again, resulting in the inability to divide evenly and make the candidate frame units vary in size.

In fact, after the above two quantifications, the candidate frame at this time has a certain deviation from the initial regressed position, and this deviation will affect the accuracy of detection or segmentation. The deviation of 0.1 pixels on the feature map of this layer is 3.2 pixels when zoomed to the original image. Then the deviation of 0.8 is a difference of nearly 30 pixels in the original image. This difference should not be underestimated. In the paper, the author summarizes it as "misalignment".

insert image description here

In order to solve the shortcomings of the above-mentioned ROI Pooling, the author proposes an improved method of ROI Align. The idea of ​​ROI Align is very simple: cancel the quantization operation, and use the method of bilinear interpolation to obtain the image value on the pixel whose coordinates are floating-point numbers, so as to convert the entire feature aggregation process into a continuous operation.

The specific operation of ROI Align is:

  1. As shown in the cat and dog picture below, the size of the original input image is 800×800, and the candidate frame extracted by the SS algorithm is 665×665.
  2. The backbone downsampling of VGG16 by 32 times makes 800 exactly divisible by 32 to 25. However, after dividing 665 by 32, 20.78 is obtained, with decimals. For each candidate frame, the calculated floating-point number is not quantized and rounded to the boundary of the candidate frame. Finally, a 25×25 feature map and a 20.78×20.78 RoI candidate box are obtained.
  3. Due to non-rounding, the candidate frame does not completely correspond to the feature point grid on the feature map. Continue to pool the features in the 20.78×20.78 RoI candidate frame to a size of 7×7, which is to divide the above-mentioned RoI candidate frame into 7 ×7 rectangular areas, the side length of each rectangular area is 2.97, and contains decimals, but the boundary of each unit is not quantized, and each candidate frame unit is the same size.
  4. Set a sampling ratio=2 value, and calculate a fixed ( S ampling R atio ) 2 (SamplingRatio)^2 in each sub-candidate frame unit(SamplingRatio)The two coordinate positions are four regular sampling points (the regular sampling points are to further divide the corresponding sub-candidate frame units into four areas on average, and take the midpoint of each area), and use the bilinear interpolation method to calculate The values ​​of these four positions are then subjected to a maximum pooling operation.

insert image description here

To split and understand step 4, we must first clarify some conditions:

  1. As shown in the figure below, the light blue grid is the feature map obtained by downsampling the convolutional backbone, the black grid is the feature map of the candidate frame retained by the floating point number, and the blue points are the four regular sampling points obtained by sampling ratio.
  2. In this process, three types of points will be used. Each point contains two types of information (pixel value and coordinate value). The first type is the intersection point of the light blue grid, which can be understood as a known pixel point. The first is the four ( x 1 , x 2 , y 1 , y 2 ) of the black grid (x_1,x_2,y_1,y_2)x1,x2,y1,y2) XY points, and the third type is the blue sampling points equally divided by each black grid. Among them, only the intersection points have both coordinate information and pixel values, while XY points and sampling points only have coordinate information, so the bilinear interpolation calculates the pixel values ​​of the sampling points.
  3. The calculation is carried out according to the following formula, in which each sampling point is calculated at the four intersection points of the nearest or the light blue grid box, uv is the relative coordinate of the sampling point relative to the light blue grid box, (f_1, f_2, f_3, f_4) are the pixel values ​​of the four intersection points of the light blue grid box.
  4. Finally, take the maximum value of the four pixel values ​​as the pixel value of this small area, and so on, the same 7×7=49 small areas get 49 pixel values, forming a feature map of 7*7 size.

f = ( 1 − u ) ( 1 − v ) f 1 + u ( 1 − v ) f 2 + v ( 1 − u ) f 3 + u v f 4 f = (1-u)(1-v)f_1 + u(1-v)f_2 + v(1-u)f_3 + uvf_4 f=(1u)(1v)f1+u(1v)f2+v(1u)f3+uvf4

insert image description here

2. Mask branch

insert image description here
The one on the left uses the Resnet network, and the one on the right uses the FPN network. For Mask R-CNN with FPN and without FPN, their Mask branches are different. The gray part is the original Faster R-CNN predicted box, class information branch, the white part is the Mask branch. The ROI Align used by the Mask branch with FPN is different, and its feature map is pooled to 14×14 instead of 7×7.

Use a picture to explain the mask convolution process:

  1. Input the 14×14×256 feature map of ROI Align, undergo four convolutions and relu functions, the convolution kernel is 3×3, the step size is 1, the filling is 1, and the output is 14×14×256.
  2. Use the transposed convolution ConvTranspose2d and relu function to double the height and width to 28×28×256.
  3. Through a convolution layer, the convolution kernel is 1×1, the step size is 1, and there is no padding, but the output depth channel will be adjusted so that the channel is equal to the number of classifications numcls num_{cls}numcls, which is 28×28× numcls num_{cls}numcls

insert image description here
The final output feature layer size is 28 × 28 × numclass, that is to say, we predict a mask for each category, and the mask size is 28x28.

And when training the network, the target of entering the Mask branch is the positive sample Proposals provided by RPN. Why not directly use Faster R-CNN to provide it? , in fact, the candidate frames in Faster R-CNN are screened by IoU. The positive sample Proposals provided by RPN must be more than the samples provided by Faster R-CNN. To a certain extent, it is a disguised increase in training samples.
However, the target input to the Mask branch during prediction is the sample provided by Faster R-CNN.

3. Mask R-CNN loss

The loss of Mask R-CNN is the loss on the Mask branch added to the Faster R-CNN.
insert image description here
The loss calculation of RPN and Faster R-CNN is briefly explained above, so I won’t go into details here.

1. Mask branch loss

Before talking about the calculation of the Mask branch loss, we need to figure out what the logits (the output of the network prediction) are and what the targets (the corresponding GT) are. As mentioned earlier, the goal of inputting the Mask branch during training is the Proposals provided by RPN, so the logits predicted by the network are the Mask information corresponding to each category for each Proposal (note that the predicted Mask size is 28x28). And the Proposals input here are all positive samples (sampled in the Fast R-CNN stage), and the corresponding GT information (box, cls) is also known.
As shown in the figure below, assuming that a Proposal (the black rectangle in the figure) is obtained through RPN, the corresponding feature information (shape is 14x14xC) is obtained through RoIAlign, and then the Mask information of each category is predicted through Mask Branch to obtain the Proposal in the figure. Logits (after the logits pass through the sigmoid activation function, all values ​​​​are mapped to between 0 and 1). Through the positive and negative sample matching process of the Fast R-CNN branch, we can know that the GT category of the Proposal is cat (cat), so the prediction mask (shape 28x28) of the corresponding category cat in the logits is extracted. Then crop and scale the GT corresponding to the original image to a size of 28x28 according to the Proposal to obtain the GT mask in the image (the corresponding target area is 1, and the background area is 0). Finally, calculate the BCELoss (BinaryCrossEntropyLoss) of the cat mask and GT mask in the logits.

insert image description here

2. Mask Branch prediction use

As shown in the figure below, through the Fast R-CNN branch, we can get the final predicted target bounding box information and category information. Then provide the target bounding box information (note that this is not the Proposals obtained through RPN) to the Mask branch, and the logits information of the target can be predicted through RoIAlign, and then according to the category information provided by the Fast R-CNN branch, the logits corresponding to The Mask information of the category is extracted, that is, the Mask information predicted for the target (the shape is 28x28, and the values ​​​​are between 0 and 1 due to the sigmoid activation function). Then use bilinear interpolation to scale the Mask to the size of the predicted target bounding box and place it in the corresponding area of ​​the original image. Then the Mask is converted into a binary image through the set threshold (0.5 by default). For example, the area with a predicted value greater than 0.5 is set as the foreground and the remaining area is the background. Now for each predicted target, we can draw the bounding box information, category information and target Mask information in the original image.
insert image description here

Guess you like

Origin blog.csdn.net/qq_45848817/article/details/127965649