# Overview of classic target detection and recognition methods RCNN/FAST/FASTER RCNN/MASK RCNN/SSD/DSSD/YOLO V1/YOLO V2, etc.

Overview of classic target detection and recognition methods @陈子逸

Overview of classic target detection and recognition methods

Since I learned some methods of target detection during Shui Master, here is an overview, there must be some imperfections, please correct me and make progress together.

Summary

1. Briefly describe the RCNN series respectively:

In my opinion, RCNN is a benchmark.It first uses traditional image detection methods to generate some candidate boxes based on texture features using selective search such as HOG and sift, and then adds these candidate boxes to the convolutional neural network for training and classification. This is obviously not an end-to-end algorithm.I remember that it takes more than 47s to recognize a picture.

On this basis, He et al. proposed Fast rcnn. Here instead of traditional SVM/reg (linear regression model) for classification, neural network (multi-task network) is also used for classification. He is combining SPPnet to improve RCNN, here Proposed ROI Pooling, using this ROI Pooling layer to get the fixed input of the FC layer. However, the selective search method is still selected in the process of extracting the candidate area.
Linear + softmax: classification
Bounding-box regressors: regression,
but Roi still needs some traditional The algorithm looks for the original image, and then imports the processed result into the CNN
. This step can only be done on the CPU!!! There is no way to use the GPU, so it cannot be iterated into the neural network.

Faster optimizes the work of extracting candidate regions and introduces the RPN network. The most direct of the RPN network is the anchor mechanism, which performs three-scale inverse mapping for each pixel on the feature map, and uses the receptive field The ratio produces a bunch of boxes on the original image. Then filter and refine the filtering of the boxes that may be the target. Here are some small tricks. For example, remove the overlap with the image frame and remove a part with the IOU ratio, plus For non-maximum suppression, there are about 2000 boxes left in the final image. Then there are two layers of four classification regressors. The first layer is to judge the foreground and background. From the transfer function, it can be seen that if it is not the foreground, Pi* (The anchor prediction is the probability of an object) is 0, and the right half is 0, so there is no need to calculate bbox regression. At the same time, the left side is obviously a two-class cross-entropy loss. The Lambda value is used to assign the classification regression weight.
Insert picture description here
And mask-rCNN has two most obvious improvements compared to the previous one. The backbone of the mask uses the new resnet+FPN. In addition, ROI align is used to replace ROI Pooling. More specifically, ResNeXt+RPN+RoI Align+Fast R-CNN +FCN. A deep residual network is a network that connects input and output across layers in a neural network. The function is F(x)+x, and x is the output of the previous layer. Because as the network depth increases, the effect of cnn (performance ) Is not getting better and better, on the contrary, it will decrease, and the deep residual network can greatly preserve the original features because the results of the first few layers are input to the following layers without the activation function scaling. And the RPN network There are two models on the left and right. The bottom-up model on the left is a simple convolution process. Here, the author divides the processes in the deep residual network into several stages. The size of the feature map will change after some layers. Change (Pooling layer), but will not change when passing through some other layers. The author classifies the layer that does not change the size of the feature map as a stage, so the features extracted each time are the output of the last layer of each stage, so Can form a characteristic pyramid. The result of each stage is output as a feature map of C1-C5, with a resolution from 512 to 32, and then a 1 1 convolution of C5 to obtain P5 (the number of channels is the same as the high level), and then the P5 high level and the lower level And P4 is obtained by adding C4 after 1 1 convolution, and so on. After generating P2-P5, a new P2-P5 (eliminating aliasing effect) is generated by a 3 3 convolution, and P5 is upsampled to generate P6 is only used for RPN network.
Each layer can generate 15 anchors with three ratios of RPN. So how do you decide which layer of feature map to choose for ROI Pooling? There is a formula, which uses high-level features for large-scale ROI. The output is three: rpn-logits/rpn-class/rpn-bbox. Roialign is the improvement of the missing feature map information value brought by the floating-point approximation of the previous roipooing coordinate conversion, and the bilinear interpolation is performed on the four pixels around the point. , And then the value obtained by maxpooling is much better than just approximation. After roi pooling is completed, the FCN is used to generate k
binary masks with resolution m m for each roi on the mask branch , and K is the total number of classification objects. Determine which mask to use according to the prediction in faster rcnn.For each pixel of the predicted binary mask, we apply the sigmod activation function (127-0.77 for example), and use the two-class cross-entropy loss function as a whole. This allows each class to generate an independent mask, avoiding competition (decoupling) between classes. If softmax is performed on each pixel like only FCN, multi-task cross entropy is used as a whole, which will cause competition and result in poor segmentation. .
FCN is full convolution, realizes any input, and because the resolution of pooling is reduced, upsampling is used, and there is a skip structure that combines different depths.
Insert picture description here
SSD?

SSD is an algorithm that uses multi-scale feature maps for detection. The backbone network is vgg16, and the last FC layer is removed. Because the FC layer behind vgg16 is used for classification, only the first part is used to extract features without classification, a total of 6 Feature maps of different scales, from 38 38 to 1 1. Here is also borrowed from the rpn network in faster rcnn, which is called prior box, which is actually similar, and the scale is also defined by itself. 38 38, 19 19, 10 10 each generates 6, others Each of the three produces 4 boxes. The final output vector latitude is (C type + 4) k m n. Also refer to rpn, set a threshold, greater than a positive sample, and less than a threshold is a negative sample. The middle is omitted. Note that the two thresholds are different. At the same time, negative samples will also add some difficult examples. The ratio of positive and negative samples is basically 1:3. Data enhancement is achieved by randomly using multiple paths, softmax loss for classification, and smooth for regression L1. The jaccard function used in the loss function to calculate IOU is actually the method of calculating IOU in RPN. The final loss function is obtained in the form of a proportional sum of the confidence loss and the classification loss.

DSSD ?

The front part of the backbone of the vgg16 model in the SSD was replaced with ResNet. Then a deconvolution network layer was added behind the network because of the addition of a large amount of context information. There are also some authors who add different prediction structures in the back. You can take a look.

YOLO V1

Divide the image into S S grids, and each box contains ground truth to return. Here we can find that the image regardless of size and resolution is fixed S S grids, so that the detection of small objects is not particularly prominent Optimization, easy to detect errors. But this is an end-to-end algorithm, or one-stage. It does not have the step of extracting candidate boxes like faster rcnn. The entire network is performed in the CNN network, and each grid produces B boxes. Each box has 5 coordinates, 4 positions, and 1 confidence. The confidence is the IOU of the box and the ground, so the total output vector is (5 B+C) S S this tensor. C is the predicted category information , The confidence is the product of the probability that the box contains the object IOU (how accurate is the measurement). From here, it can be seen that if it is not objct, the first term is 0, and the total confidence is 0, so it will not participate in the subsequent calculations. Get The box is the same as setting the threshold and then NMS. One feature is that it uses small convolution instead of inception moudule. But I don’t understand this inception module, so I will learn it later. In order to be more refined, the resolution is also from 224 to 448. At the same time Set up a dropout layer, after the first fully connected layer, ratio=0.5, in order to prevent overfitting.
But on the loss function, the author will classify, regression, and the three-part sum of squares of confidence, which is relatively simple and rude. If it is The model of the original paper, B=2, C20 category 5*B+C=30. It is not reasonable to use the regression coordinates of 8 dimensions and the classification of 20 dimensions to use the same weight and square sum. And since most of them are not objects, the confidence is 0 may cause the subsequent network to diverge or collapse. Therefore, it is better to introduce parameters into the objective function weight. In addition, the disadvantage is that it can only input a fixed size, which is poor in recognition of small targets.

YOLO V2

The most obvious is the use of convolutional layers instead of FC, and the introduction of the anchor mechanism. At the same time, batch norm is performed on all layers. batch norm is to perform batch normalization (normalization) on each input, so that the input is as Gaussian as possible The distribution makes the training more standardized, does not spread errors, and speeds up the training speed. The input size is changed from 448 of v1 to 416, because the total pooling layer pooling scale is 32, so that the last layer is 13, which is general for pictures The probability of the object in the middle is the largest. The center of 13 is a point, which can better capture information. And the darknet is transformed to replace the previous pure CNN structure, which is more efficient. The network characteristic is that the sampling channel is doubled every time. Because of SSD and The box scale of faster rcnn is manually set, and v2 uses clustering K-means to classify the box, but it does not use the Euclidean distance so that the larger the box, the larger the error, but the 1-IOU for clustering analysis.

The following are detailed memories respectively

RCNN

Region proposal
gives some candidate frames and then finds it again.

Selective search
first divides the image into potential candidate frames that may be objects through some texture features.

SVM classifier, Bbox reg: regression

RCNN: An image generates 2k candidate frames, and then they are processed by CNN separately.
Slow!!! ~ 47s an image
Convolution does not limit the input size, and the fully connected layer has limitations: Keep the input consistent.

ROI pooling
uses a pooling layer to connect feature maps of different sizes into series (stitched together). The size is the same.
Insert picture description here
Optimization of SPPnet:
Input: The
original feature map needs to reseize to
Insert picture description here
get a 21-dimensional feature for each channel. The 21*channel (256)
can ignore the size of the feature map and only convolve the original image once.

Fast RCNN

Insert picture description here
Do not use SVM/Reg to classify and use CNN.

The difference with SPPnet is that the ROI pooling layer is used to get the fixed input of the FC layer.
Insert picture description here
Insert picture description here
ROI Pooling: Insert picture description here
Insert picture description here
first convolution, the candidate area is selective research.

First total convolution.
RoIs: Find the size of the area of ​​the original image through the receptive field.

Basically it is end-to-end.
Linear+ softmax: classification
Bounding-box regressors: regression.
Insert picture description here
But Roi still needs some traditional algorithms to find on the original image, and then import the processed results into CNN
. This step can only be done on the CPU!! ! There is no way to use the GPU, so it cannot be iterated into the neural network.

RCNN Fast RCNN comparison:
Insert picture description here

Faster RCNN

At the end of 15 years, the
Insert picture description here
Region Proposal Network
did not have a good method to generate candidate frames, nor was it generated in a neural network. It
Insert picture description here
mainly completed two classifications, background and foreground. There is also a Roi Pooling layer in RPN.
Rough classification and coarse positioning.
Insert picture description here
3*3 sliding window, the center point is used as the anchor.

The three areas are: 128/256/512.

Candidate frames
Insert picture description here
are generated from the feature map!!!. There are four LOSS values:
Insert picture description here
2 classifications are performed on the 2w candidate frames generated to see what they are and whether they are objects.
1: Determine whether it is the foreground or the background. 2 Classification
2: Do fine-tuning, use Bounding box regression to calculate with the current 3.20
classification
4. Find the most suitable frame regression position.
End-to-end!!! The
Insert picture description here
past method: 1 The image pyramid enlarges and reduces the image according to different proportions and then convolves
2 pairs The feature map is scaled at different scales.

RPN网络: 把任意图像都能当成输入.
                输出: 当前框是不是物体,以及它是不是物体的得分.

Insert picture description here
Perform another convolution
on the feature map : for each point of the feature map, there is a receptive field of its original image, and each point of the feature map is mapped to the original input, and the resulting area is the receptive field.

K anchor boxes
	9个box 3个初始长宽比. 128/256/512

Insert picture description here
cls layer: classification to see if it is the score of 2k foreground and background of the object
reg layer: regression judgment coordinate 4k

Sliding window: Keep sliding
2 1 1 convolution instead of the fully connected layer. The
Insert picture description here
original image 400
600 The last layer conv 40*60 2400 points, a total of 2w candidate frames

Loss Function

1. 选择候选框中重叠比例最大的, 打为正标签.
2. IOU大于0.7的,打为正标签. 负例为小于0.3. 中间的全部去掉.

Insert picture description here

RPN network training

Mini batch=1.
Randomly choose 256. Try to make the positive and negative samples 1:1, both are Gaussian initialization.

The cross-border boxes are no more. There are 6000 left.

Most of the boxes overlap: Introduce non-maximum value to suppress NMS to
keep the proportion of objects Sa>Sb large. There are 2000 left

Take another top N. 128

Is it something that RPN does? Finally, there is another classification to do what it is. 20 classification...
Insert picture description here
SPPnet optimization:
input: the
original feature map needs to reseize to
Insert picture description here
get a 21-dimensional feature for each channel. 21* The channel (256)
can ignore the size of the feature map and only convolve the original image once.

Mask RCNN

Follow the bottom-up principle and explain in turn from backbone, FPN, RPN, anchors, RoIAlign, classification, box regression, and mask.
FPN: Feature Pyramid.

In addition to class and box, the task of mask segmentation is added.

Transposed convolution*2

RPN: Classless obejct detector based on sliding window

Insert picture description here
Insert picture description here
Insert picture description here
RPN creates a large number of boxes (anchors) on the image, and runs a lightweight binary classifier on the anchors to return target/untargeted scores. The anchors (positive anchors, positive samples) with high scores will be passed to the next stage for classification.

Normally, positive anchors will not completely cover the target, so RPN will return an offset and zoom value when scoring the anchors to correct the position and size of the anchors.
Insert picture description here
Semantic segmentation, instance segmentation.
Insert picture description here
Insert picture description here
Later, SNIP and SNIPER are also based on Image Pyramid.

SSD: The
Insert picture description here
image feature pyramid is used to detect SSD on feature maps of different scales , but the FPN author believes that the SSD algorithm does not use enough low-level features (in SSD, the lowest-level feature is conv4_3 of the VGG network), and In the author's opinion, sufficiently low-level features are very helpful for detecting small objects.

FPN: Connect the high-level features of low-resolution and high-semantic information with the low-level features of high-resolution and low-semantic information from top to bottom side, so that features at all scales have rich semantic information.

VGG16/ROIalign

Insert picture description here
Assuming that we input an 800x800 image, there are two targets (cat and dog) in the image. The BB size of the dog is 665x665. After passing through the VGG16 network, the feature map obtained will be smaller than the original image by a certain ratio. It is related to the number and size of the Pooling layer:

In this VGG16, we use 5 pooling operations, each pooling operation is 2Pooling, so we finally get the size of the feature map is 800/32 x 800/32 = 25x25 (which is an integer)

But if the dog’s BB is mapped to the feature map, the result we get is 665/32 x 665/32 = 20.78 x 20.78. The result is a floating-point number, including decimals, and rounded to 20 x 20. Here we introduce the first The quantization error of times;
then we need to map the 20 x 20 ROI into a 7 x 7 ROI feature, the result is 20 /7 x 20/7 = 2.86 x 2.86, which is also a floating-point number, contains a decimal point, and the same is rounded , Here introduces the second quantization error.

In order to obtain a fixed size (7X7) feature map, the ROIAlign technology does not use quantization operations. Instead, bilinear interpolation is used. It makes full use of virtual points in the original image (such as a floating point number of 20.56. The pixel positions are all It is an integer value, there is no floating point value) The four real pixel values ​​around to jointly determine a pixel value in the target image, that is, the pixel value corresponding to the virtual location point of 20.56 can be estimated.

The blue dashed box represents the feature map obtained after convolution, and the black solid box represents the ROI feature. The final output size is 2x2. Then we use bilinear interpolation to estimate these blue points (virtual coordinate points, also known as The pixel value corresponding to the grid point of the bilinear interpolation), and finally the corresponding output is obtained.
Then perform max pooling or average pooling operations in each orange-red area to obtain the final 2x2 output result. We did not use quantization operations in our entire process, and did not introduce errors, that is, the pixels in the original image and the pixels in the feature map are completely aligned, and there is no deviation. This will not only improve the accuracy of detection, but also facilitate instance segmentation. .
Insert picture description here
Insert picture description here
The point is: the author compares various networks as backbones and finds that the backbone using ResNet-FPN as feature extraction has higher accuracy and faster running speed, so most of the actual work uses the completely parallel mask/classification of the right picture. return

The mask branch generates a K m m output for each RoI , that is, K binary masks with a resolution of m*m, and K is the number of types of classified objects. According to the output of the prediction category branch prediction, we only register the output of the i-th category for calculation

For the predicted binary mask output, we apply the sigmoid function to each pixel, and the overall loss is defined as the average binary cross-loss entropy. Introduce the mechanism of predicting K outputs, allowing each class to generate an independent mask, avoiding competition between classes. This decouples the mask and type prediction. Unlike the FCN approach, the softmax function is applied to each pixel, and the overall multi-task cross entropy is adopted, which will lead to competition between classes and ultimately lead to poor segmentation.
Insert picture description here

SSD

Make predictions on feature maps of different scales.
Insert picture description here

One-stage direct regression. No proposal is required.
1. The backbone network is Vgg16. The original fc layer is removed, and the target detection does not require classification tasks. Convolutional layers are added to improve model performance.
2. For feature maps of 6 scales. 38 38-1 1

3.Prior box
Anchor(cell), proportional, find the box of the original graph. Similar to rpn.
Scale: definition. 1. Sk=(smin+smax-smin/m-1) (k-1) 2. Length and width Compared with
6 different boxes.
Insert picture description here
(C category + 4 coordinate offset values) K anchor M
N feature map such latitude output.

Positive samples greater than the threshold, omit the middle.
Negative samples: difficult cases, the ratio of positive and negative samples reaches 1:3

Data enhancement: random sampling of multiple paths.
Classification: softmax loss.
Regression: smooth L1.
Insert picture description here

DSSD

What are the innovations of DSSD?
1. Backbone: ResNet replaces the VGG network in SSD, which enhances the feature extraction capability
2. Added Deconvolution layer, adds a lot of context information

YOLO V1

Insert picture description here
For each grid, a vector length of 5*B+C is predicted.

Total: (5 B+C) S S
YOLO has 5 predictions for each bounding box: x, y, w, h,
and confidence.
The coordinates x, y represent the relative value of the center of the predicted bounding box and the grid boundary.
The coordinates w, h represent the ratio of the width and height of the predicted bounding box to the width and height of the entire image.
Confidence is the IOU value of the predicted bounding box and ground truth box.
Each grid also predicts C conditional class probability (conditional class probability): Pr(Classi|Object). That is, on the premise that a grid contains an Object, the probability that it belongs to a certain class.
We only predict a set of (C) class probabilities for each grid, regardless of the number of boxes B.
The conditional class probability information is for each grid. Confidence information is for each bounding box. During the test phase, the conditional class probabilities each grid bounding box of each multiplication confidence:
24 + 2 convolution layers fully connected layers
Yolo1 using a 1
1 + 3 for 3 hours instead of convolution inception module.
For More refined, the
resolution of 224 224 is increased to 448*448.
To prevent over-fitting, in a fully connected after the first layer is sandwiched between a layer ratio = dropout 0.5.
Loss function
designed loss function is to make the coordinates (x, y, w, h ), confidence, classification the three Achieve a good balance.
Simply using sum-squared error loss to do this will have the following shortcomings:
a) It is obviously unreasonable that the 8-dimensional localization error and the 20-dimensional classification error are equally important.
b) If there are no objects in some grids (there are many such grids in a picture), then the confidence of the bounding boxes in these grids will be set to 0, compared to less grids with objects, The contribution of these grids that do not contain objects to the gradient update will be much greater than that of the grids that contain objects, which will cause the network to be unstable or even divergent.
Insert picture description here
Insert picture description here
Confidence: the product of two probabilities.
Insert picture description here
Insert picture description here
Insert picture description here
As written in the loss function, loss will only be generated when the bounding box is responsible for the object

Disadvantages:
Insert picture description here

YOLO V2/YOLO9000

Insert picture description here
Insert picture description here
Batch norm:
Insert picture description here
Insert picture description here
13*13 Reverse 32 downsampling=416.

Bounding box: Two overlapping/two areas and eliminating the influence of box size.
Insert picture description here
Redesigned network:
Each time max pooling (downsampling), the channel is doubled.
Insert picture description here
For each box: prediction (5+C)*B pay more attention Category, because of the anchor.

YOLO-v2 cancels the fully connected layer, but uses the anchor box (should be borrowed from SSD). Removed a pooling layer to get a higher resolution. In order to finally get an odd number of positions, determine the center, so the resolution is changed to 416 416. Finally, a feature map of 13 13 is obtained .
K-means clusters the Bouding box. The traditional clustering method uses Euclidean distance. Such a large box often has a larger error, so the author uses 1-IOU as the clustering distance. Because the model only uses convolution and pooling, the size can be modified at will. A network of multiple sizes trained by the author. Contains {320,352,...608}. The larger the size, the higher the accuracy and the slower the speed. So you can choose the resolution according to your needs. Most of the previous detection frameworks are based on VGG-16, a powerful and accurate network. But the amount of calculation is too large. So the author built a new framework, darknet-19.

Some new networks

HyperNet:
2016 was proposed by Tsinghua University.

RFCN: He Kaiming.

Light-Head RCNN Megvii and Tsinghua University.

Part of the source: 1. https://blog.csdn.net/u011974639/article/details/78483779
2. https://www.cnblogs.com/hellcat/p/9749538.html

The above picture materials are from public papers and open source websites. If you offend, please delete it privately.

Guess you like

Origin blog.csdn.net/weixin_43152520/article/details/101285504