Object Detection: Fast R-CNN, YOLO v3

Table of contents

Object Detection

R-CNN

SPPNet

Fast R-CNN

YOLO v1

YOLO v2

YOLO v3


Object Detection

Task is a very important basic problem in computer vision, and it is also the basis for solving problems such as image segmentation, object tracking, and image description. Target detection is to detect whether there is a given category of objects in the input image, and if so, output the position information of the object in the image (represented by the coordinate value of the rectangular box, Xmin, Ymin, Xmax, Ymax).

In the early days, traditional target detection algorithms did not use deep learning, and were generally divided into three stages: area selection, feature extraction, and feature classification.

  • Area selection: Use the sliding window (Sliding Windows) algorithm to select the positions where objects may appear in the image. This algorithm will have a large number of redundant frames and has high computational complexity.

  • Feature extraction: Feature extraction is performed through hand-designed feature extractors (such as SIFT and HOG, etc.).

  • Feature Classification: Use a classifier (such as SVM) to classify the features extracted in the previous step.

The role of the traditional algorithm HOG+SVM is as follows:

picture

In 2014, R-CNN (Regions with CNN features) used deep learning to achieve target detection, which opened the prelude to deep learning for target detection. Target detection can be roughly divided into a one-stage (One Stage) model and a two-stage (Two Stage) model. The one-stage model of target detection means that the candidate region (Region Proposal) is not extracted independently, and the image is directly input to obtain the object category and corresponding position information existing in the image. Typical one-stage models include SSD (Single Shot multibox-Detector), YOLO (You Only Look Once) series models, etc. The two-stage model has independent candidate area selection. It first screens out candidate areas that may contain objects from the input image, and then judges whether there is a target in the candidate area. If there is an output target category and location information. The classic two-stage models are R-CNN, SPPNet, Fast R-CNN, Faster R-CNN

The following figure summarizes the development history of some classic models in target detection:

picture

In general, the one-stage model has an advantage in computational efficiency, and the two-stage model has an advantage in detection accuracy. For the difference in speed and accuracy between the one-stage and two-stage models, there are generally the following reasons:

1. Most of the one-stage models use the preset anchor boxes (Anchor Box) to capture the areas where objects may exist in the image. The boxes containing objects in the image are far less than the total anchor boxes, so positive and negative samples are used when training the classifier The number is extremely unbalanced, which will lead to poor performance of classifier training.

2. The two-stage model will correct the position of the candidate frame, bringing higher positioning accuracy, but also increasing the complexity of the model.

Next, briefly introduce the development process of the two-stage model.

R-CNN

First, the unsupervised Selective Search (SS) method is used to merge regions with similar colors and textures in the input image to generate 2000 candidate regions;

Then intercept the corresponding images of these candidate areas, crop and scale them to a fixed size, and then send them to the CNN feature extraction network to extract features;

The features are sent to the SVM classifier of each category to determine whether they belong to this category;

Use a linear classifier to modify the frame position and size, and finally perform Non-Maximum Suppression (NMS) on the detection results.

picture

SPPNet

In RCNN, cropping and scaling the candidate area to a fixed size will destroy the aspect ratio of the intercepted image and lose some information. In response to the above problems, SPPNet proposes a spatial pyramid pooling (Spatial Pyramid Pooling) layer, which is placed at the end of the CNN, and the input does not need to be scaled to the specified size. The first line in the figure below is R-CNN, and the second line is SPPNet. The difference can be found by comparison.

picture

The idea of ​​SPPNet is to first divide a feature map of any size into 16, 4, and 1 blocks, and then maximize pooling on each block, and the pooled features are spliced ​​to obtain a fixed-dimensional output.

picture

Fast R-CNN

The idea of ​​​​Fast R-CNN is consistent with that of SPPNet, the difference is that Fast R-CNN uses Region-of-Interest Pooling instead of spatial pyramid pooling. Compared with R-CNN, Fast R-CNN uses a fully connected network instead of the previous SVM classifier and linear regression for object classification and detection frame correction. Fast R-CNN has two outputs, one is the category prediction through the softmax layer, and the other outputs the detection box of the object.

picture

Faster R-CNN

Based on Fast R-CNN, Faster R-CNN replaces its most time-consuming candidate region extraction with a Region Proposal Network (RPN). In faster R-CNN, an input image is first extracted by RPN for candidate regions, and then the feature maps corresponding to each candidate region are taken out, and sent to Fast R-CNN (independent of the second half of RPN) for object classification and position regression.

picture

The Region-CNN (Regional Convolutional Neural Network) series reduces the problem of target detection to a classification problem, that is, first finds the area (Bounding box) where the target may exist, and then classifies these Boxes to determine the target. YoLo converts the target detection problem into a regression problem (Regreession problem), directly predicting the boudning box and related category information. YoLo is a single network that can be trained end-to-end. It does not require a separate search for Region Proposals or a separate Classifier, so its detection speed is particularly fast. YoLo can reach 45 FPS, while Fast YoLo can reach 155FPS. YoLo has a better recognition effect on the background and has a certain degree of migration, but the biggest problem with YoLo is the inaccurate detection of small targets.

YOLO v1

1. The input image (

picture

)be divided into

picture

If the center of an object falls on a certain grid, this grid is responsible for the detection of this object.

2. Each grid predicts the position of B Bounding Box, the confidence score of this Box, and the probability of whether there is an object in the Box.

  • Bounding Box contains five parameters (center x coordinate, center y coordinate, width, height, confidence)

  • The confidence score indicates the likelihood that the grid contains an object: Pr(containing an object) x IoU(pred, truth); where Pr=probability.

  • If the grid contains an object, then it predicts the probability that the object belongs to each class

3. Divide the input image into

picture

Grids, each grid predicts B Bounding Boxes and confidence, then the final prediction code is

picture

picture

YOLO v2

YOLO v2 has made improvements on the basis of YOLO v1, which can be roughly divided into improvement of network structure, design of prior frame and training skills.

1. To improve the network structure, a new network structure is proposed , which is called DarkNet.

picture

  • BN layer: A batch normalization (BN) layer is added after the convolutional layer .

  • The 7×7 convolution in the v1 version is replaced by continuous 3×3 convolution, which not only reduces the amount of calculation, but also increases the depth of the network. In addition, DarkNet removed the fully connected layer and the Dropout layer.

  • Passthrough layer: DarkNet also performs fusion of deep and shallow features.

2. The design of the prior box, YOLO v2 first uses the clustering algorithm to determine the scale of the prior box.

3. Training skills, YOLO v2 takes pictures of various scales as input for training. During the training process, the model changes the size of the input image every 10 batches.

YOLO v3

YOLO v3 has made some changes based on YOLO v2.

1. YOLO v3 uses the Logistic function instead of the Softmax function . The reason is that the predictions of multiple categories output by the Softmax function will suppress each other, and only one category can be predicted, while the Logistic classifiers are independent of each other and can achieve multi-category prediction.

2. YOLO v3 uses a deeper network as a feature extractor (DarkNet-53), including 53 convolutional layers. In order to avoid the gradient disappearance problem caused by the deep network, DarkNet-53 draws on the residual idea of ​​ResNet, and uses a large number of residual connections in the basic network.

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132502477