Common Classical Object Detection Algorithms

1. Basic concept of target detection

1.1 What is target detection

The task of object detection is to find out all the objects of interest (objects) in the image and determine their categories and positions.

insert image description here

1.2 The core problem to be solved in target detection

In addition to image classification, the core problems to be solved in target detection are:
1. The target may appear anywhere in the image.
2. Targets come in various sizes.
3. Targets may have various shapes.

1.3 Classification of Target Detection Algorithms

  • The two stage target detection algorithm
    first performs region generation (region proposal, RP) (a pre-selection box that may contain objects to be detected), and then classifies samples through a convolutional neural network.
    Task: Feature Extraction->Generate RP->Classification/Location Regression.
    Common Two stage target detection algorithms are: R-CNN, SPP-Net, Fast R-CNN and R-FCN, etc.
  • One stage target detection algorithm
    does not use RP, directly extracts features from the network to predict object classification and location.
    Task: Feature Extraction -> Classification/Location Regression.
    Common one-stage target detection algorithms include: OverFeat, YOLOv1, YOLOv3, SSD, and RetinaNet.
    insert image description here

1.4 Application field

  • Face Detection
  • pedestrian detection
  • vehicle detection
  • road detection
  • Obstacle detection
  • etc.

2. Two stage target detection algorithm

2.1 R-CNN

2.1.1 Innovations of R-CNN

  • Use CNN (ConvNet) to calculate feature vectors for region proposals. From experience-driven features (SIFT, HOG) to data-driven features (CNN feature map), improve the ability of features to represent samples.
  • Supervised pre-training under large samples (ILSVRC) and fine-tuning with small samples (PASCAL) are used to solve problems such as difficult training or even overfitting of small samples.

Note: ILSVRC is actually the well-known ImageNet challenge with a huge amount of data; the PASCAL dataset (including target detection and image segmentation, etc.) is relatively small.

2.1.2 Introduction to R-CNN

As the first-generation algorithm of the R-CNN series, R-CNN does not use "deep learning" ideas too much, but combines "deep learning" with traditional "computer vision" knowledge. For example, the second and fourth steps in the R-CNN pipeline actually belong to the traditional "computer vision" technology . Use selective search to extract region proposals, and use SVM to implement classification.
insert image description here

  1. pre-trained model. Choose a pre-trained neural network (such as AlexNet, VGG).

  2. Retraining the fully connected layer uses the object to be detected to re-train the last fully connected layer (connected layer).

  3. Extract prosals and compute CNN features. Use the Selective Search algorithm to extract all prosals (about 2000 images), adjust (resize/warp) their fixed size to meet the CNN input, and then save the feature map to the local disk.insert image description here

  4. Train SVMs. Use the feature map to train SVMs to classify objects and backgrounds (one binary SVM per class).

  5. Bounding boxes Regression. Train a linear regression classifier that will output some correction factors.insert image description here

2.1.3 R-CNN experimental results

R-CNN achieved a mAP of 58.5% on the VOC 2007 test set, defeating all target detection algorithms at that time.
insert image description here

2.2 Fast R-CNN

2.2.1 What are the innovations of Fast R-CNN?

  1. Only one feature extraction is performed on the entire image.
  2. Replace the max pooling layer of the last layer with the RoI pooling layer , and at the same time introduce the suggestion box data to extract the corresponding suggestion box features.
  3. At the end of the Fast R-CNN network, different fully-connected layers are used in parallel, which can output classification results and window regression results at the same time, realizing end-to-end multi-task training [except for suggestion box extraction], and does not require Additional feature storage space [The features in RCNN need to be kept locally for training of SVM Bounding-box regression].
  4. SVD is used to decompose the parallel fully connected layer at the end of the Fast R-CNN network to reduce computational complexity and speed up detection.

2.2.2 Fast R-CNN Introduction

Fast R-CNN is an improvement based on R-CNN and SPPnets. The innovation of SPPnets is to calculate the shared feature map of the entire image, and then map the shared feature map to the corresponding feature vector according to the object proposal (that is, there is no need to repeatedly calculate the feature map). Of course, SPPnets also has disadvantages: like R-CNN, the training is multi-stage (multiple-stage pipeline), the speed is not "fast" enough, and the features have to be saved to the local disk.

Apply region proposals directly to feature maps and use RoI pooling to convert them into fixed-size feature maps. The following is the flowchart of Fast R-CNN
insert image description here

2.2.3 Detailed explanation of RoI Pooling layer

Because Fast R-CNN uses fully connected layers, RoI Pooling is applied to convert ROIs of different sizes into fixed sizes.
RoI Pooling is a kind of Pooling layer, and it is Pooling for RoI. Its characteristic is that the size of the input feature map is not fixed, but the size of the output feature map is fixed (such as 7x7)

  • What is ROI?
    RoI is the abbreviation of Region of Interest. It generally refers to the area box on the image, but here it refers to the candidate box extracted by Selective Search.
  1. Extracting candidate frames insert image description here
    often outputs more than one rectangular frame after RPN, so here we are pooling multiple RoIs.
  2. The input of RoI Pooling
    consists of two parts:
    1. Feature map (feature map): refers to the feature map shown above. In Fast RCNN, it is located before RoI Pooling.
      In Faster RCNN, it shares that feature map with RPN. Usually we often
      call it " share_conv";
    2. RoIs, which represents the N*5 matrix of all RoIs. Where N represents the number of RoIs, the first column represents the image index, and the remaining four
      columns represent the rest of the upper-left and lower-right coordinates.

In Fast RCNN, it refers to the output of Selective Search; in Faster RCNN, it refers to the output of RPN, a bunch of rectangular candidate boxes with a shape of 1x5x1x1 (4 coordinates + index index), of which it is worth noting: coordinate reference The system is not for the feature map, but for the original image (the initial input of the neural network). In fact, the understanding of the coordinates of ROI has been very confusing. Whose coordinates are it based on? In fact, it is easy to understand, we know the size of the original image and the coordinates of the candidate frame extracted by the Selective Search algorithm, then according to the "mapping relationship", we can get the size of the feature map (featurwe map) and the position of the candidate frame on the feature map map coordinates. As for how to calculate, it is actually a ratio problem, which will be introduced below. Therefore, it is also possible to understand ROI as each candidate frame (region proposals) on the original image.

2.2.4 Specific operation of RoI

  1. According to the input image, map the ROI to the corresponding position of the feature map
    Note: The mapping rule is relatively simple, which is to divide each coordinate by the "ratio of the size of the input image to the feature map" to obtain the box coordinates on the feature map.
    2. Divide the mapped area into sections of the same size (the number of sections is the same as the output dimension)
    3. Perform a max pooling operation on each section.

In this way, we can get corresponding feature maps of fixed size from boxes of different sizes. It is worth mentioning that the size of the output feature maps does not depend on the size of the ROI and convolutional feature maps. The biggest advantage of RoI Pooling is that it greatly improves the processing speed.

2.2.5 Output of ROI Pooling

The output is a batch of vectors, where the value of the batch is equal to the number of ROIs, and the size of the vector is channel w h; the process of ROI Pooling is to map box rectangles of different sizes into fixed-size (w*h) Rectangle.
ROI Pooling Example
insert image description here

2.3 Faster R-CNN

2.3.1 What are the innovations of Faster R-CNN?

Fast R-CNN relies on external proposal methods such as selective search. But these algorithms run on the CPU and are slow. In the test, Fast R-CNN needs 2.3 seconds to make a prediction, of which 2 seconds are used to generate 2000 ROIs. Faster R-CNN adopts the same design as Fast R-CNN, except that it replaces the region proposal method with an internal deep network. The new Proposed Region Network (RPN) is more efficient at generating ROIs and runs at 10 ms per image.
Faster R-CNN flowchartinsert image description here
The candidate region network (RPN) takes the output feature map of the first convolutional network as input. It slides a 3×3 convolutional kernel on the feature map to build class-independent proposals using a convolutional network (ZF-network shown below). Other deep networks such as VGG or ResNet can be used for more comprehensive feature extraction, but this comes at the expense of speed. The ZF network will finally output 256 values, which will be fed to two separate fully connected layers to predict the bounding box and two objectness scores, which measure whether the bounding box contains the object. We could actually use a regressor to compute a single objectness score, but for brevity, Faster R-CNN uses a classifier with only two classes: the class with objects and the class without objects.
insert image description here

Guess you like

Origin blog.csdn.net/qq_43679351/article/details/125066980