Object Detection Study Notes (Overview and Traditional Algorithms and Deep Learning Algorithms)


Object Detection

This article is for the needs of course discussion, theoretical study of target detection algorithm, no actual combat content, welcome to exchange and discuss


1. The basic concept of target detection

(1) What is target detection

The task of Object Detection is to find all the objects (objects) of interest in the image. Unlike classification and regression problems, object detection also needs to determine the position (positioning) of the target in the image, and determine the category of the recognized target. and location (classification and localization) , is one of the core problems in the field of computer vision.
Figure 1 Target detection algorithm

(2) The task of target detection

There are four major categories of tasks in computer vision for image recognition:
(1) Classification-Classification : Solve the problem of "what is it?", that is, given a picture or a video to determine what category of targets are contained in it.
(2) Positioning-Location : Solve the problem of "where?", that is, locate the location of the target.
(3) Detection-Detection : Solve the problem of "where? What is it?", that is, locate the position of the target and know what the target is.
(4) Segmentation-Segmentation : Solve the problem of "which target or scene each pixel belongs to". Answers the question of "where? what" from the pixel level
(as shown below)

(3) Classification of target detection algorithms

1. Traditional target detection algorithm

Specific reference: Common classic target detection algorithms
Unlike the current convolutional neural network, which can automatically extract efficient features for image representation, the traditional target detection algorithms in the past are mainly based on manually extracted features.
The traditional detection algorithm process can be summarized as follows:

1) Select the area of ​​interest and select the area that may contain objects
2) Extract features from areas that may contain objects
3) Detect and classify the extracted features

Limitations of traditional detection algorithms:
Traditional target detection algorithms based on manually extracted features mainly have the following three shortcomings:

1) The recognition effect is not good enough and the accuracy rate is not high
2) The calculation amount is large and the operation speed is slow
3) Multiple correct recognition results may be generated

(1) Viola Jones Detector

The VJ (Viola Jones) detector uses a sliding window to check whether the target exists in the window. The detector seems simple and stable, but the time complexity is extremely high due to the huge amount of calculation. In order to solve this problem, the detection The detector greatly improves the detection speed by combining three technologies, which are:
1) Fast feature calculation method - integral graph;
2) Effective classifier learning method - AdaBoost;
3) Efficient classification strategy - Design of cascade structure.

(2)HOG Detector

The HOG (Histogram of Oriented Gradients) detector was proposed in 2005. It was an important improvement of Scale Invariant Feature Transform and Shape Contexts at that time. In order to balance feature invariance (including translation, scale, illumination, etc.) ) and non-linear (distinguishing different object categories), improving detection accuracy by computing overlapping local contrast normalization on a dense grid of evenly spaced cells, so the detector is based on local pixel blocks for feature histogram extraction An algorithm that is robust to local deformations of objects and to lighting effects. It laid an important foundation for many detection methods in the later stage, and related technologies are widely used in various applications of computer vision.
insert image description here
insert image description here
insert image description here

(3)DPM Detector

As the champion of the VOC 2007-2009 Object Detection Challenge, DPM (Deformable Parts Model) is a well-deserved SOTA (State Of The Art) algorithm in the traditional algorithm of object detection. It was proposed in 2008 and many improvements have been made, so this algorithm can be regarded as an extended algorithm. The algorithm consists of a main filter (Root-filter) and multiple auxiliary filters (Part-filters), through hard negative mining (Hard negative mining), bounding box regression (Bounding box regression) and context priming (Context priming) technology improvement Detection accuracy. As a traditional target detection algorithm, SOTA has a fast calculation speed and can adapt to object deformation, but it cannot adapt to large rotations, so its stability is poor.

2. Target detection algorithm based on deep learning

Traditional object detection algorithms based on manually extracted features are progressing slowly and have low performance. Until 2012, the rise of Convolutional Neural Networks (CNNs) pushed the field of target detection to a new level. CNNs-based target detection algorithms mainly have two technical development routes: Anchor-Based and Anchor-Free methods, the difference is whether to use Anchor for training and prediction; Anchor-based methods include one-stage and two-stage detection algorithms, two-stage The target detection algorithm is generally more accurate than the one-stage detection algorithm, but the one-stage detection algorithm will be faster, and the Anchor-Free algorithm has been gradually improved in recent years.
insert image description here

(1)Two Stage

First generate a region, which is called a region proposal (referred to as RP, a pre-selection box that may contain objects to be detected), and then classify samples through a convolutional neural network.

Task flow: feature extraction --> generate RP --> classification/positioning regression.

Common two-stage target detection algorithms include: R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN , and R-FCN .

(2)One Stage

Instead of RP, features are extracted directly in the network to predict object classification and location.

Task flow: feature extraction –> classification/positioning regression.

Common one-stage target detection algorithms include: OverFeat, YOLO series, SSD and RetinaNet , etc.

(3)Anchor-Free

Without using Anchor,
there are mainly two types of methods to represent the detection frame:

1. Detection algorithm based on key points: first detect the upper left corner and lower right corner of the target, and then form a detection frame by combining the corner points .
2. Center-based detection algorithm: directly detect the center area and boundary information of the object , and decouple the classification and regression into two sub-grids.

(4) Application of target detection algorithm

insert image description here

2. Target detection principle

Reference article: Object Detection (Object Detection)
object detection is divided into two series - RCNN series and YOLO series, RCNN series is a representative algorithm based on area detection, YOLO is a representative algorithm based on area extraction, and there are also famous SSD is an improvement based on the previous two series.

(1) Generation of candidate regions

1. Sliding window

The main idea can be clearly understood through the flow chart of the sliding window method: firstly, sliding windows of different window sizes on the input image are slid from left to right and from top to bottom. Execute the classifier on the current window every time you slide (the classifier is trained in advance). An object is considered detected if the current window gets a higher classification probability. After detecting the sliding windows of each different window size, the object marks detected by different windows will be obtained. These window sizes will have high repetitions, and finally non-maximum suppression (Non-Maximum Suppression, NMS) is used method to filter. Finally, the detected objects are obtained after NMS screening.
The sliding window method is simple and easy to understand, but the global image search with different window sizes leads to inefficiency, and the aspect ratio of the object needs to be considered when designing the window size. Therefore, for classifiers with high real-time requirements, the sliding window method is not recommended.
insert image description here

2. Select window

The sliding window method is similar to exhaustive image sub-region search, but in general, most of the sub-regions in the image have no objects. Scholars naturally thought of only searching for the most likely region of the image to contain objects in order to improve computational efficiency. The selective search (SS) method is the most well-known image bounding box extraction algorithm, which was proposed by Koen EA in 2011.
The main idea of ​​choosing a search algorithm: the areas where objects may exist in the image should have certain similarities or continuity areas. Therefore, the selection search is based on the above idea and uses the method of sub-region merging to extract bounding boxes. First, performing a segmentation algorithm on an input image produces many small sub-regions. Secondly, according to the similarity between these sub-regions (similarity criteria mainly include color, texture, size, etc.), the regions are merged, and the regions are iteratively merged continuously. In each iteration process, bounding boxes (circumscribed rectangles) are made to these merged sub-regions, and these sub-regions circumscribed rectangles are commonly referred to as candidate boxes.

insert image description here
The process is as follows:

step0: Generate the region set R
step1: Calculate the similarity of each adjacent region in the region set R S={s1, s2,...}
step2: Find the two regions with the highest similarity, merge them into a new set, add Enter R
step3: remove all subsets related to step2 from S
step4: calculate the similarity between the new set and all subsets
step5: skip to step2 until S is empty

The advantages are as follows:

1. The calculation efficiency is better than the sliding window method
2. Due to the sub-region merging strategy, it can contain various sizes of suspected object frames
3. The index diversity of similar merging regions improves the probability of detecting objects

(2) Data representation

Labeled samples:
insert image description here
predicted output:
insert image description here
insert image description here

(3) Effect evaluation

Use IoU (Intersection over Union) to judge the quality of the model. The so-called intersection ratio refers to the ratio of the intersection and union of the predicted frame and the actual frame. It is generally agreed that 0.5 is an acceptable value.
insert image description here

(4) Non-Maximum Suppression (NMS)

In the prediction results, there may be overlaps between multiple prediction results, and it is necessary to retain the largest cross-combination ratio and remove the non-maximum prediction results. This is Non-Maximum Suppression (NMS for short). As shown in the figure below, the prediction result for the same object contains three probabilities 0.8/0.9/0.95. After non-maximum suppression, only the prediction result with the highest probability is retained.
insert image description here
insert image description here

3. Target detection model

(1) R-CNN series

Reference article: R-CNN series algorithm intensive: R-CNN — "Fast R-CNN —" Faster R-CNN Advanced Road
R-CNN (full name Regions with CNN features), is the first generation of R-CNN series Algorithms, in fact, do not use "deep learning" ideas too much, but combine "deep learning" with traditional "computer vision" knowledge. For example, the second and fourth steps in the R-CNN pipeline actually belong to the traditional "computer vision" technology. Use selective search to extract region proposals, and use SVM to achieve classification.
insert image description here

1、R-CNN

1) Process
① Pre-training model. Choose a pre-trained neural network (eg AlexNet, VGG).
② Retrain the fully connected layer. The last fully connected layer is re-trained using the target to be detected.
③ Extract proposals and calculate CNN features. Use the Selective Search algorithm to extract all proposals (about 2000 images), adjust (resize/warp) them to a fixed size to meet the CNN input requirements (because of the limitation of the fully connected layer), and then save the feature map to Local Disk.
④ Training SVM. Use feature map to train SVM to classify target and background (one binary SVM for each class)
⑤ Bounding boxes regression (Bounding boxes Regression). The training will output a linear regression classifier with some correction factors
2) Effect
R-CNN achieved a mAP of 58.5% on the VOC 2007 test set, defeating all target detection algorithms at that time
3) Disadvantages
① Repeated calculations, each region proposal needs to go through An AlexNet feature extraction, it takes about 47 seconds to extract features for all RoI (region of interest), occupying space
②selective search method to generate region proposal, for a frame of image, it takes 2 seconds
③Three modules (extraction, classification, regression) It is trained separately, and during training, it consumes a lot of storage space

2、SPPNet

Paper link: SPPNet
insert image description here
[Introduction] SPPNet[5] proposed a spatial pyramid pooling layer (Spatial Pyramid Pooling Layer, SPP). Its main idea is to divide an image into image blocks of several scales (for example, an image is divided into 1, 4, 8, etc.), and then fuse the features extracted for each block, so as to take into account the features of multiple scales. . SPP enables the network to generate fixed-scale feature representations before fully connected layers, regardless of the input image size. When using the SPPNet network for target detection, the entire image only needs to be calculated once to generate the corresponding feature map. Regardless of the size of the candidate frame, after SPP, a fixed-size feature representation map can be generated, which avoids the convolution feature map. double counting.
[Performance] Compared with the RCNN algorithm, SPPNet has increased the inference speed by more than 20 times without sacrificing the detection accuracy (VOC-07, mAP=59.2%) on the Pascal-07 dataset.
[Deficiencies] Like RCNN, SPP also needs to train CNN to extract features, and then train SVM to classify these features, which requires a huge storage space, and the multi-stage training process is also very complicated. In addition, SPPNet only fine-tunes the fully connected layer, while ignoring the parameters of other layers of the network.
In order to solve some of the above shortcomings, R. Girshick et al. proposed Fast RCNN in 2015

3、Fast R-CNN

Fast R-CNN is an improvement based on R-CNN and SPPnets. The innovation of SPPnets is to extract image features only once (instead of calculating once for each candidate area), and then map the feature map of the candidate area to the feature map of the entire image according to the algorithm.
insert image description here
See references for details

4、Faster R-CNN

After the accumulation of R-CNN and Fast-RCNN, Ross B. Girshick proposed a new Faster RCNN in 2016, structurally integrating feature extraction, region proposal extraction, bbox regression, and classification into one network, making the comprehensive performance It has been greatly improved, especially in the detection speed.
insert image description here
See references for details.

(2) YOLO series

Reference article: Summary of Object Detection Algorithms
Object Detection (Object Detection)
YOLO Series Algorithms Intensive Lecture: The Advanced Road from yolov1 to yolov5 (20,000 words super complete arrangement)
insert image description here

1、YOLOv1

Before YOLOv1 was proposed, the R-CNN series of algorithms dominated the field of target detection. The R-CNN series has high detection accuracy, but due to its two-stage network structure, its detection speed cannot meet real-time performance and has been criticized. In order to break this deadlock, it is a general trend to design a faster object detector.

In 2016, Joseph Redmon, Santosh Divvala, Ross Girshick and others proposed a one-stage target detection network. Its detection speed is very fast, it can process 45 frames per second, and it can easily run in real time. Due to its fast speed and the special method it uses, the author named it: You Only Look Once (that is, the full name of YOLO we often say), and published the result on CVPR 2016, which has attracted widespread attention. attention.

The core idea of ​​YOLO is to transform target detection into a regression problem, using the entire image as the input of the network, and only passing through a neural network to obtain the location of the bounding box (bounding box) and its category.
insert image description here
It now appears that the network structure of YOLOv1 is very clear, it is a traditional one-stage convolutional neural network:

Network input: 448×448×3 color pictures.
Middle layer: It consists of several convolutional layers and maximum pooling layers, which are used to extract the abstract features of the image.
Fully connected layer: It consists of two fully connected layers and is used to predict the position and category probability value of the target.
Network output: 7×7×30 prediction results.

(1) Detection strategy
YOLOv1 adopts a "divide and conquer" strategy, which divides a picture into 7×7 grids on average, and each grid is responsible for predicting the target whose center point falls within the grid. Recall that in Faster R-CNN, an RPN is used to obtain the region of interest of the target. This method has high accuracy, but an additional RPN network needs to be trained, which undoubtedly increases the training burden. In YOLOv1, 7×7 grids are obtained by division, and these 49 grids are equivalent to the target area of ​​interest. In this way, we don't need to design an additional RPN network, which is the simplicity and speed of YOLOv1 as a single-stage network! insert image description here
The specific implementation process is as follows:

1. Divide an image into S×S grid cells. If the center of an object falls in the grid, the grid is responsible for predicting the object.
2. Each grid needs to predict B bounding boxes, and each bounding box needs to predict a total of 5 values ​​​​of (x, y, w, h) and confidence.
3. Each grid also needs to predict a category information, which is recorded as C categories.
4. In general, there are S×S grids, and each grid needs to predict B bounding boxes and C classes. The network output is a tensor of S × S × (5×B+C).
insert image description hereIn the actual process, YOLOv1 divides a picture into 7×7 grids, and each grid predicts 2 Boxes (Box1 and Box2), 20 categories. So actually, S=7, B=2, C=20. Then the shape of the network output is: 7×7×30.
(2) Target loss function
insert image description here

The loss consists of three parts, namely: coordinate prediction loss, confidence prediction loss, and category prediction loss.

The squared variance and error are used. It should be noted that w and h take their square roots when calculating the error. The reason is that in the
prediction of bounding boxes of different sizes, compared with the prediction of the large bounding box, the prediction of the small box is more unbearable. . The variance and error functions are the same for the same offset loss.
In order to alleviate this problem, the author used a more tricky method, which is to take the square root of w and h of the bounding box to replace the original w and h.

The localization error is larger than the classification error, so the penalty for localization error is increased so that λcoord=5.

In each image, many grid cells do not contain any objects. Training will push the "confidence" scores of boxes in these grids to zero, which often exceeds the gradient of boxes containing objects. This may lead to model instability and early divergence in training. Therefore, to reduce the loss of confidence prediction of the box that does not contain the target, make
λnoobj=0.5

(3) Advantages:

1. YOLO detection speed is very fast. The standard version of YOLO can process 45 images per second; the extreme version of YOLO can process 150 images per second. This means YOLO can process video in real-time with less than 25 milliseconds of latency. For less real-time systems, YOLO is faster than other methods when the accuracy rate is guaranteed.
2. The average accuracy of YOLO real-time detection is twice that of other real-time monitoring systems.
3. Strong migration ability, can be applied to other new fields (such as artwork object detection).
(4) Limitations:

1. YOLO is not good at detecting objects that are close to each other and small groups. This is because a grid only predicts 2 boxes, and they all belong to the same category.
2. Due to the problem of the loss function, the positioning error is the main reason affecting the detection effect, especially the processing of large and small objects needs to be strengthened. (Because for small bounding boxes, small error has a greater impact)
3. YOLO's target generalization performance for uncommon angles is weak.

2、YOLOv2

In 2017, the authors Joseph Redmon and Ali Farhadi made a lot of improvements on the basis of YOLOv1, and proposed YOLOv2 and YOLO9000. Focus on solving the shortcomings of YOLOv1 recall rate and positioning accuracy.

YOLOv2 is an advanced object detection algorithm that is faster than other detectors. In addition, the network can adapt to various sizes of image input, and can make a good trade-off between detection accuracy and speed.

Compared with YOLOv1, which uses the fully connected layer to directly predict the coordinates of the Bounding Box, YOLOv2 draws on the idea of ​​​​Faster R-CNN and introduces the Anchor mechanism. The K-means clustering method is used to cluster and calculate better Anchor templates in the training set, which greatly improves the recall rate of the algorithm. At the same time, combined with the fine-grained features of the image, the shallow features are connected with the deep features, which is helpful for the detection of small-sized objects.

YOLO9000 uses WorldTree to mix training data from different resources, and uses joint optimization technology to train on ImageNet and COCO datasets at the same time, and can detect more than 9,000 objects in real time. Since the main detection network of YOLO9000 is still YOLOv2, this part focuses on explaining the more widely used YOLOv2.
insert image description here
For details, please refer to the reference: YOLO series algorithm intensive lecture: the advanced road from yolov1 to yolov5 (20,000 words super complete arrangement)

3、YOLOv3

In 2018, the author Redmon made some improvements on the basis of YOLOv2. In the feature extraction part, the darknet-53 network structure is used to replace the original darknet-19, and the feature pyramid network structure is used to realize multi-scale detection. The classification method uses logistic regression instead of softmax, which ensures the accuracy of target detection while taking into account real-time performance.

From YOLOv1 to YOLOv3, the performance improvement of each generation is closely related to the improvement of the backbone (backbone network). In YOLOv3, the author not only provides darknet-53, but also provides a lightweight tiny-darknet. If you want both detection accuracy and speed, you can choose darknet-53 as the backbone; if you want to achieve faster detection speed, but the accuracy can be compromised, then tiny-darknet is a good choice for you. In short, the flexibility of YOLOv3 makes it favored by many people in actual engineering!
insert image description here
For details, please refer to the reference: YOLO series algorithm intensive lecture: the advanced road from yolov1 to yolov5 (20,000 words super complete arrangement)

4、YOLOv4

For details, please refer to the reference: YOLO series algorithm intensive lecture: the advanced road from yolov1 to yolov5 (20,000 words super complete arrangement)

5、YOLOv5

For details, please refer to the reference: YOLO series algorithm intensive lecture: the advanced road from yolov1 to yolov5 (20,000 words super complete arrangement)

6、SSD

The full English name of SSD is Single Shot MultiBox Detector. Single shot indicates that the SSD algorithm belongs to a one-stage method, and MultiBox indicates that the SSD algorithm is based on multi-frame prediction.
SSD is a very good one-stage target detection method. The one-stage algorithm is that target detection and classification are completed at the same time. The main idea is to use CNN to extract features, and evenly perform intensive sampling at different positions of the picture. When different scales and aspect ratios can be used, the object classification and the regression of the prediction frame are performed at the same time. The whole process only needs one step, so its advantage is that it is fast.
However, an important disadvantage of uniform dense sampling is that training is difficult, mainly because the positive samples and negative samples (background) are extremely unbalanced (see Focal Loss), resulting in a slightly lower model accuracy.
1) Backbone network
insert image description here
insert image description here
For specific implementation, please refer to the reference: Wisdom Object Detection 23——Pytorch Builds SSD Object Detection Platform

(3) Anchor-Free series

insert image description here
Specific reference: AnchorFree series algorithm detailed
how to represent the detection frame after removing the Anchor?
In the Anchor-based method, the detection frame is represented by the Anchor and the corresponding encoding information . The corresponding encoding information is, in other words, the offset.
Anchor Free currently mainly divides the following two types of methods to represent the detection frame:

1. Detection algorithm based on key points: first detect the upper left corner and lower right corner of the target, and then form a detection frame by combining the corner points .
2. Center-based detection algorithm: directly detect the center area and boundary information of the object , and decouple the classification and regression into two sub-grids.

! insert image description here

references

1. Object Detection (Object Detection)
2. Common Classical Object Detection Algorithms
3. CNN Notes: Popular Understanding of Convolutional Neural Networks
4. Overview of Object Detection Algorithms
5. Detailed Explanation of AnchorFree Series Algorithms
6. Smart Object Detection 23——Pytorch Builds SSD Target detection platform
7. YOLO series algorithm essence: from yolov1 to yolov5 advanced road (20,000 words super complete arrangement)

Guess you like

Origin blog.csdn.net/huangweijie0426/article/details/127583034