Object detection model summary

insert image description here

1 Problem overview

1.1 Definition

The task of object detection is to locate and classify the objects in the picture . Positioning refers to the rectangular frame of the regression target , and classification refers to classifying the target frame.

1.2 Main issues

The target detection task mainly solves the following problems: the type and quantity of the target , the size of the target , external interference factors , specific scene tasks , etc.;

2 Method overview

Target detection algorithms can be divided into traditional target detection algorithms and deep learning-based target detection algorithms . Among them, the main work of the traditional target detection algorithm is concentrated on the manual design of low-level feature extraction algorithms, such as SIFT , HOG features , etc.; the main function of the target detection algorithm based on deep learning is to combine the use of data and loss through the design of the network , making the work of feature extraction intelligent , and simplifying the entire target detection task. In recent years, with breakthroughs in the computing field and in-depth learning algorithm architecture, the performance of algorithms based on in-depth learning is much better than traditional methods, and traditional algorithms have gradually withdrawn from the stage of history. However, many deep learning methods are also implemented based on the idea of ​​traditional algorithms. The next article will introduce the development process of traditional algorithms and deep learning algorithms, so that everyone can have a general concept of target detection algorithms.

2.1 Traditional target detection algorithm

The schematic diagram of the traditional target detection algorithm flow chart is as follows:
insert image description here

Candidate frame extraction : select target candidate frames through sliding window algorithm , SS algorithm , etc.;
feature extraction : manually design algorithm to extract features, such as VJ , HOG feature and other algorithms. The main work of traditional methods is concentrated on the algorithm design of bottom and middle layer feature extraction Above;
classifier : classify each candidate frame through algorithms such as SVM ;
NMS non-maximum value suppression :

  1. Use the classifier to divide all the candidate boxes into categories, and remove the background class, because the background class does not require NMS;
  2. For the bounding boxes (B_BOX) in each object class, arrange them in descending order according to the classification confidence;
  3. In a certain class, select the bounding box B_BOX1 with the highest confidence, remove B_BOX1 from the input list, and add it to the output list;
  4. Calculate the IoU of B_BOX1 and the rest of B_BOX2 one by one, if IoU(B_BOX1,B_BOX2) > threshold TH, remove B_BOX2 from the input;
  5. Repeat the above two steps 3 and 4 until the input list is empty, and complete the traversal of an object class;
  6. Repeat the above four steps 1, 2, 3, and 4 until the NMS processing of all object classes is completed;

2.2 Target detection algorithm based on deep learning

At present, the technical direction of the target algorithm based on deep learning can be divided into methods based on anchor base and anchor free . The difference between anchor base and anchor free is whether the anchor is used to extract the candidate box. The development of the target detection field has changed from anchor free to anchor base, and now there is a trend of returning to anchor-free. The iterative update of technology in academia is also leading to changes in the industry. Among them, the technical route based on the anchor base can be divided into two-stage and one-stage methods.

Anchor

Anchor, also called anchor, is actually a set of preset bounding boxes used to learn the offset of the real border position relative to the preset border during training. In layman's terms, it is to pre-set the approximate location where the target may exist, and then make fine adjustments based on these preset borders. And its essence is to solve the problem of label assignment.
Anchor is a series of prior frame information, and its generation involves the following parts:

  • Use the network to extract the points of the feature map to locate the position of the frame;
  • Use the size of the anchor to set the size of the border;
  • Use the aspect ratio of the anchor to set the shape of the frame;
    by setting different scales and different sizes of prior frames, there is a higher probability of a prior frame constraint with a good match for the target object.
    insert image description here

Anchor (also known as anchor box) is a group of rectangular boxes clustered on the training set using methods such as k-means before training, representing the length and width scales of the main distribution of targets in the data set.

Anchor Base

Representative algorithms : CornerNet, ExtremeNet, CenterNet, FCOS, etc.;
Anchor-Base has a priori anchor frame. First, some anchor frame scales and sizes are obtained by clustering a certain amount of data, and then combined with the prior anchor frame and the predicted bias Shift to get the predicted anchor box.
There are two ways of target detection algorithm based on Anchor-Free:

  1. The key point detection method limits its search space by locating several key points of the target object;
  2. Locate by the center point of the target object, and then predict the distance from the center to the boundary.
    The most direct benefit is that there is no need to cluster multiple width and height anchor parameters on the current training data before training.
    insert image description here

Faster R-CNN-Set 3 scales, 3 width and height ratios, a total of 9 anchors to extract candidate frames

Anchor Free

Representative algorithms : Faster R-CNN, SSD, Yolo (2, 3, 4, 5), etc.;
Anchor-Free means that there is no prior anchor frame, and the anchor frame is obtained directly by predicting specific points.
In the past few years, the field of target detection has been dominated by anchor-based detectors. The process of such algorithms can be divided into three steps:

  1. Preset a large number of anchors (2D/3D) in the image or point cloud space;
  2. The four offsets of the regression target relative to the anchor;
  3. Correct the precise target position with the corresponding anchor and regression offset;
    insert image description here

CornerNet directly predicts the probability that each point is the upper left and lower right corner points, and extracts the target frame by pairing the upper left and lower right corner points.

Two-Stage (two-stage algorithm)

RCNN series

The RCNN series network is a classic of the two-stage network. Among them, the RCNN network introduced the CNN network into the field of target detection for the first time, which is the pioneering work of CNN in the field of target detection.
The following figure is a schematic diagram of RCNN series iterations :
insert image description here

RCNN

Paper address : Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper
insert image description here

Algorithm flow :

  • Candidate region generation: SS algorithm (Selective Search) method is used to generate 1K~2K candidate regions for each image
  • Feature extraction: For each candidate region, use a deep convolutional network to extract features (CNN)
  • Category judgment: The features are sent to the SVM classifier of each category to determine whether it belongs to the category
  • Position refinement: Use the regressor to fine-tune the position of the candidate frame
    Improvement points :
  • The SS algorithm is more efficient and more accurate than the sliding window;
    disadvantages :
  • The SS algorithm is still time-consuming;
  • The four modules are basically trained separately. CNN uses pre-trained model fine-tuning, SVM re-training, frame regression re-training, and fine-tuning is difficult;

Fast RCNN

Paper address : https://arxiv.org/abs/1504.08083v2
insert image description here

Algorithm flow :

  1. Also looking for a pre-trained cnn model (VGG16) trained on imagenet;
  2. Like rcnn, extract 2000 candidate regions in the picture through selective search;
  3. Input an entire picture into the cnn model, and extract the overall features of the picture ( shared convolution features );
  4. Map the candidate area to the feature map extracted by the cnn model in the previous step;
  5. Use the rol pooling layer to upsample the features of each candidate area to obtain a fixed-size feature map;
  6. The process of classifying and regressing the features of the candidate area according to softmax loss and smooth l1 loss;
    improvement points :
  • Shared convolution features greatly improve the efficiency of feature extraction for candidate frames;
  • Output the classification task and the regression task together;
    insufficient :
  • The candidate frame extraction SS algorithm is still too time-consuming;

Faster RCNN

Paper address : https://arxiv.org/abs/1506.01497
insert image description here

Algorithm process : The Faster RCNN network can be regarded as RPN + Fast RCNN.
The entire network process is similar to Fast RCNN, except that the Region Proposal SS algorithm , which consumes a lot of time, is modified to the RPN network to directly extract candidate frames and convert them to Integrating into the overall network , the comprehensive performance is greatly improved, especially in the detection speed.
Improvements :

  • The RPN network is proposed to replace the SS algorithm;
    deficiencies :
  • Compared with the one stage algorithm, the speed is slow and cannot be achieved in real time;
    advantages :
  • High network accuracy;
  • RPN network speed is significantly faster than SS algorithm;
  • Realized end-to-end training;

One-Stage (one-stage algorithm)

SSD

Paper address : https://arxiv.org/abs/1512.02325
The full name of the SSD algorithm is Single Shot MultiBox Detector , which is a One-stage target detector after YOLO V1 came out and before YOLO V2 came out. SSD uses multi-scale feature maps. In the later darknet53 of YOLO V3, the idea of ​​multi-scale feature maps is also used. On the shallower feature map, the receptive field of each cell is not very large, so it is suitable for detecting smaller objects, while on the deeper feature map, the receptive field of each cell is relatively large, suitable for detecting larger objects. objects.

SSD uses VGG16 as the basic model, and then adds a convolutional layer based on VGG16 to obtain more feature maps for detection. As shown below:
insert image description here

The network is divided into three parts:

  • Backbone : VGG16 network for image feature extraction
  • Extra : A network for eliciting multi-scale feature maps
  • Loc and cls : Networks for box location regression and object classification

Features of SSDs:

  • Single Shot Detection : Single shot indicates that the SSD algorithm belongs to the One Stage algorithm. The SSD model encapsulates positioning and detection in the forward operation of the network, thereby significantly improving the operation speed.
  • Multiscale Feature Maps (Multiscale Feature Maps) : MultiBox indicates that SSD is a multi-frame prediction. The SSD network contains multiple convolutional layers. The model uses convolutional layers of different depths to output feature maps to locate and detect targets of different scales. .
  • Anchor boxes (Anchors) : Predefine rectangular boxes of different sizes at each position of the feature map. These rectangular boxes contain different aspect ratios, and they are used to match the rectangular boxes of real objects.

YOLO series

The YOLO series of algorithms were written by Joseph Redmon, which can be said to be the foundation of the one-stage target detection algorithm. The full name of YOLO is you only look once, which means to unify the tasks of category classification and frame regression. From YOLOv1 to YOLOv3, the author completed the basic framework of the YOLO series of networks, and most of the subsequent networks (YOLOv4, YOLOv5, YOLObile, and YOLOX, etc.) are some modifications based on the YOLOv3 network. The author Joseph Redmon announced on Twitter in February 2020 that he had withdrawn from the CV field (the reason was that the military invited him to watch the application of YOLOv3 on military drones, which the author believes violated his original intention). After the author announced his withdrawal from the CV world, on April 23 and June 25, YOLOv4 and YOLOv5 were published successively.
insert image description here

YOLOv1

Paper address : https://arxiv.org/abs/1506.02640
insert image description here

Algorithm flow :

  • Divide the original image into a 7×7 grid;
  • Each grid has 2 candidate boxes, but it is only responsible for predicting one target (box confidence and scores for each category);
  • For the 7×7×2 grids, delete the boxes with low confidence through the threshold, and then use the NMS algorithm to delete redundant boxes;
    advantages :
  • high speed;
  • Low background false detection rate;
  • Strong versatility;
    Disadvantages :
  • Poor detection ability for small targets and hen targets;
  • Low recall rate;
  • The positioning of the frame is not precise enough;

YOLOv2(YOLO9000)

Paper address : https://arxiv.org/abs/1612.08242
YOLO9000 is the YOLOv2 network, which plays a connecting role in the YOLO series of networks.
insert image description here

The above table is the improvement points listed in the paper and the improvement of each improvement point:
Improvement points :

  1. Added BN layer;
  2. use high-resolution classifiers;
  3. Use the prior anchor size;
  4. Get the size of the anchor by clustering;
  5. Constrain the center point position of the anchor;
  6. A passthrough layer is proposed to increase fine-grained features;
  7. Use multi-scale input training;
  8. Proposed Darknet-19 network;
  9. Proposed WordTree structure, supporting 9K+ targets;
    advantages :
  • Significantly improved accuracy;
  • Significantly increased the number of detection targets;

YOLOv3

Paper address : https://pjreddie.com/media/files/papers/YOLOv3.pdf
YOLOv3 is not strictly an academic paper, but a technical report. The author wrote a temporary report because he needed to quote YOLOv3 in another article. So it is very casual in terms of description and comparison of results.
insert image description here

The picture above is from the beginning of the original text of YOLOv3. The author drew the results of YOLOv3 outside the coordinate axis to show that the results of YOLOv3 are significantly better than the current SOTA network RetinaNet (there is a suspicion of pretentiousness, or the author directly I took the picture of the RetinaNet paper, and I didn't want to redraw it again, so I added the result of YOLOv3 directly on it). But no matter what, YOLOv3 basically laid the basic framework of the YOLO series detection network, and subsequent improvements are based on this framework to add some current work methods.
insert image description here

Improvements :

  • Multi-scale prediction (introducing FPN);
  • Better basic classification network (Darknet-53, similar to ResNet introducing residual structure);
  • The classifier no longer uses Softmax, and the classification loss uses binary cross-entropy loss (two-class cross-entropy loss entropy).
    Advantages :
  • Improved detection accuracy;
  • Multi-scale prediction greatly improves the number of detection targets (8K+) and small target detection capabilities;

YOLOv4

Paper address : https://arxiv.org/abs/2004.10934
insert image description here

YOLOv4 is like a literature review of the tricks of the target detection network. The author basically tried all the methods that were working in the field of target detection at that time, which shows the author's "alchemy" level. The YOLOv4 network greatly improves the accuracy of the network while maintaining the same speed as YOLOv3.
insert image description here

Improvements :

  • Compared with DarkNet53 of YOLO V3, YOLO V4 uses CSPDarkNet53;
  • Compared with the FPN of YOLO V3, YOLO V4 uses SPP+PAN;
  • CutMix data enhancement and Mosaic data enhancement;
  • DropBlock regularization;
  • etc.
    Advantages :
  • The network accuracy has been greatly improved at the same speed;

YOLOv5

Project address : https://github.com/ultralytics/yolov5
insert image description here

YOLOv5 mainly made a series of optimizations for engineering. Compared with YOLOv4 in the case of the same accuracy, the speed has reached 140FPS from the original 50FPS, which is very meaningful to the industry.
insert image description here

Improvements :

  • Using the Pytorch framework is very user-friendly and can easily train your own data set. Compared with the Darknet framework adopted by YOLO V4, the Pytorch framework is easier to put into production.
  • The code is easy to read and integrates a large number of computer vision technologies, which is very conducive to learning and reference.
  • Not only is the environment easy to configure, model training is also very fast, and batch inference produces real-time results.
  • Ability to perform efficient inference directly on single images, batches of images, video, and even webcam port inputs.
  • It can easily convert the Pytorch weight file to the ONXX format used by Android, and then convert it to the format used by OPENCV, or convert it to the IOS format through CoreML, and directly deploy it to the mobile application.
  • Finally, the object recognition speed of YOLO V5s up to 140FPS is very impressive, and the experience is very good.
    Advantages :
  • high speed;
  • Easy engineering deployment

YOLOX

Paper address : http://arxiv.org/abs/2107.08430
Backbone of YOLOX follows the Backbone structure of YOLOv3
insert image description here

Improvements :

  • Features use the Focus network structure, CSPDarknet and SPPNet;
  • Classification regression layer modification (Decoupled Head, the decoupling head used in the previous version of Yolo is together, that is, classification and regression are implemented in a 1X1 convolution, YoloX believes that this has a negative impact on the recognition of the network. In YoloX , Yolo Head is divided into two parts, implemented separately, and integrated together at the last prediction);
  • Mosaic data enhancement;
  • Do not use a priori frame, Anchor Base is changed to Anchor Free;
  • Dynamically match positive samples for targets of different sizes, SimOTA;
    advantages :
  • Improved network accuracy;

YOLOv6

On the basis of YOLOv5, the Backbone side of YOLOv6 designs the EfficientRep Backbone structure .
insert image description here

Compared with YOLOv5's Backbone, YOLOv6's Backbone can not only efficiently utilize hardware computing power, but also has stronger representation capabilities.

In the Backbone of YOLOv6, the ordinary convolution is replaced by the RepConv structure. At the same time, the RepBlock structure is designed on the basis of RepConv, in which the first RepConv in RepBlock will do channel dimension transformation and alignment.

In addition, YOLOv6 optimizes SPPF into a more efficient SimSPPF to increase the efficiency of feature reuse.
insert image description here

Improvements :

  • Hardware-friendly backbone network design, introduces the RepVGG structure on the backbone, and improves it based on hardware, and proposes EfficientRep with higher efficiency;
  • The more concise and efficient Decoupled Head adopts the structure of the decoupled detection head (Decoupled Head) and simplifies its design. A more efficient decoupling head structure is redesigned using the Hybrid Channels strategy, which reduces the delay while maintaining accuracy, and alleviates the additional delay overhead caused by the 3x3 convolution in the decoupling head;
  • More effective training strategy, Anchor-free no anchor paradigm, SimOTA label allocation strategy, SIoU bounding box regression loss;
    advantages :
  • Improved detection accuracy;
  • Speed ​​up training;

YOLOv7

The Backbone side of YOLOv7 is based on YOLOv5, and the E-ELAN and MPConv structures are designed .
insert image description here

Improvements :

  • Model reparameterization, introducing model reparameterization into the network architecture, the idea of ​​reparameterization first appeared in REPVGG;
  • Dynamic label allocation, integrating the essence of both YOLOv5 and YOLOX, the label allocation strategy of YOLOV7 adopts the cross-grid search of YOLOV5 and the matching strategy of YOLOX;
  • ELAN efficient network architecture, a new network architecture proposed in YOLOV7, is mainly efficient;
  • For training with an auxiliary head, YOLOV7 proposes a training method for the auxiliary head. The main purpose is to increase the training cost and improve the accuracy without affecting the reasoning time, because the auxiliary head will only appear during the training process;
    advantages :
  • Improved detection accuracy;
  • Speed ​​up training;

Algorithm based on Transformer

Since Transformer came out, Transformer has begun to dominate the NLP field. However, Transformer’s response in the CV field was mediocre, and it was once considered unsuitable for the CV field. Until a few Transformer articles came out in the field of computer vision earlier, the performance was close to that of CNN-based SOTA, giving a new imagination to the field of computer vision.
In Transformer-based computer vision-related research, ViT is used for image classification, and DETR and Deformable DETR are used for target detection. It can be seen from these articles that Transformer's paradigm in the field of computer vision has begun to take shape, which can be roughly summarized as: Embedding -> Transformer -> Head

ViT

Paper address : https://arxiv.org/abs/2010.11929
insert image description here

Vision Transformer (ViT) splits the input image into 16x16 patches. Each patch performs a linear transformation to reduce the dimensionality and embed position information, and then sends it to the Transformer, avoiding the operation of pixel-level attention. Similar to the setting of the BERT [class] flag bit, ViT adds an additional learnable [class] flag bit before the Transformer input sequence, and the Transformer Encoder output at this position is used as an image feature.
ViT abandons the inductive preference problem of CNN, which is more conducive to learning knowledge on ultra-large-scale data, that is, large-scale training and inductive preference, which is close to SOTA in many image classification tasks.

DETR

Paper address : https://arxiv.org/abs/2005.12872
DETR stands for DE detection TR ansformer, which is a CV model proposed by Facebook AI Research Institute. It is mainly used for target detection and can also be used for segmentation tasks. This model uses Transformer to replace the complex traditional routines of target detection , such as two-stage or one-stage, anchor-based or anchor-free, nms post-processing, etc.; it also does not use some coquettish techniques, such as using multiple Scale feature fusion, use some special types of convolution (such as group convolution, variable convolution, dynamic generation convolution, etc.) to extract features, map different types of feature maps to decouple classification and regression tasks, and even It is data enhancement. The whole process is to use CNN to extract features and then encode and decode to get the predicted output .

It can be said that the overall work is very solid. Although the effect is not as good as SOTA, it is of great significance to use the Transformer, which is usually considered by the alchemists as belonging to the NLP field, to cross-border to the CV field, and it can work. It is also worth learning. This kind of work that breaks through the tradition and creates the era is often popular, such as Faster R-CNN and YOLO, and you can see that many of the subsequent works are improved on the basis of them.

In a nutshell, DETR regards the target detection task as a set prediction problem. For a picture, a certain number of objects are fixedly predicted (the original is 100, which can be changed in the code), and the **model is based on the global The contextual relationship directly outputs the prediction set in parallel, **that is, the Transformer decodes the prediction results of all objects in the picture at one time. This parallel feature makes DETR very efficient.

insert image description here

Network Architecture:

  • Backbone : First, get the patch vector of the picture through convolution, add the position code, and input it to the encoder.
    The encoder performs feature extraction on the input of the backbone to get K, V (same as VIT)
  • Encoder : Convert the feature map output by Backbone into a one-dimensional representation, get the feature map, and then combine positional encoding as the input of Encoder. Each Encoder consists of Multi-Head Self-Attention and FFN. Unlike the Transformer Encoder, because the Encoder has position invariance, DETR adds positional encoding to each Multi-Head Self-Attention to ensure the positional sensitivity of target detection.
  • Decoder : Reconstruct the 100 input vectors (object queries) according to the features extracted by the encoder. Object queries are the core, let it learn how to find the position of the object from the original features extracted by the encoder
  • Prediction heads : Classify and regression the reconstructed object queries
  • FFN : FFN is computed by 3 linear layers with ReLU activation function and hidden layer, or 1 × 1 convolution. The FFN prediction box normalizes the center coordinates, height and width, and then uses the softmax function activation to obtain the predicted class label.
    insert image description here

Deformable DETR

Disadvantages of DETR :

  • The model is difficult to converge and difficult to train. Compared with the existing detectors, it needs longer training time to converge. On the coco dataset, it needs 500 rounds to converge, which is 10 to 20 times faster r-cnn;
  • DETR performs poorly on small object detection. Existing detectors usually have multi-scale features, and small objects are usually detected on high-resolution feature maps, while DETR does not use multi-scale features for detection, mainly because high-resolution feature maps will increase DETR unacceptably Computational complexity.
    Reasons for the above problems :
  • At initialization, the attention model is almost uniform for all pixel weights on the feature map (that is, the contribution map multiplied by a query and all k is relatively uniform, and the ideal situation is that q is highly correlated and sparse k is more correlated), so It takes a long time to learn a better attention map;
  • Processing high-resolution features has the characteristics of excessive calculation and complex storage;
    improvement plan :
  • Let the weights initialized by the encoder no longer have a uniform distribution, that is, no longer calculate the similarity with all keys, but calculate the similarity with more meaningful keys. Deformable convolution is an effective way to focus on sparse spatial positioning;
  • The deformable DETR is proposed, which integrates the sparse spatial sampling of deformable conv and the correlation modeling capability of the transformer. In the overall feature map pixel, the model focuses on the sampling position of the small sequence as a pre-filter and as a key.

Based on this, the Deformable DETR model is proposed. The deformable detr combines the space sparse sampling advantages of deformable conv and the relationship modeling ability between elements of transformer. The computational complexity of detr comes from the attention calculation of the transformer structure in the global context, and the author noticed that although this attention is calculated in the global context, in the end a certain visual element will only be related to a small part Other visual elements establish strong connections through weights.

Therefore, deformable detr no longer uses the global attention calculation, and only calculates the attention of a small number of points around the reference point, instead of calculating the global one, which can reduce the amount of calculation and speed up the convergence speed.

insert image description here

insert image description here

The Deformable Attention proposed by Deformable DETR can alleviate the problems of slow convergence and high complexity of DETR. At the same time, it combines the sparse space sampling ability of deformable convolution and the relationship modeling ability of transformer. Deformable Attention can consider a small set of sampling locations as a pre-filter to highlight the key features of all feature maps, and can be naturally extended to fuse multi-scale features, and Multi-scale Deformable Attention itself can be performed between multi-scale feature maps To exchange information, no FPN operation is required.

Network architecture :

  • Encoder : replace all the attention modules with multi-scale deformable attention, and the input and output of the encoder are multi-scale feature maps, keeping the same resolution;
  • Decoder : There are two modules: cross-attention and self-attention. Only cross-attention is replaced with multi-scale deformable attention, and self-attention remains unchanged. The reference point in cross attention is obtained by object query [300×C] through linear mapping + sigmoid.
  • Prediction heads : The output of the detection head is the offset of the reference point. The text says that this can reduce some training difficulties.

quote

  1. Summary of target detection (RCNN series/YOLO series (v1-v5)
  2. Introduction and analysis of YOLOX advantages
  3. The main innovations of YOLOV6
  4. Detailed explanation of the main improvement points of YOLOv7
  5. Anchor-base and anchor-free in target detection
  6. What are anchor-based and anchor free?
  7. SSD Algorithm Analysis
  8. Detailed explanation of Transformer-based ViT, DETR, and Deformable DETR principles
  9. Source code analysis target detection crossover star DETR
  10. Transformer series - detailed explanation of detr
  11. Detailed explanation of DETR
  12. DETR series
  13. Deformable DETR Explained

Guess you like

Origin blog.csdn.net/Lc_001/article/details/129435559