2022 Object Detection Overview

Table of contents

0 Preface

1. Background

1.1. Problem description

1.2, the core problem of target detection

1.3. Key challenges in target detection

2. About the loss function

3. About IOUs

4. Datasets and Evaluation Indicators

4.1. Evaluation indicators

4.2. Dataset

5. Development context of target detection

5.1. Classification of target detection algorithms

5.2. Overview of Target Detection Development

6. Backbone architecture

6.1、AlexNet

6.2、VGG

6.3、GoogleNet/Inception

6.4、ResNet

6.5、ResNeXt

6.6、CSPNet

6.7、EfficientNet

7. Target detector

7.1. Traditional detection methods

1) Viola-Jones

2)HOG

3)DPM

7.2. Two-stage detector

1)R-CNN

2)SPP-Net

3)Fast R-CNN

4)Faster R-CNN

5) FPN

6) R-FCN

7)MaskR-CNN

8)DetectoRS

7.3. One-stage detector

1)YOLO

2)SSD

3) YOLOv2 and YOLO9000

4)RetinaNet

5)YOLOv3

6) CenterNet

7)EfficientIt

8) YOLOv4

9)Swin Transformer

7.4. Target detection algorithm based on Anchor Freed

7.5. Target detection algorithm based on Transformer

1)DETR

2)YOLOS

3)Swin Transformer

4)Know It

8. Lightweight network 

8.1、SqueezeNet

8.2、MobileNet

8.3、ShuffleNet

8.4、MobileNetv2

8.5、PeleeNet

8.6、ShuffleNetv2

8.7、MnasNet

8.8、MobileNetv3

8.9、Once-For-All (OFA)

9. Future trends

10. Summary


0 Preface

The field of target detection has been developed for more than 20 years, and it is a very important core direction of computer vision. Its main tasks are target positioning and target classification. As one of the basic problems of computer vision, target detection forms the basis of many other visual tasks, such as instance segmentation, image labeling and target tracking, etc.; from the perspective of detection applications: pedestrian detection, face detection, text detection, traffic labeling Together with traffic light detection and remote sensing target detection, they are collectively referred to as the five major applications of target detection.

Before deep learning intervened in this field, traditional target detection ideas included region selection, manual feature extraction, and classifier classification. Since the method of manually extracting features is often difficult to meet the diverse characteristics of the target, the traditional method has not been able to solve the problem of target detection well.

After the rise of deep learning, neural networks can automatically learn powerful feature extraction and fitting capabilities from a large amount of data, so many target detection algorithms with excellent performance have emerged. Target detection methods based on deep learning can be roughly divided into three categories: two-stage target detection, single-stage target detection, and transformer-based target detection.

The editor has sorted out the mainstream target detection algorithms in recent years, and sorted out the online materials. This article will give a systematic introduction to the development of the target detection field, aiming to build a complete knowledge system for readers, hoping to help themselves and Everyone quickly understands the technology related to target detection and its future development trend, and learns relevant target detection algorithms in a targeted manner.


1. Background

1.1. Problem description

The three main tasks of target detection, classification, and segmentation are called CV. The differences between them are mainly shown in the figure below. From the perspective of tasks, target detection can be seen as a bridge between classification and segmentation tasks . This is one of the reasons why object detection research is so important.

Object detection is a natural extension of object classification, which simply aims to identify objects in images. The goal of object detection is to detect all instances of a predefined class and provide their coarse localization in the image via axis-aligned boxes. A detector should be able to identify all instances of an object class and draw a bounding box around them. This is often viewed as a supervised learning problem. Modern object detection models have access to large sets of labeled images for training and evaluation on various canonical benchmarks.

1.2, the core problem of target detection

From the perspective of target positioning, there are three core problems that need to be solved in target detection:

  • Diversity of size
    Multiple different or identical targets may appear on the same image at the same time, and the size differences between them are large
  • Arbitrary position
    The target can appear anywhere in the image
  • Variation in shape
    The shape of the same object may vary greatly, and the target may have various shapes

1.3. Key challenges in target detection

Computer vision has come a long way in the past decade, but it still has some significant challenges to overcome. Some of the key challenges faced in practical applications are:

From the perspective of accuracy:
From the perspective of high accuracy, the common challenges in real-world scenarios mainly include:

  • Intra-category diversity
    Interference caused by the variety of materials, textures, postures, etc. within the category, for example, the materials and shapes of the chairs in the yellow box are very different, but they all belong to the large category of chairs
  • Interference from the external environment
    Noise interference from the external environment, such as the recognition and regression challenges brought by the light, fog, and occlusion in the blue box.
  • Similarity between classes
    The similarity interference caused by texture and posture between classes, for example, there are different species of animals in the yellow box, but the differences between them are very small; this can actually be derived into the field of fine-grained recognition
  • Cluster small target problem
    Cluster target detection faces a large number of problems with diverse categories, such as pedestrian detection, remote sensing detection, etc.

From an efficiency point of view:

Object detection is a very down-to-earth practical application technology, which usually needs to be applied in real-time processing scenarios, such as automatic driving systems. And it may also need to process thousands of data at the same time. Therefore, in addition to considering high accuracy, it is also necessary to consider efficiency issues in terms of processing time, memory usage, and traffic consumption. Today's models require massive computing resources to generate accurate detection results, but on mobile or edge devices, computational efficiency is even more critical.


2. About the loss function

 Portal: IOU Variant Chapter of Target Detection Algorithm Review

The loss function for classification:

  • CE loss
  • Focal loss
  • AP loss
  • DR loss

Positioning loss function:

  • smooth L1 loss
  • Balanced L1 loss
  • KL loss
  • IOU loss


3. About IOUs

Portal: IOU Variant Chapter of Target Detection Algorithm Review

 The development of IOU in target detection:

  • smooth L1 loss
  • IOU
  • GIOU
  • TODAY
  • CIAU
  • EIOU


4. Datasets and Evaluation Indicators

4.1. Evaluation indicators

The target detector uses a variety of indicators to evaluate the performance of the detector, such as: FPS, precision, recall, and the most commonly used mAP. The precision is derived from IoU, which is defined as the intersection ratio between the predicted frame and GT. Then, set an IoU threshold to determine whether the detection result is correct: if the IoU is greater than the threshold, the result is classified as True Positive (TP), and if it is less than the threshold, it is classified as False Positive (FP). Objects present in GT are classified as False Negative (FN) if the model does not detect them. Then precision and recall are defined as follows:

Based on the above definition, average precision (AP) is the average precision of each class. Then, in order to compare different detectors, the AP of all classes is averaged to obtain the single indicator of mAP.

4.2. Dataset

1. Mainstream data sets

Although there are many public datasets, the most popular classic datasets are COCO and PASCAL VOC. These two data sets are data sets that must be compared for common tasks of target detection or backbone. Some available datasets commonly used for object detection tasks are outlined below.

1) PASCAL VOC 07/12
 Pascal Visual Object Classes (VOC) Challenge is a competition that has been going on for many years to promote visual perception. It started in 2005 to classify and detect four object categories, but the two versions of VOC are mainly used as benchmark test sets. VOC2007 has 5K training images and more than 12K labeled targets; VOC2012 increases the training images to 11K, and has more than 27K labeled targets, and the target categories have also been expanded to 20 categories, and semantic segmentation and action recognition tasks have also been added. Pascal VOC introduced [email protected] as an evaluation indicator to evaluate model performance. Figure 3 shows the distribution of the number of images in each category in the Pascal VOC dataset:

2) ILSVRC
 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual challenge from 2010 to 2017, has now become a benchmark for evaluating model performance. The scale of the data set has been expanded to include 1000 categories and more than 1 million images, of which 200 categories and more than 500K images have been selected for target detection. The target detection dataset contains a variety of data sources including ImageNet and Flikr. ILSVRC also relaxes the IoU restriction to incorporate small object detection. Figure 4 shows the distribution of the number of images in different categories in the ImageNet dataset:

3) MS-COCO
The Microsoft Common Objects in Context (MS-COCO), is currently one of the most challenging datasets. It contains a total of 91 common objects found in the natural environment that four-year-olds can easily identify. The MS-COCO word was proposed in 2015, and its popularity has only increased since then. It contains more than 2 million examples, with an average of 3.5 categories and 7.7 examples per image, and also includes images from multiple perspectives. MS-COCO introduces a more rigorous method to evaluate detectors. Unlike VOC and ILSVCR, COCO calculates mAP every 0.5 in the range of IoU from 0.5 to 0.95, and then averages the ten mAPs to get AP. Besides, it also uses AP for small, medium and large objects separately to compare the performance at different scales. Figure 5 shows the distribution of the number of images in different categories in the MS-COCO dataset:

4) Open Image
Google's Open Images dataset consists of 9.2 million images, annotated with image-level labels, object bounding boxes, and segmentation masks. It was launched in 2017 and has been updated 6 times. For object detection, Open Images has 16 million bounding boxes covering 600 categories on 1.9 million images, making it the largest object localization dataset. Its creators took extra care to select images that are interesting, complex and diverse, with 8.3 object categories per image. Made some changes to AP introduced in Pascal VOC, like ignoring unannotated classes, needing to detect classes and subclasses, etc. The distribution of the number of images in each category is shown in Figure 6:

2. Two major annotation software

In order to facilitate the collection and production of COCO and PASCAL VOC data sets, there are two commonly used data set labeling tools labelme and labelImg in the field of target detection. These two annotation tools have corresponding python packages, and the generated data formats are three common dataset formats VOC format, COCO format and YOLO format.

3. Three commonly used label formats

The difference between these three data formats is mainly in the label file type and the bbox format. VOC uses image files and xml tag files, while COCO uses json format, and YOLO uses txt text format.


5. Development context of target detection

5.1. Classification of target detection algorithms

The development of target detection can be divided into two periods: the period of traditional target detection algorithms (1998-2014) and the period of target detection algorithms based on deep learning (2014-present). The target detection algorithm based on deep learning has developed into four technical routes: Anchor based method (one stage, two stage) and Anchor free method, Transformer-based method and NAS-based method.

From the perspective of the development of the above model, target detection algorithms can be mainly divided into the following five categories. Among them, traditional algorithms rely more on the design of manual features. For the Anchor-based method, the development of its model can be viewed from two perspectives, one is the training mode, and the other is the shape of the Anchor. From the perspective of training mode, anchor-based target detection models can be mainly divided into One-stage and Two-stage models. The One-stage model is mostly used in mobile scenarios due to its fast detection rate, while the Two-stage model is mostly used in hardcover device scenarios due to its high detection accuracy. From the perspective of the Anchor shape, it will be more accurate to use an anchor that is more suitable for its own shape for different objects. Among them, rectangles and polygons are mostly used in remote sensing and text detection scenarios, while ellipses and circles are mostly used in remote sensing and medical fields.

5.2. Overview of Target Detection Development


6. Backbone architecture

The backbone is an important part of the target detector, and the features of the input image are extracted through it. Several classic backbone architectures are discussed here.

6.1、AlexNet

After Dropout and Relu were proposed, AlexNet was born in 2012 . AlexNet 's paper is considered to be one of the most influential papers in the CV world. As of 2019, it has been cited about 47,000 times, which shows its influence. AlexNet is the first classic network that really influenced the development of CNN later.

Krizhevsky et al. proposed AlexNet, an image classifier based on convolutional neural networks, and won the ILSVRC2012 Challenge Championship, which achieved higher performance (over 26%) than the best model at the time. AlexNet includes 8 learnable layers: 5 convolutional layers, 3 fully connected layers. The last fully connected layer is connected to an N-Way (N is the number of categories) softmax classifier. AlexNet uses a variety of convolution kernels to obtain image features, and also uses dropout and ReLU for regularization and accelerated training respectively. It once again brought the convolutional neural network into the public eye, and soon caused a series of research booms.

The breakthrough points of AlexNet mainly include:

  • The network is bigger and deeper , LeNet5 (for details, please refer to the animation to explain the LeNet-5 network structure in detail) has 2 layers of convolution + 3 layers of fully connected layers, with about 60,000 parameters, while AlexNet has 5 layers of convolution + 3 layers of fully connected connections, with 60 million parameters and 65,000 neurons.
  • Using ReLU as the activation function , LeNet5 uses Sigmoid. Although ReLU was not proposed by Alex, it was this opportunity that made ReLU C debut and became a hit. For activation functions, please refer to my other blogAnalysis of the advantages and disadvantages of activation functions commonly used in deep neural networks. AlexNet can use a deeper network and use ReLU is closely related.
  • Use data augmentation and dropout to address overfitting . In the data enhancement part, technologies that are now well-known, such as crop, PCA, and Gaussian noise, are used. And dropout has also proved to be a very effective means of preventing overfitting.
  • Replace average pooling with maximum pooling to avoid the fuzzy effect of average pooling, and when pooling, the step length is smaller than the size of the pooling kernel, so that there will be overlap and coverage between the outputs of the pooling layer, which improves the richness of features.
  • The LRN layer is proposed to create a competition mechanism for the activities of local neurons, so that the value with a relatively large response becomes relatively larger, and inhibits other neurons with small feedback, which enhances the generalization ability of the model.

6.2、VGG

VGGNet was proposed by Karen Simonyan in  Very Deep Convolutional Networks for Large-Scale Image Recognition in 2014 .

While AlexNet and its successor ZFNet focused on smaller receptive field window sizes to improve accuracy, Simonyan and Zisserman began to study the impact of network depth. They proposed VGG, a network that uses smaller convolution kernels to construct different depths. Since the large receptive field can be achieved by stacking a series of smaller convolution kernels, and this can greatly reduce the amount of parameters, and can quickly converge. Its article shows how deep networks (16-19 layers) can be used for classification and localization with high accuracy. VGG consists of a series of convolutional layers + 3 full connections, followed by a softmax layer. The number of convolutional layers varies from 8 to 16: first, a minimal 11-layer architecture is trained with random initialization, and then its weights are used to train larger networks to prevent unstable gradients. In a single network performance category, VGG outperformed GoogleNet, the 2014 ILSVRC winner, and quickly became the most commonly used backbone in object classification and detection models.

The breakthrough points of VGG mainly include:

  • It refreshed the results in the ImageNet Challenge that year, and it has made great progress compared with the previous network.
  • The error rate is reduced to below 10, and the number of layers of the network also breaks through single digits, reaching 16-19.
  • Choose a relatively small convolution kernel (3x3), and both AlexNet and LeNet5 used larger convolution kernels, such as 11x11, 7x7. There are two main meanings of using a small convolution kernel. One is that when the same receptive field is obtained, for example, two 3x3 receptive fields and a 5x5 receptive field have the same size, but the amount of calculation is much smaller. There is a very detailed explanation in the original text on this point, it is recommended to read the original text directly; the second point is that two layers of 3x3 can introduce more nonlinearity than one layer of 5x5, so that the fitting ability of the model is stronger. It is proved by experiments. In fact, another advantage here is that it is more convenient to optimize the convolution calculation by using a small convolution kernel. For example, the Winograd algorithm has a better optimization effect on the convolution operation of the small kernel.
  • Using a 1x1 convolution kernel does not affect the dimension of the input and output, and performs nonlinear processing through ReLU to improve the nonlinearity of the model. Of course, this is not the first initiative of VGGNet, it was first proposed in Network In Network.
  • It is proved that increasing the depth of the network can improve the accuracy.

6.3、GoogleNet/Inception

GoogLeNet was published  on Going Deeper with Convolutions by Christian Szegedy in 2014 .

Although classification networks are moving towards faster and more accurate networks, they are still a long way from being deployed in real-world applications since they are resource-intensive. Computational costs increase exponentially as the size of the network increases for better performance. Szegedy et al. believe that the main reason for this is the waste of calculations in the network. Larger models usually have more parameters and tend to be overfitting. They propose to replace fully connected architectures with locally sparsely connected architectures to address these issues. GoogleNet is a 22-layer network consisting of multiple Inception modules stacked on top of each other. Inception modules are networks with convolution kernels of multiple sizes at the same level. After the input feature map passes through these convolution kernels, it is connected to the next layer. The network also has auxiliary classifiers in intermediate layers for regularization and to facilitate gradient passing. GoogLeNet shows that the efficient use of computational blocks can be compared with other networks with more parameters. It achieves 93.3% top-5 accuracy with ImageNet alone without external data, while being faster than other contemporary models. Subsequent versions of several iterations further improved the performance and further demonstrated the application of refined sparse connection architecture.

The breakthroughs made by GoogLeNet mainly include:

  • The network structure is quite different from the previous network structure, and the depth has reached 22 layers. Branches began to appear on GoogLeNet instead of a line connected to the end. This is the most intuitive difference, also known as the Inception module, as shown in the figure below. It can be seen from the figure that kernels of different sizes are used in each module, and then the feature maps are superimposed, which actually plays the role of an image pyramid, that is, the so-called multiple resolution.
  • There are many 1x1 convolution kernels in the above picture. The 1x1 convolution operation here is different from what was mentioned before. Here we use it to change the channel of the output. Specifically, here is to reduce the number of channels, so as to achieve the purpose of reducing calculations. .
  • Replace FC with Global Ave Pool . It can be seen that for FC, the hyperparameter 7x7x1024x1024=51.3M, but after switching to Ave Pool, the hyperparameter becomes 0, so it can prevent overfitting here. In addition, the author found that after using Ave Pool, top-1 The accuracy is improved by about 0.6%. But it should be noted that FC is not completely replaced in GoogLeNet.
  • Auxiliary classifiers are used . The entire model has three outputs (the previous network only had one output), and the multiple outputs here are only used during training, that is to say, only the last output is used during testing or deployment. During training, the loss of the three outputs is weighted and averaged, weight=0.3, in this way, the gradient disappearance can be alleviated, and the author also said that it has a regularization effect. In fact, this idea is somewhat similar to the voting mechanism in traditional machine learning. The final result is decided by multiple decision makers voting together. This can often improve the accuracy by about 2% in traditional machine learning.

6.4、ResNet

ResNet was proposed by He Kaiming in Deep Residual Learning for Image Recognition in 2015 .

As CNNs get deeper, Kaiming He et al. show how network accuracy saturates and drops off rapidly. They proposed to use residual learning to stack convolutional layers to slow down the accuracy drop. It does so by adding skip connections between layers. This connection is an element-wise addition between the input and output of the block, adding no extra parameters or computational complexity to the network. A typical 34-layer ResNet is basically a large (7x7) convolution kernel, then 16 bottleneck modules (a pair of small 3x3 filters with identity skip connections between them), and finally a fully connected layer. The bottleneck module can be adapted to deeper networks by stacking 3 convolutional layers (1x1, 3x3, 1x3). Kaiming He et al. also showed that the computational complexity of VGG16 is equivalent to ResNet104 and 152, and the accuracy is lower. In the subsequent article, the author proposed Resnetv2, which used BN and ReLu in the block, making it more general and easier to train. ResNets are known as the backbone widely used in classification and detection, and its ideas have inspired many other networks.

Key points of ResNet:

  • Short-circuit connections are used to alleviate the gradient vanishing phenomenon caused by too deep network. ResNet18 and ResNet34 use the connection method on the left of the figure below, and the connection method on the right of the figure below for very deep networks (ResNet50, ResNet101, ResNet152), mainly to reduce the number of parameters and reduce the amount of calculation.
  • ResNet makes extensive use of batch normalization.
  • In addition to the pooling layer at the beginning and the end of ResNet, the convolution operation of Conv stride=2 is used in the middle instead of pooling.
  • From the network structure of ResNet, it can be seen that the shortcut connections are divided into solid line and dotted line connections. Solid line connection (such as red box): indicates the same channel, and the calculation method is: H(x)=F(x)+x; dotted line connection (such as red box): indicates that the channels are different, and the calculation method is: H(x) )=F(x)+Wx, where W is a convolution operation used to adjust the x dimension.
  • There are two main improvements to ResNet: one is to make ResNet deeper (such as preResNet); the other is to make ResNet wider (such as ResNeXt).

6.5、ResNeXt

ResNeXt originated from the paper "Aggregated Residual Transformations for Deep Neural Networks" and is the champion network of ILSVRC 2016 . ResNeXt is a combination of ResNet and Inception. By adjusting the width of the network (referring to Inception), increasing the number of branches widens the network, and effectively improves the performance of the network.

Existing traditional methods to improve model accuracy are to increase the depth or width of the model. However, increasing either one leads to an increase in the complexity of the model and the number of parameters, while the gain diminishes rapidly. Xie et al. proposed the ResNeXt architecture, which is simpler and more efficient than other existing models. ResNeXt is inspired by the stacking of similar blocks in VGG/ResNet, and the "separate-transform-merge" approach in the Inception module. It is essentially a ResNet, where each ResNet block is replaced by a ResNeXt module similar to Inception, and inception's complex, customized transformation modules are replaced by topologically identical modules in ResNeXt blocks, making the network easier to expand and generalize. change. Xie et al. also emphasize that cardinality (the topological path in a ResNeXt block) can be considered as a third dimension together with depth and width to improve the accuracy of the model. ResNeXt is elegant and concise. Compared with similar deep ResNet architectures, it achieves higher accuracy while possessing fewer hyperparameters. It was also the first place in the ILSVRC 2016 challenge.

  

6.6、CSPNet

Existing neural networks have achieved incredible results in computer vision tasks; however, they rely on excessive computing resources. Wang et al. believe that a large number of inference calculations can be reduced by reducing the repeated gradient information in the network. They proposed CSPNet [25], which creates distinct paths for gradient flow within the network. CSPNet divides the underlying feature map into two parts: one part is sent to the convolutional network block (for example, the Dense and Transition block in DenseNet or the Res(X) block in ResNeXt), and the other part is in the later stage with the output of the first part Combine. This reduces the number of parameters, increases the utilization of computing units, and reduces the memory footprint. It is easy to implement and general enough to be applied to other architectures such as ResNet, ResNeXt, DenseNet, Scaled-YOLOv4, etc. Applying CSPNet to these networks reduces the amount of computation by 10% to 20%, while the accuracy remains the same or improves. This approach also greatly reduces memory overhead and computational bottlenecks. It is used in many advanced detector models, but also in mobile and edge devices.

Key points of CSPNet:

  • Enhance the learning ability of CNN : The author hopes to enhance the learning ability of CNN, so that even if the weight is reduced, the accuracy can be maintained. CSPNet can be easily added to ResNet, ResNeXt, and DenseNet, and the amount of calculation will generally be reduced by 10% to 20%, but the accuracy rate exceeds the original algorithm.
  • Remove the computing bottleneck structure with high computing power : If the computing bottleneck consumes too much computing power, it will lead to too long inference time, or some computing units will be idle. The author distributes the amount of calculation evenly on each layer, so that the utilization rate of each computing unit can be effectively improved and unnecessary resource consumption can be reduced. CSPNet cuts the computational bottleneck of PeleeNet almost in half. On the MS COCO dataset, it reduces the computational bottleneck of the YOLOv3-based model by 80%.
  • Reduce memory usage : In order to reduce memory usage, the author uses cross-channel pooling to compress feature maps during feature pyramid generation. CSPNet reduces the memory consumption of PeleeNet during feature pyramid generation by 75%.

6.7、EfficientNet

Tan et al. systematically studied network scale and its impact on model performance. They summarize how changing network parameters such as depth, width and resolution affects its accuracy. Scaling any parameter individually has an associated cost. Increasing the depth of the network can help capture richer and more complex features, but they are difficult to train due to the vanishing gradient problem. Likewise, scaling the network width makes capturing fine-grained features easier, but difficult to obtain high-level features. Gains from increasing image resolution (such as depth and width) saturate as model scale increases. Tan et al. propose to use a composite coefficient that scales all three dimensions uniformly. Each model parameter has an associated constant, which is found by fixing the coefficient to 1 and performing a grid search on the baseline network. The baseline architecture was inspired by their previous work (MnasNet), which was developed by performing neural architecture search on the search target while optimizing both accuracy and computation. EfficientNet is a simple and efficient architecture. It outperforms existing models in both accuracy and speed, and is much smaller. By greatly improving efficiency, it has the potential to usher in a new era of efficient networks.


7. Target detector

In this survey, we divide detectors into two categories: two-stage detectors and one-stage detectors. At the same time, we also briefly review the traditional detection methods. If a network has a separate module for generating region proposals (region candidate boxes), then the network is called a two-stage detector. This model tries to find a certain number of target proposals in the first stage, and then locates and classifies each proposal in the second stage. Due to the multiple stages, these networks usually take a long time in the stage of generating proposals, and the structure is complex and lacks global information. One-stage detectors directly classify and localize semantic objects through dense sampling. They use predefined boxes/points of different scales and aspect ratios to locate objects, which surpass two-stage detection in terms of real-time performance and simpler design. device.

7.1. Traditional detection methods

1) Viola-Jones

Proposed in 2001, the Viola-Jones detector is mainly used for face detection and is an accurate and powerful detector. It combines several techniques like Haar features, integral images, Adaboost and cascaded classifiers. The first step is to search for Haar-like features by sliding a window over the input image and use the integral image for computation. It then uses a trained Adaboost to find classifiers for each haar feature and concatenates them. The Viola-Jones algorithm is still used in small devices because it is very efficient and fast.

2)HOG

Dalal and Triggs proposed the Histogram of Oriented Gradients (HOG) feature descriptor in 2005 for feature extraction of target detection. Compared to other detectors, HOG is an improved version that extracts gradients and their edge directions to create a feature table. The image is divided into a grid, and a feature table is used to create a histogram for each cell in the grid. Generate HOG features for regions of interest and feed them into a linear SVM classifier for detection. It is proposed as a pedestrian detection detector, although it can be trained to detect various other classes.

3)DPM

Deformable Parts Model (DPM) was introduced by Felzenszwalb et al. and was the champion of the 2009 Pascal VOC Challenge. It utilizes individual "parts" of objects for detection with higher accuracy than HOG. It follows a divide-and-conquer philosophy; during inference, parts of objects are detected individually and one possible permutation of them is labeled as the detection result. For example, a human body can be thought of as a collection of parts such as head, arms, legs, and torso. A model will be assigned to capture a portion of the entire image, and the process repeated for all of these portions. Then, another model removes those impossible combinations to generate final detections. DPM-based models were among the most successful algorithms before the era of deep learning.

7.2. Two-stage detector

The two-stage target detection algorithm first extracts the candidate frame based on the image, and then performs secondary correction based on the candidate area to obtain the detection point result. The detection accuracy is high, but the detection speed is slow.

The pioneering work of this type of algorithm is RCNN[3], and then Fast RCNN[4] and Faster RCNN[5] improved it in turn.

Due to its excellent performance, Faster RCNN is still a very competitive algorithm in the field of target detection. Subsequently, algorithms such as FPN[6] and Mask RCNN[7] proposed improvements for the shortcomings of Faster RCNN, which further enriched the components of Faster RCNN and improved its performance.insert image description here

1)R-CNN

Region-based Convolutional Neural Network (R-CNN), the first article in the R-CNN series, proves that CNNs can greatly improve performance. R-CNN transforms detection into a classification and localization problem using a class-agnostic region proposals CNNs module. The input image after subtracting the mean is first passed through the region proposal module to generate 2000 candidate objects. This module uses Selective Search (SS) to find parts of an image that have a high probability of belonging to an object. These candidates are then warped and propagated through a CNN network , which extracts a 4096-dimensional feature vector for each proposal. Girshick et al. used AlexNet as the backbone of the detector. Then, the feature vectors are fed into a trained, category-specific SVM to obtain confidence scores. Next, non-maximum suppression (NMS) is used to filter the scored regions based on IoU and class. Once the category is confirmed, the algorithm uses the trained bounding box regressor to predict its bounding box, which predicts four parameters: xyhw.

R-CNN has a complex multi-stage training process: in the first stage, a large number of classification data sets are used to pre-train CNN; in the second stage, images of specific domains (mean subtraction, warped proposals) are used to fine-tune the detection, and the The classification layer of the CNN model is replaced by an N+1-way classifier (N is the number of categories); finally, a linear SVM and border regressor are trained for each category.

R-CNN has caused a new wave in the field of object detection, but it is slow (47 seconds per image) and has high temporal and spatial complexity. It has a complex training process that takes days to train on small datasets even when some computation is shared.

2)SPP-Net

He et al. proposed to use the spatial pyramid pooling (SPP) layer to process images of any size and any aspect ratio. They realized that only fully connected layers require fixed-size inputs. Before the region proposal module, SPP-net only translated the convolutional layer of CNN and added a pooling layer, so that the network does not depend on the size/aspect ratio and reduces the amount of calculation. The algorithm for generating candidate windows is still Selective Search (SS). Feature maps are extracted from the input image through the convolutional layers of the ZF-5 network. The candidate windows are then mapped onto feature maps, which are then transformed into fixed-length representations by the spatial bins of the pyramid pooling layer. Finally, the resulting vector is fed into the fully connected layer, and then the SVM classifier is used to predict the category and score. Similar to R-CNN, SPP-Net also has a post-processing layer of bounding box regression to improve localization accuracy. It also uses a multi-stage training process, except for fine-tuning, other steps are only performed on fully connected layers.

Under the premise of similar accuracy, SPP-Net is much faster than R-CNN. It can also process images of any size and ratio, so it also avoids target deformation caused by input deformation. However, since its architecture is similar to R-CNN, it also has the disadvantages of R-CNN, like multi-stage training, expensive computation and training time.

SPP-Net key points:

  • Directly input the entire image into the CNN network, extract the features of the entire image, and then intercept the corresponding feature on the entire feature map according to the position of the region proposal, so as to avoid repetitive use of CNN for each region proposal Extract features individually.
  • The second point is to add the SPP layer after the conv5 layer of the original CNN network, so that the warp region proposal is unnecessary, because the SPP layer can accept inputs of different sizes and can obtain outputs of the same size.

3)Fast R-CNN

A major disadvantage of R-CNN and SPP-Net is the need for separate training in multiple stages. Fast-RCNN addresses this problem by creating a single end-to-end trainable system. The network feeds an image into a series of convolutional layers, and the target proposals are also mapped to the acquired feature maps. Girshick replaced the pyramid-structured Pooling in SPP-net with the ROI-Pooling layer, followed by two fully connected layers, and then divided into an N+1 softmax layer and a frame regression layer that also has a full connection. The model also changes the loss function of the bounding box regressor from L2 to smooth L1 to improve performance, and introduces multi-task loss to train the network.

The author also uses an advanced improved pre-trained model as the backbone. The network is trained in a single step using stochastic gradient descent (SGD) with a mini-batch of 2, which helps the network to converge faster because backpropagation shares computation between the roi of the two images.

Fast R-CNN is mainly introduced as a speed improvement (146 times that of R-CNN), while the accuracy improvement is secondary.

4)Faster R-CNN

Although Fast R-CNN is getting closer to real-time object detection, its region proposal generation is still an order of magnitude slower (2 seconds per image vs. 0.2 seconds per image). Ren et al. propose a fully convolutional network as a Region Proposal Network (RPN), which takes an arbitrary input image and outputs a set of candidate windows. Each such window has an associated object score, which determines the likelihood of an object appearing. RPN introduces the concept of anchors, which use multiple bounding boxes with different aspect ratios and regress on top of them to localize objects. The input image first passes through CNN to obtain a set of feature maps. They are forwarded to the RPN, which generates bounding boxes and their classifications. The selected proposals are then mapped back to the feature maps extracted by the previous CNN layer, and finally sent to the fully connected layer for classification and border regression. Faster R-CNN is actually Fast R-CNN using the so-called region proposals module of RPN.

Training Faster R-CNN is more complicated because there are shared layers between the two models that perform different tasks. First, RPN is pre-trained on the ImageNet dataset and fine-tuned on the PASCAL VOC dataset. Then, use the region proposals obtained by the RPN in the first step to train a Fast R-CNN. So far, the network has not shared convolutional layers. Now, we fix the convolutional layers of the detector and fine-tune the RPN. Finally, Fast R-CNN is fine-tuned from the updated RPN.

5) FPN

Using image pyramids at multiple levels to obtain feature pyramids (featured image pyramids) is a common approach when improving small target detection. Although it improves the average accuracy of the detector, the increase in inference time is also substantial. Lin et al. proposed the Feature Pyramid Network (FPN), which adopts a top-down horizontal connection architecture to construct high-level semantic features at different scales. FPN has two paths, one is a bottom-up path that computes feature levels at multiple scales by a convolutional neural network (ConvNet), and the other is a top-down path that takes a coarse feature map from a higher level Sampling for high-resolution features. These paths are laterally connected by 1x1 convolution operation to enhance the semantic information in the features. Here, FPN is used as the RPN of Faster R-CNN, and ResNet-101 is used as the backbone.

FPN can provide high-level semantics at all scales, reducing the detection error rate. It became a standard building block for future detection models, improving overall accuracy. It also facilitates the development of negotiated improved networks such as PANet, NAS-FPN, EfficientNet, etc.

6) R-FCN

Dai et al. proposed a region-based fully convolutional neural network (R-FCN), which shared almost all calculations in the network, unlike previous two-stage detectors that used resource-intensive techniques for each proposal. They argued against using fully connected layers and instead used convolutional layers. However, the deep layers of ConvNets are translation invariant, which makes them ineffective in localization tasks. The authors suggest using a position-sensitive score map as a remedy. These sensitive scoring maps encode relevant spatial information and are later pooled to determine accurate localization. R-FCN divides the ROI into k*k grids, calculates the score of each cell, and then averages these scores to predict the target category. The R-FCN detector is a combination of four convolutional networks: the input image is first passed through ResNet-101 to obtain feature maps; the intermediate output (Conv4) is sent to RPN to determine ROI proposals, and the final output is further sent to a convolutional layer for processed and sent to classifiers and regressors. The classification layer generates predictions by combining the generated position-sensitive maps and RoI proposals, while the regression network outputs the details of the bounding box. R-FCN adopts a 4-step training method similar to Faster-RCNN, while using combined cross-entropy and bounding box regression loss. Meanwhile, Online Hard Example Mining (OHEM) is also used during training.

Dai et al. propose a new approach to address translation invariance in convolutional neural networks. R-FCN combines Faster R-CNN and FCN to achieve a fast and more accurate detector. Although it doesn't improve accuracy much, it is 2.5-20 times faster than similar products.

7)MaskR-CNN

Mask R-CNN is extended on the basis of Faster R-CNN by adding a branch to perform pixel-level target instance segmentation in parallel. This branch is a fully connected network applied to RoI, which segments each pixel with a small overall cost. It uses an architecture similar to Faster R-CNN for target proposal extraction, but adds a mask head parallel to the classification and regression head. A major difference is the use of RoIAlign layers instead of RoIPool layers to avoid pixel-level misalignment due to spatial quantization. For better accuracy and speed, the authors choose ResNeXt-101 with Feature Pyramid Network (FPN) as its backbone. The loss function in the original Faster R-CNN was updated to mask loss, just like in FPN, which uses 5 anchors and 3 aspect ratios. The overall training of Mask R-CNN is similar to Faster R-CNN.

Mask R-CNN performs better than existing SOTA one-stage model architectures, adding an additional instance segmentation function with little added overhead. The algorithm is simple and flexible to train, and has good versatility in applications such as key point detection and human pose estimation. However, it is still below real-time performance (>30 fps).

8)DetectoRS

Many contemporary two-stage detectors employ a look-and-think mechanism, where object proposals are first computed and then features are extracted to detect objects. DetectoRS uses this mechanism at both the macro and micro levels of the network. At the macro level, it proposes the Recursive Feature Pyramid (RFP), which is stacked by multiple Feature Pyramids (FPN) with additional feedback connections from the top-down level of FPN to the bottom-up level. . The output of FPN is processed by the Atrous Space Pyramid Pooling layer (ASPP) and then sent to the next FPN layer. Then, an attention map is created by a fusion module to combine the outputs of the FPNs of different modules. At the micro level, Qiao et al. proposed Switchable Atrous Convolution (SAC) to adjust the dilation rate of the convolution. An average pooling layer with 5x5 filters and 1x1 convolution is used as an exchange function to determine the rate of atrous convolution [55], which helps the backbone to dynamically detect objects of various scales. They also put the SAC between the two global context modules, as this helps achieve more stable switching. The combined convolution of the two techniques of recursive feature pyramid and switchable Atrous yields a detector. The author uses the above technique with Hybrid Task Cascade (HTC) as a baseline and combines it with the ResNext-101 backbone.

DetectoRS combines several systems to improve the performance of the detector and sets a state-of-the-art two-stage detector. Its RFP and SAC modules have good versatility and can be used for other detection models. However, since it can only process data, it is not suitable for real-time detection (4 frames per second).

7.3. One-stage detector

1)YOLO

Two-stage detectors treat detection as a classification problem: a module is required to enumerate some candidate boxes that are classified by the network as foreground or background. However, YOLO reformulates the detection problem as a regression problem, directly predicting image pixels as objects and their bounding box attributes. In YOLO, the input image is divided into S*S grids, and the cell where the center point of the target is located is responsible for the detection of the target. A grid cell predicts multiple borders, and each prediction array includes five elements: the center point (x, y) of the border, the width and height w/h of the border, and the confidence score.

YOLO was inspired by the GoogLeNet model for image classification, which uses cascaded modules of smaller convolutional networks. It pre-trains on ImageNet data until the model reaches high accuracy, and then refines the model by adding randomly initialized convolutions and fully connected layers. During training, each network cell only predicts one class, which can lead to better convergence, but during inference, multiple classes can be predicted. The model is optimized using a multi-task loss, the combined loss of all prediction components. Non-maximum suppression (NMS) removes multiple detections of specific classes.

YOLO far outperforms its contemporary single-stage real-time models in both accuracy and speed. However, it also has obvious disadvantages. The localization accuracy of small or clustered objects and the limitation of the number of objects per cell are its main disadvantages. These issues have been fixed in subsequent versions of YOLO.

2)SSD

The Single Shot MultiBox Detector (SSD) is the first one-stage detector that matches the accuracy of contemporary two-stage detectors such as Faster R-CNN while maintaining real-time speed. The SSD is built on VGG-16 with additional auxiliary structures to improve performance. These auxiliary convolutional layers are added at the end of the model, gradually decreasing in size. When image features are not too coarse, SSD detects smaller objects in earlier layers, while deeper layers are responsible for default-sized boxes and aspect ratios.

During training, SSD matches each GT box with the box with the best jaccard overlap, and then trains the corresponding network similar to Multibox. At the same time, difficult negative sample mining and extensive data augmentation are also used. Similar to DPM, SSD also utilizes a weighted sum of localization and confidence losses to train the model. The final output is obtained by non-maximum suppression.

Although SSD is faster and more accurate than state-of-the-art networks like YOLO and Faster R-CNN, it has trouble detecting small objects. This problem was later solved by using better backbone architectures (such as ResNet) and other small patches.

3) YOLOv2 and YOLO9000

YOLOv2 is an improvement on YOLO, providing a simple trade-off between speed and accuracy; while the YOLO9000 model can predict 9000 object classes in real time. These two replace the backbone in YOLO from GoogleNet to DarkNet-19. It combines many impressive techniques, such as BN to improve convergence, jointly train classification and detection systems to increase the number of detection categories, remove full connections to improve detection speed, use clustered anchors to improve recall and provide Prior Knowledge. Redmon et al. also used WordNet to combine hierarchically structured classification and detection datasets. Even if the present word is not correctly classified, this WordTree can be used to predict a higher conditional probability of the upper and lower words, thus improving the overall performance.

YOLOv2 provides better flexibility in choosing the speed and accuracy of the model, and the new architecture has fewer parameters. As the title of the article suggests "better, faster and stronger".

4)RetinaNet

Given the difference in accuracy between single-stage and two-stage detectors, Lin et al. attribute the lag of single-stage detectors to “extreme foreground-background class imbalance”. They proposed a modified cross-entropy loss, called Focal Loss, as a means to solve the imbalance, and reduce the contribution of simple samples to the loss through the parameters. The authors demonstrate its effectiveness with a simple single-stage detector (RetinaNet), which predicts objects by densely sampling the position, scale, and aspect ratio of the input image. The algorithm uses ResNet augmented by Feature Pyramid Network (FPN) as the backbone network, and two similar sub-networks are used for classification and regression respectively. Each layer of FPN is passed into subnetworks, enabling it to detect objects of different scales. The classification subnet predicts object scores at each location, while the bounding box regression subnet regresses the offsets of each anchor to the GT. Both subnets are small FCNs and share parameters across networks. Unlike most previous networks, the authors used a class-independent bounding box regressor and found them equivalent.

RetinaNet is simple to train, fast to converge, and easy to implement. It outperforms two-stage detectors in both accuracy and running time. RetinaNet also advances methods for object detector optimization by introducing a new loss function.

5)YOLOv3

YOLOv3 has "incremental improvements" compared to previous YOLO versions. Redmon et al. replaced the original feature extractor with a larger Darknet-53 network. They also integrated various techniques such as data augmentation, multi-scale training, batch normalization, etc.; the Softmax of the classifier layer was replaced by a logistic classifier.

Although YOLOv3 is faster than YOLOv2, it does not have any breakthrough changes compared to the previous version, and its accuracy is not even as good as a year-old SOTA detector.

6) CenterNet

Zhou et al. take a very different approach: modeling objects as points rather than traditional bounding box representations. CenterNet predicts objects as a single point at the center of the bounding box. The input image generates a heatmap through FCN, and the peak value of the heatmap corresponds to the center of the detected object. It uses ImageNet pre-trained Hourglass-101 as the feature extraction network, and has 3 heads: the heatmap head of the point target center point, the target size wh head, and the target center point offset head. At training time, the multi-task loss for the three heads is backpropagated into the feature extractor. During inference, the output of the offset head is used to determine the object point and finally generate a box. Since predictions are points, not outcomes, there is no need for postprocessing using non-maximum suppression (NMS) here.

CenterNet does not use the common routines of target detection over the years, but proposes a novel perspective. It is more accurate and has a shorter inference time than previous methods. It has high accuracy and can be used for various tasks such as 3D object detection, keypoint estimation, pose, instance segmentation, orientation detection, etc. However, when doing different tasks, different backbone architectures are required, because the general architecture works well with other detectors and performs poorly, and vice versa.

For more information about CenterNet, please refer to: CenterNet (Objects as Points): Minimalist Anchor-free target detection framework, CenterNet post-processing process and source code analysis.

7)EfficientIt

EfficientDet builds on the idea of ​​a scalable detector with higher accuracy and efficiency, introducing efficient multi-scale features, BiFPN and model scaling. BiFPN is a bidirectional feature pyramid network with learnable weights, which is used for cross-connection of input features at different scales. Based on NAS-FPN, it improves by deleting an input node and adding an additional horizontal connection. NAS-FPN requires extensive training and complex networks, which eliminates inefficient nodes and enhances advanced feature fusion. Unlike existing detectors, which can amplify based on larger and deeper backbone networks or stacked FPN layers, EfficientDet introduces a composite coefficient that can be used to "jointly amplify backbone networks, BiFPN networks, class/box networks, and All dimensions of resolution". EfficientDet utilizes EfficientNet as the backbone, which is a stacked feature extraction network with multiple BiFPNs, and finally each output of the BiFPN layer is sent to the class and bounding box prediction network. The model is trained with SGD optimizer and synchronous BN, and uses swish activation instead of standard ReLU activation, which is differentiable, more efficient and performs better.

EfficientDet achieves better efficiency and accuracy than previous detectors, while being smaller and less computationally expensive. It is easily scalable, can be well applied to other tasks, and is the current state-of-the-art model for single-stage object detection.

8) YOLOv4

YOLOv4 combines many effective ideas to design an object detector that can work fast and easy to train in existing systems. It utilizes a "bag of freebies" approach that only increases training time without affecting inference time. YOLOv4 uses data enhancement techniques, regularization methods, class label smoothing, CIoU-loss, Cross mini-Batch Normalization (CmBN), self-adversarial training, cosine annealing learning rate scheduling and other techniques to improve training. A method that only affects inference time is also added to the network, called "Bag of Specials", including Mish activation [, Cross-stage partial connections (CSP), SPP-Block, PAN path aggregation block, multi-input weighted residual connection ( MiWRC), etc., also use genetic algorithm for hyperparameter search. It uses the CSPNetDarkNet-53 pre-trained on ImageNet as the backbone, the SPP and PAN blocks as the neck, and the head of YOLOv3 as the head.

Most current detection algorithms require multiple GPUs to train the model, but YOLOv4 can be easily trained on a single GPU. It is twice as fast as EfficientDet, yet has similar performance, reaching SOTA.

9)Swin Transformer

Transformer has had a profound impact on the field of natural language processing (NLP) since its inception. Its application in language models, such as BERT (Bidirectional Encoder Representation from Transformers), GPT (Generative Pre-trained Transformer), T5 (Text-To-Text Transfer Transformer), etc., has promoted technological progress in this field. Transformer [75] uses an attention model to establish dependencies between sequence elements and can attend to longer contexts than other sequential architectures. Success in natural language processing has sparked interest in its application to computer vision. CNN has always been the pillar of CV, but it has some inherent shortcomings, such as the lack of importance of global context, fixed post-training weights, etc.

Swin Transformer aims to provide a Transformer-based backbone for computer vision tasks, which splits the input image into multiple non-overlapping patches and converts them into tokens. A large number of Swin Transformer blocks are then applied to the patches in 4 stages, with each subsequent stage reducing the number of patches to maintain a hierarchical representation. The Swin Transformer block consists of a local multi-head self-attention (MSA) module based on alternately shifted patch windows in successive blocks. In local self-attention, the computational complexity scales linearly with image size, while moving windows enables cross-window connections. The authors also show how moving Windows improves detection accuracy with little overhead.

Transformers provide a paradigm different from CNN, but its application in the field of CV is still in its infancy, and its potential to replace convolution in these tasks is very large. Swin Transformer has reached a new SOTA on MS-COCO, but its parameter volume is higher than that of the CNN model.

For more information about Swin Transformer, please refer to: Swin Transformer: Hierarchical Visual Transformer Using Sliding Window.

7.4. Target detection algorithm based on Anchor Freed

We talked about anchor-based methods before, so the general feature of anchor-based methods is to generate multiple candidate boxes of different sizes and proportions on the same pixel (one-stage models usually use sliding windows and other (clustering) methods generated, and the two-stage model is more generated by RPN), and it is screened before classification and regression. To a certain extent, it can solve the problem of different target scales and occlusion, and improve the detection accuracy.

The benefits of anchors:

  • The network can perform classification and regression tasks directly on the anchor. (With higher resolution, the extracted features are more abundant)
  • Add prior knowledge to make the model more stable and robust. (Because of the addition of artificial prior distribution, and at the same time, the range of prediction (especially regression) is actually relatively small during training, which makes the anchor-based network easier to train and more stable).
  • Can improve the recall rate, especially for small object detection.
  • To a certain extent, the problem of object occlusion and scale inconsistency is solved.

Anchor limitations:

  • Rely on too much manual design!
  • The training and prediction process is too inefficient! (The time-consuming and computing power required to generate the pre-selection box is greatly increased)
  • Positive and negative sample imbalance problem!

The difference between anchor-free and anchor-based:

  • The anchor-based method is to represent the object through the anchor and the corresponding encoding information. (It is necessary to pre-set a certain number of anchors for each position in the feature map of the image, and then classify and return each anchor)
  • The anchor-free method mainly represents objects through multiple key points (corner points) or through center points and corresponding boundary information. (There is no need to pre-set the anchor, and the target detection is performed directly on the image)
  • The difference between the two lies in whether to use the anchor to generate the candidate frame proposal. It can also be said that the difference between the two lies in the difference in the solution space.

Compared with anchor-base, the biggest advantage of anchor-free lies in the detection speed of its detector, because there is no need to preset the anchor, only the target center point and width and height of the feature maps of different scales need to be regressed, which is extremely large. Reduced time-consuming and required computing power. The disadvantage is that its accuracy cannot reach the SOTA of the anchor-base method. In the past two years, some detectors combining anchor-base and anchor-free have also been proposed.

7.5. Target detection algorithm based on Transformer

Transformer is indeed a hot topic in recent years. At first it was applied in the field of NLP. Judging from the research data, research publications in the fields of CV, voice and video, and multimodality have been very rapid in recent years. Among them, Transformers in the CV field are generally referred to as Vision Transformers, or Vit for short.

At present, Transformer-based target detection algorithms are mainly based on DETR and ViT series. Among them, the Transformer detection model extended by DETR is mainly in object query (adding prior knowledge), attention mechanism (sparse, focusing on meaningful areas) And the label assignment (OTA, etc.) mechanism, as well as the improvement and expansion of the feature matching and assignment mechanism.

1)DETR

The full name of DETR is DEtection TRansformer, which is an end-to-end target detection network based on Transformer proposed by Facebook and published in ECCV2020.

Transformer has been widely used since it was proposed in 2017. Not only has it basically become a unified paradigm in the NLP field, but it has also been applied to some visual fields, such as image classification, target detection, behavior recognition, etc., in some functions Instead of CNN, there is a tendency to unify NLP and CV. As the pioneering work of Transformer in the field of target detection, DETR is a hurdle that cannot be avoided in the field of CV to learn Transformer. The predecessors planted trees and the descendants took advantage of the shade. Learning some classic ideas and codes is also a huge improvement for oneself.

The idea of ​​DETR is similar to the essential idea of ​​traditional target detection, but the way of expression is very different. Traditional methods such as the Anchor-based method essentially classify the categories of predefined dense anchors and regress the frame coefficients. DETR regards target detection as a set prediction problem (sets and anchors have similar functions). Since Transformer is essentially a sequence conversion function, DETR can be regarded as a conversion process from an image sequence to a collection sequence. This set is actually a learnable positional encoding.

The network structure of DETR is divided into three parts:

  • The first part is a traditional CNN for extracting high-latitude features of pictures;
  • The second part is a Transformer structure, Encoder and Decoder to extract the Bounding Box;
  • Finally, use Bipartite matching loss to train the network.

First, input a 3-channel picture into the network whose backbone is CNN, extract the picture features, and then combine the position information, input it into the encoder and decoder of the transformer model, and get the detection result of the transformer. Each result is a box, where Each box represents a tuple, including the category of the object and the position of the detection box.

2)YOLOS

YOLOS combines DETR's encoder-decoder neck and ViT's encoder-only backbone to redesign an encoder-only detector.

Yolos background:

  1. Transformer detection series with CNN as the backbone: For example, the DTER series uses random initialization Transformer to encode and decode CNN features, which does not reveal the transferability of pre-trained Transformer in target detection.
  2. ViT can use the transformer directly as a backbone to perform image classification from a pure sequence-to-sequence perspective. It is important to know that ViT is different from CNN in that it models long-range dependencies and global contextual information instead of local and region-level relationships. In addition, ViT lacks a layered structure (multi-scale) like CNN to deal with changes in the scale of visual entities. So can ViT be a target detection backbone?
  3. Previous transformer detection series with ViT as backbone: ViT-FRCNN is the first to use pre-trained ViT as the backbone of R-CNN object detector. However, this design cannot get rid of the dependence on convolutional neural network (CNN) and strong 2D inductive bias, because ViT-FRCNN reinterprets the output sequence of ViT as a 2D spatial feature map and relies on the region pooling operation (i.e., RoIPool or RoIAlign) and a region-based CNN architecture to decode ViT features for object-level perception.

YOLOS framework:

  1. YOLOS removes the [CLS] tokens for image classification and adds one hundred randomly initialized detection [DET] tokens to the sequence of input patch embeddings for object detection.
  2. The image classification loss used in ViT is replaced by a binary matching loss to perform DETR-like object detection. This avoids reinterpreting ViT's output sequence as a 2D feature map and prevents manual injection of heuristics and prior knowledge of the object's 2D spatial structure during label assignment.

The starting point of YOLOS is not for better performance, but to accurately reveal the migration ability of ViT in target detection . With only very minor modifications to ViT, this architecture can be successfully migrated to the challenging COCO object detection benchmark and achieve 42boxAP metrics. This minimal modification of YOLOS precisely reveals the flexibility and generalization performance of Transformer .

3)Swin Transformer

Original paper address:  https://arxiv.org/abs/2103.14030
Official open source code address: https://github.com/microsoft/Swin-Transformer

Pytorch implementation code:  http://pytorch_classification/swin_transformer

Swin  Transformer is an article published by Microsoft Research on ICCV in 2021, and has already won ICCV 2021 best paperthe honorary title.

Transformer has set off an upsurge in the CV field, from ViT for image classification, to DETR for target detection, to SETR for image segmentation, and METRO for 3D human body pose. Although these Transformers designed for different tasks have indeed done the work of CNN, the computational complexity of its native Self-Attention has not been resolved. Self-Attention needs to calculate the size of all N tokens that are input  n^{2}. Matrix, considering that visual information is originally two-dimensional (image) or even three-dimensional (video), it is difficult to reduce the amount of calculation when the resolution is slightly higher. What Swin Transformer wants to solve is this problem of computational complexity .

About the difference between Swin Transformer and Vision Transformer:

  • Swin Transformer uses a hierarchical construction method (Hierarchical feature maps) similar to the convolutional neural network. For example, in the size of the feature map, the image is downsampled by 4 times, 8 times and 16 times. Such a backbone is helpful in On this basis, tasks such as target detection and instance segmentation are built. In the previous Vision Transformer, the sampling rate was directly down-sampled by 16 times from the beginning, and the following feature maps also maintained the same down-sampling rate.
  • The concept of Windows Multi-Head Self-Attention (W-MSA) is used in Swin Transformer. For example, in the 4 times downsampling and 8 times downsampling in the figure below, the feature map is divided into multiple disjoint regions (Window ), and Multi-Head Self-Attention is only performed within each window (Window). Compared with directly performing Multi-Head Self-Attention on the entire (Global) feature map in Vision Transformer, the purpose of this is to reduce the amount of calculation, especially when the shallow feature map is large. Although this reduces the amount of calculation, it also isolates the information transmission between different windows. Therefore, in the paper, the author proposes the concept of Shifted Windows Multi-Head Self-Attention (SW-MSA). transfer in adjacent windows.

The basic process of the architecture of the Swin Transformer (Swin-T) network is as follows:

  • First, the picture is input into the Patch Partition module for block, that is, every 4x4 adjacent pixels are a Patch, and then flattened in the channel direction (flatten). Assuming that the input is an RGB three-channel image, then each patch has 4x4=16 pixels, and each pixel has three values ​​of R, G, and B, so it is 16x3=48 after flattening, so the image shape after passing Patch Partition From [H, W, 3] to [H/4, W/4, 48]. Then linearly transform the channel data of each pixel through the Linear Embeding layer, from 48 to C, that is, the image shape changes from [H/4, W/4, 48] to [H/4, W/4 , C]. In fact, Patch Partition and Linear Embeding in the source code are implemented directly through a convolutional layer, which is exactly the same as the Embedding layer structure mentioned in the previous Vision Transformer.
  • Then, feature maps of different sizes are constructed through four stages. Except for a Linear Embeding layer in Stage1, the remaining three stages are all downsampled through a Patch Merging layer (will be described in detail later). Then stack the Swin Transformer Block repeatedly. Note that the Block here actually has two structures, as shown in Figure (b). The difference between the two structures is that one uses the W-MSA structure and the other uses the SW- MSA structure. And these two structures are used in pairs, first use a W-MSA structure and then use a SW-MSA structure. So you will find that the number of stacked Swin Transformer Blocks is even (because they are used in pairs).
  • Finally, for the classification network, a Layer Norm layer, a global pooling layer, and a fully connected layer will be connected to obtain the final output. It is not drawn in the picture, but it is done in the source code.

In general, when the Transformer craze swept the CV field, Swin Transformer chose the right problem to be solved. The computational complexity problem is very critical for the application of the Transformer structure on CV, and the solution is reasonable and intuitive. The final performance is also very good, making it a non-negligible SOTA method.

4)Know It

Both the one-stage model and the anchor-free model accuracy stand-in mentioned earlier need to rely on multi-scale feature extraction to improve the accuracy of detection. Therefore, FPN is considered standard in the current detection task. The motivation of FPN is to combine early high-resolution features with later stronger features. This is achieved in FPN through top-down and lateral connections. If the backbone network is not hierarchical, the basis for the motivation of FPN disappears, since all feature maps in the backbone network have the same resolution.

For the original ViT, since there is no downsampling, it is impossible to apply feature maps of different resolutions like CNN, and the detection efficiency of high-resolution images is relatively low, so like Swin Transformer, ViT Mask-RCNN is in In the ViT model, a hierarchical structure design is reintroduced, and downsampling is performed step by step. Although this is indeed a success, does target detection necessarily require FPN? Is it possible to remove the hierarchical constraints on the backbone network and use a common backbone network for object detection?

The ViTDet paper starts from this direction, abandons the common FPN design (this is also the case for YOLOF), and uses the original ViT architecture to construct a simple feature pyramid from a single-scale feature map, that is, directly use the last layer of ViT features to do simple By upsampling and downsampling, a simple FPN can be reconstructed without the need to extract features from different stages like CNN. There is no need to do top-down and bottom-up feature fusion like the standard FPN. It is really simple and rude to pass the feature map of the last layer (because it should have the most powerful features) through a set of convolution or Deconvolution is used to obtain feature maps of different scales to achieve the same performance as FPN. Specifically, they used the default ViT feature maps with a scale of 1/16 (stride = 16).

ViTDet chooses the Mask R-CNN architecture as the main research object. The optimized version is used here. The specific improvements mainly include the following points:

  • RPN uses 2 hidden convolutional layers (the default is 1);
  • The box head of ROI heads is changed from the original 2 fully connected layers to 4 convolutional layers + 1 fully connected layer;
  • LayerNorm is used between the box head of ROI heads and the convolution layer of mask head (the earliest version uses BatchNorm, but SyncBN is often required, and LN is not affected by batch size).

Since there is no overlap between the patches in ViT (attention in each patch and each window cannot obtain global information), it is necessary to use some means to allow information interaction between different patches. ViTDet does not use shift operation (moving windows across layers) like Swin, but uses global attention and convolution for information interaction. In fact, the four blocks perform a global propagation in the last stage after finishing the window atteion (each block will be divided into cells), so that simple interaction can incorporate global information and local information into the learning, and It greatly reduces the amount of memory and computation required for model training.

8. Lightweight network 

In recent years, a new branch of research has formed aimed at designing small and efficient networks for the resource-constrained environments common in the Internet of Things. This trend also permeates the design of powerful object detectors. We can see that although a large number of object detectors can achieve good accuracy and real-time inference, most of these models require excessive computing resources, so Cannot be deployed on edge devices.

Many different approaches have shown exciting results in the past. Utilizing high-efficiency components and compression techniques, such as pruning, quantization, hashing, etc., improves the efficiency of deep learning models. Using a trained large network to train smaller models, called distillation, has also shown interesting results. In this section, however, we explore some typical examples of efficient neural network designs that achieve high performance on edge devices. The list looks like this:

8.1、SqueezeNet

SqueezeNet is a network model proposed by Forrest N. Iandola et al. in SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size in 2016. From the title of the paper, we can see that the author only uses AlexNet 1/ 50 parameters achieved comparable accuracy to AlexNet.

Recent advances in the field of CNNs have mainly focused on improving the state-of-the-art accuracy on benchmark datasets, which has led to an explosion in model sizes and their parameters. But in 2016, Iandola et al. proposed a smaller, smarter network called SqueezeNet that reduced parameters while maintaining performance. They adopted three main design strategies, namely using smaller filters, reducing the number of input channels as input to 3x3 filters, and placing downsampling layers later in the network. The first two strategies reduce the number of parameters while maintaining accuracy, and the third strategy increases the accuracy of the network. The building block of SqueezeNet is called the fire module, which consists of two layers: a squeeze layer and an expand layer, each with a ReLU activation. The squeeze layer consists of multiple 1*1 filters, and the expand layer consists of a 1*1, 3*3 hybrid filter, thus limiting the number of input channels. The SqueezeNet architecture consists of 8 Fire modules interspersed between convolutional layers. Inspired by ResNet, SqueezeNet with residual blocks is also proposed, which improves the accuracy rate compared with ordinary models. The author also conducted experiments on deep compression, and compared to AlexNet, the model size was compressed by 510 times. SqueezeNet presents a good candidate for improving the hardware efficiency of neural network architectures.

The main idea of ​​SqueezeNet is as follows:

  • Use more 1x1 convolution kernels and less 3x3 convolution kernels. Because the advantage of 1x1 is that it can reduce the channel while maintaining the feature map size.
  • When using 3x3 convolution, reduce the number of channels as much as possible, thereby reducing the amount of parameters.
  • Use pooling later, because pooling will reduce the feature map size, and use pooling later, so that the size can be reduced later, and the front layer can maintain a larger size, thereby improving the accuracy.

8.2、MobileNet

 MobileNet (v1) was proposed by Andrew G. Howard et al. in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications in 2017  .

MobileNet gets rid of traditional methods for small models, such as shrinking, pruning, quantization or compression, and instead uses an efficient network architecture. The network uses depthwise separable convolutions to decompose traditional convolutions into depthwise convolutions and 1*1 pointwise convolutions. A standard convolution convolves on all channels and merges them all at once; while a depthwise separable convolution uses a different kernel for each channel of the input and then merges them using a pointwise convolution. This separation of feature filtering and combination reduces computational cost and model size. MobileNet consists of 28 independent convolutional layers, each followed by batch normalization and ReLU activation functions. Howard et al. also introduced two model shrinkage hyperparameters: a width multiplier and a resolution multiplier to further increase the speed and reduce the size of the model. The width multiplier uniformly manipulates the width of the network by reducing input and output channels, while the resolution multiplier affects the size of the input image and its representation throughout the network. MobileNet achieves the accuracy of some mature models, but the model size is only several times smaller. Howard et al. also show how it generalizes in various applications, such as face attribution, geolocation, and object detection. However, it is too simple and linear like VGG, so there are not many channels to achieve gradient flow, but this is also resolved in later iterations of this model.

8.3、ShuffleNet

ShuffleNet was proposed in ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices in 2017 by Xiangyu Zhang and others. The core idea of ​​ShuffleNet is to group convolutions to reduce the amount of calculations. However, since grouping is equivalent to confining the convolution operation to certain fixed inputs, in order to solve this problem, the shuffle operation is used to scramble the input to solve this problem. .

This is an extremely computationally efficient neural network architecture designed specifically for mobile devices. They recognize that many efficient networks become less effective as they scale down, and claim that this is caused by expensive 1x1 convolutions. Combined with channel scrambling, they proposed to utilize grouped convolution to overcome its shortcoming of limited information flow. ShuffleNet mainly consists of a standard convolution, followed by a ShuffleNet unit divided into three stages. The ShuffleNet unit is similar to the ResNet block, using depth convolution in the 3x3 layer, and replacing the 1x1 layer with a point-by-point group convolution, and there is a channel scrambling operation before the depth convolution layer. The computational cost of ShuffleNet can be managed by two hyperparameters: the number of groups controls the connection sparsity, and the scaling factor manipulates the model size. As the number of groups increases, the error rate saturates with fewer input channels per group, thus potentially degrading representational power. ShuffleNet outperforms contemporary models and has a fairly small model size, but since the only improvement of ShuffleNet is channel shuffle, there is no improvement in the inference speed of the model.

8.4、MobileNetv2

Based on MobileNetv1, Sandler et al. proposed MobileNetv2 in 2018-introducing a novel module of reverse residual with a linear bottleneck, thereby reducing computational complexity and improving accuracy. This module expands the input low-dimensional representation to high-dimensional, filters it through depthwise convolution, and then projects it back to low-dimensional, unlike the common residual block that first compresses, then convolutes, and finally expands. MobileNetv2 consists of one convolutional layer, followed by 19 residual bottleneck blocks, followed by two convolutional layers. Only when stride is 1, the residual bottleneck block has shortcut connections. For higher strides, the shortcut is not used due to the difference in size. They also use ReLU6 as the non-linear function instead of simple ReLU, to limit the computation. For target detection, the author uses MobileNetv2 as the backbone and designs an SDD called SSDLite, which claims to have 8 times fewer parameters than the original SSD while achieving competitive accuracy. It generalizes well to other datasets and is easy to implement, so it has been well received by the community.

8.5、PeleeNet

Existing lightweight deep learning models rely heavily on depthwise separable convolutions and lack efficient implementations. Wang et al. proposed a new efficient structure based on traditional convolution, named PeleeNet, using the technique of conservation of computation. The core of PeleeNet is DenseNet, but it refers to many other models for inspiration. It introduces two-way dense layers, dynamic number of channels in the bottleneck, transition layer compression, and traditional post-activation to reduce computational cost and improve speed. Two-way dense layers help to obtain receptive fields of different scales, making it easier to recognize larger objects. To reduce information loss, a stem block is used. They also discarded the compression factor used in DenseNet because it hurts feature expressions and reduces accuracy. PeleeNet consists of a stem block, four stages of modified dense and conversion layers, and finally a classification layer. The authors also propose a real-time object detection system called Pelee, which is a variant based on PeleeNet and SSD. Its performance is improved compared to contemporary detectors on mobile and edge devices, showing that simple design choices can make a huge difference in overall performance.

8.6、ShuffleNetv2

In 2018, Ma Ningning et al. proposed a set of comprehensive guidelines for designing efficient network architectures in ShuffleNetv2. They advocated the use of direct indicators such as speed or delay to measure computational complexity, rather than indirect indicators such as FLOPs. ShuffleNetv2 is built on four guiding principles: 1) input and output channels are of equal width to minimize memory access cost, 2) group convolutions are carefully chosen according to the target platform and task, 3) multipath structure gains higher at the cost of efficiency , 4) Element-wise operations like add and ReLU are computationally non-negligible. Based on the above principles, they designed a new building block by splitting the input into two parts through a channel separation layer, followed by three convolutional layers, which are concat with the residuals and passed through a channel shuffle layer. For the downsampling model, channel separation is removed, and the residual connections have depthwise separable convolutional layers. Inserting a collection of these blocks between two convolutional layers results in ShuffleNetv2. The authors also experimented with larger models (50/162 layers) and achieved higher accuracy with little increase in FLOPs. ShuffleNetv2 also outperforms other SOTA models in computational complexity.

8.7、MnasNet

With the increasing demand for accurate, fast, and low-latency models for various edge devices, designing such neural networks is more challenging than ever. In 2018, Tan et al. proposed Mnasnet based on the automatic neural structure search (NAS) method design. They defined the search problem as a multi-objective optimization with high accuracy and low latency as objectives. It also factorizes the search space, divides the CNN into unique blocks, and then searches for operations and connections in these blocks separately, reducing the search space. This also allows each block to have a unique design, unlike earlier models where identical blocks were stacked. The authors use RNN-based reinforcement learning agents as controllers and trainers to measure accuracy as well as latency on mobile devices. Each sampled model is trained on one task for its accuracy and run on a real device to test latency, which is used for soft reward targets and controller updates. This process is repeated until the maximum number of iterations is reached or a better candidate is obtained. It consists of 16 different blocks, some with residual connections. MnasNet is almost twice as fast as MobileNetv2 with higher accuracy. However, like other reinforcement learning-based neural architecture search models, the search time of MnasNet requires massive computing resources.

8.8、MobileNetv3

The core of MobileNetv3 is the same as the method used to create MnasNet, with some modifications. A platform-aware automatic neural architecture search is performed by NetAdapt in a decomposed hierarchical search space, which removes underutilized components of the network over multiple iterations. Once it has an architectural solution, it tunes the channels, initializes the weights, and then fine-tunes them to improve the target metric. The model is further modified to remove some computationally expensive layers in the architecture and to obtain additional latency optimizations. Howard et al. believe that the filters in the architecture are usually mirror images of each other, and the accuracy can be maintained even if half of the filters are removed, which can reduce the amount of calculation. MobileNetv3 uses a hybrid ReLU and hard swish as the activation core, the latter is mainly used behind the model. There is no obvious difference between hard swish and swish, but the former is less computationally expensive while retaining accuracy. For different resource usage use cases, the authors propose two models: MobileNetv3-Large and MobileNetv3-Small. MobileNetv3-Large consists of 15 bottleneck blocks, while MobileNetv3-Small consists of 11 bottleneck blocks. Its building blocks also include squeeze and excitation layers. Similar to MobileNetV2, these models act as feature detectors in SSDLite and are 35% faster than earlier models while achieving higher mAP.

8.9、Once-For-All (OFA)

The architectural design of Neural Architecture Search (NAS) has yielded many SOTAs in the past few years, however, they are computationally expensive due to sample model training. Cai et al. proposed a new method to decouple the model training phase and the neural architecture search phase. The model is trained only once, and subnetworks can be extracted from it on demand. The OFA (Once-for-all) network provides flexibility in the selection of sub-networks in four important dimensions of depth, width, kernel size, and dimensionality. Since they are nested in the OFA network, interfering with training, progressive shrinkage is introduced. First, the largest network is trained by setting all parameters to their maximum values. Subsequently, the network is fine-tuned by gradually reducing parameter dimensions such as kernel size, depth, and width. For elastic cores, use small cores in the center of large cores. When centers are shared, a kernel transformation matrix is ​​used to maintain performance. In order to change the depth, only the first few layers of the large network are used, and the later layers will be skipped. Elastic width utilizes a channel ordering operation that redistributes channels and uses the most important cores in smaller models. OFA reached SOTA with 80% top-1 accuracy in ImageNet, and achieved fourth place in the Low Power CV Challenge (LPCVC) because it reduced GPU training time by several orders of magnitude. It demonstrates a new paradigm for designing lightweight models for various hardware needs.

9. Future trends

Over the past decade, object detection has made tremendous progress. The algorithm has achieved human-level accuracy in several vertical domains, but there are still many exciting challenges to be solved. In this section, we discuss some open problems in the field of object detection.

AutoML: Determining the properties of object detectors using Neural Architecture Automated Search (NAS) has become a hot research area. The previous chapters have shown some detectors designed by NAS, but it is still in its infancy, and the search of algorithms is a complex and resource-intensive process.

Lightweight detectors: Although the lightweight network can achieve performance comparable to that of a mature classification network, showing its great potential, the detection accuracy is still below 50%. As more and more on-device machine learning applications hit the market, the need for models that are small, efficient, and equally accurate will increase.

Weakly supervised/few shot detection: Most SOTA object detection models are trained on millions of labeled data, which is time-consuming, labor-intensive and difficult to expand. Training on weakly supervised data (ie: image-level annotations) will greatly reduce the cost.

Domain transfer: Domain transfer refers to using a model trained on annotated images of a specific source task on an independent but related target task. It encourages the reuse of trained models and reduces reliance on the availability of large datasets to achieve high accuracy.

3D object detection: 3D object detection is a particularly critical issue in autonomous driving. Even if a model has achieved high accuracy, any application of sub-human-level performance will pose safety concerns.

Object detection in video: Object detectors are designed for image-independent reasoning, which lacks correlation across multiple frames of images. Object recognition using the spatio-temporal relationship between multiple images is an open problem.

10. Summary

With powerful feature extraction capabilities, deep learning has helped object detection algorithms to make great progress.

Starting from RCNN, relevant researchers have continuously introduced new mechanisms and new tricks to improve the accuracy of such algorithms. In the end, the two-stage target detection algorithm represented by Faster RCNN achieved a high prediction accuracy.

Compared with Faster RCNN, the single-stage target detection algorithm represented by YOLO has achieved high calculation speed while ensuring high prediction accuracy. With its excellent real-time performance, it has been widely used in the industry.

The visual Transformer algorithm represented by DETR proposed in 2020 introduces the attention mechanism into the field of target detection. With its simple and elegant structure and higher accuracy than Faster RCNN, DETR has attracted more and more target detection practitioners to carry out research on visual transformers.

From the previous review of the target detection algorithm, we can see that the target detection algorithm actually develops from complex to simple, from rough to fine.

In terms of components and training techniques:

  • The selection of the model in the candidate area is from anchor-based to anchor-free, realizing the learning from the bounding box to the adjustment of the bounding box, and then converting the bbox to point/pixel-based learning.
  • The post-processing method of the model has developed from the traditional nms to the nms-free era.
  • The iou of the evaluation standard is also covering more learning about the relative position between the two frames step by step, from a process of data calculation to a process of network adaptive learning, and now to the era of iou-free.

From the training stage:

  • The model is a simple process from the traditional complex process at the beginning to the two-stage era led by r-cnn and then to the one-stage era. Evolution from slow to fast.

From the model and characteristics:

  • The development of the model can be summarized as traditional algorithms to cnn-based and then to transformer-based, and the features are also from the original design to abstract and then to the features of interest.

Although object detection has made great progress in the past decade, the best detectors are still far from saturated in performance. As its applications in the real world increase, the demand for lightweight models that can be deployed on mobile and embedded systems will grow exponentially. There is growing interest in this area, but it remains an open challenge. In this paper, we show how two-stage, one-stage detectors are developed step by step and surpass previous studies. Two-stage detectors are generally more accurate, but they are slower and difficult for real-time applications like autonomous driving. However, over the past few years, this has changed with the development of one-stage detectors, which can achieve equivalent performance, but much faster. As shown in Figure 10, from the current development situation, the visual transformer will further promote the rapid progress of target detection. Swin Transformer is also the most accurate detector to date. With the current positive trend in detector accuracy, we have high hopes for more accurate and faster detectors.


The above content is for learning reference only.

The main reference link of this article is attached (if there is any infringement, the contact must be deleted):

A Survey of Target Detection Algorithms - Zhihu (zhihu.com)

Detailed explanation of mainstream target detection algorithms: from RCNN to DETR - Zhihu (zhihu.com)

A Review of 2021 Deep Learning Target Detection_Ye Zhou's Blog-CSDN Blog_2021 Target Recognition

The latest target detection algorithm review 2022 notes_xiaobai_Ry's Blog-CSDN Blog_The latest progress in target detection

Summary of the classic basic network structure (backbone) in deep learning_kuweicai's blog-CSDN blog_backbone network structure

Guess you like

Origin blog.csdn.net/weixin_44074191/article/details/128057231