Detailed explanation of YOLO series (YOLO1-YOLO5)

Table of contents

foreword

Two, YOLOv1

for example:

3. YOLOv2

4. YOLOv3

Five, YOLOv4 framework principle

5.4.5 Cosine simulated annealing

5.5.2 DIoU-NMS 

Six YOLOv5

7. YOLOv6



foreword

一、前言
YOLO系列是one-stage且是基于深度学习的回归方法,而R-CNN、Fast-RCNN、Faster-RCNN等是two-stage且是基于深度学习的分类方法。

YOLO官网:GitHub - pjreddie/darknet: Convolutional Neural Networks

1.1 YOLO vs Faster R-CNN
1、统一网络:YOLO没有显示求取region proposal的过程。Faster R-CNN中尽管RPN与fast rcnn共享卷积层,但是在模型训练过程中,需要反复训练RPN网络和fast rcnn网络。相对于R-CNN系列的"看两眼"(候选框提取与分类),YOLO只需要Look Once.

2、YOLO统一为一个回归问题,而Faster R-CNN将检测结果分为两部分求解:物体类别(分类问题)、物体位置即bounding box(回归问题)。
 

Two, YOLOv1


Paper address: https://arxiv.org/abs/1506.02640

Official code: GitHub - pjreddie/darknet: Convolutional Neural Networks

The core idea of ​​YOLOv1:

The core idea of ​​YOLOv1 is to use the entire image as the input of the network, and directly return the position of the bounding box and the category to which the bounding box belongs at the output layer.
Faster RCNN also directly uses the entire image as input, but Faster-RCNN still adopts the idea of ​​proposal+classifier of RCNN as a whole, but it just implements the step of extracting proposal in CNN, while YOLOv1 uses direct regression train of thought.
2.1. Implementation method
Divide an image into SxS grid cells. If the center of an object falls in the grid, the grid is responsible for predicting the object.

Each grid needs to predict B bounding boxes. In addition to returning to its own position, each bounding box also needs to predict a confidence value. This confidence represents the confidence of the predicted box containing the object and how accurate the box prediction is. Its value is calculated as follows:


        The meaning of the expression: if an object falls in a grid cell, the first item is 1, otherwise it is 0. The second item is the IoU value between the predicted bounding box and the actual groundtruth. 

Each bounding box needs to predict a total of 5 values ​​(x, y, w, h) and confidence, and each grid also needs to predict a category information, which is recorded as C category. Then there are SxS grids, and each grid needs to predict B bounding boxes and C categories. The output is a tensor of S x S x (5*B+C).
       Note: The class information is for each grid, and the confidence information is for each bounding box.

for example:

In PASCAL VOC, the image input is 448x448 pixels, S=7, B=2, and there are 20 categories (C=20). Then the output is a tensor of 7x7x(2x5+20). The entire network structure is shown in the figure below:

During the test, the class information predicted by each grid is multiplied by the confidence information predicted by the bounding box to obtain the class-specific confidence score of each bounding box. After obtaining the class-specific confidence score of each box, set the threshold , filter out the boxes with low scores, and perform NMS processing on the reserved boxes to get the final detection result.


       The meaning of this expression: the first item on the left side of the equation is the category information predicted by each grid, and the second and third items are the confidence predicted by each bounding box. This product encodes the probability that the predicted box belongs to a certain category, and also has information about the accuracy of the box. 

Note:
Since the output layer is a fully connected layer, the input of the YOLOv1 model only supports the same input resolution as the training image when detecting.
Although each grid can predict B bounding boxes, only the bounding box with the highest IOU is finally selected as the object detection output, that is, each grid only predicts at most one object. When the object occupies a small proportion of the frame, such as a herd or a flock of birds in the image, each grid contains multiple objects, but only one of them can be detected.
A simple summary is:

Given an input image, first divide the image into 7*7 grids;
for each grid, we predict 2 borders (including the confidence that each border is the target and each border area is on multiple categories Probability);
According to the previous step, 7*7*2 target windows can be predicted, and then the target windows with relatively low probability are removed according to the threshold, and finally the redundant windows can be removed by NMS.
2.2. Loss function
Each grid has 30 dimensions. Among the 30 dimensions, 8 dimensions are the coordinates of the regression box, 2 dimensions are the confidence of the box, and 20 dimensions are categories. The x and y of the coordinates are normalized to 0-1 by the offset of the corresponding grid, and w and h are normalized to 0-1 by the width and height of the image. In the implementation, the most important thing is how to design the loss function so that these three aspects are well balanced. The author simply and rudely uses sum-squared error loss to do this.

This approach has the following problems:

First, it is obviously unreasonable that the 8-dimensional localization error and the 20-dimensional classification error are equally important;
second, if there is no object in a grid (there are many such grids in a picture), then these grids will be The confidence of the box in the grid is pushed to 0. Compared with the grid with fewer objects, this approach is overpowering, which will lead to network instability or even divergence.
Solution:

Pay more attention to the 8-dimensional coordinate prediction, and give these losses a larger loss weight;
for the confidence loss of the box without an object, give a small loss weight;
the confidence loss of the box with the object and the loss weight of the category loss are normally taken 1.
In YOLOv1's prediction of boxes of different sizes, compared with the big box prediction, the small box prediction is definitely more unbearable. The sum-square error loss is the same for the same offset loss. In order to alleviate this problem, the author used a tricky method, which is to take the square root of the width and height of the box to replace the original height and width. This is easy to understand by referring to the figure below. The horizontal axis value of the small box is small. When an offset occurs, the response to the y-axis is larger than that of the large box. (also an approximation)

A grid predicts multiple boxes, and each box predictor is expected to be responsible for predicting an object. The specific method is to see which IoU is larger in the currently predicted box and the ground truth box, and which one is responsible. This approach is called box predictor specialization. 

In the loss function of YOLOv1:

The classification error is only penalized when there is an object in a certain grid.
Only when a box predictor is responsible for a ground truth box, the coordinate error of the box will be punished, and which ground truth box is responsible depends on whether its predicted value and the IoU of the ground truth box are in that cell The largest of all boxes.
Note:

The YOLOv1 method model training relies on object recognition and labeling data. Therefore, for unconventional object shapes or proportions, the detection effect of YOLOv1 is not ideal.
YOLOv1 uses multiple downsampling layers, and the object features learned by the network are not fine, so it will also affect the detection effect.
In the loss function of YOLOv1, the IOU error of large objects and the IOU error of small objects are close to the loss contribution value in network training (although the square root method is used, it does not solve the problem fundamentally). Therefore, for small objects, a small IOU error will also have a large impact on the network optimization process, thereby reducing the localization accuracy of object detection.
Disadvantages of YOLO
YOLO is not effective in detecting objects that are close to each other and small groups. This is because only two boxes are predicted in a grid, and they only belong to one category
; In common aspect ratios and other situations, the generalization ability is weak;
due to the problem of loss function, positioning error is the main reason affecting the detection effect. Especially the processing of large and small objects needs to be strengthened.


3. YOLOv2


Paper address: https://arxiv.org/abs/1612.08242

Official code: YOLO: Real-Time Object Detection

3.1 Introduction to YOLOv2
Compared with the v1 version, YOLOv2 has improved in three aspects: more accurate prediction (Better), faster speed (Faster), and more recognition objects (Stronger) on the basis of continuing to maintain the processing speed. Among them, recognizing more objects is to expand to be able to detect 9000 different objects, called YOLO9000.

The article proposes a new training method-joint training algorithm, which can mix these two data sets together. Use a hierarchical view to classify objects, and augment the detection dataset with data from the huge classification dataset, thereby mixing two different datasets. The basic idea of ​​the joint training algorithm is: simultaneously train object detectors (Object Detectors) on the detection data set and classification data set, use the data of the detection data set to learn the exact position of the object, and use the data of the classification data set to increase the classification category volume and improve robustness.

YOLO9000 is trained using the joint training algorithm. It has 9000 categories of classification information. These classification information are learned from the ImageNet classification data set, while object position detection is learned from the COCO detection data set.

YOLOv1 has many shortcomings. The author hopes to improve the direction: improve recall, improve the accuracy of positioning, and maintain the accuracy of classification. The current trend in computer vision is larger and deeper networks. Better performance usually relies on training larger networks or integrating multiple models, but YOLOv2 focuses on simplifying the network. The specific improvements are shown in the table below:

3.2 Improvements of YOLOv2
3.2.1 Batch Normalization Batch normalization
helps to solve the problem of gradient disappearance and gradient explosion in the backpropagation process, and reduces the impact on some hyperparameters (such as learning rate, network parameter size range, and activation function). selection), and when each batch is normalized separately, it has a certain regularization effect (YOLOv2 no longer uses dropout), so that better convergence speed and convergence effect can be obtained.

Using Batch Normalization to optimize the network allows the network to improve convergence while also eliminating dependence on other forms of regularization. By adding Batch Normalization to each convolutional layer of YOLOv2, the mAP is finally increased by 2%, and the model is also regularized. Using Batch Normalization can remove Dropout from the model without overfitting.

For more information about batch normalization, please refer to: Batch Normalization principle and actual combat

3.2.2 High resolution classifier
There are many training samples for image classification, but the samples for training target detection with borders marked are much less, because the labor cost of marking borders is relatively high. Therefore, the target detection model usually uses image classification samples to train the convolutional layer to extract image features, but this leads to another problem, that is, the resolution of image classification samples is not very high. So YOLOv1 uses ImageNet's image classification samples to use 224*224 as input to train the CNN convolutional layer. Then, when training target detection, the image samples for detection use a higher resolution 448*448 pixel image as input, but such inconsistent input resolution will definitely have a certain impact on model performance.

Therefore, after YOLOv2 uses 224*224 images for classification model pre-training, it uses 448*448 high-resolution samples to fine-tune the classification model (10 epochs), so that the network features gradually adapt to the resolution of 448*448. Then use 448*448 detection samples for training, which alleviates the impact of sudden resolution switching. Finally, by using high resolution, the mAP is increased by 4%.

3.2.3 Convolution with anchor boxes
YOLOv1 contains a fully connected layer, which can directly predict the coordinate values ​​​​of Bounding Boxes. The Faster R-CNN algorithm only uses the convolutional layer and the Region Proposal Network to predict the offset value and confidence of the Anchor Box instead of directly predicting the coordinate value. The author of YOLOv2 found that the problem can be simplified by predicting the offset instead of the coordinate value, so that Neural networks are easier to learn.

Drawing on the practice of Faster RCNN, YOLOv2 also tries to use the prior frame (anchor). Preset a set of frames of different sizes and aspect ratios in each grid to cover different positions and multiple scales of the entire image. These a priori frames are used as predefined candidate areas in the neural network to detect whether there is an object in them. , and fine-tune the position of the border.

Previously, YOLOv1 did not use a priori box, and each grid only predicted two bounding boxes, that is, there were only 98 bounding boxes in the entire image. If YOLOv2 uses 9 prior frames for each grid, there are a total of 13*13*9=1521 prior frames. So in the end YOLOv2 removed the fully connected layer and used Anchor Boxes to predict Bounding Boxes. The author removes a Pooling layer in the network, which allows the output of the convolutional layer to have a higher resolution, and at the same time shrinks the network structure so that it runs at 416*416 instead of 448*448.

Since the objects in the picture tend to appear in the center of the picture, especially the relatively large objects, there is a single position in the center of the object for predicting these objects. The convolutional layer of YOLOv2 uses a value of 32 to downsample the image, so by selecting 416*416 as the input size, a 13*13 Feature Map can finally be output. Using the Anchor Box will reduce the accuracy slightly, but using it allows YOLOv2 to predict more than a thousand frames, while the recall reaches 88%, and the mAP reaches 69.2%.

Before 3.2.4 Dimension clusters,
the size of the Anchor Box was manually selected, so there is still room for optimization in size. YOLOv2 tries to count the a priori frame that is more in line with the size of the object in the sample, so that it can reduce the difficulty of fine-tuning the a priori frame to the actual position of the network. The approach of YOLOv2 is to perform K-means cluster analysis on the marked borders in the training set to find the border size that matches the sample as much as possible. If we use standard Euclidean distance k-means, larger boxes produce more errors than smaller boxes. Because our purpose is to improve the IOU score, which depends on the size of the Box, the use of the distance metric:

Among them, centroid is the border selected as the center when clustering, box is the other border, and d is the "distance" between the two. The larger the IOU, the closer the "distance". The cluster analysis results given by YOLOv2 are shown in the figure below. By analyzing the experimental results (Figure 2), after weighing the model complexity and high recall, the number of cluster classifications K=5 is selected. 

 

Table1 shows that when using K-means to select Anchor Boxes, when the Cluster IOU selection value is 5, the AVG IOU value is 61, which is higher than the 60.9 of the method without clustering. When the value is 9, the AVG IOU is significantly improved. In short, it shows that the method of clustering is effective. 

 

3.2.5 Direct location prediction 
using the Anchor Box method will make the model unstable, especially in the first few iterations. Most of the instability comes from predicting the (x, y) position of the Box. According to the previous YOLOv1 method, the network does not predict the offset, but directly predicts the coordinates according to the position of the grid unit in YOLOv1, which makes the value of Ground Truth between 0 and 1. In order to make the results of the network fall within this range, the network uses a Logistic Activation to limit the network prediction results so that the results are between 0 and 1. The network predicts 5 Bounding Boxes in each grid unit, and each Bounding Boxes has five coordinate values ​​tx, ty, tw, th, t0, and their relationship is shown in the figure below. Assuming that the offset of a grid unit to the upper left corner of the picture is cx, cy, and the width and height of Bounding Boxes Prior are pw, ph, then the prediction result is shown in the formula on the right of the figure below:

3.2.6 Fine-Grained Features 
One of the problems faced by target detection is that the target to be detected in the image will be large or small. The input image is extracted through a multi-layer network, and the final output feature map (for example, input 416*416 in YOLOv2 After the convolutional network is down-sampled, the final output is 13*13), and the possible features of smaller objects are not obvious or even ignored. In order to better detect some relatively small objects, the final output feature map needs to retain some more detailed information. So YOLOv2 introduced a method called the passthrough layer to retain some detailed information in the feature map. Specifically, before the last pooling, the size of the feature map is 26*26*512, which is divided into 4 by 1, and directly passed (passthrough) to the feature map after pooling (and after a set of convolutions). or superimposed together as the output feature map.

Specifically how to split one feature map into four feature maps, see the figure below. The example in the figure is a 4*4 split into four 2*2, because the depth remains unchanged, so it is not drawn.

 

3.2.7 Multi-ScaleTraining
The author hopes that YOLOv2 can run robustly on images of different sizes, so this idea is used in the training model. Different from the previous method of completing the size of the picture, YOLOv2 changes the network parameters every few iterations. Every 10 batches, the network will randomly select a new picture size. Since the downsampling parameter is 32, different sizes are also selected as multiples of 32 {320, 352.....608}, the minimum is 320*320 , the maximum size is 608*608, the network will automatically change the size and continue the training process. This policy allows the network to achieve a good prediction effect on different input sizes, and the same network can perform detection on different resolutions. When the input image size is relatively small, it runs faster, and when the input image size is relatively large, the accuracy is high, so you can trade off the speed and accuracy of YOLOv2.

 

3.3 YOLOv2 Faster
The backbone of YOLOv1 uses GoogleLeNet, which is faster than VGG-16. YOLOv1 only needs 8.52 billion operations to complete a forward process, while VGG-16 needs 30.69 billion, but the accuracy of YOLOv1 is slightly lower than VGG-16.

3.3.1 Draknet19
YOLOv2 is based on a new classification model, which is somewhat similar to VGG. YOLOv2 uses 3*3filter, doubling the number of Channels after each Pooling. YOLOv2 uses Global Average Pooling and Batch Normilazation to make training more stable, accelerate convergence, and normalize the model. The final model – Darknet19, has 19 convolutional layers and 5 maxpooling layers. It only needs 5.58 billion operations to process a picture, and it reaches 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

3.3.2 Training for classification
The network training was trained for 160epochs on the ImageNet 1000 class classification data set, using stochastic gradient descent, with an initial learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9. Standard data augmentation methods are used during training: random cropping, rotation, shifting colors (hue), shifting saturation (saturation), shifting exposures (exposure shifts). During training, Fine Turning 10 epoches of the entire network on a larger 448*448 resolution, the initial learning rate is set to 0.001, this network achieves 76.5% top-1 accuracy, 93.3% top-5 accuracy .

3.3.3 The Training for detection
network removes the last convolutional layer, and adds three 3*3 convolutional layers, each convolutional layer has 1024 Filters, and each convolutional layer is followed by a 1*1 volume laminated. For VOC data, the network predicts that each grid unit predicts five Bounding Boxes, each Bounding Boxes predicts 5 coordinates and 20 categories, so a total of 125 Filters, the Passthough layer is added to obtain the fine-grained information of the previous layer, the network 160epoches were trained, the initial learning rate was 0.001, the data expansion method was the same, and the training strategies for COCO and VOC datasets were the same.

4. YOLOv3

Paper address: https://pjreddie.com/media/files/papers/YOLOv3.pdf 

 

 

 

DBL: Darknetconv2d_BN_Leaky in the code is the basic component of YOLOv3, which is convolution + BN + Leaky relu.
resn: n represents a number, including res1, res2, ..., res8, etc., indicating how many res_units are contained in this res_block. If you don't understand resnet, please poke here
concat: tensor splicing; splicing the upsampling of the middle layer of darknet and a certain layer behind. The operation of splicing is different from that of add in the residual layer. Splicing will expand the dimension of the tensor, while add will not change the dimension of the tensor.

Backbone: darknet-53
In order to achieve better classification results, the author designed and trained darknet-53. Experiments on the ImageNet dataset found that this darknet-53 is indeed very strong. Compared with ResNet-152 and ResNet-101, darknet-53 53 is not only similar in classification accuracy, but also has a much faster calculation speed than ResNet-152 and ResNet-101, and has fewer network layers than them. The test results are shown in the figure.

 

The network structure of darknet-53 is shown in the figure below. YOLOv3 uses the first 52 layers of darknet-53 (without a fully connected layer). The network of YOLOv3 is a fully convolutional network, which uses a large number of residual layer skip connections, and in order to reduce the negative effect of gradients brought by pooling, the author Directly abandon POOLing and use conv's stride to achieve downsampling. In this network structure, a convolution with a step size of 2 is used for downsampling.

In order to strengthen the accuracy of the algorithm for small target detection, YOLOv3 adopts an upsample and fusion approach similar to FPN (finally fused with 3 scales, and the sizes of the other two scales are 26×26 and 52×52 respectively), in multiple scales Do detection on the feature map.

The author also adopts a full convolution structure in the three prediction branches. The number of convolution kernels in the last convolution layer is 255, which is for the 80 categories of the COCO data set: 3*(80+4+1)= 255, 3 indicates that a grid cell contains 3 bounding boxes, 4 indicates the 4 coordinate information of the box, and 1 indicates the objectness score.

 

output 

The so-called multi-scale comes from these three prediction paths. The depths of y1, y2 and y3 are all 255, and the rule of side length is 13:26:52. YOLOv3 sets that each grid unit predicts 3 boxes, so each box needs to have five basic parameters (x, y, w, h, confidence), and then there are 80 categories of probabilities. So 3×(5 + 80) = 255, this is how this 255 comes about.

Let's take a look at how y1, y2, and y3 come from.
In the network, the author carried out three detections, which were detected at 32 times downsampling, 16 times downsampling, and 8 times downsampling, so that the detection on the multi-scale feature map is a bit like SSD. The reason for using up-sample (up-sampling) in the network: the deeper the network, the better the feature expression effect. For example, when performing 16-fold down-sampling detection, if you directly use the fourth down-sampled feature to detect, then use Shallow features, this effect is generally not good. If you want to use the features after 32 times downsampling, but the size of the deep features is too small, so YOLOv3 uses up-sample (upsampling) with a step size of 2 to double the size of the feature map obtained by 32 times downsampling , which becomes the dimension after 16 times downsampling. In the same way, 8-fold sampling is also an upsampling with a step size of 2 on the features of 16-fold downsampling, so that deep features can be used for detection.

The author extracts deep features by upsampling, and its dimension is the same as the dimension of the feature layer to be fused (the channel is different). As shown in the figure below, the 85th layer upsamples the 13×13×256 features to get 26×26×256, and then stitches it with the 61-layer features to get 26×26×768. In order to obtain channel255, a series of 3×3 and 1×1 convolution operations are required, which can not only improve the degree of nonlinearity, increase generalization performance and improve network accuracy, but also reduce parameters and improve real-time performance. The 52×52×255 feature is also a similar process.

 

Bounding Box

The Bounding Box of YOLOv3 has been improved by YOLOv2. In both YOLOv2 and YOLOv3, k-means clustering is used for the objects in the image. Each cell in the feature map will predict 3 bounding boxes (bounding box), and each bounding box will predict three things: (1) the position of each box (4 values, center coordinates tx and ty, height of the box bh and width bw), (2) an objectness prediction, (3) N categories, coco dataset 80 categories, voc20 categories.

For three detections, the corresponding receptive field is different each time, and the receptive field of 32 times downsampling is the largest, which is suitable for detecting large targets. Therefore, when the input is 416×416, the three anchor boxes of each cell are (116, 90); (156,198); (373,326). 16 times is suitable for objects of general size, the anchor box is (30,61); (62,45); (59,119). 8 times the receptive field is the smallest, suitable for detecting small targets, so the anchor box is (10,13); (16,30); (33,23). So when the input is 416×416, there are actually a total of (52×52+26×26+13×13)×3=10647 proposal boxes.

Feel the size of the 9 kinds of a priori boxes. The blue box in the figure below is the a priori box obtained by clustering. The yellow box is the ground truth, and the red box is the grid where the center point of the object is located. 

 

Note here the difference between bounding box and anchor box:
Bounding box outputs the position of the box (center coordinates and width and height), confidence and N categories. The anchor box is just a scale, that is, only width and height.

One of the important changes in LOSS Function
YOLOv3: No more softmaxing the classes. YOLOv3 now performs multi-label classification on objects detected in images.

Logistic regression is used to perform an objectness score on the part surrounded by the anchor, that is, how likely is this position to be the target. This step is performed before predict, which can remove unnecessary anchors and reduce the amount of calculation.

If the template box is not optimal even if it exceeds the threshold we set, we will not predict it. Unlike Faster R-CNN, YOLOv3 will only operate on 1 prior, which is the best prior. The logistic regression is used to find the one with the highest objectness score (target existence possibility score) from the 9 anchor priors. Logistic regression is to use the curve to linearly model the mapping relationship between prior and objectness score.

	lxy, lwh, lcls, lconf = ft([0]), ft([0]), ft([0]), ft([0])
    txy, twh, tcls, indices = build_targets(model, targets)#在13 26 52维度中找到大于iou阈值最适合的anchor box 作为targets
    #txy[维度(0:2),(x,y)] twh[维度(0:2),(w,h)] indices=[0,anchor索引,gi,gj]
 
    # Define criteria
    MSE = nn.MSELoss()
    CE = nn.CrossEntropyLoss()
    BCE = nn.BCEWithLogitsLoss()
 
    # Compute losses
    h = model.hyp  # hyperparameters
    bs = p[0].shape[0]  # batch size
    k = h['k'] * bs  # loss gain
    for i, pi0 in enumerate(p):  # layer i predictions, i
        b, a, gj, gi = indices[i]  # image, anchor, gridx, gridy
        tconf = torch.zeros_like(pi0[..., 0])  # conf
 
 
        # Compute losses
        if len(b):  # number of targets
            pi = pi0[b, a, gj, gi]  # predictions closest to anchors 找到p中与targets对应的数据lxy
            tconf[b, a, gj, gi] = 1  # conf
            # pi[..., 2:4] = torch.sigmoid(pi[..., 2:4])  # wh power loss (uncomment)
 
            lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]),txy[i])  # xy loss
            lwh += (k * h['wh']) * MSE(pi[..., 2:4], twh[i])  # wh yolo loss
            lcls += (k * h['cls']) * CE(pi[..., 5:], tcls[i])  # class_conf loss
 
        # pos_weight = ft([gp[i] / min(gp) * 4.])
        # BCE = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        lconf += (k * h['conf']) * BCE(pi0[..., 4], tconf)  # obj_conf loss
    loss = lxy + lwh + lconf + lcls
 

The above is the loss_function code of YOLOv3 described by the pytorch framework. Ignoring the constant coefficients, I would like to focus on a few points below:

First of all, YOLOv3 needs to build the target first, because we know that the positive sample is composed of the label and the anchor box iou greater than 0.5, so we find the corresponding anchor box according to the label. How to find out [image, class, x (normalized), y, w (normalized), h] stored in the label, we can use these coordinates to correspond to 13×13 Or 26×26 or 52×52 In the map, calculate the iou for 9 anchors, find the ones that meet the requirements, and record the index and position. Use the recorded index position to find the anchor box of the predict.
xywh is calculated by the mean square error to calculate the loss. The predicted xy is sigmoid to calculate the difference with the lable xy. The label xy is the coordinate of the center point of the grid cell, and its value is between 0 and 1, so the predicted xy needs to be sigmoid.
Multi-category cross-entropy for classification and binary cross-entropy for confidence. Only positive samples participate in the loss calculation of class and xywh, and negative samples only participate in confidence loss.

五、YOLOv4
YOLOv4: Optimal Speed and Accuracy of Object Detection

Paper: https://arxiv.org/abs/2004.10934

代码:GitHub - AlexeyAB/darknet: YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

YOLOv4 is actually an algorithm that combines a large number of previous research technologies, combines them and performs appropriate innovations, achieving a perfect balance between speed and accuracy. There are arguably many tricks to improve the accuracy of a Convolutional Neural Network (CNN), but some tricks are only suitable to run on certain models, or only on certain problems, or only on small datasets; we Let’s code one code. What tuning methods are used by the author in this article: weighted residual connection (WRC), cross-stage partial connection (CSP), cross-small batch normalization (CmBN), self-adversarial training (SAT), Mish activation , Mosaic data enhancement, CmBN, DropBlock regularization, CIoU Loss, etc. After a series of stacking, the best experimental result is finally achieved: 43.5% AP (on Tesla V100, the real-time speed of MS COCO data set is about 65FPS).

Five, YOLOv4 framework principle


YOLOv4The overall schematic diagram (source network) directly uploaded is as follows:

5.1.1 CSPDarknet53
We know that in YOLOv3, the feature extraction network uses Darknet53, and in YOLOv4, Darknet53 has been improved a little, and CSPNet is used for reference. The full name of CSPNet is Cross Stage Partial Networks, which is a cross-stage local network. CSPNet solves the problem of duplication of gradient information for network optimization in Backbone, another large convolutional neural network framework, and integrates gradient changes into the feature map from the beginning to the end, thus reducing the number of parameters and FLOPS values ​​​​of the model, which not only ensures the inference speed And accuracy, and reduce the model size. As shown below:

CSPNet is actually based on the idea of ​​Densnet, copying the feature map of the base layer, and sending the copy to the next stage through the dense block, thereby separating the feature map of the base layer. This can effectively alleviate the problem of gradient disappearance (it is difficult to reverse the lost signal through a very deep network), support feature propagation, and encourage the network to reuse features, thereby reducing the number of network parameters. The idea of ​​CSPNet can be combined with ResNet, ResNeXt and DenseNet. At present, there are mainly two transformations of Backbone network, CSPResNext50 and CSPDarknet53.

Consider the balance of several aspects: input network resolution/number of convolutional layers/number of parameters/output dimension. A good classification effect of a model does not necessarily mean a good detection effect. To achieve a good detection effect, the following points are required:

Larger network input resolution - for detecting small targets
Deeper network layers - able to cover a larger area of ​​receptive field
More parameters - better detection of targets of different sizes in the same image
This final CSPDarknet53 structure Just like the picture below:

CSPNet paper: https://arxiv.org/pdf/1911.11929v1.pdf

In order to increase the receptive field, the author also used SPP-block, using PANet instead of FPN for parameter aggregation to be suitable for target detection at different levels.

 5.1.2 SPP structure
SPP-Net structure We have also learned before, SPP-Net full name Spatial Pyramid Pooling Networks, was mainly used to solve how feature maps of different sizes enter the fully connected layer, directly see the figure below, the figure below A fixed-size pooling is performed directly on a feature map of any size to obtain a fixed number of features.

As shown in the figure above, taking the pooling of 3 sizes as an example, a maximum value pooling is performed on the feature map, that is, a feature map must take its maximum value to obtain 1*d (d is the dimension of the feature map) features; The feature map is divided into 2x2 grids, and then the maximum pooling is performed on each grid, then 4*d features are obtained; similarly, the feature map is divided into 4x4 grids, and each The grid is max-pooled to get 16*d features. Then, the features obtained by each pooling are combined to obtain the number of features of a fixed length (the dimension of the feature map is fixed), and then it can be input into the fully connected layer to train the network. It is used here to increase the receptive field.

5.1.3 PAN structure
YOLOv4 uses PANet (Path Aggregation Network) instead of FPN for parameter aggregation to be suitable for different levels of target detection. The method used for fusion in the PANet paper is Addition, and the YOLOv4 algorithm changes the fusion method from addition to Concatenation . As shown below:

5.2 BackBone training strategy
Here we mainly learn the BackBone training strategy from the aspects of data enhancement, DropBlock regularization, and class label smoothing.

5.2.1 Data enhancement
1. CutMix
YOLOv4 chooses to use the CutMix enhancement method. The CutMix processing method is also relatively simple. It also operates on a pair of pictures. Simply speaking, it randomly generates a cropping frame Box and cuts out the corresponding position of the picture A. , and then use the ROI of the corresponding position in the picture B to be placed in the cropped area of ​​the picture A to form a new sample. The ground truth label will be adjusted proportionally according to the area of ​​the patch, such as 0.6 like a dog, 0.4 like a cat, and the same is used when calculating the loss Solve by weighted summation. Here are some similar enhancement methods by the way of CutMix:

The picture above is a comparison of several enhancement methods by the authors in the CutMix paper. The results are obvious, and the enhancement methods of CutMix perform optimally on the three data sets. Among them, Mixup is the direct summation of two images. Like possession and ghosting, it is difficult for the model to learn the accurate response distribution of feature maps. Cutout is to directly remove an area of ​​the image, which forces the model not to be overconfident about specific features when classifying. However, part of the image is filled with useless information, which is a waste. In CutMix, cutting and pasting part of an image onto another image makes it easier for the model to distinguish between heterogeneous ones.

CutMix paper: https://arxiv.org/pdf/1905.04899v2.pdf

2.
The Mosaic data enhancement of Mosaic Yolov4 refers to CutMix data enhancement, which is similar in theory. The difference is that Mosaic is a data augmentation method that combines 4 training images into one for training (instead of 2 as in CutMix). This enhances the detection of objects outside the normal context, enriching the context of detected objects. In addition, each mini-batch contains a large variation of images (4×), thus reducing the requirement for large mini-batches when estimating the mean and variance, reducing the training cost. As shown below:

5.2.2 DropBlock Regularization
Regularization techniques help avoid the most common problem faced by data science professionals, namely overfitting. For regularization, several methods have been proposed, such as L1 and L2 regularization, dropout, early stopping, and data augmentation. Here YOLOv4 uses the DropBlock regularization method.

The DropBlock method was introduced to overcome the main disadvantage of randomly dropping features from Dropout, which proved to be an effective strategy for fully-connected networks, but does not work well in feature-space correlated convolutional layers. The DropBlock technique drops features in adjacent related regions called blocks. In this way, the purpose of generating a simpler model can be achieved, and the concept of learning some network weights can be introduced in each training iteration to compensate the weight matrix, thereby reducing overfitting. As shown below:

In the ropBlock paper, the author finally used the Resnet-50 structure on the ImageNet classification task to increase the accuracy by 1.6% points, and on the COCO detection task, the accuracy increased by 1.6% points.

DropBlock paper: https://arxiv.org/pdf/1810.12890.pdf

5.2.3 DropBlock regularization
For classification problems, especially multi-classification problems, vectors are often converted into one-hot-vectors, and the problems caused by one-hot: For the loss function, we need to use the predicted probability to fit the real probability , and fitting the real probability function of one-hot will bring two problems:

The generalization ability of the model cannot be guaranteed, and it is easy to cause overfitting;
the full probability and zero probability encourage the gap between the category and other categories to be as large as possible, and it can be seen from the bounded gradient that it is difficult to adapt to this situation. Will cause the model to be too confident about the predicted category.
Having 100% confidence in the prediction may indicate that the model is memorizing the data rather than learning. The target for label smoothing adjustment predictions is capped at a low value, say 0.9. It will use this value instead of 1.0 to calculate the loss. This concept alleviates overfitting. To put it bluntly, this smoothing is to narrow the gap between min and max in the label to a certain extent, and label smoothing can reduce overfitting. Therefore, adjusting the label appropriately so that the extreme values ​​at both ends move toward the middle can increase the generalization performance.

5.3 BackBone Reasoning Strategy
5.3.1 Mish Activation Function
Research on activation functions has never stopped. ReLU still dominates the activation function of deep learning. However, this situation may be changed by Mish. Mish is another activation function that is very similar to ReLU and Swish. As the paper claims, Mish can outperform them in many deep networks across different datasets. The formula is as follows:

                                                                                                    

Mish is a smooth curve, and a smooth activation function allows better information to go deep into the neural network, resulting in better accuracy and generalization; it is not completely truncated when it is negative, allowing relatively small negative gradients to flow in. In the experiment, as the depth of the layer increases, the accuracy of the ReLU activation function decreases rapidly, while the Mish activation function has excellent performance in terms of training stability, average accuracy (1%-2.8%), and peak accuracy (1.2%-3.6%). There is an overall improvement. As shown below:

Mish paper: https://arxiv.org/pdf/1908.08681.pdf

5.3.2 MiWRC strategy
MiWRC is the abbreviation of Multi-input weighted residual connections. In BiFPN, it is proposed to use MiWRC to perform scale-level reweighting and add feature maps of different scales. We have discussed FPN and PAN as examples. Figure (d) below shows another neck design known as BiFPN, which according to the BiFPN paper has a better accuracy and efficiency tradeoff.

In the above figure (a) FPN introduces a top-down path to fuse multi-scale features from level 3 to level 7 (P3-P7); (b) PANET adds an additional bottom-up path on top of FPN (c) NAS-FPN uses neural network search to find an irregular feature topology network, and then repeatedly applies the same block topology; (d) is BiFPN here, with a better accuracy and efficiency trade-off. Put the neck into the connection of the entire network as shown in the figure below:

The above figure uses EfficientNet as the backbone network, BiFPN as the feature network, and shares the class/box prediction network. Both the BiFPN layer and the class/boxnet layer are repeated multiple times based on different resource constraints.

BiFPN paper: https://arxiv.org/pdf/1911.09070.pdf

5.4 Detection head training strategy
5.4.1 CIoU-loss

The loss function gives how to adjust the weights to reduce the loss. So in cases where we make a wrong prediction, we expect it to show us the way forward. But if IoU is used, considering that neither prediction overlaps with the ground truth, then the IoU loss function cannot tell which one is better, or which one is closer to the ground truth. Here, by the way, look at several commonly used forms of loss, as follows:

1. Classic IoU loss:

The IoU algorithm is the most widely used algorithm, and most of the detection algorithms use this algorithm.

 It can be seen that the loss of IOU is actually very simple, mainly intersection/union , but there are actually two problems. 

 

 

Question 1: That is, in the case of state 1, when the prediction frame and the target frame are disjoint, IOU=0, which cannot reflect the distance between the two frames. At this time, the loss function is not derivable, and IOU_Loss cannot optimize the case where the two frames are disjoint.

Question 2: In the case of state 2 and state 3, when the two prediction frames have the same size and the two IOUs are the same, IOU_Loss cannot distinguish the difference between the two intersections.

Therefore, GIOU_Loss appeared in 2019 to improve.

2、GIoU:Generalized IoU

GIoU considers that the loss of IoU is the same when the detection frame and the real frame do not overlap, so GIoU adds the C detection frame (the C detection frame is the smallest rectangular frame that contains the detection frame and the real frame), so It can solve the problem that the detection frame and the real frame do not overlap. Among them, C refers to the smallest box that can contain the predict box and the Ground Truth box.

 

 

Question: States 1, 2, and 3 are all cases where the prediction frame is inside the target frame and the size of the prediction frame is the same. At this time, the difference between the prediction frame and the target frame is the same, so the GIOU values ​​​​of these three states are also all Similarly, at this time, GIOU degenerates into IOU, and the relative positional relationship cannot be distinguished.
Based on this problem, the 2020 AAAI proposed DIOU_Loss again.

3、DIoU:Distance IoU

A good target box regression function should consider three important geometric factors: overlapping area, center point distance, and aspect ratio. For the problems of IOU and GIOU, the author considers from two aspects

(1): How to minimize the normalized distance between the prediction box and the target box?

(2): How to make the regression more accurate when the prediction frame and the target frame overlap?

For the first question, DIOU_Loss (Distance_IOU_Loss) is proposed

DIOU_Loss considers the overlapping area and the center point distance . When the target frame wraps the prediction frame, it directly measures the distance between the two frames, so DIOU_Loss converges faster.

But like the previous good object box regression function said, the aspect ratio is not taken into account.

 

For example, in the above three cases, the target frame wraps the prediction frame, and DIOU_Loss can work. But the position of the center point of the prediction frame is the same, so according to the calculation formula of DIOU_Loss, the values ​​of the three are the same.

In response to this problem, CIOU_Loss was proposed again. No, no, science is always solving problems and making continuous progress! !

4、CIOU_Loss

The formulas in front of CIOU_Loss and DIOU_Loss are the same, but an impact factor is added on this basis, taking into account the aspect ratio of the prediction box and the target box.

 

 Where v is a parameter to measure the consistency of the aspect ratio, we can also define it as:

 

In this way, CIOU_Loss takes three important geometric factors into account in the regression function of the target frame: overlapping area, center point distance, and aspect ratio.

Let's take a comprehensive look at the differences between the various Loss functions:

IOU_Loss: Mainly consider the overlapping area of ​​the detection frame and the target frame.

GIOU_Loss: On the basis of IOU, solve the problem when the bounding boxes do not coincide.

DIOU_Loss: On the basis of IOU and GIOU, consider the information of the distance from the center point of the bounding box.

CIOU_Loss: On the basis of DIOU, consider the scale information of the bounding box aspect ratio.

The CIOU_Loss regression method is adopted in YOLOv4, which makes the prediction frame regression faster and more accurate.

5.4.2 CmBN strategy
BN is to use the current iteration time information to perform norm, while CBN will consider the previous k time statistics when calculating the current time statistics, so as to realize the operation of expanding the batch size. At the same time, the author pointed out that the CBN operation will not introduce relatively large memory overhead, and the training speed will not be affected much, but the training will be slower, slower than GN.

CmBN is an improved version of CBN, which regards the four mini batches inside the big batch as a whole and isolates them from the outside. At time t, CBN will also consider the statistics of the first three moments for confluence, while CmBN operation will not, no longer slide the cross, it only performs confluence operation inside the mini batch, and keeps BN one batch to update the trainable parameters once.

 

BN: No matter how many mini batches each batch is divided into, its algorithm is to count the current BN data (that is, the expectation and variance of each neuron) after each mini batch forward propagation and perform Nomalization, BN data and other The data of the mini batch is irrelevant. CBN: The BN data in each iteration is the sum of the previous n times of data and the current data (the non-current batch statistical data is compensated before participating in the calculation), and the accumulated value is used to Nomalize the current batch. The advantage is that each batch can be set to a smaller size. CmBN: The method of using CBN only inside each batch. Personally, if each batch is divided into a mini batch, its effect is the same as that of BN; if it is divided into multiple mini batches, it is similar to CBN, but the mini batch Calculated as a batch, the difference is that the weight update time point is different, and the weight parameters in the same batch are the same, so the calculation does not need to be compensated.

5.4.3 Self-Adversarial Training (SAT)
SAT is a new data enhancement method. In the first stage, the neural network alters the original image instead of the network weights. In this way, the neural network conducts an adversarial attack on itself, changing the original image to create the illusion that there is no object on the image. In the second stage, the neural network is trained to perform normal object detection on the modified image.

Self-Adversarial Training is a data enhancement technology that resists adversarial attacks to a certain extent. CNN calculates the Loss, and then changes the picture information through backpropagation to form the illusion that there is no target on the picture, and then performs normal target detection on the modified image. It should be noted that in the process of backpropagation of SAT, there is no need to change the network weights. Using adversarial generation can improve the weak links in the learned decision boundary and improve the robustness of the model. Therefore, this data enhancement method is used by more and more object detection frameworks.

5.4.4 Eliminating grid sensitivity
For the case of and we need to have large negative and positive values, respectively. But we can achieve this more easily by multiplying by a scaling factor (>1.0)

5.4.5 Cosine simulated annealing

Cosine scheduling adjusts the learning rate according to a cosine function. First, larger learning rates decrease at a slower rate. Then halfway through, the learning rate decreases faster, and finally the learning rate decreases again very slowly.

This graph shows how the learning rate decays (the learning rate warmup is also applied in the image below) and its mAPeffect on . As it may not seem obvious, this new approach to scheduling makes more steady progress rather than making progress after a period of stagnation.

 

Cosine simulated annealing paper: https://arxiv.org/pdf/1608.03983.pdf 

5.5 Detection Head Reasoning Strategy
5.5.1 SAM Module
Attention mechanism is widely adopted in DL design. In SAM, maximum pooling and average pooling are used to input feature maps respectively, creating two sets of feature maps. The result is fed into a convolutional layer followed by a sigmoid function to create spatial attention.

Applies a spatial attention mask to input features, outputting a refined feature map. 

 In YOLOv4, the modified is used SAMwithout applying max pooling and average pooling. 

In YOLOv4, FPNconcepts are gradually implemented/replaced with modified SPP, PANand PAN.

5.5.2 DIoU-NMS 

NMSFilter out other bounding boxes that predict the same object, and keep the bounding box with the highest confidence.

DIoU (discussed earlier) is used as a factor for non-maximum suppression (NMS). This method employs the IoU and the distance between the center points of two bounding boxes while suppressing redundant boxes. This makes it more robust against occlusions.

6. YOLOv5
Shortly after YOLOv4 appeared, YOLOv5 was born. YOLOv5 has made further improvements on the basis of the YOLOv4 algorithm, and the detection performance has been further improved. Although the YOLOv5 algorithm has not been compared and analyzed with the YOLOv4 algorithm, the test effect of YOLOv5 on the COCO data set is quite good. Everyone is skeptical about the innovativeness of the YOLOv5 algorithm, some people have a positive attitude towards it, and some people have a negative attitude towards it. In my opinion, there are still many places to learn in the YOLOv5 detection algorithm. Although these improvement ideas seem relatively simple or lack of innovation, they can definitely improve the performance of the detection algorithm. In fact, the industry often prefers to use these methods instead of utilizing a super complex algorithm to obtain high detection accuracy.

Six YOLOv5


YOLOv5 is a single-stage target detection algorithm. Based on YOLOv4, this algorithm has added some new improvement ideas, so that its speed and accuracy have been greatly improved. The main improvement ideas are as follows:

Input end: In the model training phase, some improvement ideas are proposed, mainly including Mosaic data enhancement, adaptive anchor box calculation, and adaptive image scaling; benchmark network: Integrating some new
ideas in other detection algorithms, mainly including: Focus structure and CSP structure;
Neck network: The target detection network often inserts some layers between BackBone and the final Head output layer, and FPN+PAN structure is added in Yolov5; Head output layer: The
anchor box mechanism of the output layer is the same as YOLOv4, the main improvement What is the loss function GIOU_Loss during training, and DIOU_nms for prediction box screening.
6.2 Detailed explanation of YOLOv5 algorithm
6.2.1 YOLOv5 network architecture

The figure above shows the overall block diagram of the YOLOv5 target detection algorithm. For a target detection algorithm, we can usually divide it into 4 general modules, including: input terminal, reference network, Neck network and Head output terminal, corresponding to the 4 red modules in the above figure. The YOLOv5 algorithm has four versions, including: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. This article focuses on YOLOv5s, and other versions deepen and widen the network on the basis of this version.

Input - The input represents the input image. The input image size of the network is 608*608, and this stage usually includes an image preprocessing stage, which is to scale the input image to the input size of the network, and perform operations such as normalization. In the network training phase, YOLOv5 uses Mosaic data enhancement operations to improve the training speed of the model and the accuracy of the network; and proposes an adaptive anchor frame calculation and adaptive image scaling method.
Benchmark network - The benchmark network is usually a network of classifiers with excellent performance, and this module is used to extract some general feature representations. Not only the CSPDarknet53 structure is used in YOLOv5, but also the Focus structure is used as the benchmark network.
Neck network-Neck network is usually located in the middle of the benchmark network and the head network, and it can further improve the diversity and robustness of features. Although YOLOv5 also uses the SPP module and the FPN+PAN module, the implementation details are somewhat different.
Head output terminal - Head is used to complete the output of target detection results. For different detection algorithms, the number of branches at the output end is different, usually including a classification branch and a regression branch. YOLOv4 uses GIOU_Loss to replace the Smooth L1 Loss function, thereby further improving the detection accuracy of the algorithm.
6.2.2 YOLOv5 basic components
CBL-CBL module consists of Conv+BN+Leaky_relu activation function, as shown in module 1 in the above figure.
Res unit - draw on the residual structure in the ResNet network to build a deep network. CBM is a sub-module in the residual module, as shown in module 2 in the above figure.
CSP1_X-Referring to the CSPNet network structure, this module is composed of CBL module, Res unint module, convolution layer, and Concate, as shown in module 3 in the above figure.
CSP2_X-Referring to the CSPNet network structure, this module is composed of a convolutional layer and X Res unint module Concate, as shown in module 4 in the above figure.
Focus-As shown in module 5 in the above figure, the Focus structure first concats the results of multiple slices, and then sends them to the CBL module.
SPP-uses 1×1, 5×5, 9×9 and 13×13 maximum pooling methods for multi-scale feature fusion, as shown in module 6 in the above figure.
6.2.3 Detailed explanation of input details
Mosaic data enhancement - YOLOv5 still uses the Mosaic data enhancement method in the training model stage, which is improved on the basis of the CutMix data enhancement method. CutMix only uses two images for splicing, while the Mosaic data enhancement method uses four images, which are spliced ​​according to random scaling, random cropping, and random arrangement. The specific effect is shown in the figure below. This enhancement method can combine several pictures into one, which not only enriches the data set, but also greatly improves the training speed of the network, and reduces the memory requirements of the model.

Adaptive anchor box calculation - In the YOLOv5 series of algorithms, for different data sets, it is necessary to set anchor boxes of specific length and width. In the network training phase, the model outputs the corresponding prediction frame based on the initial anchor frame, calculates the gap between it and the GT frame, and performs a reverse update operation to update the parameters of the entire network, so the initial anchor point is set The frame is also a crucial part. In the YOLOv3 and YOLOv4 detection algorithms, when training different data sets, the initial anchor box is obtained by running a separate program. YOLOv5 embeds this function into the code, and calculates the best anchor box adaptively according to the name of the data set during each training. Users can turn off or turn on the function according to their own needs. The specific command is parser. add_argument('–noautoanchor', action='store_ true', help='disable autoanchor check'), if you need to open it, you only need to add the –noautoanch or option when training the code.
Adaptive image scaling - For different target detection algorithms, we usually need to perform image scaling operations, that is, to scale the original input image to a fixed size, and then send it to the detection network. Commonly used sizes in YOLO series algorithms include 416*416, 608*608 and other sizes. There are some problems in the original scaling method. Because many pictures in actual use have different aspect ratios, after scaling and filling, the size of the black borders at both ends is different. However, if there is too much filling, there will be a large number of Information redundancy, which affects the reasoning speed of the entire algorithm. In order to further improve the reasoning speed of the YOLOv5 algorithm, the algorithm proposes a method that can adaptively add the least black border to the zoomed picture.
6.2.4 Details of the benchmark network
Focus structure - the main idea of ​​this structure is to crop the input image through the slice operation. As shown in the figure below, the original input image size is 608*608*3, after Slice and Concat operations, a 304*304*12 feature map is output; then a Conv layer with 32 channels (the number of channels is only For the YOLOv5s structure, other structures will change accordingly), output a feature map of size 304*304*32.

 

In the CSP structure-YOLOv4 network structure, the design idea of ​​CSPNet is borrowed, and only the CSP structure is designed in the backbone network. In YOLOv5, two CSP structures are designed. Taking the YOLOv5s network as an example, the CSP1_X structure is applied to the Backbone backbone network, and the other CSP2_X structure is applied to the Neck network. The implementation details of the CSP1_X and CSP2_X modules are shown in 3.1.
6.2.5 Detailed explanation of Neck network details
FPN+PAN-YOLOv5's Neck network still uses the FPN+PAN structure, but some improvements have been made on the basis of it. In the Neck structure of YOLOv4, ordinary convolution operations are used. . In the Neck network of YOLOv5, the CSP2 structure designed by referring to CSPnet is adopted to strengthen the network feature fusion ability. The figure below shows the specific details of the Neck network of YOLOv4 and YOLOv5. By comparison, we can find: (1) The gray area indicates the first difference. YOLOv5 not only uses the CSP2_\1 structure to replace part of the CBL module, but also removes the lower CBL Module; (2) The green area indicates the second difference. YOLOv5 not only replaces the CBL module after the Concat operation with the CSP2_1 module, but also replaces the position of another CBL module; (3) The blue area indicates the third difference. , the original CBL module is replaced with the CSP2_1 module in YOLOv5.

7. YOLOv6


YOLOv6 is a target detection framework developed by Meituan Visual Intelligence Department, dedicated to industrial applications. This framework focuses on detection accuracy and inference efficiency at the same time. Among the commonly used size models in the industry: YOLOv6-nano has an accuracy of 35.0% AP on COCO, and an inference speed of 1242 FPS on T4; YOLOv6-s on COCO The accuracy can reach 43.1% AP, and the inference speed can reach 520 FPS on T4. In terms of deployment, YOLOv6 supports the deployment of different platforms such as GPU (TensorRT), CPU (OPENVINO), ARM (MNN, TNN, NCNN), which greatly simplifies the adaptation work during project deployment.

The accuracy and speed far exceed the new framework of YOLOv5 and YOLOX

As a basic technology in the field of computer vision, target detection has been widely used in the industry. Among them, the YOLO series of algorithms have gradually become the preferred framework for most industrial applications due to their better comprehensive performance. So far, the industry has derived many YOLO detection frameworks, among which YOLOv5, YOLOX and PP-YOLOE are the most representative, but in actual use, we found that the above frameworks still have a lot of room for improvement in terms of speed and accuracy. Based on this, we have developed a new target detection framework - YOLOv6 by researching and drawing on the existing advanced technologies in the industry. The framework supports the industrial application requirements of the whole chain, such as model training, reasoning, and multi-platform deployment, and has made many improvements and optimizations at the algorithm level such as network structure and training strategy. On the COCO dataset, YOLOv6 has both accuracy and speed. Surpassing other algorithms with the same volume, the relevant results are shown in the figure below:

YOLOv6 has made many improvements in Backbone, Neck, Head and training strategies:

Unified design of more efficient Backbone and Neck: Inspired by the design idea of ​​hardware-aware neural network, a reparameterizable and more efficient backbone network EfficientRep Backbone and Rep-PAN Neck are designed based on RepVGG style.

Optimized the design of a more concise and effective Efficient Decoupled Head, which further reduces the additional delay overhead caused by general decoupled heads while maintaining accuracy.

In the training strategy, we adopt the Anchor-free paradigm, supplemented by the SimOTA label allocation strategy and the SIoU bounding box regression loss to further improve the detection accuracy.

7.1 Hardware-friendly Backbone Network Design
The Backbone and Neck used by YOLOv5/YOLOX are based on CSPNet[5], and adopt multi-branch and residual structure. For hardware such as GPU, this structure will increase the delay to a certain extent, while reducing the utilization rate of memory bandwidth. The figure below is an introduction to the Roofline Model in the field of computer architecture, showing the relationship between computing power and memory bandwidth in hardware.

Therefore, we redesigned and optimized Backbone and Neck based on the idea of ​​hardware-aware neural network design. This idea is based on the characteristics of the hardware, the characteristics of the reasoning framework/compilation framework, and takes the hardware and compiler-friendly structure as the design principle. When building the network, comprehensively consider the hardware computing power, memory bandwidth, compilation optimization features, network representation capabilities, etc., and then Get a fast and good network structure. For the above-mentioned redesigned two detection components, we call them EfficientRep Backbone and Rep-PAN Neck respectively in YOLOv6, and their main contributions are:

The RepVGG[4] style structure is introduced.
Backbone and Neck are redesigned based on the idea of ​​hardware awareness.
The RepVGG Style structure is a reparameterizable structure that has a multi-branch topology during training and can be equivalently fused into a single 3x3 convolution during actual deployment (the fusion process is shown in the figure below). Through the fused 3x3 convolutional structure, you can effectively use the computationally intensive hardware computing power (such as GPU), and you can also get the help of the highly optimized NVIDIA cuDNN and Intel MKL compilation framework on GPU/CPU.

Experiments show that through the above strategy, YOLOv6 reduces the delay on the hardware, and significantly improves the accuracy of the algorithm, making the detection network faster and stronger. Taking the nano size model as an example, compared with the network structure adopted by YOLOv5-nano, this method increases the speed by 21%, and the accuracy increases by 3.6% AP.

EfficientRep Backbone: In terms of Backbone design, we designed an efficient Backbone based on the above Rep operator. Compared with the CSP-Backbone used by YOLOv5, the Backbone can efficiently utilize the computing power of hardware (such as GPU) and has strong representation ability.

The following figure is the specific design structure diagram of EfficientRep Backbone. We replaced the ordinary Conv layer with stride=2 in Backbone with the RepConv layer with stride=2. At the same time, the original CSP-Block is redesigned as RepBlock, in which the first RepConv of RepBlock will do channel dimension transformation and alignment. In addition, we optimized the original SPPF into a more efficient SimSPPF.

Rep-PAN: In terms of Neck design, in order to make its reasoning on hardware more efficient and achieve a better balance between accuracy and speed, we designed a more effective feature fusion network for YOLOv6 based on the design idea of ​​hardware-aware neural network structure.

Rep-PAN is based on the PAN topology, replacing the CSP-Block used in YOLOv5 with RepBlock, and at the same time adjusting the operators in the overall Neck, the purpose is to achieve efficient reasoning on the hardware while maintaining better multi-scale features Fusion capability (Rep-PAN structure diagram is shown in the figure below).

7.2 More concise and efficient Decoupled Head
In YOLOv6, we adopted the Decoupled Head structure and simplified its design. The detection head of the original YOLOv5 is realized by the fusion and sharing of classification and regression branches, while the detection head of YOLOX decouples the classification and regression branches, and adds two additional 3x3 convolutional layers, although The detection accuracy is improved, but the network delay is increased to a certain extent.

Therefore, we simplified the design of the decoupling head, taking into account the balance between the representation capabilities of related operators and the computing overhead on the hardware, and redesigned a more efficient decoupling head structure by using the Hybrid Channels strategy. Accuracy is reduced at the same time, and the additional delay overhead caused by the 3x3 convolution in the decoupling head is alleviated. Through the ablation experiment on the nano size model, compared with the decoupling head structure with the same number of channels, the accuracy is increased by 0.2% AP and the speed is increased by 6.8%.

7.3 More effective training strategy
In order to further improve the detection accuracy, we have absorbed the advanced research progress of other detection frameworks in academia and industry: Anchor-free anchor-free paradigm, SimOTA label assignment strategy and SIoU bounding box regression loss.

7.3.1 Anchor-free Anchor-free paradigm
YOLOv6 adopts a more concise Anchor-free detection method. Since the Anchor-based detector needs to perform cluster analysis before training to determine the optimal Anchor set, this will increase the complexity of the detector to a certain extent; at the same time, in some edge-end applications, it is necessary to carry a large number of detection results between hardware The steps will also bring additional delay. The Anchor-free paradigm has been widely used in recent years because of its strong generalization ability and simpler decoding logic. After an experimental investigation on Anchor-free, we found that compared with the additional delay caused by the complexity of Anchor-based detectors, Anchor-free detectors have a 51% increase in speed.

7.3.2 SimOTA label assignment strategy
In order to obtain more high-quality positive samples, YOLOv6 introduces the SimOTA [4] algorithm to dynamically assign positive samples to further improve detection accuracy. The label allocation strategy of YOLOv5 is based on Shape matching, and the number of positive samples is increased through the cross-grid matching strategy, so that the network can quickly converge. However, this method is a static allocation method and will not be adjusted during the network training process.

In recent years, many methods based on dynamic label assignment have emerged. Such methods will allocate positive samples according to the network output during training, so as to generate more high-quality positive samples, and then promote the positive optimization of the network. For example, OTA obtains the optimal sample matching strategy under global information by modeling sample matching as an optimal transmission problem to improve accuracy, but OTA uses the Sinkhorn-Knopp algorithm to increase the training time, while the SimOTA[4] algorithm Using the Top-K approximation strategy to get the best match of the sample greatly speeds up the training. Therefore, YOLOv6 adopts the SimOTA dynamic allocation strategy, combined with the anchor-free paradigm, the average detection accuracy on the nano size model is increased by 1.3% AP.

7.3.3 SIoU bounding box regression loss
In order to further improve the regression accuracy, YOLOv6 uses the SIoU bounding box regression loss function to supervise the learning of the network. The training of target detection network generally needs to define at least two loss functions: classification loss and bounding box regression loss, and the definition of loss function often has a great impact on detection accuracy and training speed.

In recent years, commonly used bounding box regression losses include IoU, GIoU, CIoU, DIoU loss, etc. These loss functions measure the relationship between the predicted frame and the target frame by considering factors such as the degree of overlap between the predicted frame and the target frame, the distance between the center point, and the aspect ratio. In order to guide the network to minimize the loss to improve the regression accuracy, but these methods do not take into account the matching of the direction between the prediction box and the target box. The SIoU loss function redefines the distance loss by introducing the vector angle between the required regressions, which effectively reduces the degree of freedom of the regression, speeds up the network convergence, and further improves the regression accuracy. By using SIoU loss on YOLOv6s for experiments, compared with CIoU loss, the average detection accuracy is increased by 0.3% AP.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_45303602/article/details/129175854