[Target detection] YOLOV4 detailed explanation

1. YOLOV4 network structure

I have finished talking about V1, V2, and V3 above, and the network difference between YOLOV4 and V3 is that there are more CSP and PAN structures, and one SSP. Post a network diagram.

 First, let's introduce the various components that appear in the network structure.

CBM: It is composed of Conv+BN+Mish activation function. The difference from V3 is that the activation function here is replaced by Mish from Leaky_relu.

CBL: This component is the smallest component in YOLOV3, but here V4 puts CBL in the Neck module instead of Backbone.

Res unit: Learn from the residual structure in the Resnet network, so that the network can be built deeper.

CSPX: Drawing on the CSPNet network structure, it consists of a convolutional layer and X Res unit module Concate.

SPP: Use 1×1, 5×5, 9×9, 13×13 maximum pooling methods for multi-scale fusion.

What needs to be said here is that it is still necessary to separate the Concate and Add operations.

After introducing these components, you can see that the overall model of YOLOV4 is divided into four major blocks, and there are some tricks in each block.

Input side: Here are some data enhancement methods, such as Mosaic data enhancement, cmBN, SAT self-adversarial training.

Backbone: Replaced with CSPDarknet53, Mish activation function, Dropblock.

Neck: The target detection network often inserts some layers between the BackBone and the final output layer, such as the SPP module and FPN+PAN structure in Yolov4.

Prediction: The anchor frame mechanism of the output layer is the same as Yolov3. The main improvement is the loss function CIOU_Loss during training, and the nms of the prediction frame screening becomes DIOU_nms.

Take a look at the author's rendering.

2. Innovation at the input end

 Here we mainly talk about the Mosaic data enhancement method. This enhancement method is actually to splicing the pictures by random zooming, random cropping and random arrangement. Here, four pictures are used for splicing. The details are as shown in the following figure Show:

Data augmentation is a very effective way, especially when the data set is small. But the benefits of Mosaic data enhancement in YOLOV4 are mainly:

  1. Enriched data set: Randomly use 4 pictures, randomly scale them, and then randomly distribute them for splicing, which greatly enriches the detection data set, especially the random scaling adds many small targets, making the network more robust.
  2. Reduce GPU: Some people may say that random scaling and ordinary data enhancement can also be done, but the author considers that many people may only have one GPU, so when Mosaic enhances training, the data of 4 pictures can be directly calculated, making the Mini-batch size It doesn't need to be very large, a GPU can achieve better results.

3. Innovation of Backbone

First of all, let's take a look at the overall framework. The overall framework is based on YOLOV3's Darknet53, applying the idea of ​​CSPNet, and replacing the CSP with the Res unit.

 

The full name of CSPNet is Cross Stage Paritial Network, which mainly solves the problem of large amount of calculation in reasoning from the perspective of network structure design. The author of CSPNet believes that the problem of high inference calculation is due to the duplication of gradient information in network optimization. Therefore, the CSP module is used to first divide the feature map of the base layer into two parts, and then merge them through a cross-stage hierarchy, which can ensure accuracy while reducing the amount of calculation. Therefore, Yolov4 adopts the CSPDarknet53 network structure in the backbone network Backbone, which has three main advantages: enhancing the learning ability of CNN, making it lightweight while maintaining accuracy, reducing computing bottlenecks, and reducing memory costs.

The second change is to apply the Mish function in the backbone. The specific difference from leaky_relu is shown in the figure below:

The third is the application of Dropblock. The specific function is to randomly discard some neurons to make the network simpler. The difference between this and Dropout is that Dropout will randomly discard some neurons, but this method is not in the convolutional layer. Very sensitive, because the convolutional layer can learn the same information from adjacent activation units, so Dropout on the fully connected layer is not applicable to the convolutional layer. Specifically as shown in the figure below:

4. Innovation of Neck

First of all, the first one is the addition of the SPP component. The main principle of SPP is to use the maximum pooling of K={1*1, 5*5, 9*9, 13*13} to perform Concat operation on feature maps of different scales. Here, the padding method is used for feature maps of the same size to ensure that the feature map after pooling is the same size as before.

This way can increase the feature range of the backbone network more effectively.

The second change is the use of FPN+PAN, which is also the biggest difference from YOLOV3. First look at the specific structure diagram:

 The convolution kernel in front of each CSP module is 3*3 in size, with a step size of 2, which is equivalent to a downsampling operation. So the feature maps at the three purple arrows are 76*76, 38*38, 19*19. It can be seen that the difference from YOLOV3 is not only the FPN structure, but also the way of combining large feature maps with small feature maps. This is PAN. The specific pictures are as follows:

The advantage of this is that the large feature map contains stronger positioning features, which can better mark the regression frame of the object, and the deeper feature map will transmit the voice information, and the combination of two is more effective.

 5、Prediction

The loss function of the target detection task is generally composed of two parts: Classification Loss (classification loss function) and Bounding Box Regeression Loss (regression loss function). The development process of Loss of Bounding Box Regeression in recent years is: Smooth L1 Loss-> IoU Loss (2016)-> GIoU Loss (2019)-> DIoU Loss (2020)->CIoU Loss (2020). Let's start with the most commonly used IOU_Loss, conduct a comparative disassembly analysis, and see why Yolov4 chooses CIOU_Loss.

a.IOU_Loss

In the first picture, you can see that the IOU is very simple. It is actually an intersection/union, but there will be two problems in the figure below.

 

Question 1: That is, in the case of state 1, when the prediction frame and the target frame are disjoint, IOU=0, which cannot reflect the distance between the two frames. At this time, the loss function is not derivable, and IOU_Loss cannot optimize the case where the two frames are disjoint.

Question 2: In the case of state 2 and state 3, when the two prediction frames have the same size and the two IOUs are the same, IOU_Loss cannot distinguish the difference between the two intersections.

So GIOU_Loss appeared to improve.

b.GIOU_Loss

It can be seen that in the GIOU_Loss on the right, the measurement method of the intersection scale has been added, which alleviates the embarrassment of pure IOU_Loss. But why just say mitigation? Because there is still a shortcoming : states 1, 2, and 3 are all cases where the prediction frame is inside the target frame and the size of the prediction frame is the same. At this time, the difference between the prediction frame and the target frame is the same, so the three states The GIOU values ​​are also the same. At this time, GIOU degenerates into IOU, and the relative positional relationship cannot be distinguished. Based on this problem, DIOU_Loss
was proposed .

c.DIOU_Loss

A good target box regression function should consider three important geometric factors: overlapping area, center point distance, and aspect ratio. In response to the problems of IOU and GIOU, the author considers two aspects : how to minimize the normalized distance between the prediction frame and the target frame? Two: How to make the regression more accurate when the prediction frame and the target frame overlap? For the first question, DIOU_Loss (Distance_IOU_Loss) is proposed

DIOU_Loss considers the overlapping area and the center point distance . When the target frame wraps the prediction frame, it directly measures the distance between the two frames, so DIOU_Loss converges faster. But like the previous good object box regression function said, the aspect ratio is not taken into account. For example, in the above three cases, the target frame wraps the prediction frame, and DIOU_Loss can work. But the position of the center point of the prediction frame is the same, so according to the calculation formula of DIOU_Loss, the values ​​of the three are the same. In response to this problem, CIOU_Loss was proposed.

d.CIOU_Loss

The formulas in front of CIOU_Loss and DIOU_Loss are the same, but an impact factor is added on this basis, taking into account the aspect ratio of the prediction box and the target box.

Where v is a parameter to measure the consistency of the aspect ratio, we can also define it as:

In this way, CIOU_Loss takes three important geometric factors into account in the regression function of the target frame: overlapping area, center point distance, and aspect ratio.

Let's take a comprehensive look at the differences between the various Loss functions:

IOU_Loss: Mainly consider the overlapping area of ​​the detection frame and the target frame.

GIOU_Loss: On the basis of IOU, solve the problem when the bounding boxes do not coincide.

DIOU_Loss: On the basis of IOU and GIOU, consider the information of the distance from the center point of the bounding box.

CIOU_Loss: On the basis of DIOU, consider the scale information of the bounding box aspect ratio.

The CIOU_Loss regression method is adopted in Yolov4 , which makes the prediction frame regression faster and more accurate.

Reference: A complete explanation of the core basic knowledge of Yolov3&Yolov4&Yolov5&Yolox in the Yolo series-Knowledge

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125535444