YOLOv5

YOLO series (5) - YOLO v5



foreword

This series is a self-summary of self-learning YOLO series for bloggers. This article is a summary and analysis of YOLOv5. There are four versions of different scales under the YOLOv5 series, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. This article mainly takes YOLOv5s as an example, divides the network into four parts: Input, Backbone, Neck, and Head, and analyzes the innovation points in YOLOv5 one by one. And it will analyze the differences between YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.


1. Input

(1) Adaptive anchor box calculation

The previous YOLOv3 and YOLOv4, for different data sets, will calculate the prior frame anchor. Then during network training, the network will make predictions based on the anchor, then output the prediction frame, compare it with the label frame, and finally perform gradient backpropagation. In YOLOv3 and YOLOv4, when training different data sets, a separate script is used to calculate the initial anchor box. In YOLOv5, this function is embedded in the entire training code. So before each training starts, it will adaptively calculate the anchor according to different data sets.
The specific process of adaptive calculation:

  1. Get the width and height of all objects in the dataset.
  2. Scale each image proportionally to the specified size of resize. Here, ensure that the maximum value of width and high school conforms to the specified size.
  3. Change bboxes from relative coordinates to absolute coordinates, where the multiplied width and height are scaled.
  4. Filter bboxes and keep bboxes whose width and height are greater than or equal to two pixels.
  5. Use k-means clustering tripartite to get n anchors, the same operation as v3 and v4.
  6. Randomly mutate the width and height of the anchors using a genetic algorithm. If the mutated effect is good, assign the mutated result to the anchors; if the mutated effect becomes worse, skip it, and the default is 1000 times of mutation. Here is the fitness fitness calculated using the anchor_fitness method, and then evaluated.

(2) Adaptive image scaling

In the commonly used target detection algorithm, different pictures have different lengths and widths, so the common way is to uniformly scale the original picture to a standard size, and then send it to the detection network. But in actual use, many pictures have different aspect ratios. Therefore, after scaling and filling, the size of the black borders at both ends is different, and if there is more filling, there will be information redundancy, which will affect the speed of reasoning. Therefore, the letterbox function of datasets.py in the Yolov5 code has been modified to adaptively add the least black border to the original image. The specific steps of black border calculation are as follows:
1. Calculate the zoom ratio based on the size of the original image and the size of the image input to the network.
Assuming that the input image size is 800×600, and the initial zoom size is 416×416, then the initial zoom factor is 416/800=0.52 , 416/600=0.69. Choose a smaller zoom factor for the final zoom factor.
2. Calculate the scaled image size based on the original image size and the zoom ratio.
The original image size * zoom ratio. 800 * 0.52 = 416, 600 * 0.52 = 312.
3. Calculate the filling value of the black border.
Set 416-312=104 to get the height that needs to be filled. Then use np.mod in numpy to take the remainder to get 8 pixels, and then divide by 2 to get the value that needs to be filled at both ends of the picture height. Why use 32 behind the np.mod function? Because Yolov5's network has been downsampled 5 times, and 2 to the 5th power is equal to 32. So at least the multiple of 32 should be removed, and then the remainder should be taken.
Note that
the method of reducing the black border is not used during training, but the traditional filling method is used, that is, the size is scaled to 416×416. Only when testing and using model reasoning, the method of reducing black edges is used to improve the speed of target detection and reasoning.


2. Backbone

The YOLOv5 network structure diagram quotes the blogger Jiang Dabai's diagram. It can be seen that the innovation of Backbone in YOLOv5 lies in the Focus module and the CSP1_x module.
insert image description here

(1)Focus

insert image description here
For the Focus module, the input channel has been expanded by 4 times, and the function is to increase the computing power without losing information. Focus was proposed in YOLOv5. It first divides and slices the feature map, then concats the results, and then sends them to the following module. Taking the structure of Yolov5s as an example, the original 608×608×3 image is input into the Focus structure, and the slice operation is used to first become a 304×304×12 feature map, and then after a convolution operation of 32 convolution kernels, it finally becomes into a 304×304×32 feature map. It should be noted that the Focus structure of Yolov5s finally uses 32 convolution kernels, while the number of other three structures used has increased. Please pay attention first, and the differences between the four structures will be explained later.

(2)CSP1_X

insert image description here

The difference between YOLOv5 and YOLOv4 is that only the backbone network of YOLOv4 uses the CSP structure, while in YOLOv5, two CSP structures are designed. Among them, CSP1_X is applied to Backbone, and another CSP2_X is applied to Neck.


3. Neck

As shown in the figure below, YOLOv5 has made some changes to the structure of FPN + PAN, using another structure of CSP2_X to enhance the ability of network feature fusion.
insert image description here

(1)CSP2_X

insert image description here
The difference between CSP2_X and CSP1_X is the module between CBL and Conv. CSP1_X is composed of X Res Unit residual modules, while CSP2_X is composed of 2X CBLs.


4. Head

In the training phase, YOLOv5, like YOLOv4, uses CIOU_Loss. In the inference stage, YOLOv4 uses DIOU_nms on the basis of DIOU_Loss, while YOLOv5 uses weighted nms. This estimate can be exactly the same as YOLOv4. I didn't test the specific results, but just guessed. Because DIOU_nms is mentioned in the YOLOv4 analysis, DIOU_NMS works better in overlapping target detection, and there is little difference in other cases.


5. Comparison of Four Networks

The content of the four networks in the Yolov5 code is basically the same, only the two parameters of depth_multiple and width_multiple at the top are different. It is these two parameters (network depth and network width) that determine the difference between the four versions.
insert image description here

(1) Depth of the four networks

The network depth I understand here refers to the total number of layers of the network. It can be seen from the figure below that the X in CSP1_X in the four versions of s, m, l, and x is different, and X represents the number of Res Units. The more Res Units, the deeper the layer.
[The pictures in this part are also quoted from blogger Jiang Dabai's blog post.
insert image description here
Taking YOLOv5s as an example, as shown in the figure below, the parameter 3 in the first BottleneckCSP layer is recorded as n, the depth parameter of YOLOv5s is 0.33 and recorded as gd, and X in CSP1_X indicates the number of residual components, X = n * gd = 3 * 0.33 = 1 . Similarly, n in the second BottleneckCSP layer is 9, and X=3 in CSP1_X, which corresponds to the second CSP1 in the above figure. YOLOv5m, YOLOv5l, and YOLOv5x are the same, so I won’t go into details.
insert image description here

(2) The width of the four networks

The network width refers to the number of convolution kernels in the convolutional layer, that is, the number of channels (the third dimension of the convolutional feature map), which can be understood as the thickness of the network.
insert image description here
Taking YOLOv5s as an example, as shown in the figure below, the number in the box indicates the number of convolution kernels. The width coefficient of YOLOv5s is gw=0.5, and the number of convolution kernels in Focus is 64, so the number of convolution kernels for downsampling operation=gw * 64 = 32. Every time a Conv layer passes through, it is doubled compared to the previous layer, which is consistent with the situation shown in the above figure. The same applies to YOLOv5m, YOLOv5l, and YOLOv5x.
insert image description here

Guess you like

Origin blog.csdn.net/Fredzj/article/details/125717022