[Target detection] YOLOV5 detailed explanation

1. Network Architecture

YOLOV5 and YOLOV4 have many similarities, the biggest change is the change of infrastructure. In the official code of Yolov5, there are 4 versions in the given target detection network, namely Yolov5s, Yolov5m, Yolov5l, Yolov5x four models. First look at the network architecture of YOLOV5s.

The above figure is the network structure diagram of Yolov5 . It can be seen that it is still divided into four parts: input, Backbone, Neck, and Prediction .

(1) Input end: Mosaic data enhancement, adaptive anchor box calculation, adaptive image scaling
(2) Backbone: Focus structure, CSP structure
(3) Neck: FPN+PAN structure
(4) Prediction: GIOU_Loss

Below is the algorithm performance test chart of the author of Yolov5:

The author of Yolov5 also tested on the COCO dataset. Dabai said in the previous article that the small target ratio of the COCO dataset, so the final four network structures have their own advantages in terms of performance. Yolov5s has the smallest network, the lowest speed, and the lowest AP accuracy. However, if the detection is mainly based on large targets and the pursuit of speed, it is also a good choice. The other three networks, on this basis, continue to deepen and widen the network, and the AP accuracy is also continuously improved, but the speed consumption is also increasing.

2. Input terminal

The input of Yolov5 uses the same Mosaic data enhancement method as Yolov4. Random zooming , random cropping , and random arrangement are used for splicing, and the detection effect for small targets is still very good. These have already been mentioned in the previous article YOLOV4.

In the commonly used target detection algorithm, different pictures have different lengths and widths, so the common way is to uniformly scale the original picture to a standard size, and then send it to the detection network.

For example, 416*416, 608*608 and other sizes are commonly used in the Yolo algorithm , such as scaling the image below 800*600 .

However, this has been improved in the Yolov5 code , and it is also a good trick for the Yolov5 reasoning speed to be very fast. The author believes that when the project is actually used, many pictures have different aspect ratios. Therefore, after zooming and filling, the size of the black borders at both ends is different. If there is more filling, there will be information redundancy, which will affect the speed of reasoning. Therefore, the letterbox function of datasets.py in Yolov5's code has been modified to adaptively add the least black border to the original image .

The black borders at both ends of the height of the image will be reduced, and the amount of calculation will be reduced during inference, that is, the target detection speed will be improved. In the discussion, through this simple improvement, the inference speed has been improved by 37%, which can be said to be very effective.

Step 1: Calculate the scaling

The original zoom size is 416*416. After dividing them by the size of the original image, two zoom factors of 0.52 and 0.69 can be obtained. Choose a small zoom factor.

Step 2: Calculate the scaled size

The length and width of the original image are multiplied by the minimum scaling factor of 0.52, the width becomes 416, and the height becomes 312.

Step 3: Calculate the black border fill value

Set 416-312=104 to get the height that needs to be filled. Then use np.mod in numpy to take the remainder to get 8 pixels, and then divide by 2 to get the value that needs to be filled at both ends of the picture height.

In addition, it is important to note that:

a.Yolov5 is filled with gray, ie (114,114,114) .

b. The method of reducing the black border is not used during training, but the traditional filling method is used, that is, the size is scaled to 416*416. Only when testing and using model reasoning, the method of reducing black edges is used to improve the speed of target detection and reasoning.

c. Why use 32 behind the np.mod function ? Because Yolov5's network has been downsampled 5 times, and 2 to the 5th power is equal to 32 . So at least the multiple of 32 should be removed, and then the remainder should be taken.

3、backbone

The Focus module in v5 slices the picture before it enters the backbone. The specific operation is to get a value for every other pixel in a picture, which is similar to neighboring downsampling. In this way, four pictures are obtained, four The pictures are complementary, and the length is almost the same, but no information is lost. In this way, the W and H information are concentrated in the channel space, and the input channel is expanded by 4 times, that is, the spliced ​​pictures are compared with the original RGB three-channel mode. 12 channels, and finally the obtained new picture is subjected to a convolution operation, and finally a double downsampled feature map without information loss is obtained.

Taking yolov5s as an example, the original 640 × 640 × 3 image is input into the Focus structure, and the slice operation is used to first become a 320 × 320 × 12 feature map, and then after a convolution operation, it finally becomes a 320 × 320 × 32 feature map feature map. The slice operation is as follows:

insert image description here

According to the author, this structure can increase the speed of reasoning.

The difference between Yolov5 and Yolov4 is that only the backbone network in Yolov4 uses the CSP structure.

In Yolov5, two CSP structures are designed. Taking the Yolov5s network as an example, the CSP1_X structure is applied to the Backbone backbone network , and the other CSP2_X structure is applied to the Neck .

 4. neck and output

Yolov5's current Neck, like Yolov4, uses the structure of FPN+PAN.

In Yolov5, CIOU_Loss is used as the loss function of Bounding box.

CIOU_Loss is also used in Yolov4 as the loss of the target Bounding box.

Reference:  A complete explanation of the core basic knowledge of Yolov5 in the Yolo series-Knowledge

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125541460