Introduction to YOLOv5

YOLOv5

1. Input port

1. Mosaic data enhancement:

Please add a picture description

CutMix data enhancement : Randomly generate a cropping frame Box, cut out the corresponding position in A, and then use the ROI of the corresponding position in B to be placed in the cropped area in A to form a new sample. The weighted sum is used to calculate the loss, and the cut-off position in the A area is randomly filled with the area pixel values ​​​​of other data in the training set, and the classification results are distributed according to a certain ratio.
Mosaic data enhancement : splicing four pictures, each picture has its corresponding frame, after splicing the four pictures, a new picture is obtained, and the anchor frame corresponding to this picture is also obtained, and then it will be like this This new picture is passed into the neural network for learning, which is equivalent to passing in four pictures at a time for learning.

The paper says that this greatly enriches the background of detecting objects! And during the standardized BN calculation, the data of four pictures will be calculated at a time! Mainly used for small target detection

2. Adaptive anchor box calculation:

Sample a large number of regions in the input image, determine whether these regions contain objects of interest, and adjust the region boundaries to more accurately predict the ground-truth bounding box of the object. Different models may use different methods of area sampling. Anchor box : Center each pixel to generate multiple bounding boxes with different zoom ratios and aspect ratios.

3. Adaptive image scaling:

Please add a picture description

1. Principle: When the size of the input network is uniformly scaled to the same size, the detection effect will be better ( the image placed in the train does not pass through the letterbox, but uses the letterbox when detecting ). If you simply use resize, it is likely to cause the loss of image information, so the letterbox adaptive image scaling technology is proposed.
The main idea of ​​letterbox is to use the information characteristics of the network receptive field as much as possible. For example, Stride=5 in the last layer of YOLOV5, that is, each point in the feature map of the last layer can correspond to the 32X32 area information in the original image 2. Idea: The same
image transformation ratio refers to the shrinkage ratio of length and width The same ratio should be used. Effective use of receptive field information means that for the side that does not meet the conditions after shrinkage, fill it with gray and white bars until it can be divisible by the receptive field.

2. backbone

Please add a picture description

1. Focus layer (now deprecated):

After obtaining the input, first use the focus layer to convert the width and height information of the image into channel information. The specific method is to obtain a pixel every other pixel in a picture, and then obtain four independent feature layers, and then stack the feature layers. The purpose is to reduce the amount of parameters and increase the running speed.

Under the latest version of YOLOv5, the Focus layer has been changed to ordinary 64 × 64 64 \times 6464×64 convolutional layers. As shown in the picture:

Please add a picture description

2. Conv2D_BN_SiLU convolution block:

Among them, the SiLU activation function is an improved version of sigmoid and ReLU, which has the characteristics of no upper bound, lower bound, smoothness and monotonicity.

f ( x ) = x × s i g m o i d ( X ) f(x)=x \times sigmoid(X) f(x)=x×sigmoid(X)
Please add a picture description

Some versions also use the LeakyReLU activation function.

3、bottleneck

Please add a picture description
The building block (left in the picture) and the bottleneck (right in the picture) were both proposed in Resnet. Among them, the building block was proposed by Resnet34, and the bottleneck was proposed by Resnet50. In the bottleneck, the role of the 1×1 convolutional layer is to reduce the amount of parameters. Experiments have proved that the bottleneck reduces the amount of parameters and optimizes the calculation, and maintains the original accuracy. If the number of network layers is small, use building block. And if the number of network layers is very deep, in order to reduce the amount of calculation, choose bottleneck.

The shortcut is to choose add instead of concat. The function is to add the feature maps and keep the number of channels unchanged.

4、 CSPLayer:

Please add a picture description
The CSPNet structure of YOLOv5 divides the original input into two branches (Figure b), and the two branches perform convolution operations separately to halve the number of channels. The next branch performs the Bottleneck × N operation, and then concats the two branches to stack the number of channels. It can be abstractly understood that CSPlayer has a large "residual edge". Such an operation makes the features of the input CSPLayer consistent with the number of output channels, and the purpose is to make the model learn more features.

The CSPLayer structure of Backbone is:

(CBL is Conv+BN+leakyReLU, CBS is Conv+BN+SiLU)
Please add a picture description
As shown in the figure, it divides the input into two branches, one first passes through the CBL block (later changed to CBS), and then passes through multiple residual structures (Bottleneck × N), after a convolution to adjust the number of channels. The other branch directly performs convolution to adjust the number of channels. Afterwards, the two branches are stacked according to the channel, and finally pass through a CBL (later changed to CBS).

Neck 's CSPLayer structure is: Please add a picture description
The CSPLayer of the Neck layer replaces the intermediate residual structure with 2 × X CBLs (CBS in the later stage). The main reason is that the Neck layer is relatively shallow.

5、SPPBottleneck:

Traditional SPP, also known as pyramid pooling, can convert feature maps of any size into feature vectors of fixed size. But in YOLOv5, the main function of SPPBottleneck is to extract features through the maximum pooling of different pooling kernel sizes to improve the receptive field of the network. Among them, the shotcut is concat, so it should be ensured that the size of the map remains unchanged after each pooling.
Please add a picture description
SPPF, the SPPF structure was proposed in the later period of YOLOv5, and the parallel maximum layering was changed to the serial maximum pooling layer. Although the structure was modified, the purpose was exactly the same.
Please add a picture description
Among them, two serial 5×5 and one 9×9 are equivalent, and three serial 5×5 and one 13×13 are equivalent. Parallel is the same as serial, but serial is more efficient.

3. Neck layer

Please add a picture description
As shown in the figure, YOLOv5 extracts three feature layers for target detection, which are located in the middle layer, the middle and lower layers, and the bottom layer. The dimensions of the three features are ( 80 , 80 , 256 ) (80, 80, 256)(80,80,256) ( 40 , 40 , 512 ) (40, 40, 512) (40,40,512 ) and(20, 20, 1024) (20, 20, 1024)(20,20,1024)

The feature pyramid can perform feature fusion of feature layers of different shapes, which is conducive to extracting better features. The specific fusion method is shown in the figure. Among them, interpolation is used for upsampling, and convolution is used for downsampling.

FPN (semantic information) + PAN (positioning information) :

It is generally believed that the deep feature layer carries strong semantic feature information and weak positioning information; the shallow feature layer has strong position information and weak semantic information. FPN is to transfer the deep semantic features to the shallow layer, thereby enhancing the semantic expression on multiple scales. The PAN transmits the positioning information of the shallow layer to the deep layer, and enhances the positioning ability on multiple scales.

As shown in the figure, FPN is bottom-up, and the high-level semantic features are passed up to enhance the entire pyramid, but only the semantic information is enhanced without enhancing the position information; by adding a top-down PAN to the position after the FPN The information is supplemented and the top-level location information is passed on.

4. YOLO Head

Through the feature pyramid, three enhanced features are obtained, which are ( 20 , 20 , 1024 ) (20, 20, 1024)(20,20,1024) ( 40 , 40 , 512 ) (40, 40, 512) (40,40,512) ( 80 , 80 , 256 ) (80, 80, 256) (80,80,256 ) , and then pass these three enhanced features into YOLO Head to obtain prediction results.

For each feature layer, a convolution is first used to adjust the number of channels, and the final number of channels is related to the number of categories in the dataset. Among them, there are 3 anchor boxes for each feature point of each feature layer.

Assuming that the voc data set is used, there are 20 categories, and the final dimension is 75=3×(4+1+20)

Among them: 3 represents three anchor boxes; 4 represents the regression parameters of each anchor frame; 1 represents whether the feature point contains an object (whether the feature point is a background); 20 is used to judge the type of the feature point.

Assuming that the coco data set (category is 80) is used, the final number of channels is 255=3×(4+1+80).

5. Forecast

1. Decoding

Assuming that the data set is a coco data set, the obtained feature layer prediction results are: ( N , 20 , 20 , 255 ) (N, 20, 20, 255)(N,20,20,255) ( N , 40 , 40 , 255 ) (N, 40, 40, 255) (N,40,40,255) ( N , 80 , 80 , 255 ) (N, 80, 80, 255) (N,80,80,255 ) , and then split 255 into three 85, corresponding to the 85 parameters of the three anchor boxes.

Will ( N , 20 , 20 , 255 ) (N, 20, 20, 255)(N,20,20,255 ) feature layer reshape into( N , 20 , 20 , 3 , 85 ) (N, 20, 20, 3, 85)(N,20,20,3,85 ) , and then decode the three anchor boxes corresponding to each feature point:

where: pw p_wpw p h p_h phis the length and width of the anchor box, cx c_xcx c y c_y cyIt is the distance from the cell corresponding to the predicted anchor box to the upper left corner of the image. ( tx , ty , tw , th ) (t_x, t_y, t_w, t_h)(tx,ty,tw,th) is the prediction offset. σ \sigmaσ is the sigmoid activation function.

Please add a picture description
Please add a picture description

Figure 1 shows the anchor frame regression calculation formula of YOLOv2/3/4, and Figure 2 shows the anchor frame regression calculation formula of YOLOv5.

  1. Width and height calculation: The author believes that the original yolo/darknet box equation has serious flaws. Although the width and height are always > 0, they are not limited because they are just out = ein out=e^{in}out=ein , this kind of exponential operation is very dangerous, it is easy to cause gradient instability, and the training difficulty is increased. Use( 2 σ ( tw ) ) 2 \left(2\sigma\left( t_w \right) \right)^2( 2 p(tw))2 can not only guarantee the width and height > 0, but also limit the width and height, and the maximum value is 4 times the width and height of the anchor frame.
  2. Offset calculation: Because the positive sample definition of YOLOv5 is different from that of the previous series of YOLO, the calculation of this formula is also different. The positive and negative samples are introduced in "Training".

2. Score screening and non-maximum suppression:

The process of score screening and non-maximum suppression can be summarized as follows:

  1. Find the boxes in the image with a score greater than the threshold function. Score screening before coincident box screening can greatly reduce the number of boxes.
  2. Cycling the category, the function of non-maximum suppression is to filter out the frame that belongs to the same category with the highest score in a certain area. Cycling the category can help us perform non-maximum suppression for each class separately.
  3. Sort the category from largest to smallest according to the score.
  4. Each time the box with the highest score is taken out, the degree of overlap with all other predicted boxes is calculated, and the overlap is too large to be eliminated.

6. Training

1. Composition of LOSS

The loss of the network is the same as the prediction composition of the network, which is divided into: Reg part (regression parameters of feature points), Obj part (whether the feature point contains objects), Cls part (type of feature point object).

2. Positive sample matching:

From IOU matching to shape matching, first calculate the aspect ratio between gt and 9 anchor boxes. If the aspect ratio is less than the set threshold, it means that the gt matches the corresponding anchor box. One gt may match several anchor boxes. In terms of matching, because like YOLO before, YOLOv5 has a three-layer network, 9 anchor boxes, from small to large, and every 3 anchor boxes correspond to a layer of network, so a gt may do prediction training on different network layers, which greatly increases The number of positive samples, of course, there will be cases where gt does not match all the anchor boxes, so gt will be regarded as the background and will not participate in training.

3. Match the filter box:

After the gt box is matched with the anchor box, the grid of the network layer corresponding to the anchor box is obtained. To see which grid the gt center point falls on, not only the anchor box matching gt in the grid is taken as a positive sample, but also two adjacent ones are taken. An anchor in a grid is a positive sample. At the same time, gt not only matches one anchor box, but if it matches several anchor boxes, and the anchor boxes are not on the same network layer, there may be 3-9 positive samples, increasing the number of positive samples.

4. LOSS calculation:

1. DIOU Loss: A good target frame function should consider three important geometric factors: overlapping area, center point distance, and aspect ratio How to minimize the normalized distance between the predicted frame and the target frame: DIOU_Loss DIOU_Loss considers the overlapping
Please add a picture description
area The distance from the center point, when the target frame wraps the prediction frame, directly measures the distance between the two frames, so DIOU Loss converges faster.
Please add a picture description

Problem: The aspect ratio is not considered: such as 1, 2, 3, the target frame wraps the prediction frame, originally DIOU_Loss can work, but the position of the center point of the prediction frame is the same, so according to the calculation formula of DIOU Loss, the three values ​​are the same.

2. CIOU Loss: CIOU Loss adds an impact factor on the basis of DIOU Loss, taking into account the aspect ratio of the prediction frame and the target frame.
CIOU _ L oss = 1 − CIOU = 1 − ( IOU − D istance _ 2 2 D istance _ C 2 − v 2 ( 1 − IOU ) + v ) CIOU\_Loss = 1-CIOU=1-(IOU -\frac {Distance\_2^2}{Distance\_C^2}- \frac{v^2}{(1-IOU)+v})CIOU_Loss=1C I O U=1(IOUDistance_C2Distance_22(1IOU)+vv2)
among them,vvv is a parameter to measure the consistency of aspect ratio, it can also be defined as:
v = 4 π 2 ( arctanwgthgt − arctanw P h P ) 2 v=\frac{4}{\pi^2}(arctan\frac{w^ {gt}}{h^{gt}}-arctan\frac{w^P}{h^P})^2v=Pi24( a rc t anhgtwgtI 'm sorry _hPwP)2
In this way, CIOU Loss takes into account three important set factors of the target frame regression function: overlapping area, center point distance, and aspect ratio.

3. In the Obj part, the a priori frame corresponding to each real frame can be known from the first positive sample matching part. The a priori frames corresponding to all real frames are positive samples, and the remaining a priori frames are negative samples. According to the positive and negative samples and The prediction result of whether the feature point contains the object is calculated as the cross entropy loss, which is used as the Loss component of the Obj part.

4. In the Cls part, the prior frame corresponding to each real frame can be known from the third part. After obtaining the prior frame corresponding to each frame, the prediction result of the type of the prior frame is obtained. According to the type of the real frame and the prior frame The type prediction result of the inspection box is used to calculate the cross entropy loss, which is used as the Loss component of the Cls part.

Guess you like

Origin blog.csdn.net/qq_44733706/article/details/129092559