Yolov5 Series 1---Yolo development history and detailed explanation of Yolov5 model

Recently, I have been doing detection-related work. I analyzed the code and papers of faster-rcnn a few years ago. Now, sort out the latest and fast model of yolov5. This series will start from the development of Yolo, to the loss function, the concept of mAP, and finally to how to train your customized data set at the code level. Okay, let's get started~

1. Development history of YOLO (You Only Look Once)

This part of the content is mainly borrowed from the article posted by the technology beast @Zhihu on Zhihu. I will simplify it here. Please read this article for specific details [1-3].

1.1 YOLO v0

The idea of ​​YOLO v0 originated from extending the basic CNN idea from classification tasks to detection. So, first, let’s take a look at the difference between detection tasks and classification tasks:

  • Detection : The output of the network should be the coordinates of the bounding box (rectangular box) (represented by at least 4 numbers, there are 2 common ones: (x1, y1, x2, y2), (x1, y1, w, h)). As for why a rectangular frame is used, it is because compared to circles and other forms of polygons, the geometric properties of the rectangular frame make it more suitable to frame the target object with minimal additional redundancy cost.

insert image description here

  • Classification : Take the most basic single classification task as an example. Single classification means that the object only belongs to one category. The characteristic of this task is that the input is a picture and the output is its category. For the input image, we generally use a Tensor to represent it, and its shape is [N, C, H, W]. For the output result, we generally use a one-hot vector to represent: [0, 0, 0, 1] [0, 0, 0, 1][0,0,0,1 ] , which dimension is 1, which represents which category the picture belongs to.

Okay, after knowing the basic difference between classification and detection, we can regard the detection task as an ergodic classification task (i.e. multi-target detection; if it is determined that a single target detection in a picture has at most 1 target, then it is similar to classification The network design and output of the task are similar, except that the activation function of the last fully connected layer is different).

So how to traverse it? The method of the RCNN class is to traverse all positions of the picture through a sliding window and classify each frame.

However, the problem with this method follows that the incomplete traversal greatly affects the accuracy. The more accurate the traversal, the higher the accuracy of the detector, but at the same time the higher the cost, because it is necessary to consider bboxes of different scales to traverse the entire image, which is quite costly. . .

For example: For example, the input image size is ( 320 , 320 ) (320, 320)(320,3 2 0 ) which means there are320 × 320 = 102, 400 320 \times 320 = 102, 400320×320=102,400 positions . _ The minimum window is(1, 1) (1, 1)(1,1 ) , maximum(320 × 320) (320 \times 320)(320×3 2 0 ) , so the number of traversals is infinite. Let’s look at the pseudocode:

That is, essentially, we are training a binary classifier. The input of this binary classifier is the content of a box, and the output is (foreground/background).

And this brings 2 more questions:

  • Frames have different sizes. Are frames of different sizes input into the same two-classifier?

We need to deal with this situation. The usual way is to resize to the same fixed size and input the binary classifier. This will obviously cause a big problem. For example, the height and width of the fixed input Tensor of a binary classifier are 64 × 64 64 \times 6464×6 4 , after sliding the frame, the size of some frames is200 × 200 200 \times 200200×2 0 0 , some are10 × 10 10 \times 1010×1 0 , we all need to resize these bboxes to64 × 64 64 \times 6464×Only bboxes of 6 and 4 can input the two classifiers.

  • ②There are many background images and few foreground images: the two-classification samples are unbalanced.

This sliding window classification method will be very slow, and the class imbalance problem is serious.

So far, a detector has been designed using a classification algorithm. It has various problems. Now it is time to optimize (then we will officially enter the YOLO series of methods):

The author of YOLO thought this way at the time: For the classifier, the last fully connected layer will output a one-hot vector, then replace it with (x, y, w, h, c), c represents confidence Therefore, wouldn’t it be better to transform the problem into a regression problem and directly return the position of the Bounding Box?

Okay, now the model is:
insert image description here
So how to organize training? Label the point data yourself, set label to ( 1 , x ∗ , y ∗ , w ∗ , h ∗ ) (1, x^*, y^*, w^*, h^*)(1,x,y,w,h ). here∗ * represents ground truth (i.e. actual label). With data and labels, training can be performed.

We will find that this method is much simpler than the sliding window classification method just now. This version of the idea is called YOLO v0 because it is the simplest version of You Only Look Once.

1.2 YOLO v1

YOLO v1 solves several problems based on YOLO v0:

  • ① YOLO v0 can only perform single target detection and is in urgent need of expansion.

The solution idea of ​​YOLO v1 is to use a (c, x, y, w, h) to be responsible for inputting the target of a certain sub-area of ​​the image.
That is, compared to a picture of YOLO v0, which only gets one (x, y, h, w, c), YOLO v1 gets n (x, y, h, w, c). So, how to get n (x, y insert image description here
) , h, w, c) How about getting the faces of Rick and Morty we want? Here, the author of YOLO v1 uses NMS (non-maximum suppression) to filter bbox. The specific algorithm is [1]:insert image description here

NMS automatically solves the problem of not knowing how many targets there are in the graph.

  • ② YOLO v0 can only perform single-category detection and is in urgent need of expansion.

A general detection task requires detecting a lot of content. For example, it is necessary to detect both the faces of Rick and Morty and the telescope. So what should we do?

Taking 2 categories: face and telescope as an example, we let the network predict the content from N * (c, x, y, w, h) to N * (c, x, y, w, h, one-hot ). 2 categories, one-hot is [0,1],[1,0], as shown in the following figure:insert image description here

  • ③ Small target detection

Small targets are always poorly detected, so YOLO v1 specifically designs neurons to fit small targets.
In the actual code, YOLO v1 uses 2 quintuples (c, x, y, w, h) for each area, one is responsible for returning the large target, and the other is responsible for returning the small target. Also add one-hot vector, one- Hot is [0,1],[1,0] to indicate which category it belongs to (face or telescope).
insert image description here

The core of YOLO v1 is to solve these three problems compared with v0. Its architecture diagram is as follows (grid represents the situation of dividing the image into regions. The above example is 4x4, but actually it is 7x7 in YOLO v1. The category is 20 categories):

insert image description here

1.3 YOLO v2

Although YOLO v1 is much faster than RCNN detection algorithms, it still has problems, such as: the predicted frames are inaccurate and many targets cannot be found:

  • ① The box predicted by YOLO v1 is inaccurate

The reason for this problem is mainly because YOLO v1 directly predicts the bbox (x, y, w, h), and this value range is very large, so there will be inaccurate problems. Think about it when we do CV tasks, we normalize the image processing and scale common 8-bit images (0-255) to [0, 1] or [-1, 1].

The strategy of directly predicting the position of YOLO v1 will cause the neural network to be unstable at the beginning of training, but using offsets will make the training process more stable, and the performance index will increase by about 5%.

So, learn from this idea and the idea of ​​​​the RCNN class. YOLO v2 proposes a method that does not directly predict the bbox coordinates, but instead predicts grid-based offsets and anchor-based offsets .

The author calls this location prediction .

  • The offset based on the grid means that the position of the anchor is fixed (the anchor is a fixed anchor obtained by clustering the entire data set when making the data set), and the offset = target position-anchor position .

  • The anchor-based offset means that the position of the grid is fixed (the grid is the green grids of Rick and Morty above), and the offset = target position-grid position.

The diagram about location prediction comes from the Zhihu article of Tech Beast [2],

insert image description here
As shown in the figure above, assuming that this picture is divided into 9 grids, GT (ground truth) is shown in the red box, and Anchor is shown in the purple box ( ) 该Anchor是根据数据集的GT计算产生的,与目标GT的IoU最大的那个Anchor. The numbers in the picture are the real information of the image.

The predicted value of YOLO v2 is from (x, y, h, w), x , y , h , w ∈ [ 0 , 447 ] x , y, h, w \in [0,447]x,y,h,w[0,447];变为 t x , t y , t w , t h t_x, t_y, t_w, t_h tx,ty,tw,th, as shown below, the range of these values ​​is very small, which is very useful for detecting network convergence.

insert image description here
It can be seen that the coordinates of the center position of Ground Truth's red bbox compared to the upper left corner (1, 1) of the grid where it is located are (1.543, 1.463). Calculation formula:

t x = l o g ( ( b b o x x − c x ) / ( 1 − ( b b o x x − c x ) ) ) t_x = log((bbox_x - c_x) / (1 - (bbox_x - c_x))) tx=l o g ( ( b b o xxcx)/(1( b b o xxcx)))
t y = l o g ( ( b b o x y − c y ) / ( 1 − ( b b o x y − c y ) ) ) t_y = log((bbox_y - c_y) / (1 - (bbox_y - c_y))) ty=l o g ( ( b b o xycy)/(1( b b o xycy)))
t w = l o g ( g t w / p w ) t_w = log(gt_w / p_w) tw=log(gtw/pw)
t h = l o g ( g t h / p h ) t_h = log(gt_h / p_h) th=log(gth/ph)

The lower right corner is a specific illustration of the improvements of YOLO v2. The meaning of the parameters is as follows [2]:
insert image description here

  • ② YOLO v1 will miss many targets, that is, the missed detection phenomenon is obvious.
    This is because in multi-target and multi-category detection, the size and aspect ratio of the target objects are different. For example, pedestrians are long and narrow boxes, while cars are square boxes.
    Based on this, the author of YOLO v2 prepared several bounding boxes with relatively high probability of occurrence in the data set in advance, and then used them as a basis for prediction. This is the original intention of Anchor.
    The specific method is that YOLO v2 divides the image into 13 × 13 13 \times 1313×1 3 areas, each area has 5 Anchors, and each anchor corresponds to 1 category. Take 2 categories as an example, as calculated by the following formula, the last dimension of the Tensor predicted by the network is 35. 35
    = 5 × ( 5 ( c , tx , ty , tw , th ) + 2 classes ) 35 = 5 \times (5 (c, t_x, t_y, t_w, t_h) + 2 classes)35=5×(5(c,tx,ty,tw,th)+2classes)

    • How are the 5 anchors in each area obtained?
      As shown in the figure below, for any data set, such as COCO (purple anchor), first cluster the bounding boxes of the GT (ground truth) of the training set. How many categories are they clustered into? After conducting experiments, the author found that recall vs. complexity of 5 categories is better. Now it is clustered into 5 categories. Of course, for complex tasks, the more categories, the higher the mAP will be, and the prediction is the most comprehensive, but the complexity increases a lot. At the same time, the accuracy of the model is not greatly improved, so a relatively compromise method is adopted to select 5 clusters, that is, 5 a priori boxes are used.
      insert image description here
    • Note: The Anchor of YOLO v2 is obtained from the statistics of the data set (while the width, height and size of the Anchor in Faster-RCNN are selected manually).

1.4 YOLO v3

At this point, the basic idea of ​​YOLO is determined, but when it comes to YOLO v2, the effect of small target detection is still not good enough (resnet hasn't come out yet..., the network for feature extraction is not good enough~).

With YOLO v3, the main improvement here is the addition of multi-scale prediction , and changing the Backbone of YOLO v2 from a 19-layer Darknet to a 53-layer Darknet [4].

  • ①The multi-scale prediction [2]
    YOLO v3 detection head is bifurcated and divided into 3 parts, each dimension has 3 anchors:

YOLO v3, in total uses 9 anchor boxes. Three for each scale (3 large, 3 medium, 3 small). If you're training YOLO on your own dataset, you should go about using K-Means clustering to generate 9 anchors.[4]

  • 13 ∗ 13 ∗ 3 ∗ ( 4 + 1 + 2 ) 13*13*3*(4+1 + 2) 13133(4+1+2)
  • 26 ∗ 26 ∗ 3 ∗ ( 4 + 1 + 2 ) 26*26*3*(4+1 + 2) 26263(4+1+2)
  • 52 ∗ 52 ∗ 3 ∗ ( 4 + 1 + 2 ) 52*52*3*(4+1 + 2) 52523(4+1+2 )
    insert image description here
    Compared with YOLO v2, the number of predicted bboxes of YOLO v3 is:
    (13 × 13 + 26 × 26 + 52 × 52) × 3 = 10467 (V 3) ≫ 845 (V 2) (13 × 13 × 5) (13 \times 13 + 26 \times 26 + 52 \times 52) \times 3 = 10467(V3) \gg 845(V2) (13 \times 13 \times 5)(13×13+26×26+52×52)×3=1 0 4 6 7 ( V 3 )8 4 5 ( V 2 ) ( 1 3×13×5)

With so many more predictable bounding boxes, the model's capabilities have obviously been enhanced.

The official YOLO v3 model is shown below:

aa

1.5 YOLO v4

Yolo v4 adds some features based on v3, mainly three features:

  • Using multi-anchors for single ground truth
    The previous YOLO v3 used one anchor to be responsible for one GT. In YOLO v4, multiple anchors are used to be responsible for one GT. The method is: for GT j GT_jGTjFor example, as long as I o U ( anchori , GT j ) > threshold IoU(anchor_i, GT_j) > thresholdIoU(anchori,GTj)>t h r e s h o l d,Gettinganchor_ianchoriGo to charge GT j GT_jGTj.
    This is equivalent to the fact that the number of your anchor boxes has not changed, but the proportion of selected positive samples has increased, which alleviates the problem of imbalance of positive and negative samples (generally, there are too many backgrounds).

  • Eliminate_grid sensitivity
    Do you still remember this picture of YOLO v2 before? YOLO v2 and YOLO v3 all predict tx, ty, tw, th t_x, t_y, t_w, t_htx,ty,tw,thThese 4 offsets.

    There is actually a problem hidden here:
    insert image description here

  • CIoU Loss
    will not be introduced for the time being. For details, please refer to the Zhihu article of the technology beast boss [2].

1.6 YOLO v5

YOLO v5 basically modifies the structure of YOLO v3. The following introduction is divided into several modules:

1.6.1 Network module

Take (N, 3, 640, 640) (N, 3, 640, 640)(N,3,640,6 4 0 ) as an example, taking the most lightweight Yolov5s as an example, its structure is as follows
insert image description here

Below I will describe in detail the important modules of Focus, BottleneckCSP, SPP, and PANET. Since this project uses the YOLO v5s network structure to train the model, the network diagrams and examples below are all based on YOLO v5s, and the input image is 3x640x640 .

The YOLO network consists of three main components:

1) Backbone : A convolutional neural network that aggregates and forms image features on different images at fine granularity.

2) Neck : A series of network layers that mix and combine image features and pass the image features to the prediction layer. (usually FPN or PANET)

3) Head : Predict image features, generate bounding boxes and predict categories.

Important modules used in YOLO V5 1.0 include Focus, BottleneckCSP, SPP, and PANET. The upsampling of the model uses nearest twice upsampling interpolation nn.Upsample(mode="nearest").

It is worth noting that the Pretrained_model originally trained for the COCO data set in YOLO V5 1.0 used FPN as the Neck. After June 22, Ultralytics has updated the Neck of the model to PANET. Many YOLO V5 network structure introductions on the Internet are based on FPN-NECK. The model training in this article is based on PANET-NECK. Only PANET-NECK is introduced below.

For YOLO V5, whether it is V5s, V5m, V5l or V5x, the Backbone, Neck and Head are the same. The only difference lies in the depth and width settings of the model. You only need to modify these two parameters to adjust the network structure of the model. The parameters for V5l are the default parameters.

depth multiple is used to control the depth of the model, for example, the depth of V5s is 0.33, while the depth of V5l is 1, which means that the number of Bottlenecks in V5l is 3 times that of V5s.

width_multiple is used to control the number of convolution kernels. The width of V5s is 0.5, while the width of V5l is 1, which means that the number of convolution kernels of V5s is half of the default setting. Of course, you can also set it to 1.25 times, that is, V5x . For example, the first layer of the backbone in the yaml file of YOLO V5 below is [[-1, 1, Focus, [64, 3]], and the width of V5s is 0.5, so this layer is actually [[-1, 1, Focus, [32, 3]].

Because my goal is a very lightweight detection model, I only consider the
model definition file for yolov5s as follows: yolov5s.yaml(for the COCO data set), it can be seen that it corresponds well to the above figure.

# parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, BottleneckCSP, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, BottleneckCSP, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, BottleneckCSP, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, BottleneckCSP, [1024, False]],  # 9
  ]

# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, BottleneckCSP, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, BottleneckCSP, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, BottleneckCSP, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, BottleneckCSP, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]
  • Focus
    The picture below shows the Focus interlaced sampling splicing structure of YOLO V5s.
    insert image description here

YOLO V5 has a default input of 3x640x640. The function of the Focus layer is to copy it into four copies, and then cut the four pictures into four 3x320x320 slices through the slicing operation. Next, use concat to connect the four slices from the depth and output is 12x320x320, and then passes through a convolution layer with a convolution kernel number of 32 to generate an output of 32x320x320. Finally, the result is input to the next convolution layer through batch_norm and leaky_relu.

The effect of the Focus layer is as follows: 4 × 4 4 \times 44×Take the picture of 4 as an example. The left picture is the original input picture, and the right picture is the feature map after Focus processing.
insert image description here

As of now (2020.09.28), its implementation is like this [5], which is the same as the passthrough of YOLO v2.
insert image description here
The core is this code self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1)). x[..., ::2, ::2]is the yellow part, x[..., 1::2, ::2]is the red part, x[..., ::2, 1::2]is the green part, and so on.

  • BottleneckCSP
    The picture below shows the first BottlenneckCSP structure of YOLO V5s: it can be seen that BottlenneckCSP is divided into two parts, Bottlenneck and CSP .
    insert image description here
    Among them, Bottleneckit is the classic residual structure: first a 1x1 convolution layer (conv+batch_norm+leaky relu), then a 3x3 convolution layer, and finally the residual structure is added to the initial input [6].
    insert image description here
    It is worth noting that YOLO V5 controls the depth of the model through depth multiple . For example, the depth of V5s is 0.33 , while the depth of V5l is 1. That is to say, the number of Bottlenecks in the BottlenneckCSP of V5l is 3 times that of V5s. The first BottlenneckCSP in the model The default number of bottlenecks is x3. For V5s, there is only one bottleneck in the picture above.

    The author's code is as follows. It is worth noting that e is width_multiple , which indicates the ratio of the number of convolution kernels currently operated to the default number [7]:
    insert image description here
    As can be seen from the BottleneckCSP code above, it divides the branch into 2 blocks, divided into branches y1 and y2,
    among which branch 1 (y1) performs the operation of Bottleneck * N, and branch 2 (y2) performs channel reduction.
    Finally, concat the two branches, and then go through a series of operations of bn, act, and Conv (the picture below is from William in Zhizhi) published articles [6]).
    insert image description here

  • SPP
    SPP is a spatial pooling layer . The input is 512x20x20. After a 1x1 convolution layer, the output is 256x20x20, and then it is down-sampled through three parallel Maxpools of different kernel_sizes (5, 9, 13). Note that for different branch, the padding sizes are [5,9,13]//2 respectively. In addition, since stride=1, the result after each pooling is 256x20x20, so the results can be spliced ​​and added to their initial features. Output 1024x20x20, and finally use a convolution kernel of 512 to restore it to 512x20x20 (the picture below is from a Zhihu article by Technology Beast [3]).
    insert image description here

  • PANet
    The PAN structure comes from the paper Path Aggregation Network [8], which is intended to be used in the instance segmentation task (Instance Segmentation). Its model structure is as follows:
    insert image description here
    The feature extractor of the network adopts a new enhanced bottom-up (Bottom Up) path The FPN structure improves the propagation of low-level features (Part a). Each stage of the third path takes as input the feature maps of the previous stage and processes them with 3x3 convolutional layers. The outputs are added to the feature maps of the same stage of the top-down pathway via lateral connections, and these feature maps provide information for the next stage (part b).

    At the same time , adaptive feature pooling is used to restore the damaged information paths between each candidate region and all feature levels, aggregate each candidate region at each feature level, and avoid being arbitrarily allocated (Part c).

    YOLO V5 borrows a modified PANET structure from YOLO V4: PANET typically uses adaptive feature pooling to add adjacent layers together for mask prediction. However, when PANET is used in YOLO v4, this method is slightly troublesome. Therefore, the authors of YOLO v4 did not use adaptive feature pooling to add adjacent layers, but performed Concat operations on them, thereby improving the accuracy of predictions.
    insert image description here
    YOLO V5 also uses cascade operation. For details, please refer to the model large picture and the corresponding Concat operation in the Netron network diagram.
    insert image description here

1.6.2 Data processing improvements

The following content is reproduced from [3]a Zhihu article by Technology Beast

  • Mosaic data enhancement[3]
    insert image description here

CutMix only uses two images for splicing, while Mosaic data enhancement uses four images, randomly scaled, randomly cropped, and randomly arranged for splicing.

Its main advantages are:

Enrich the data set : Randomly use 4 pictures, randomly scale them, and then randomly distribute them for splicing, which greatly enriches the detection data set. In particular, random scaling adds many small targets, making the network more robust.
Reduce GPU usage : Some people may say that random scaling and ordinary data enhancement can also be done, but the author considers that many people may only have one GPU, so during Mosaic enhanced training, the data of 4 pictures can be directly calculated, making Mini- The batch size does not need to be large, and a GPU can achieve better results.

  • Adaptive anchor frame calculation[3]

In the Yolo algorithm, for different data sets, there will be anchor boxes with initial length and width.

During network training, the network outputs the predicted frame based on the initial anchor frame, and then compares it with the real frame ground truth, calculates the difference between the two, and then updates it in reverse to iterate the network parameters.

Therefore, the initial anchor box is also an important part, such as the anchor box initially set by Yolov5 on the Coco data set:
insert image description here
In Yolov3 and Yolov4, when training different data sets, the value of the initial anchor box is calculated through a separate program. .

However, Yolov5 embeds this function into the code, and adaptively calculates the best anchor box values ​​in different training sets during each training.

Of course, if you feel that the calculated anchor box effect is not very good, you can also turn off the automatic anchor box calculation function in the code [9].

parser.add_argument('--noautoanchor', action='store_true', help='disable autoanchor check')
  • Adaptive Image Scaling[3]

In common target detection algorithms, different pictures have different lengths and widths, so a common method is to uniformly scale the original pictures to a standard size and then send them to the detection network.

For example, 416 × \times is commonly used in the Yolo algorithm× 416,608 × \times × 608 and other sizes, for example, scale the 800*600 image below:, as shown in the figure: However,
insert image description here
this has been improved in the Yolov5 code, and it is also a good trick that makes Yolov5 inference speed faster.

The author believes that when the project is actually used, many pictures have different aspect ratios, so after scaling and filling, the sizes of the black edges at both ends are different. If there is more filling, there will be information redundancy, which will affect the inference speed.

datasets.pyTherefore, modifications were made in the function in the Yolov5 code letterboxto adaptively add the minimum black edges to the original image, as shown in the figure: the
insert image description here
black edges at both ends of the image height are reduced, and the amount of calculation will also be reduced during inference. , that is, the target detection speed will be improved.

Through this simple improvement, the inference speed has been improved by 37%, which can be said to have an obvious effect.

The filling in Yolov5 is gray, that is, (114, 114, 114), which has the same effect, and the method of reducing black edges is not used during training, but the traditional filling method is used, that is, scaling to 416*416 size. Only when testing and using model inference, the method of reducing black edges is used to improve the speed of target detection and inference.

  • Positive samples increase[3]

This is the same as YOLO v4's Using multi-anchors for single ground truth .
insert image description here

2. Summary

The picture below is the author’s summary of the characteristics of each generation of the YOLO series. Among them, the Loss part will be discussed in Series 3.
insert image description here

reference article

[1] You must have never seen such an easy-to-understand YOLO series (from v1 to v5) model interpretation (Part 1) [
2] You must have never seen such an easy-to-understand YOLO series (from v1 to v5) model Interpretation (Part 2)
[3] You must have never seen such an easy-to-understand YOLO series (from v1 to v5) model interpretation (Part 2)
[4] What's new in YOLO v3?
[5] Focus layer of YOLO v5
[6 ] Use YOLO V5 to train the automatic driving target detection network
[7] Bottleneck layer of YOLO v5
[8] Path Aggregation Network for Instance Segmentation: CVPR2018
[9] YOLO v5 code train.py

Guess you like

Origin blog.csdn.net/g11d111/article/details/108845799