YOLOv5 of sample allocation strategy

The loss of target detection often consists of three parts: classification loss Lcls , confidence loss Lobj and bounding box iou loss Lbox . Lcls and Lbox are generated only from positive samples, while Lobj is generated from all samples.
Unlike DETR, an end-to-end target detection algorithm, YOLO will generate a large number of prediction frames, and each prediction frame is called a sample. So for these generated prediction frames, which ones should be used as positive samples to calculate Lbox and Lcls with gt (ground truth) , and which ones should be used as negative samples to only contribute Lobj ? It depends on the sample allocation method defined.

YOLOv3&v5 sample allocation strategy

Before starting to talk about sample allocation, we must first clarify two points:

  • Both YOLOv3 and YOLOv5 use the feature pyramid structure to achieve multi-scale, and finally output three feature maps with different downsampling multiples
  • Both YOLOv3 and YOLOv5 are target detection methods based on anchor frames. Each point on each layer of feature maps corresponds to three anchors.
    YOLOv3 network structure diagram

Sample allocation is performed layer by layer on the feature maps of three different downsampling multiples output by the network at the end:

  • First map the normalized gt to the size corresponding to the feature map;
  • Calculate the aspect ratio between gt and the three preset anchors of different sizes on the feature map of this scale and judge whether it is satisfied: , 1/thr < ratio <thrif it is satisfied, it means that the gt matches the size of the anchor, and then further assign positive samples to it; if it is not satisfied It means that this gt does not match the size of this anchor, and it will not match the positive sample corresponding to the anchor. Assuming we have m labeled real bounding boxes gt, then theoretically there will be at most 3*msuccessful pairs of matches on a layer of feature maps gt-anchor(because each grid point in YOLOv3&v5 corresponds to 3 anchors); core code analysis:
t = targets * gain  # gt映射到对应特征图上
if nt:  
    # Matches  这一步完成target与3个anchor的匹配  
    r = t[..., 4:6] / anchors[:, None]  # wh ratio  
    j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # compare  (3, n)  
    t = t[j]  # filter
  • For YOLOv3, the last step only needs to determine which grid the center point of gt falls in, and the grid point in the upper left corner of the grid will be responsible for predicting this gt, that is to say, as its positive sample:
    YOLOv3 sample allocation
  • For YOLOv5, each gt is not only responsible for predicting the grid point where its center is located, but also assigns two additional positive samples according to whether it is located in the upper left corner, lower left corner, upper right corner or lower right corner of the grid:
    YOLOv5 sample allocation

For the distribution of different areas that fall on each grid point in YOLOv5, I drew a picture and listed them as follows:
insert image description here

In the same case, 3 times more positive samples are assigned to each gt in YOLOv5 than in YOLOv3. Core code analysis:

t = targets * gain  # gt映射到对应特征图上
if nt:  
    # Matches  这一步完成target与3个anchor的匹配  
    r = t[..., 4:6] / anchors[:, None]  # wh ratio  
    j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # compare  (3, n)  
    t = t[j]  # filter

    # Offsets  
    gxy = t[:, 2:4]  # grid xy 相对于左上角的距离
    gxi = gain[[2, 3]] - gxy  # inverse  相对于右下角的距离
    # 根据gt中心点是落在网格中的左上、左下、右上、右下制定一个mask
    # 除了第一行、第一列、最后一行、最后一列的网格,j与l、k与m是互补的
    # 对于第一行、第一列、最后一行、最后一列的网格需要特殊处理:
    # 比如如果gt落在最左上角网格的左上区域,那么对于这个gt只会有一个点负责预测它,那就是所处网格的左上格点
    j, k = ((gxy % 1 < g) & (gxy > 1)).T  
    l, m = ((gxi % 1 < g) & (gxi > 1)).T  
    # 这里为什么前面要加一个torch.ones_like(j),是因为对于每个gt有三个格点负责预测它(见上面画的那个网格图),torch.ones_like(j)代表中心点本身所处的网格,jklm中有两个为true,代表偏移之后的网格
    j = torch.stack((torch.ones_like(j), j, k, l, m))   # (5, 10)  
    # 最终分配到了正样本的gt,以及其对应的偏移量
    t = t.repeat((5, 1, 1))[j]  
    offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

Guess you like

Origin blog.csdn.net/Fyw_Fyw_/article/details/129819656