Training process of YOLOV7 learning records

In the process of learning YOLOV7, we have already learned its network structure. However, in fact, the difficulty of the YOLOV7 project lies not in its network model but in the design of its loss function, that is, how to train a suitable bbox.
The neural network model has a training and testing (reasoning) process. In the training process of YOLOV7, it includes model construction, label assignment and loss function calculation. Among them, the model has been mentioned before. The test process includes loading models, loss function calculations, output value decoding, non-maximum value suppression, MAP calculations, etc. Today we will talk about the training process of YOLOV7.
These files are mainly used in the training process:
insert image description here

An important idea in the YOLOV7 training process is the positive sample matching strategy, which is more like a combination of YOLOV5 and YOLOX, so let's take a look at the matching strategy in combination with the code:

YOLOV5, V7 positive and negative sample allocation strategy

The biggest difference between yolov5, v7 and yolov3 and yolov4 is that a gt of v3 and v4 will only match one positive sample, while a gt of v5 and v7 can be assigned to multiple anchors, and may be assigned to three different feature maps Two or even three of them.

Matching strategy: This refers to the absence of an auxiliary head. In the paper, the Head responsible for the final output is the lead Head, and the Head used for auxiliary training is called the auxiliary Head. This blog does not focus on the discussion, because the improvement of the structural experiments in the paper is relatively limited (0.3 points)

主要是参考了YOLOV5 和YOLOV6使用的当下比较火的simOTA.

S1.训练前,会基于训练集中gt框,通过k-means聚类算法,先验获得9个从小到大排列的anchor框。(可选)

S2.将每个gt与9个anchor匹配:Yolov5为分别计算它与9种anchor的宽与宽的比值(较大的宽除以较小的宽,比值大于1,下面的高同样操作)、高与高的比值,在宽比值、高比值这2个比值中,取最大的一个比值,若这个比值小于设定的比值阈值,这个anchor的预测框就被称为正样本。一个gt可能与几个anchor均能匹配上(此时最大9个)。所以一个gt可能在不同的网络层上做预测训练,大大增加了正样本的数量,当然也会出现gt与所有anchor都匹配不上的情况,这样gt就会被当成背景,不参与训练,说明anchor框尺寸设计的不好。

S3.扩充正样本。根据gt框的中心位置,将最近的2个邻域网格也作为预测网格,也即一个groundtruth框可以由3个网格来预测;可以发现粗略估计正样本数相比前yolo系列,增加了三倍(此时最大27个匹配)。图下图浅黄色区域,其中实线是YOLO的真实网格,虚线是将一个网格四等分,如这个例子中,GT的中心在右下虚线网格,则扩充右和下真实网格也作为正样本。

S4.获取与当前gt有top10最大iou的prediction结果。将这top10 (5-15之间均可,并不敏感)iou进行sum,就为当前gt的k。k最小取1。

S5.根据损失函数计算每个GT和候选anchor损失(前期会加大分类损失权重,后面减低分类损失权重,如1:5->1:3),并保留损失最小的前K个。

S6.去掉同一个anchor被分配到多个GT的情况。

Positive and negative sample distribution

The function build_targets (yolo_training.py) for the distribution of positive and negative samples is divided into the following structure:

├── 数据准备
└── 遍历每个特征图
        ├── ①anchors和gt匹配,看哪些gt是当前特征图的正样本(find_3_positive)初筛
        └── ②将当前特征图的正样本分配给对应的grid(完成复筛:iou,类别)

Step 1: Match anchors and gt to see which gt is a positive sample of the current feature map**(find_3_positive)** What to do
here is to offset 0.5 from the top, bottom, left, and right sides of gt to obtain surrounding cells for prediction, through Calculate whether the aspect ratio of the anchor is appropriate (the ratio is between 1/4 and 4), then it is considered to be consistent, then the current gt can match the current feature map.
As shown in the figure: This is the lead head’s positive sample matching strategy
insert image description here
insert image description here
YOLOV7 introduced the auxiliary head, and its positive sample is:
insert image description here
As shown in the figure: During training, the distribution diagram of the positive samples in the lead head and aux head (the blue dot represents the gt The position at the position, the grid composed of solid lines represents the feature map grid, and the dotted line represents a grid divided into 4 quadrants for positive and negative sample distribution. If a gt is located at the blue dot position, then in the lead head, the yellow grid will be become a positive sample. In the aux head, the yellow + orange grid will be a positive sample)

Primary screening (find_3_positive)

Set the offset direction and offset size

 g = 0.5  # offsets 漂移的距离,为获取更多正样本
        off = torch.tensor([#漂移方向
            [0, 0],
            [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
            # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
        ], device=targets.device).float() * g

In yolov5, v7, a feature point will be divided into four quadrants, and for the gt matched in step 1, it will be calculated which of the four quadrants the gt (blue point in the figure above) is in, and the adjacent Two feature points are also used as positive samples. For example in the above figure, if gt is biased towards the lower right quadrant, the feature points on the right and lower sides of the grid where gt is located will also be used as positive samples.

# 分别对应中心点、左、上、右、下
off = torch.tensor([[0, 0],
                    [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
                    ], device=targets.device).float() * g

# gain = [1, 1, 特征图w, 特征图_h, 特征图w, 特征图_h]
gxy = t[:, 2:4]  # 以特征图左上角为原点,gt的xy坐标
gxi = gain[[2, 3]] - gxy  # 以特征图左上角为原点,gt的xy坐标
# jklm就分别代表左、上、右、下是否能作为正样本。g=0.5
# j和l, k和m是互斥的,(x,y)%1会得到两个值所以其最终可以组成四个方位
j, k = ((gxy % 1 < g) & (gxy > 1)).T
l, m = ((gxi % 1 < g) & (gxi > 1)).T
j = torch.stack((torch.ones_like(j), j, k, l, m))#组成五维
# 原本一个gt只会存储一份,现在复制成3份 拼接函数
t = t.repeat((5, 1, 1))[j]
# 偏移量
offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

The grammatical understanding of the above code is as follows
insert image description here
. It can be seen that V5 and V7 are very similar. Compared with yolov3 and v4, a gt will only match one positive sample. This method can allocate more positive samples, which is helpful Training acceleration, positive and negative sample balance.
After completing the addition of positive samples (found t prior boxes (positive samples), we need to determine which picture these boxes belong to, which category they belong to, what is the upper left coordinate of the cell responsible for the sample prediction, and the coordinate The scaling ratio of w, h

 # -------------------------------------------#
            #   b   代表属于第几个图片,即每个t属于的图片
            #   gxy 代表该真实框所处的x、y中心坐标
            #   gwh 代表该真实框的wh坐标
            #   gij 代表真实框所属的特征点坐标
            # -------------------------------------------#
            b, c = t[:, :2].long().T  # image, class
            gxy = t[:, 2:4]  # grid xy
            gwh = t[:, 4:6]  # grid wh
            gij = (gxy - offsets).long()#.long是取值,不要小数部分,如gxy(2.3,2.2)左移-0.5则为(1.8,2.2)取值(1,2)即由(1,2)的anchor来进行匹配,获得偏移后负责预测的单元格
            gi, gj = gij.T  # grid xy indices

            # -------------------------------------------#
            #   gj、gi不能超出特征层范围
            #   a代表属于该特征点的第几个先验框
            # -------------------------------------------#
            a = t[:, 6].long()  # anchor indices
            indices.append(
                (b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1)))  # image, anchor, grid indices
            anchors.append(anchors_i[a])  # anchors比例

And because in each feature map, all gt and the anchor of the current feature map will be calculated to assign positive samples, which means that a gt may be assigned to positive samples in multiple feature maps.
The return result of find_3_positive is:
The shape of the indices is: [3, 4, the number of positive samples]
insert image description here
The shape of the anchor is:
insert image description here

Double screening get_target()

So far, we have completed the matching of positive samples, that is, the primary screening work, and then we need to re-screen the prior frames obtained from the primary screening. At this time, we need to calculate the IOU and category based on the predicted prior frames and real frames of predictions Perform re-screening.

 #   取出这个真实框对应的预测结果
                # -------------------------------------------#
                fg_pred = prediction[b, a, gj, gi]
                #判断是否是物体与类别符合
                p_obj.append(fg_pred[:, 4:5])
                p_cls.append(fg_pred[:, 5:])

                # -------------------------------------------#
                #   获得网格后,进行解码,这里需要按照步长进行恢复,并得到我们的预测恢复结果
                # -------------------------------------------#
                grid = torch.stack([gi, gj], dim=1).type_as(fg_pred)
                pxy = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]
                pwh = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]
                pxywh = torch.cat([pxy, pwh], dim=-1)
                pxyxy = self.xywh2xyxy(pxywh)#该函数将xywh转化为左上,右下坐标形式
                pxyxys.append(pxyxy)


Calculate the overlap degree ou between the real frame and the predicted frame in the current picture. The
range is 0-1. After taking -log
, it is 0~inf. Larger, the smaller the pair_wise_iou_loss, the result is (number of real frames * number of candidate frames)

 pair_wise_iou = self.box_iou(txyxy, pxyxys)
 pair_wise_iou_loss = -torch.log(pair_wise_iou + 1e-8)

After calculating the IOU, select the top 20, if there are no 20, choose as many as there are

 top_k, _ = torch.topk(pair_wise_iou, min(20, pair_wise_iou.shape[1]), dim=1)
            dynamic_ks = torch.clamp(top_k.sum(1).int(), min=1) 
           #   gt_cls_per_image    种类的真实信息,转换为one -hot格式,复制操作           
            gt_cls_per_image = F.one_hot(this_target[:, 1].to(torch.int64), self.num_classes).float().unsqueeze(
                1).repeat(1, pxyxys.shape[0], 1)

Predict class and calculate cross entropy

 cls_preds_ = p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_() * p_obj.unsqueeze(0).repeat(num_gt,
                                                                                                    1,    1).sigmoid_()
 y = cls_preds_.sqrt_()
 pair_wise_cls_loss = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image,
                                                            reduction="none").sum(-1)

Find the sum of cost losses, topk function: find the top k maximum values

 cost = ( pair_wise_cls_loss+ 3.0 * pair_wise_iou_loss)
 matching_matrix = torch.zeros_like(cost)
 for gt_idx in range(num_gt):#从真实框中去找这里面损失最小的k个
                _, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item())
                matching_matrix[gt_idx][pos_idx] = 1.0

insert image description here

In order to prevent an anchor from predicting multiple gts, it needs to be converted to take the smallest iou as y prediction

 anchor_matching_gt = matching_matrix.sum(0)#sum(0)求数组每一列的和
            if (anchor_matching_gt > 1).sum() > 0:#找出哪些sum>0,说明一个anchor正样本匹配到了多个gt
                _, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)#找到最小的
                matching_matrix[:, anchor_matching_gt > 1] *= 0.0#其余赋值0
                matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0#最小的赋值1
            fg_mask_inboxes = matching_matrix.sum(0) > 0.0
            fg_mask_inboxes = fg_mask_inboxes.to(torch.device(device))#哪些是正样本
            matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)#正样本对应的真实框索引

In the end, we will match the batches, and the values ​​we get are:
matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs
Let’s explain the meaning of the values:
matching_bs: matching batch
insert image description here
matching_as: matching anchor id [0,1 ,2]
insert image description here
matching_gjs, matching_gis: matching cell xy coordinates (responsible for positive sample prediction)
insert image description here
insert image description here
matching_targets: matching labels, through labels in the batch, xywh (real frame) can be calculated with the previously matched anchor. Anchorid was added to find-3-positive, but it was not used here and was deleted.
insert image description here
insert image description here
matching_anchs: matching anchor scaling
insert image description here

calculate loss

After completing the build-target function, obtain the matching positive sample information mentioned above: in the call function

 bs, as_, gjs, gis, targets, anchors = self.build_targets(predictions, targets, imgs)

Start calculating the loss:

 for i, prediction in enumerate(predictions):
            # -------------------------------------------#
            #   image, anchor, gridy, gridx
            # -------------------------------------------#
            b, a, gj, gi = bs[i], as_[i], gjs[i], gis[i]
            tobj = torch.zeros_like(prediction[..., 0], device=device)  # target obj

            # -------------------------------------------#
            #   获得目标数量,如果目标大于0
            #   则开始计算种类损失和回归损失
            # -------------------------------------------#
            n = b.shape[0]
            if n:
                prediction_pos = prediction[b, a, gj, gi]  # prediction subset corresponding to targets

                # -------------------------------------------#
                #   计算匹配上的正样本的回归损失
                # -------------------------------------------#
                # -------------------------------------------#
                #   grid 获得正样本的x、y轴坐标
                # -------------------------------------------#
                grid = torch.stack([gi, gj], dim=1)
                # -------------------------------------------#
                #   进行解码,获得预测结果,这里可以看到与build_target中是遥相呼应的是相同的计算方式
                # -------------------------------------------#
                xy = prediction_pos[:, :2].sigmoid() * 2. - 0.5
                wh = (prediction_pos[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
                box = torch.cat((xy, wh), 1)
                # -------------------------------------------#
                #   对真实框进行处理,映射到特征层上
                # -------------------------------------------#
                selected_tbox = targets[i][:, 2:6] * feature_map_sizes[i]
                selected_tbox[:, :2] -= grid.type_as(prediction)
                # -------------------------------------------#
                #   计算预测框和真实框的回归损失
                # -------------------------------------------#
                iou = self.bbox_iou(box.T, selected_tbox, x1y1x2y2=False, CIoU=True)
                box_loss += (1.0 - iou).mean()
                # -------------------------------------------#
                #   根据预测结果的iou获得置信度损失的gt,使用iou来代替置信度
                # -------------------------------------------#
                tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * iou.detach().clamp(0).type(tobj.dtype)  # iou ratio

                # -------------------------------------------#
                #   计算匹配上的正样本的分类损失
                # -------------------------------------------#
                selected_tcls = targets[i][:, 1].long()
                t = torch.full_like(prediction_pos[:, 5:], self.cn, device=device)  # targets
                t[range(n), selected_tcls] = self.cp
                cls_loss += self.BCEcls(prediction_pos[:, 5:], t)  # BCE

            # -------------------------------------------#
            #   计算目标是否存在的置信度损失
            #   并且乘上每个特征层的比例
            # -------------------------------------------#
            obj_loss += self.BCEobj(prediction[..., 4], tobj) * self.balance[i]  # obj loss

        # -------------------------------------------#
        #   将各个部分的损失乘上比例
        #   全加起来后,乘上batch_size
        # -------------------------------------------#
        box_loss *= self.box_ratio
        obj_loss *= self.obj_ratio
        cls_loss *= self.cls_ratio
        bs = tobj.shape[0]

        loss = box_loss + obj_loss + cls_loss
        return loss

Calculate the regression value according to the model prediction results, and the regression value calculation formula
insert image description here
So far, the positive sample matching and loss function calculation process of YOLOV7 is completed.

Guess you like

Origin blog.csdn.net/pengxiang1998/article/details/128393512#comments_27544850