Understand YOLO v8 in one article

In 2023, the YOLO series has been iterated to v8. Both v8 and v5 are from U God. For the convenience of understanding, we will explain v8 by comparing it with v5. If you want to know about v5, you can refer to the article yolov5 . Next, I will integrate the work of pruning and distillation into v8, you can look forward to it. If there is anything you don't understand, you can leave a message.

First, return to yolov5:

  • Backbone : CSPDarkNet structure, the main structural idea is reflected in the C3 module, where the main idea of ​​gradient shunting is located;
  • PAN-FPN : Dual-stream FPN, in addition to the upsampling and CBS convolution modules, the most important one is the C3 module;
  • Head:Coupled Head+Anchor-base,YOLOv3、YOLOv4、YOLOv5、YOLOv7都是Anchor-Base的
  • label assignment : Multi-positive sample reference points, using shape matching rules, respectively calculate the ratio of the width and height of GT and anchor width and height, and the ratio is less than the threshold, and it is recognized as a positive sample point;
  • Loss : BEC Loss for classification and CIoU Loss for regression. There is also a confidence loss for the presence of objects, and the total loss is the weighted sum of the three losses.
    insert image description here

The specific improvements of yolov8 are as follows:

  • Backbone : The idea of ​​CSP is still used, but the C3 module in YOLOv5 is replaced by the C2f module to achieve further lightweight, and YOLOv8 still uses the SPPF module used in YOLOv5 and other architectures;
  • PAN-FP N: YOLOv8 still uses the idea of ​​PAN, but by comparing the structure diagrams of YOLOv5 and YOLOv8, it can be seen that YOLOv8 deletes the CBS 1*1 convolution structure in the PAN-FPN upsampling stage in YOLOv5, and at the same time Also replaced the C3 module with the C2f module;
  • Decoupled-Head : YOLOv8 uses Decoupled-Head; that is, output the output of cls and reg through two heads respectively;
  • Anchor-Free : YOLOv8 abandoned the previous Anchor-Base and used the idea of ​​Anchor-Free;
  • Loss : YOLOv8 uses VFL Loss as a classification loss (not used in actual training), and uses DFL Loss+CIOU Loss as a classification loss;
  • label assignmet : YOLOv8 abandoned the previous IOU matching or unilateral ratio allocation, but used the Task-Aligned Assigner matching method.
    insert image description here

Friends who know yolo should have a general understanding of the structure of v8 after reading the above comparison. The most important update is the c2, c2f structure, and the decoupling of cls and reg in Detect, and use the integral of dfl to obtain reg. dfl is a part of GFL, if you don't understand it, you can refer to GFL .

c2f

The C3 module mainly extracts the idea of ​​shunting with the help of CSPNet, and at the same time combines the idea of ​​the residual structure to design the C3 Block. The CSP main branch gradient module here is the BottleNeck module, which is the residual module. The structure is shown in the figure below.
insert image description here
In order to further reduce weight, v8 designed the c2f structure. Compared with c3, there is one less layer of conv, and split is used to layer features instead of conv.
insert image description here

class C2(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv(2 * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n)))

    def forward(self, x):
        a, b = self.cv1(x).split((self.c, self.c), 1)
        return self.cv2(torch.cat((self.m(a), b), 1))

Decoupled-head
and YOLOv8 refer to YOLOX and YOLOV6, and use Decoupled-Head, that is, use two convolutions for classification and regression respectively. At the same time, due to the use of the idea of ​​​​DFL, the number of channels in the regression head has also become 4. *reg_max:

As shown in the code, reg is output by cv2, cls is output by cv3, yolov8 has 3 feature layers (8, 16, 32 times downsampling) like v5, and the feature layer is traversed by for x in ch. Gfl has done experiments in coco. Generally, the distance between the instance and the center point will not exceed 16 pixels under 8, 16, and 32 downsampling, so self.reg_max is set to 16.

self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)
self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)


The most important update of label-assignment v8 is to adopt the anchor-free method, and learn TOOD to use Task-Alignment learning to align cls and reg tasks, so let's explain in detail the label assignment of tood .

A normally aligned Anchor should be able to predict high classification scores and have precise positioning; here Tood designed a new Anchor alignment metric, the Anchor alignment metric is obtained by multiplying the cls score and the predicted frame with the IOU of GT, and it is used as an arbitrary The quality assessment of the anchor measures the level of Task-Alignment at the Anchor level. Moreover, the Alignment metric is integrated in the sample allocation and loss function to dynamically optimize the prediction of each Anchor.

def forward(self, pd_scores, pd_bboxes, anc_points, gt_labels, gt_bboxes, mask_gt):
        self.bs = pd_scores.size(0)
        self.n_max_boxes = gt_bboxes.size(1)

        if self.n_max_boxes == 0:
            device = gt_bboxes.device
            return (torch.full_like(pd_scores[..., 0], self.bg_idx).to(device), torch.zeros_like(pd_bboxes).to(device),
                    torch.zeros_like(pd_scores).to(device), torch.zeros_like(pd_scores[..., 0]).to(device),
                    torch.zeros_like(pd_scores[..., 0]).to(device))

        mask_pos, align_metric, overlaps = self.get_pos_mask(pd_scores, pd_bboxes, gt_labels, gt_bboxes, anc_points,
                                                             mask_gt)

        target_gt_idx, fg_mask, mask_pos = select_highest_overlaps(mask_pos, overlaps, self.n_max_boxes)

        # assigned target
        target_labels, target_bboxes, target_scores = self.get_targets(gt_labels, gt_bboxes, target_gt_idx, fg_mask)

        # normalize
        align_metric *= mask_pos
        pos_align_metrics = align_metric.amax(axis=-1, keepdim=True)  # b, max_num_obj
        pos_overlaps = (overlaps * mask_pos).amax(axis=-1, keepdim=True)  # b, max_num_obj
        # norm_align_metric = (align_metric * pos_overlaps / (pos_align_metrics + self.eps)).amax(-2).unsqueeze(-1)
        norm_align_metric = (align_metric  / (pos_align_metrics + self.eps)).amax(-2).unsqueeze(-1)
        target_scores = target_scores * norm_align_metric

        return target_labels, target_bboxes, target_scores, fg_mask.bool(), target_gt_idx

In v8, label assignment is divided into 3 steps. First, calculate the mask of the positive sample, iou and align_metric of GT and pred_bboxes according to self.get_pos_mask(), where

align_metric = bbox_scores.pow(self.alpha) * overlaps.pow(self.beta)

mask_pos (that is, the mask of the positive sample) needs to be obtained by multiplying mask_topk * mask_in_gts * mask_gt. mask_in_gts indicates the mask inside the GT. The coordinates of the upper left corner and the lower right corner of the GT are respectively different from the center point of the anchor to obtain bbox_deltas. When the values ​​of bbox_deltas are greater than 0, it means that the point is in GT, and mask_in_gts can be obtained.

bbox_deltas = torch.cat((xy_centers[None] - lt, rb - xy_centers[None]), dim=2).view(bs, n_boxes, n_anchors, -1)

mask_topk This is easy to understand, it is the topk of align_metric. align_metric takes both cls and reg into account, so it can better align the two tasks of cls and reg. cls and reg are model prediction values. The combination of the two can well estimate which grids the GT performs well in. It is more reasonable to use the topk of align_metric to select positive sample points.

def get_pos_mask(self, pd_scores, pd_bboxes, gt_labels, gt_bboxes, anc_points, mask_gt):
        # get anchor_align metric, (b, max_num_obj, h*w)
        align_metric, overlaps = self.get_box_metrics(pd_scores, pd_bboxes, gt_labels, gt_bboxes)
        # get in_gts mask, (b, max_num_obj, h*w)
        mask_in_gts = select_candidates_in_gts(anc_points, gt_bboxes)
        # get topk_metric mask, (b, max_num_obj, h*w)
        mask_topk = self.select_topk_candidates(align_metric * mask_in_gts,
                                                topk_mask=mask_gt.repeat([1, 1, self.topk]).bool())
        # merge all mask to a final mask, (b, max_num_obj, h*w)
        mask_pos = mask_topk * mask_in_gts * mask_gt

        return mask_pos, align_metric, overlaps

At this point, we have obtained the mask of the positive sample, but GTs overlap, so one point may correspond to multiple GTs. We need to prevent this situation and assign the GT with the largest area to the ambiguous point. The select_highest_overlaps function can accomplish such a task.

def select_highest_overlaps(mask_pos, overlaps, n_max_boxes):
    # (b, n_max_boxes, h*w) -> (b, h*w)
    fg_mask = mask_pos.sum(-2)
    if fg_mask.max() > 1:  # one anchor is assigned to multiple gt_bboxes
        mask_multi_gts = (fg_mask.unsqueeze(1) > 1).repeat([1, n_max_boxes, 1])  # (b, n_max_boxes, h*w)
        max_overlaps_idx = overlaps.argmax(1)  # (b, h*w)
        is_max_overlaps = F.one_hot(max_overlaps_idx, n_max_boxes)  # (b, h*w, n_max_boxes)
        is_max_overlaps = is_max_overlaps.permute(0, 2, 1).to(overlaps.dtype)  # (b, n_max_boxes, h*w)
        mask_pos = torch.where(mask_multi_gts, is_max_overlaps, mask_pos)  # (b, n_max_boxes, h*w)
        fg_mask = mask_pos.sum(-2)
    # find each grid serve which gt(index)
    target_gt_idx = mask_pos.argmax(-2)  # (b, h*w)
    return target_gt_idx, fg_mask, mask_pos

Superimpose mask_pos on the n_max_boxes dimension. When fg_mask.max() > 1, it means that there is an ambiguity point. Find the index mask_multi_gts of the ambiguous point and the index max_overlaps_idx of the GT with the largest area corresponding to each prediction frame, change max_overlaps_idx into onehot form, and replace the value of the ambiguous point with is_max_overlaps to eliminate the ambiguity. The target_gt_idx can be obtained through mask_pos.argmax(-2), that is, which GT corresponding to each point can be found.

Finally, we can get the targets for calculating loss according to target_gt_idx. The logic of get_targets is relatively clear and will not be described in detail. So far, the label assignment of v8 has been explained.


    def get_targets(self, gt_labels, gt_bboxes, target_gt_idx, fg_mask):
        """
        Args:
            gt_labels: (b, max_num_obj, 1)
            gt_bboxes: (b, max_num_obj, 4)
            target_gt_idx: (b, h*w)
            fg_mask: (b, h*w)
        """

        # assigned target labels, (b, 1)
        batch_ind = torch.arange(end=self.bs, dtype=torch.int64, device=gt_labels.device)[..., None]
        target_gt_idx = target_gt_idx + batch_ind * self.n_max_boxes  # (b, h*w)
        target_labels = gt_labels.long().flatten()[target_gt_idx]  # (b, h*w)

        # assigned target boxes, (b, max_num_obj, 4) -> (b, h*w)
        target_bboxes = gt_bboxes.view(-1, 4)[target_gt_idx]

        # assigned target scores
        target_labels.clamp(0)
        target_scores = F.one_hot(target_labels, self.num_classes)  # (b, h*w, 80)
        fg_scores_mask = fg_mask[:, :, None].repeat(1, 1, self.num_classes)  # (b, h*w, 80)
        target_scores = torch.where(fg_scores_mask > 0, target_scores, 0)

        return target_labels, target_bboxes, target_scores

Loss
For YOLOv8, its classification loss is VFL Loss, and its regression loss is in the form of CIOU Loss+DFL, where Reg_max defaults to 16.

The main improvement of VFL is to propose an asymmetric weighting operation, and both FL and QFL are symmetrical. The idea of ​​asymmetric weighting comes from the paper PISA, which pointed out that firstly, the positive and negative samples have an imbalance problem, even in the positive sample, there is also the problem of unequal weight, because the calculation of mAP is the main positive sample. VFL is to highlight positive samples, so positive samples use bce and negative samples use FL to attenuate loss.
insert image description here
As shown in the formula above, p is the label, q is the value calculated by norm_align_metric for positive samples, and p=0 for negative samples. In fact, FL is not used when it is a positive sample, but ordinary BCE, but there is an extra self. Adapt norm_align_metric weighting for highlighting master samples. And when it is a negative sample, it is the standard FL. It can be clearly found that VFL is simpler than QFL, and its main features are asymmetric weighting of positive and negative samples, and prominent positive samples as the main samples.

DFL (Distribution Focal Loss) changes the single value of coordinate regression to output n+1 values, each value represents the probability of the corresponding regression distance, and then uses the integral to obtain the final regression distance. For the DFL here, it mainly converts the frame The location is modeled as a general distribution, allowing the network to quickly focus on the distribution of locations close to the target location.
insert image description here
DFL allows the network to focus on the values ​​​​near the target y faster and increase their probability;
the meaning of DFL is The form of cross entropy is used to optimize the probability of the two positions closest to the label y, one left and one right, so that the network can focus on the distribution of the adjacent area of ​​​​the target position faster; that is to say, the learned distribution is theoretically in the real The vicinity of the floating point coordinates, and the weight of the distance from the left and right integer coordinates is obtained in the mode of linear interpolation.

think

Compared with v5, the main update of v8 is the iteration of label assignment and corresponding loss, that is, positive and negative samples adopt a dynamic strategy (select metric topk) to improve the consistency of output cls and reg, and loss uses metric as a soft label. In order to explore the gain of soft label for od performance, I did related ablation experiments.
insert image description here

The above picture is called label 1, and the label of the positive sample points selected by v8 is set to hard label, that is, all are set to 1.

insert image description here

The above picture is called label 2, using metric as soft label, and using vfl as classification loss.

From label1 and 2, it can be found that the performance of soft label and hard label is not much different in AP50, but the mAP of soft label is 1.6% higher than that of hard label. At the same time, the AP75 and AR of the soft label are higher than those of the hard label, which indicates that the consistency of the cls and reg of the soft label in the hard samples is high, that is, the anchor with a high cls score is also accurate in coordinate regression, and the results are retained after the post process The IOU with GT is larger. But why does the soft label have such a role? We can look at the heat map of their scores and regressions.

hard label iou heatmapinsert image description here soft label iou heatmapinsert image description here
hard label cls heatmapinsert image description here soft label cls heatmapinsert image description here
hard label iou heatmapinsert image description here soft label iou heatmapinsert image description here
hard label cls heatmapinsert image description here soft label cls heat mapinsert image description here

I visualized the cls of each anchor and the iou of reg and GT in the final output. It can be found that the confidence of the soft label is slightly lower than that of the hard label, and the overall Gaussian distribution is present, and the heat map will cover difficult samples. Although the hard label The confidence is high, but its heat map is irregular, and the confidence max index may drift away from the center. At the same time, the heat map of the soft label iou generally covers a larger area than the hard label heat map, and has a higher iou with GT. Therefore, the soft label Map will be higher than the hard label, and it is also beneficial for difficult samples.

Guess you like

Origin blog.csdn.net/litt1e/article/details/128804663