目标检测损失函数 yolos、DETR为例

yolos和DETR,除了yolos没有卷积层以外,几乎所有操作都一样。
HF官方文档

因为目标检测模型,实际会输出几百几千个“框”,所以损失函数计算比较复杂。损失函数为偶匹配损失 bipartite matching loss,参考此blog

target为class_label和box组成的字典。假设对于一张图片,我们有5个target框。
num_detection_tokens为模型对一张图最多可以产生的box的数量
简单阐述loss计算流程

  1. vit 模型,输入经过预处理的图片,输出最后隐含层状态, 大小为 [batchsize,seq_len,hidden_size]

  2. 取最后num_detection_tokens个token的隐藏状态,变为
    [batchsize,num_detection_tokens,hidden_size]

  3. 由于输出了num_detection_tokens个box,而target为5个box,所以需要进行一对一的匹配,

  4. 匹配过程:

    1. 先计算3个cost矩阵,shape均为【num_detection_tokens,num_target_box】,矩阵元素代表loss,矩阵代表对所有pred和target之间两两计算一次loss。
    2. 3个cost矩阵分别代表标签loss(交叉熵损失)、坐标loss(表示一个框的4个值的L1损失)、GIoU loss(框与框之间计算GIoU)
    3. 三个cost矩阵加权得到总体cost矩阵,大小为【num_detection_tokens,num_target_box】
    4. 对此矩阵进行linear_sum_assignment操作,得到一个匹配,此匹配下cost最小(即cost矩阵中找到不同行且不同列的5个元素,这5个元素之和最小)。匹配表示为长度为min(num_detection_tokens,num_target_box)的索引对。本例长度为5。
  5. 根据此匹配,pred和target之间计算一次loss(本例中一共计算5次loss并求和),最重loss就是上面说的3种loss的加权和

  6. 其实还有两种loss:

    1. “cardinality” loss,表示输出的num_detection_tokens个class_label中,class_label不为“无目标”的个数,与num_target_box的个数,的L1 loss. 说白了就是,除了5个框有实际的class以外,其他框应尽可能分类为“无目标”,避免检测出来目标过多。但之一loss不产生梯度,仅仅用于评估。
    2. mask loss:功能暂时不清楚

官方匹配函数,匈牙利算法

# Copied from transformers.models.detr.modeling_detr.DetrHungarianMatcher with Detr->Yolos
class YolosHungarianMatcher(nn.Module):
    """
    This class computes an assignment between the targets and the predictions of the network.

    For efficiency reasons, the targets don't include the no_object. Because of this, in general, there are more
    predictions than targets. In this case, we do a 1-to-1 matching of the best predictions, while the others are
    un-matched (and thus treated as non-objects).

    Args:
        class_cost:
            The relative weight of the classification error in the matching cost.
        bbox_cost:
            The relative weight of the L1 error of the bounding box coordinates in the matching cost.
        giou_cost:
            The relative weight of the giou loss of the bounding box in the matching cost.
    """

    def __init__(self, class_cost: float = 1, bbox_cost: float = 1, giou_cost: float = 1):
        super().__init__()
        requires_backends(self, ["scipy"])

        self.class_cost = class_cost
        self.bbox_cost = bbox_cost
        self.giou_cost = giou_cost
        if class_cost == 0 and bbox_cost == 0 and giou_cost == 0:
            raise ValueError("All costs of the Matcher can't be 0")

    @torch.no_grad()
    def forward(self, outputs, targets):
        """
        Args:
            outputs (`dict`):
                A dictionary that contains at least these entries:
                * "logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                * "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates.
            targets (`List[dict]`):
                A list of targets (len(targets) = batch_size), where each target is a dict containing:
                * "class_labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of
                  ground-truth
                 objects in the target) containing the class labels
                * "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates.

        Returns:
            `List[Tuple]`: A list of size `batch_size`, containing tuples of (index_i, index_j) where:
            - index_i is the indices of the selected predictions (in order)
            - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds: len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        batch_size, num_queries = outputs["logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        target_ids = torch.cat([v["class_labels"] for v in targets])
        target_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        class_cost = -out_prob[:, target_ids]

        # Compute the L1 cost between boxes
        bbox_cost = torch.cdist(out_bbox, target_bbox, p=1)

        # Compute the giou cost between boxes
        giou_cost = -generalized_box_iou(center_to_corners_format(out_bbox), center_to_corners_format(target_bbox))

        # Final cost matrix
        cost_matrix = self.bbox_cost * bbox_cost + self.class_cost * class_cost + self.giou_cost * giou_cost
        cost_matrix = cost_matrix.view(batch_size, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(cost_matrix.split(sizes, -1))]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]

目标检测还有很多细节问题,以后更新

猜你喜欢

转载自blog.csdn.net/qq_51750957/article/details/129114388