VarifocalNet: An IoU-aware Dense Object Detector (CVPR 2021) Principle and Code Analysis

paper:VarifocalNet: An IoU-aware Dense Object Detector

official implementation:https://github.com/hyz-xmaster/VarifocalNet 

third-party implementation:mmdetection/vfnet_head.py at main · open-mmlab/mmdetection · GitHub 

background 

Most of the current target detection models first generate a set of redundant detection frames, and then filter out duplicate detection frames of the same object through NMS. Generally speaking, NMS uses classification scores to sort the detection frames. But this may reduce the performance of the model, because the classification score is not always a good evaluation of the positioning accuracy of the detection frame, and the detection frame with accurate positioning but low classification score may be deleted by NMS by mistake.

In order to solve this problem, the existing detection model will predict an additional IoU score or centerness score as an evaluation index of positioning accuracy, and the result of multiplying them with the classification score is used as the ranking index in NMS. These methods can alleviate the misalignment problem between classification scores and localization accuracy, but the result is sub-optimal, because multiplying two imperfect predictions results in a worse result, and the authors pass Experiments demonstrate that this approach has a limited upper bound on performance. Also, adding an extra network branch to predict the localization score is not an elegant solution and brings extra computation.

Contribution to this article

To overcome the above problems, it is natural to ask: instead of predicting an additional localization accuracy score, can we incorporate it into the classification score? That is, predict a localization-aware or IoU-aware classification score ( IACS ), which can simultaneously represent the classification score and positioning accuracy score of an object.

The contributions of this paper are as follows

  1. This paper demonstrates that accurately ranking candidate detection boxes is critical to the performance of detection models, and that IACS achieves better ranking than other methods.
  2. This paper proposes a new loss function Varifocal Loss to train the model regression IACS.
  3. This paper designs a star-shaped detection frame representation method for calculating IACS and fine-tuning the refine detection frame.
  4. Based on FCOS+ATSS and the new method proposed in this paper, a new target detection model VarifocalNet, referred to as VFNet, is designed.

The method of this article is shown in the figure below

Motivation

The authors first investigate the performance ceiling of the FCOS model, identify its main performance barriers, and demonstrate the importance of using IoU-aware classification scores as NMS ranking metrics. In order to study the performance upper limit of FCOS+ATSS, the author alternately replaces the classification score, distance offset, and centerness score predicted by the foreground points before NMS with the corresponding ground truth value, and evaluates its AP on COCO val2017. For the classification score, there are two options, one is to replace the element at the gt position with 1 or the IoU between the prediction box and the corresponding gt box (ie gt_IoU). At the same time, in addition to replacing the centerness score with the true value, it is also considered to be replaced with gt_IoU. 

The results are shown in Table 1. It can be seen that the original FCOS+ATSS obtained an AP of 39.2. When the centerness is replaced with the true value gt_ctr, the AP is only increased by 2.0. Likewise, replacing the centerness score with gt_IoU (gt_ctr_iou) only yields an AP of 43.5. This shows that neither the product of the predicted centerness score and the classification score nor the product of the IoU score and the classification score can bring significant performance improvements. 

In contrast, substituting the ground truth (gt_bbox) of the detection box even without the centerness score (no w/ctr) achieves 56.1 AP. But if the classification prediction score is replaced with the true value 1, whether there is centerness becomes very important (43.1 AP vs 58.1 AP), because centerness can distinguish accurate and inaccurate detection boxes to a certain extent.

The most surprising result is replacing the classification score with gt_IoU (gt_cls_iou), without centerness, the AP reaches 74.7, which is significantly higher than other examples. This actually indicates that a large number of candidate frames already contain precisely positioned detection frames, and the key to achieving high-precision detection performance is to accurately pick out high-quality detection frames from a large number of candidate frames. The above results show that replacing the classification score with gt IoU is the most effective method. The author calls this score IoU-aware Classification Score ( IACS ).

method introduction 

Based on the above experimental results, the author developed a new detection model VarifocalNet based on FCOS+ATSS, and removed the centerness branch. Compared with the traditional FCOS+ATSS, VFNet has three new parts: varifocal loss, star-shaped bounding box feature representation , bounding box refinement.

IACS - IoU-Aware Classification Score

The value at the gt position of the classification vector is changed from 1 to the IoU between the prediction box and the corresponding gt box, and the other positions are 0.

Varifocal Loss

The author draws on the weighting idea of ​​focal loss to deal with the problem of category imbalance when returning to continuous IACS during training. Unlike focal loss, the author treats positive and negative samples in an asymmetric way, as follows

where \(p\) is the predicted IACS and \(q\) is the predicted value.

It can be seen from formula (2) that varifocal loss only reduces the contribution of negative samples (q=0) through the coefficient \(p^{\gamma}\), but does not reduce the contribution of positive samples in the same way, because Compared with negative samples, the number of positive samples is very small, so their precious learning information needs to be preserved. On the other hand, inspired by PISA, the author uses the label \(q\) of the positive sample to weight the positive sample. If the gt_IoU value of a positive sample is large, its contribution to the loss will be relatively greater. This forces the model to pay more attention to those high-quality positive samples, resulting in higher AP.

Star-Shaped Box Feature Representation

The author designed a new star-shaped detection frame feature representation method, as shown in the yellow circle in Figure (1), which uses deformable convolution to represent a detection frame using 9 fixed-point features. This new representation captures the geometry of the bounding box and its nearby contextual information, which is important for encoding the offset between the predicted box and the ground truth box. 

Specifically, given a point \((x,y)\) on the feature map, an initial box is first regressed with a 3x3 convolution. Like FCOS, this detection frame is encoded by a 4-dimensional vector \((l',t',r',b')\), which respectively represent the distance from this point to the left, top, right, and bottom of the detection frame. Using this distance vector, we can choose 9 sampling points: (x, y), (xl', y), (x, yt'), (x+r', y), (x, y+b') , (xl', yt'), (x+l', yt'), (xl', y+b'), (x+r', y+b'), and then map them to the feature map. Their offsets relative to the point \((x,y)\) are used as the offset of the deformable convolution, and then the features on these 9 points are represented by a deformable convolution to represent a detection frame.

Bounding Box Refinement

The author further improves the positioning accuracy through a fine-tuning refinement step of the detection frame. The fine-tuning of the detection frame has been used in cascade r-cnn and single-shot refinement, but due to the lack of effective object descriptors, it is rarely used in dense target detection models. Use, but with the star representation method proposed in this paper, it can be used efficiently in the dense target detection model.

The author models the fine-tuning of the detection frame as a residual learning problem. For an initial regression detection frame \((l',t',r',b')\), first extract the star-shaped representation for encoding . Then based on this representation, learn four distance scaling factors \((\triangle l,\bigtriangleup t,\bigtriangleup r,\bigtriangleup b)\) to scale the initial distance vector, and the final fine-tuned detection frame can be expressed as \((l,t,r,b)=(\triangle l\times l',\triangle t\times t',\triangle r\times r',\triangle b\times b')\).

VarifocalNet

Add the above three parts to FCOS and remove the centerness branch to get the VarifocalNet proposed in this paper.

The complete structure of VFNet is shown in Figure 3. The backbone network of VFNet is the same as the FPN network and FCOS, the difference lies in the head part. The head part of VFNet contains two subnetworks subnetworks. The positioning subnetwork performs the regression of the bounding box and the subsequent fine-tuning. It takes the output feature map of each level of FPN as input, and first performs 3 3x3 convolutions with ReLu activation. Get a feature map with a channel of 256. Then a branch of the positioning sub-network is convolved again, and then a 4-dimensional distance vector \((l',t',r',b')\) is obtained at each spatial position to represent the initial detection frame. According to the initial detection frame and the output feature map of the three 3x3 convolutions, another branch of the localization sub-network performs deformable convolution on the nine sampling points of the star shape, and obtains the distance scaling factor vector\((\triangle l,\ triangle t,\triangle r,\triangle b)\) , and then multiplied by the initial distance vector to get the fine-tuned detection frame\((l,t,r,b)\).

Another sub-network is used to predict IACS. Its structure is similar to that of the localization sub-network except that its output vector length is \(C\) (number of categories), where each element is a joint representation of object presence confidence and localization accuracy.

Loss Function and Inference

The loss function of VFNet is as follows 

Where \(p_{c,i}\) and \(q_{c,i}\) are the prediction and true value IACS of the category \(c\) at the position \(i\) on the feature map of each layer of FPN, respectively, \(L_{bbox}\) is the GIoU loss, \(bbox_{i}',bbox_{i},bbox_{i}^{*}\) are the initial, fine-tuned, and gt detection boxes respectively. The author uses the training target \(q_{c^{*},i}\) to weight \(L_{bbox}\), the foreground is gt_IoU and the background is 0. \(\lambda_{0}\) and \(\lambda_{1}\) are weight coefficients which are set to 1.5 and 2.0 respectively in this paper. \(N_{pos}\) is the total number of foreground points.

Experimental results 

The author first determined the values ​​of the two hyperparameters \(\alpha, \gamma\) of varifocal loss through experiments, and the results are as follows. It can be seen that the accuracy is the highest when \(\alpha=0.75,\gamma=2\).

The contribution of each component is then investigated and the results are as follows. It can be seen that all three parts contribute to the improvement of performance, and the performance is the highest when used together.

 

Finally, the comparison with other sota methods is as follows. It can be seen that under the same configuration (backbone, whether to use DCN, mstrain, etc.) VFNet has achieved the highest accuracy. 

code analysis

Here we take the implementation in mmdetection as an example to explain the implementation details. Here, input input_shape=(2, 3, 300, 300), backbone='resnet-50', and the output size of P3~P7 after FPN is [(2,256,38 ,38),(2,256,19,19),(2,256,10,10),(2,256,5,5),(2,256,3,3)], the innovative part of VFNet is in the head, as shown in Figure 3 Show. Taking the output of P3 as an example, the complete implementation code of the head part is as follows

def forward_single(self, x, scale, scale_refine, stride, reg_denom):
    """Forward features of a single scale level.

    Args:
        x (Tensor): FPN feature maps of the specified stride.
        scale (:obj: `mmcv.cnn.Scale`): Learnable scale module to resize
            the bbox prediction.
        scale_refine (:obj: `mmcv.cnn.Scale`): Learnable scale module to
            resize the refined bbox prediction.
        stride (int): The corresponding stride for feature maps,
            used to normalize the bbox prediction when
            bbox_norm_type = 'stride'.
        reg_denom (int): The corresponding regression range for feature
            maps, only used to normalize the bbox prediction when
            bbox_norm_type = 'reg_denom'.

    Returns:
        tuple: iou-aware cls scores for each box, bbox predictions and
            refined bbox predictions of input feature maps.
    """
    cls_feat = x  # (2,256,38,38)
    reg_feat = x

    for cls_layer in self.cls_convs:  # 3个3x3 conv
        cls_feat = cls_layer(cls_feat)
    # (2,256,38,38)

    for reg_layer in self.reg_convs:  # 3个3x3 conv
        reg_feat = reg_layer(reg_feat)
    # (2,256,38,38)

    # predict the bbox_pred of different level
    reg_feat_init = self.vfnet_reg_conv(reg_feat)  # 3x3conv, (2,256,38,38)
    if self.bbox_norm_type == 'reg_denom':
        bbox_pred = scale(
            self.vfnet_reg(reg_feat_init)).float().exp() * reg_denom  # 3x3conv, 64, (2,4,38,38)
    elif self.bbox_norm_type == 'stride':
        bbox_pred = scale(
            self.vfnet_reg(reg_feat_init)).float().exp() * stride
    else:
        raise NotImplementedError

    # compute star deformable convolution offsets
    # converting dcn_offset to reg_feat.dtype thus VFNet can be
    # trained with FP16
    dcn_offset = self.star_dcn_offset(bbox_pred, self.gradient_mul,
                                      stride).to(reg_feat.dtype)  # _, 0.1, 8, (2,18,38,38)

    # refine the bbox_pred
    reg_feat = self.relu(self.vfnet_reg_refine_dconv(reg_feat, dcn_offset))  # (2,256,38,38)
    bbox_pred_refine = scale_refine(
        self.vfnet_reg_refine(reg_feat)).float().exp()  # (2,4,38,38)
    bbox_pred_refine = bbox_pred_refine * bbox_pred.detach()  # (2,4,38,38)

    # predict the iou-aware cls score
    cls_feat = self.relu(self.vfnet_cls_dconv(cls_feat, dcn_offset))  # (2,256,38,38)
    cls_score = self.vfnet_cls(cls_feat)  # (2,20,38,38)

    if self.training:
        return cls_score, bbox_pred, bbox_pred_refine
    else:
        return cls_score, bbox_pred_refine

First of all, the classification and regression sub-networks start with three consecutive 3x3 convolutions, namely self.cls_convs and self.reg_convs in the code . The branch below the regression sub-network passes through a 3x3 convolution self.vfnet_reg_conv and then undergoes a deviation prediction 3x3 convolution self.vfnet_reg to obtain the initial bounding box prediction result bbox_pred , which is the orange feature map in the middle of Figure 3, shape=(2, 4, 38, 38). What is predicted here is the distance from each point to the four sides of the corresponding prediction frame, and then according to the coordinates of this point and the distance to the four sides according to Figure 1, 9 points of star-shape representation are obtained, which is realized by the function self.star_dcn_offset, and the code is as follows .

def star_dcn_offset(self, bbox_pred, gradient_mul, stride):
    """Compute the star deformable conv offsets.

    Args:
        bbox_pred (Tensor): Predicted bbox distance offsets (l, r, t, b). 这里应该是(l,t,r,b)
        gradient_mul (float): Gradient multiplier.
        stride (int): The corresponding stride for feature maps,
            used to project the bbox onto the feature map.

    Returns:
        dcn_offsets (Tensor): The offsets for deformable convolution.
    """
    dcn_base_offset = self.dcn_base_offset.type_as(bbox_pred)
    bbox_pred_grad_mul = (1 - gradient_mul) * bbox_pred.detach() + \
        gradient_mul * bbox_pred
    # detach() 截断梯度
    # map to the feature map scale
    bbox_pred_grad_mul = bbox_pred_grad_mul / stride  # (2,4,38,38)
    N, C, H, W = bbox_pred.size()

    x1 = bbox_pred_grad_mul[:, 0, :, :]  # (2,38,38)
    y1 = bbox_pred_grad_mul[:, 1, :, :]
    x2 = bbox_pred_grad_mul[:, 2, :, :]
    y2 = bbox_pred_grad_mul[:, 3, :, :]
    bbox_pred_grad_mul_offset = bbox_pred.new_zeros(
        N, 2 * self.num_dconv_points, H, W)
    # 顺序为第一行从左到右、第二行从左到右、第三行从左到右。并且每个点先y坐标后x坐标
    bbox_pred_grad_mul_offset[:, 0, :, :] = -1.0 * y1  # -y1
    bbox_pred_grad_mul_offset[:, 1, :, :] = -1.0 * x1  # -x1
    bbox_pred_grad_mul_offset[:, 2, :, :] = -1.0 * y1  # -y1
    bbox_pred_grad_mul_offset[:, 4, :, :] = -1.0 * y1  # -y1
    bbox_pred_grad_mul_offset[:, 5, :, :] = x2  # x2
    bbox_pred_grad_mul_offset[:, 7, :, :] = -1.0 * x1  # -x1
    bbox_pred_grad_mul_offset[:, 11, :, :] = x2  # x2
    bbox_pred_grad_mul_offset[:, 12, :, :] = y2  # y2
    bbox_pred_grad_mul_offset[:, 13, :, :] = -1.0 * x1  # -x1
    bbox_pred_grad_mul_offset[:, 14, :, :] = y2  # y2
    bbox_pred_grad_mul_offset[:, 16, :, :] = y2  # y2
    bbox_pred_grad_mul_offset[:, 17, :, :] = x2  # x2
    dcn_offset = bbox_pred_grad_mul_offset - dcn_base_offset

    return dcn_offset

Then get the regression features after refine through deformable convolution self.vfnet_reg_refine_dconv , and then get the biased refine vector bbox_pred_refine through a 3x3 convolution self.vfnet_reg_refine , which is the \((\triangle l,\triangle t,\ triangle r,\triangle b)\), and then multiplied by the initial bbox_pred to complete the box refinement, and get the final deviation prediction value.

The classification sub-network is similar to the regression sub-network and will not be described in detail.

Finally, the implementation of varifocal loss, the code is as follows

def varifocal_loss(pred,
                   target,
                   weight=None,
                   alpha=0.75,
                   gamma=2.0,
                   iou_weighted=True,
                   reduction='mean',
                   avg_factor=None):
    """`Varifocal Loss <https://arxiv.org/abs/2008.13367>`_

    Args:
        pred (torch.Tensor): The prediction with shape (N, C), C is the
            number of classes
        target (torch.Tensor): The learning target of the iou-aware
            classification score with shape (N, C), C is the number of classes.
        weight (torch.Tensor, optional): The weight of loss for each
            prediction. Defaults to None.
        alpha (float, optional): A balance factor for the negative part of
            Varifocal Loss, which is different from the alpha of Focal Loss.
            Defaults to 0.75.
        gamma (float, optional): The gamma for calculating the modulating
            factor. Defaults to 2.0.
        iou_weighted (bool, optional): Whether to weight the loss of the
            positive example with the iou target. Defaults to True.
        reduction (str, optional): The method used to reduce the loss into
            a scalar. Defaults to 'mean'. Options are "none", "mean" and
            "sum".
        avg_factor (int, optional): Average factor that is used to average
            the loss. Defaults to None.
    """
    # pred and target should be of the same size
    assert pred.size() == target.size()
    pred_sigmoid = pred.sigmoid()
    target = target.type_as(pred)
    if iou_weighted:
        focal_weight = target * (target > 0.0).float() + \
            alpha * (pred_sigmoid - target).abs().pow(gamma) * \
            (target <= 0.0).float()
    else:
        focal_weight = (target > 0.0).float() + \
            alpha * (pred_sigmoid - target).abs().pow(gamma) * \
            (target <= 0.0).float()
    loss = F.binary_cross_entropy_with_logits(
        pred, target, reduction='none') * focal_weight
    loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
    return loss

Among them, iou_weighted=True, and the target is the IoU value between the prediction frame and the corresponding gt.

Guess you like

Origin blog.csdn.net/ooooocj/article/details/130545488