paper:VarifocalNet: An IoU-aware Dense Object Detector
official implementation:https://github.com/hyz-xmaster/VarifocalNet
third-party implementation:mmdetection/vfnet_head.py at main · open-mmlab/mmdetection · GitHub
background
Most of the current target detection models first generate a set of redundant detection frames, and then filter out duplicate detection frames of the same object through NMS. Generally speaking, NMS uses classification scores to sort the detection frames. But this may reduce the performance of the model, because the classification score is not always a good evaluation of the positioning accuracy of the detection frame, and the detection frame with accurate positioning but low classification score may be deleted by NMS by mistake.
In order to solve this problem, the existing detection model will predict an additional IoU score or centerness score as an evaluation index of positioning accuracy, and the result of multiplying them with the classification score is used as the ranking index in NMS. These methods can alleviate the misalignment problem between classification scores and localization accuracy, but the result is sub-optimal, because multiplying two imperfect predictions results in a worse result, and the authors pass Experiments demonstrate that this approach has a limited upper bound on performance. Also, adding an extra network branch to predict the localization score is not an elegant solution and brings extra computation.
Contribution to this article
To overcome the above problems, it is natural to ask: instead of predicting an additional localization accuracy score, can we incorporate it into the classification score? That is, predict a localization-aware or IoU-aware classification score ( IACS ), which can simultaneously represent the classification score and positioning accuracy score of an object.
The contributions of this paper are as follows
- This paper demonstrates that accurately ranking candidate detection boxes is critical to the performance of detection models, and that IACS achieves better ranking than other methods.
- This paper proposes a new loss function Varifocal Loss to train the model regression IACS.
- This paper designs a star-shaped detection frame representation method for calculating IACS and fine-tuning the refine detection frame.
- Based on FCOS+ATSS and the new method proposed in this paper, a new target detection model VarifocalNet, referred to as VFNet, is designed.
The method of this article is shown in the figure below
Motivation
The authors first investigate the performance ceiling of the FCOS model, identify its main performance barriers, and demonstrate the importance of using IoU-aware classification scores as NMS ranking metrics. In order to study the performance upper limit of FCOS+ATSS, the author alternately replaces the classification score, distance offset, and centerness score predicted by the foreground points before NMS with the corresponding ground truth value, and evaluates its AP on COCO val2017. For the classification score, there are two options, one is to replace the element at the gt position with 1 or the IoU between the prediction box and the corresponding gt box (ie gt_IoU). At the same time, in addition to replacing the centerness score with the true value, it is also considered to be replaced with gt_IoU.
The results are shown in Table 1. It can be seen that the original FCOS+ATSS obtained an AP of 39.2. When the centerness is replaced with the true value gt_ctr, the AP is only increased by 2.0. Likewise, replacing the centerness score with gt_IoU (gt_ctr_iou) only yields an AP of 43.5. This shows that neither the product of the predicted centerness score and the classification score nor the product of the IoU score and the classification score can bring significant performance improvements.
In contrast, substituting the ground truth (gt_bbox) of the detection box even without the centerness score (no w/ctr) achieves 56.1 AP. But if the classification prediction score is replaced with the true value 1, whether there is centerness becomes very important (43.1 AP vs 58.1 AP), because centerness can distinguish accurate and inaccurate detection boxes to a certain extent.
The most surprising result is replacing the classification score with gt_IoU (gt_cls_iou), without centerness, the AP reaches 74.7, which is significantly higher than other examples. This actually indicates that a large number of candidate frames already contain precisely positioned detection frames, and the key to achieving high-precision detection performance is to accurately pick out high-quality detection frames from a large number of candidate frames. The above results show that replacing the classification score with gt IoU is the most effective method. The author calls this score IoU-aware Classification Score ( IACS ).
method introduction
Based on the above experimental results, the author developed a new detection model VarifocalNet based on FCOS+ATSS, and removed the centerness branch. Compared with the traditional FCOS+ATSS, VFNet has three new parts: varifocal loss, star-shaped bounding box feature representation , bounding box refinement.
IACS - IoU-Aware Classification Score
The value at the gt position of the classification vector is changed from 1 to the IoU between the prediction box and the corresponding gt box, and the other positions are 0.
Varifocal Loss
The author draws on the weighting idea of focal loss to deal with the problem of category imbalance when returning to continuous IACS during training. Unlike focal loss, the author treats positive and negative samples in an asymmetric way, as follows
where \(p\) is the predicted IACS and \(q\) is the predicted value.
It can be seen from formula (2) that varifocal loss only reduces the contribution of negative samples (q=0) through the coefficient \(p^{\gamma}\), but does not reduce the contribution of positive samples in the same way, because Compared with negative samples, the number of positive samples is very small, so their precious learning information needs to be preserved. On the other hand, inspired by PISA, the author uses the label \(q\) of the positive sample to weight the positive sample. If the gt_IoU value of a positive sample is large, its contribution to the loss will be relatively greater. This forces the model to pay more attention to those high-quality positive samples, resulting in higher AP.
Star-Shaped Box Feature Representation
The author designed a new star-shaped detection frame feature representation method, as shown in the yellow circle in Figure (1), which uses deformable convolution to represent a detection frame using 9 fixed-point features. This new representation captures the geometry of the bounding box and its nearby contextual information, which is important for encoding the offset between the predicted box and the ground truth box.
Specifically, given a point \((x,y)\) on the feature map, an initial box is first regressed with a 3x3 convolution. Like FCOS, this detection frame is encoded by a 4-dimensional vector \((l',t',r',b')\), which respectively represent the distance from this point to the left, top, right, and bottom of the detection frame. Using this distance vector, we can choose 9 sampling points: (x, y), (xl', y), (x, yt'), (x+r', y), (x, y+b') , (xl', yt'), (x+l', yt'), (xl', y+b'), (x+r', y+b'), and then map them to the feature map. Their offsets relative to the point \((x,y)\) are used as the offset of the deformable convolution, and then the features on these 9 points are represented by a deformable convolution to represent a detection frame.
Bounding Box Refinement
The author further improves the positioning accuracy through a fine-tuning refinement step of the detection frame. The fine-tuning of the detection frame has been used in cascade r-cnn and single-shot refinement, but due to the lack of effective object descriptors, it is rarely used in dense target detection models. Use, but with the star representation method proposed in this paper, it can be used efficiently in the dense target detection model.
The author models the fine-tuning of the detection frame as a residual learning problem. For an initial regression detection frame \((l',t',r',b')\), first extract the star-shaped representation for encoding . Then based on this representation, learn four distance scaling factors \((\triangle l,\bigtriangleup t,\bigtriangleup r,\bigtriangleup b)\) to scale the initial distance vector, and the final fine-tuned detection frame can be expressed as \((l,t,r,b)=(\triangle l\times l',\triangle t\times t',\triangle r\times r',\triangle b\times b')\).
VarifocalNet
Add the above three parts to FCOS and remove the centerness branch to get the VarifocalNet proposed in this paper.
The complete structure of VFNet is shown in Figure 3. The backbone network of VFNet is the same as the FPN network and FCOS, the difference lies in the head part. The head part of VFNet contains two subnetworks subnetworks. The positioning subnetwork performs the regression of the bounding box and the subsequent fine-tuning. It takes the output feature map of each level of FPN as input, and first performs 3 3x3 convolutions with ReLu activation. Get a feature map with a channel of 256. Then a branch of the positioning sub-network is convolved again, and then a 4-dimensional distance vector \((l',t',r',b')\) is obtained at each spatial position to represent the initial detection frame. According to the initial detection frame and the output feature map of the three 3x3 convolutions, another branch of the localization sub-network performs deformable convolution on the nine sampling points of the star shape, and obtains the distance scaling factor vector\((\triangle l,\ triangle t,\triangle r,\triangle b)\) , and then multiplied by the initial distance vector to get the fine-tuned detection frame\((l,t,r,b)\).
Another sub-network is used to predict IACS. Its structure is similar to that of the localization sub-network except that its output vector length is \(C\) (number of categories), where each element is a joint representation of object presence confidence and localization accuracy.
Loss Function and Inference
The loss function of VFNet is as follows
Where \(p_{c,i}\) and \(q_{c,i}\) are the prediction and true value IACS of the category \(c\) at the position \(i\) on the feature map of each layer of FPN, respectively, \(L_{bbox}\) is the GIoU loss, \(bbox_{i}',bbox_{i},bbox_{i}^{*}\) are the initial, fine-tuned, and gt detection boxes respectively. The author uses the training target \(q_{c^{*},i}\) to weight \(L_{bbox}\), the foreground is gt_IoU and the background is 0. \(\lambda_{0}\) and \(\lambda_{1}\) are weight coefficients which are set to 1.5 and 2.0 respectively in this paper. \(N_{pos}\) is the total number of foreground points.
Experimental results
The author first determined the values of the two hyperparameters \(\alpha, \gamma\) of varifocal loss through experiments, and the results are as follows. It can be seen that the accuracy is the highest when \(\alpha=0.75,\gamma=2\).
The contribution of each component is then investigated and the results are as follows. It can be seen that all three parts contribute to the improvement of performance, and the performance is the highest when used together.
Finally, the comparison with other sota methods is as follows. It can be seen that under the same configuration (backbone, whether to use DCN, mstrain, etc.) VFNet has achieved the highest accuracy.
code analysis
Here we take the implementation in mmdetection as an example to explain the implementation details. Here, input input_shape=(2, 3, 300, 300), backbone='resnet-50', and the output size of P3~P7 after FPN is [(2,256,38 ,38),(2,256,19,19),(2,256,10,10),(2,256,5,5),(2,256,3,3)], the innovative part of VFNet is in the head, as shown in Figure 3 Show. Taking the output of P3 as an example, the complete implementation code of the head part is as follows
def forward_single(self, x, scale, scale_refine, stride, reg_denom):
"""Forward features of a single scale level.
Args:
x (Tensor): FPN feature maps of the specified stride.
scale (:obj: `mmcv.cnn.Scale`): Learnable scale module to resize
the bbox prediction.
scale_refine (:obj: `mmcv.cnn.Scale`): Learnable scale module to
resize the refined bbox prediction.
stride (int): The corresponding stride for feature maps,
used to normalize the bbox prediction when
bbox_norm_type = 'stride'.
reg_denom (int): The corresponding regression range for feature
maps, only used to normalize the bbox prediction when
bbox_norm_type = 'reg_denom'.
Returns:
tuple: iou-aware cls scores for each box, bbox predictions and
refined bbox predictions of input feature maps.
"""
cls_feat = x # (2,256,38,38)
reg_feat = x
for cls_layer in self.cls_convs: # 3个3x3 conv
cls_feat = cls_layer(cls_feat)
# (2,256,38,38)
for reg_layer in self.reg_convs: # 3个3x3 conv
reg_feat = reg_layer(reg_feat)
# (2,256,38,38)
# predict the bbox_pred of different level
reg_feat_init = self.vfnet_reg_conv(reg_feat) # 3x3conv, (2,256,38,38)
if self.bbox_norm_type == 'reg_denom':
bbox_pred = scale(
self.vfnet_reg(reg_feat_init)).float().exp() * reg_denom # 3x3conv, 64, (2,4,38,38)
elif self.bbox_norm_type == 'stride':
bbox_pred = scale(
self.vfnet_reg(reg_feat_init)).float().exp() * stride
else:
raise NotImplementedError
# compute star deformable convolution offsets
# converting dcn_offset to reg_feat.dtype thus VFNet can be
# trained with FP16
dcn_offset = self.star_dcn_offset(bbox_pred, self.gradient_mul,
stride).to(reg_feat.dtype) # _, 0.1, 8, (2,18,38,38)
# refine the bbox_pred
reg_feat = self.relu(self.vfnet_reg_refine_dconv(reg_feat, dcn_offset)) # (2,256,38,38)
bbox_pred_refine = scale_refine(
self.vfnet_reg_refine(reg_feat)).float().exp() # (2,4,38,38)
bbox_pred_refine = bbox_pred_refine * bbox_pred.detach() # (2,4,38,38)
# predict the iou-aware cls score
cls_feat = self.relu(self.vfnet_cls_dconv(cls_feat, dcn_offset)) # (2,256,38,38)
cls_score = self.vfnet_cls(cls_feat) # (2,20,38,38)
if self.training:
return cls_score, bbox_pred, bbox_pred_refine
else:
return cls_score, bbox_pred_refine
First of all, the classification and regression sub-networks start with three consecutive 3x3 convolutions, namely self.cls_convs and self.reg_convs in the code . The branch below the regression sub-network passes through a 3x3 convolution self.vfnet_reg_conv and then undergoes a deviation prediction 3x3 convolution self.vfnet_reg to obtain the initial bounding box prediction result bbox_pred , which is the orange feature map in the middle of Figure 3, shape=(2, 4, 38, 38). What is predicted here is the distance from each point to the four sides of the corresponding prediction frame, and then according to the coordinates of this point and the distance to the four sides according to Figure 1, 9 points of star-shape representation are obtained, which is realized by the function self.star_dcn_offset, and the code is as follows .
def star_dcn_offset(self, bbox_pred, gradient_mul, stride):
"""Compute the star deformable conv offsets.
Args:
bbox_pred (Tensor): Predicted bbox distance offsets (l, r, t, b). 这里应该是(l,t,r,b)
gradient_mul (float): Gradient multiplier.
stride (int): The corresponding stride for feature maps,
used to project the bbox onto the feature map.
Returns:
dcn_offsets (Tensor): The offsets for deformable convolution.
"""
dcn_base_offset = self.dcn_base_offset.type_as(bbox_pred)
bbox_pred_grad_mul = (1 - gradient_mul) * bbox_pred.detach() + \
gradient_mul * bbox_pred
# detach() 截断梯度
# map to the feature map scale
bbox_pred_grad_mul = bbox_pred_grad_mul / stride # (2,4,38,38)
N, C, H, W = bbox_pred.size()
x1 = bbox_pred_grad_mul[:, 0, :, :] # (2,38,38)
y1 = bbox_pred_grad_mul[:, 1, :, :]
x2 = bbox_pred_grad_mul[:, 2, :, :]
y2 = bbox_pred_grad_mul[:, 3, :, :]
bbox_pred_grad_mul_offset = bbox_pred.new_zeros(
N, 2 * self.num_dconv_points, H, W)
# 顺序为第一行从左到右、第二行从左到右、第三行从左到右。并且每个点先y坐标后x坐标
bbox_pred_grad_mul_offset[:, 0, :, :] = -1.0 * y1 # -y1
bbox_pred_grad_mul_offset[:, 1, :, :] = -1.0 * x1 # -x1
bbox_pred_grad_mul_offset[:, 2, :, :] = -1.0 * y1 # -y1
bbox_pred_grad_mul_offset[:, 4, :, :] = -1.0 * y1 # -y1
bbox_pred_grad_mul_offset[:, 5, :, :] = x2 # x2
bbox_pred_grad_mul_offset[:, 7, :, :] = -1.0 * x1 # -x1
bbox_pred_grad_mul_offset[:, 11, :, :] = x2 # x2
bbox_pred_grad_mul_offset[:, 12, :, :] = y2 # y2
bbox_pred_grad_mul_offset[:, 13, :, :] = -1.0 * x1 # -x1
bbox_pred_grad_mul_offset[:, 14, :, :] = y2 # y2
bbox_pred_grad_mul_offset[:, 16, :, :] = y2 # y2
bbox_pred_grad_mul_offset[:, 17, :, :] = x2 # x2
dcn_offset = bbox_pred_grad_mul_offset - dcn_base_offset
return dcn_offset
Then get the regression features after refine through deformable convolution self.vfnet_reg_refine_dconv , and then get the biased refine vector bbox_pred_refine through a 3x3 convolution self.vfnet_reg_refine , which is the \((\triangle l,\triangle t,\ triangle r,\triangle b)\), and then multiplied by the initial bbox_pred to complete the box refinement, and get the final deviation prediction value.
The classification sub-network is similar to the regression sub-network and will not be described in detail.
Finally, the implementation of varifocal loss, the code is as follows
def varifocal_loss(pred,
target,
weight=None,
alpha=0.75,
gamma=2.0,
iou_weighted=True,
reduction='mean',
avg_factor=None):
"""`Varifocal Loss <https://arxiv.org/abs/2008.13367>`_
Args:
pred (torch.Tensor): The prediction with shape (N, C), C is the
number of classes
target (torch.Tensor): The learning target of the iou-aware
classification score with shape (N, C), C is the number of classes.
weight (torch.Tensor, optional): The weight of loss for each
prediction. Defaults to None.
alpha (float, optional): A balance factor for the negative part of
Varifocal Loss, which is different from the alpha of Focal Loss.
Defaults to 0.75.
gamma (float, optional): The gamma for calculating the modulating
factor. Defaults to 2.0.
iou_weighted (bool, optional): Whether to weight the loss of the
positive example with the iou target. Defaults to True.
reduction (str, optional): The method used to reduce the loss into
a scalar. Defaults to 'mean'. Options are "none", "mean" and
"sum".
avg_factor (int, optional): Average factor that is used to average
the loss. Defaults to None.
"""
# pred and target should be of the same size
assert pred.size() == target.size()
pred_sigmoid = pred.sigmoid()
target = target.type_as(pred)
if iou_weighted:
focal_weight = target * (target > 0.0).float() + \
alpha * (pred_sigmoid - target).abs().pow(gamma) * \
(target <= 0.0).float()
else:
focal_weight = (target > 0.0).float() + \
alpha * (pred_sigmoid - target).abs().pow(gamma) * \
(target <= 0.0).float()
loss = F.binary_cross_entropy_with_logits(
pred, target, reduction='none') * focal_weight
loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
return loss
Among them, iou_weighted=True, and the target is the IoU value between the prediction frame and the corresponding gt.