CVPR 2022 | New Weighted Paradigm! Hong Kong Polytechnic proposes DW: a dual-weighted label assignment method for object detection

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Reprinted from: Jizhi Shutong

baaf3aad451bed8e600b0b590ee714a9.png

The purpose of Label Assignment (LA) is to assign a loss weight of a positive sample and a negative sample to each training sample, and LA plays an important role in object detection. Existing LA methods mainly focus on the design of the positive sample weight function, while the negative sample weight comes directly from the pos weight. This mechanism limits the learning ability of the detector.

This paper explores a new weighting paradigm, called double weighting (DW), to specify pos and neg weights separately. Firstly, the key influencing factors of positive/negative weights are determined by analyzing the evaluation indicators in object detection, and then positive and negative weighting functions are designed based on them.

Specifically, the positive sample weight is determined by the degree of consistency between its classification and localization, while the negative sample weight is decomposed into 2 terms: its probability of being a negative sample and its importance conditioned on a negative sample .

This weighting strategy provides more flexibility for distinguishing important from unimportant samples, resulting in a more efficient object detector. With the proposed DW method, the FCOS-ResNet-50 detector can achieve 41.5% mAP on COCO, outperforming other existing LA methods. It continuously improves COCO's Baseline without attaching a Trick.

A Dual Weighting Label Assignment Scheme for Object Detection

Paper: https://arxiv.org/abs/2203.09730

Github:https://github.com/strongwolf/DW

1 Introduction

Object detection, as a fundamental vision task, has received extensive attention from researchers for decades. Current state-of-the-art detectors mostly perform dense detection by using a set of predefined anchors to predict class labels and regression offsets.

Anchor, as the basic unit of detector training, needs to assign appropriate classification (cls) and regression (reg) labels to supervise the training process. This label assignment (LA) process can be viewed as the task of assigning loss weights to each anchor. Anchor's cls loss (reg loss can be similarly defined) can usually be expressed as:

a89ff4e9cefb5fa72f342dc212ea3cba.png

where and are the weights of positive and negative samples, respectively, and s is the predicted classification score. According to the design of and, LA methods can be roughly divided into 2 categories: Hard Label Assignment and Soft Label Assignment .

Hard Label Assignment assumes that each Anchor is pos or neg, which means and. The core idea of ​​this strategy is to find a suitable division boundary to divide the anchor into a positive set and a negative set. The division rules along this research direction can be further divided into static rules and dynamic rules. Static rules take predefined metrics such as IoU or distance from Anchor center to GT center to match objects or backgrounds to each Anchor.

This static matching strategy ignores the fact that the demarcation boundaries may be different for objects with different sizes and shapes .

In recent years, many dynamic matching strategies have been proposed. For example, ATSS segments the training anchors of objects according to their IoU distribution. The prediction-aware assignment strategy uses the predicted confidence score as a reliable indicator for estimating anchor quality.

However, both static and dynamic matching strategies ignore the fact that the samples are not equally important. The evaluation metric for object detection shows that the optimal prediction should not only have high classification scores , but also have accurate localization , which means that during training, Anchors with high consistency between classification heads and regression heads should be more accurate. important .

Based on the above motivations, the researchers chose to assign Soft weights to Anchors. GFL and VFL are two typical methods that define Soft Label targets based on IoU, and then convert them into loss weights by multiplying them by a modulation factor.

Some other works compute sample weights by jointly considering reg scores and cls scores. Existing methods mainly focus on the design of the pos weighting function, while the negative sample weights are only derived from the pos weights, which may limit the learning ability of the detector due to the little new supervision information provided by the negative sample weights. The authors argue that this coupled weighting mechanism cannot distinguish each training sample at a finer level.

671ba28ddda1c3f26e86870a37f6151f.png

Figure 1 is an example. 4 Anchors have different prediction results. However, GFL and VFL assign almost the same pair of (pos, neg) weights to (B, D) and (C, D), respectively. GFL also assigns zero pos and negative sample weights to Anchor A and C, since each weight has the same cls score and IoU. Since the negative sample weighting function in existing Soft LA methods is highly correlated with pos 1, Anchors with different properties can sometimes be assigned almost the same (pos, neg) weight, which may compromise the effectiveness of the trained detector.

In order to provide the detector with more discriminative supervision signals, the authors propose a new Label Assignment scheme, called Double Weighting (DW), which assigns the weights of positive and negative samples from different perspectives and makes them complementary to each other .

Specifically, the weight of pos is dynamically determined by the combination of confidence score (obtained by cls head) and reg score (obtained by reg head). The negative weight of each anchor is decomposed into 2 terms:

  • The probability that Anchor is a negative sample

  • Importance of Anchor when Anchor is a negative sample

The pos weight reflects the degree of consistency between the cls head and the reg head, it will push the more consistent Anchor forward in the Anchor List, while the negative sample weight reflects the degree of inconsistency and pushes the inconsistent Anchor to the Anchor behind the List.

With this approach, in inference, bounding boxes with higher cls scores and more precise locations have a greater probability of being preserved after NMS, while those with imprecise locations are ranked later and filtered out. As shown in Figure 1, DW distinguishes 4 different anchors by assigning different pairs of Anchor (pos, neg) weights, thereby providing a finer-grained supervised training signal for the detector.

To provide a more accurate reg score to the weight function, the authors further propose a box refinement operation. Specifically, a learned prediction module is designed to generate 4 boundary positions based on the coarse regression map, and then aggregate their prediction results to obtain the updated bounding box of the current anchor. This lightweight module enables more accurate reg scores for DW by introducing modest computational overhead.

The advantages of the proposed DW method are demonstrated through comprehensive experiments on MSCOCO. In particular, it improves the results of the ResNet-50 based FCOS detector on the COCO validation set to 42.2AP under a single-scale training scheme, outperforming other LA methods.

2 Related methods

2.1 Hard Label Assignment

Labeling each anchor as a pos or neg sample is a critical step in training the detector. Classic anchor-based object detectors set the labels of anchors by measuring their IoU with GT objects. In recent years, Anchor-Free detectors have attracted a lot of attention due to their compact design and performance.

Both FCOS and Foveabox select pos samples through a center sampling strategy: Anchors close to the GT center are sampled as positive samples, and other Anchors are negative samples or ignored during training. The above LA methods employ fixed rules for GT boxes of different shapes and sizes, which are sub-optimal.

Many scholars have also proposed some advanced Label Assignment strategies to dynamically select pos samples for each GT:

  • ATSS selects top-k Anchors from each layer of the feature pyramid, and uses the mean+std IoU of these top-Anchors as the pos/neg partition threshold.

  • PAA adaptively separates anchors into pos/neg anchors according to the joint state of cls and reg losses.

  • OTA addresses the Label Assignment problem from a global perspective by defining the assignment process as an optimal transportation problem.

  • Transformer-based detectors employ a one-to-one assignment scheme to find the best pos samples for each GT.

Hard Label Assignment treats all samples equally, but this is not very compatible with evaluation metrics in object detection.

2.2 Soft Label Assignment

Since the predicted boxes have different qualities in evaluation, the samples should be treated differently during training. Many works have been proposed to address the inequality of training samples.

Focal Loss adds a modulation factor to the cross-entropy loss to reduce the loss assigned to well-classified samples, which will push the detector to focus on difficult samples.

Generalized focal loss assigns a soft weight to each anchor by jointly considering the cls score and reg quality.

Varifocal loss utilizes IoU-aware cls labels to train the cls head.

Most of the methods mentioned above focus on computing the pos weight and simply define the negative sample weight as a function of .

In this paper, we decouple this process and assign pos and negative sample loss weights to each anchor separately. Most Soft Label Assignment methods assign weights to losses. There is a special case where weights are assigned to scores, which can be expressed as

402 Payment Required

Typical methods include FreeAnchor and Auto assign.

It should be noted that our method differs from them. In order to match Anchors in a fully differential manner, the sums in the automatic assignment still receive gradients. However, in our method, the loss weights are carefully designed and completely removed from the network, which is also a common practice for weighted losses.

3 This paper method

3.1 What is the motivation?

To be compatible with NMS, a good Dense Detector should be able to predict consistent bounding boxes with both high classification scores and precise locations. However, if all training samples are treated equally, misalignment occurs between the 2 heads: the location with the highest class score is usually not the best location to regress to the target boundary . This misalignment can degrade detector performance, especially at high IoU metrics. Soft Label Assignment is to process training samples in a soft way by weighting the loss, which is an attempt to enhance the consistency between cls and reg head. For Soft Label Assignment, the loss of Anchor can be expressed as:

fbe73003ddb9db9cac09a7753463da1c.png

In the formula, s is the predicted cls score, b and b' are the positions of the predicted bounding box and GT target, respectively, and are regression losses such as Smooth L1 Looss, IoU Loss, and GIoU Loss. The inconsistency problem between cls and reg heads can be mitigated by assigning larger sums to anchors with higher consistency. Therefore, these well-trained anchors are able to simultaneously predict high cls scores and precise locations at inference time.

Existing work usually sets equal to, and mainly focuses on how to define consistency and integrate it into loss weights.

759bba7101a0e4b25159c29b4e799e7b.png

Table 1 summarizes the formulas of the sum of pos Anchors in recent representative methods. It can be seen that current methods usually define a metric t to represent the degree of consistency between two heads at the anchor level, and then design the inconsistency metric as a function of 1−t. By adding a scale factor or, respectively, the consistent and inconsistent measures are finally integrated into the pos and neg loss weights.

Different from the above and highly correlated methods, the authors propose to set the pos and neg weights separately in a predictive-aware manner.

Specifically, the pos weighting function takes the predicted cls score s and the IoU between the predicted box and the GT target as input, and sets the pos weight by estimating the degree of agreement between the cls and the reg head.

The neg weighting function takes the same input as the pos weighting function, but expresses the negative sample weight as the multiplication of 2: Anchor is the probability of a negative sample and its importance when it is a negative sample . In this way, fuzzy anchors with similar pos weights can receive finer-grained supervision signals with different negative sample weights, which are not available in existing methods.

02fb0a341889ba2533edd3ca96502d8d.png

The DW framework is shown in Figure 2. As a common practice, we first construct a set of candidate positive samples for each GT target by selecting the anchors near the GT center (center prior). Anchors outside the candidate set are considered as negative samples and do not participate in the design process of the weighting function because their statistics (such as IoU, cls scores) are very confusing in the early training stage. Anchors in the candidate set will be assigned to three weights, , and , to more effectively supervise the training process.

3.2 Positive sample weighting function

The pos weight should reflect its importance in accurately detecting objects in classification and localization. The authors try to find out the factors that influence this importance by analyzing the evaluation metrics of object detection. During COCO testing, all predictions for a class should be properly ranked by a ranking metric. Existing methods usually use the cls score or the combination of the cls score and the predicted IoU as a ranking metric. The correctness of each bounding box will be checked from the beginning of the ranking list. A forecast will be defined as a correct forecast under the following conditions:

  • The IoU between the predicted bounding box and its nearest GT is greater than a threshold θ;

  • There are no boxes in front of the current box that meet the above conditions.

In conclusion, only the 1st bounding box in the prediction list whose IoU is greater than θ will be defined as a pos detection, while all other bounding boxes should be treated as false positives for the same GT.

It can be seen that high ranking scores and high IoU are sufficient and necessary conditions for pos prediction. This means that Anchors satisfying both conditions are more likely to be defined as pos predictions during testing, so they should have higher importance during training.

From this perspective, the position weight should be positively correlated with the IoU and rank scores, i.e., and . In order to specify the pos function, first define a consistency measure, denoted as t, to measure the degree of alignment between these two conditions:

47697b9c1f0c4c711ed8b60f12438512.png

Among them, β is used to balance these two conditions. In order to promote the distance between the pos weights between different Anchors, the author adds an exponential modulation factor here:

ba69fbeaad5156f7b594a05834d6da39.png

where µ is a hyperparameter that controls the relative gap of different pos weights. Finally, the pos weight of each anchor for each instance is normalized by the sum of all pos weights in the candidate set.

p_loc = torch.exp(-reg_loss*5)
p_cls = (cls_score * objectness)[:, gt_labels] 
p_pos = p_cls * p_loc
p_pos_weight = (torch.exp(5*p_pos) * p_pos * center_prior_weights) / (torch.exp(3*p_pos) * p_pos * center_prior_weights).sum(0, keepdim=True).clamp(min=EPS)
p_pos_weight = p_pos_weight.detach()

3.2 Negative sample weighting function

While pos weights can enforce consistent anchors to have high cls scores and large IoUs, the importance of inconsistent anchors cannot be discriminated with pos weights. As shown in Figure 1, the location of Anchor D is better (IoU is larger than θ), but the cls score is lower, while the location of Anchor B is coarser (IoU is smaller than θ), but the cls score is higher. They may have the same degree of agreement, t, and thus will move forward with the same pos strength, which does not reflect their differences. To provide detectors with more discriminative supervision information, we propose to faithfully indicate their importance by assigning them more pronounced negative weights, which are defined as the multiplication of the following 2 terms.

1. The probability of being a negative sample

According to the evaluation metric of COCO, an IoU less than θ is a sufficient condition for a wrong prediction. This means that a predicted bounding box that does not satisfy the IoU metric will be considered a negative detection even if it has a high cls score. That is, IoU is the only factor that determines the probability of being a negative sample, denoted by . Since COCO adopts the IoU interval of 0.5~0.95 to estimate AP, the probability of the bounding box should satisfy the following rules:

c55f3200f329875beb622a816417cf97.png

Any monotonically decreasing function defined in the interval [0.5, 0.95] will work. For simplicity, it will be instantiated as the following function:

0f71ffa7a330817c7673d4561ed42796.png

It goes through the points (0.5,1) and (0.95,0). Once determined, the parameters k and b can be obtained using the undetermined coefficient method. Figure 3 plots vs. IoU. IoU with different values.

e93ba64069652e2e7b88cb2f67e39678.png
Figure 3 vs. IoU

2. The importance of negative samples

During inference, negative predictions in the Rank list do not affect recall, but reduce precision. To delay this process, the Ranks of negative bounding boxes should be as backward as possible, that is, their Rank scores should be as small as possible. Based on this, negative sample predictions with higher Rank scores are more important than negative sample predictions with lower Rank scores because they are difficult samples for network optimization.

Therefore, the importance of negative samples, denoted by , should be a function of the Rank score. For simplicity, set this to:

32a7a43621c1efa942f410b9b18200b1.png

Among them, is a factor that indicates how much priority should be given to important negative samples.

Finally, the negative sample weights are:

398b4dce6c82347c11858d236a5a5b6b.png

It is negatively correlated with IoU, but positively correlated with s. It can be seen that for two anchors with the same pos weight, the anchor with smaller IoU has a larger neg weight. The definition of is well compatible with the inference process, which can further distinguish fuzzy Anchors with almost the same pos weight.

p_neg_weight = torch.ones_like(joint_conf)
neg_metrics = torch.zeros_like(ious).fill_(-1)
alpha = 2
t = lambda x: 1/(0.5**alpha-1)*x**alpha - 1/(0.5**alpha-1)
if num_gts > 0:
    def normalize(x): 
        x_ = t(x)
        t1 = x_.min()
        t2 = min(1., x_.max())
        y = (x_ - t1 + EPS ) / (t2 - t1 + EPS )
        y[x<0.5] = 1
        return y
    for instance_idx in range(num_gts):
        idxs = inside_gt_bbox_mask[:, instance_idx]
        if idxs.any():
            neg_metrics[idxs, instance_idx] = normalize(ious[idxs, instance_idx])
    foreground_idxs = torch.nonzero(neg_metrics != -1, as_tuple=True)
    p_neg_weight[foreground_idxs[0], gt_labels[foreground_idxs[1]]] = neg_metrics[foreground_idxs]
        
p_neg_weight = p_neg_weight.detach()
neg_avg_factor = (1 - p_neg_weight).sum()
p_neg_weight = p_neg_weight * joint_conf ** 2

3.4 Box Refinement

Since both the pos and neg weighting functions take IoU as input, a more accurate IoU can lead to higher quality samples, which is beneficial for learning stronger features. The author proposes a Box Refinement operation, based on the prediction offset map, which represents the predicted distance from the current anchor center to the leftmost l, the uppermost r, the rightmost r, and the lowermost b side, respectively, as shown in Figure 4. . Since points close to object boundaries are more likely to predict accurate locations, we design a learnable prediction module to generate a boundary point for each edge based on a thick bounding box.

bee876e83c9c162095294f05f740abe5.png
Figure 4

Referring to Figure 4, the coordinates of the four boundary points are defined as:

1edbf677dff36ae9ecc699cad999dfa9.png

in,

402 Payment Required

is the output of the Refinement module.

Refine's offset map O' is updated to:

f9f5c9f750ea3dcd8479c671e751b7ee.png
if self.with_reg_refine:
    reg_dist = bbox_pred.permute(0, 2, 3, 1).reshape(-1, 4)
    points = self.prior_generator.single_level_grid_priors((h,w), self.strides.index(stride), dtype=x.dtype, device=x.device)
    points = points.repeat(b, 1) 
    decoded_bbox_preds = distance2bbox(points, reg_dist).reshape(b, h, w, 4).permute(0, 3, 1, 2)
    reg_offset = self.reg_offset(reg_feat)
    bbox_pred_d  = bbox_pred / stride 
    reg_offset = torch.stack([reg_offset[:,0], reg_offset[:,1] - bbox_pred_d[:, 0],\
                              reg_offset[:,2] - bbox_pred_d[:, 1], reg_offset[:,3],
                              reg_offset[:,4], reg_offset[:,5] + bbox_pred_d[:, 2],
                              reg_offset[:,6] + bbox_pred_d[:, 3], reg_offset[:,7],], 1)
    bbox_pred = self.deform_sampling(decoded_bbox_preds.contiguous(), reg_offset.contiguous()) 
    bbox_pred = F.relu(bbox2distance(points, bbox_pred.permute(0, 2, 3, 1).reshape(-1, 4)).reshape(b, h, w, 4).permute(0, 3, 1, 2).contiguous())

3.5 Loss function

The proposed DW scheme can be applied to most existing Dense Detectors. Here, the representative Dense Detectors FCOS is used to realize DW. As shown in Figure 2, the entire network structure consists of Backbone, FPN and detection Head. Multiplying the output of the centrality branch and the classification branch by the final cls score by convention, the final loss of the network is:

ebbc4163aa748c208ca48ef7ce836170.png

where β is a balance factor

30e48a6543a20c284f8a1ffccaafe39f.png

where N and M are the total number of Anchors, respectively, FL is the Focal Loss, GIoU regression loss, the predicted cls score at s, and b and b' are the positions of the predicted box and GT, respectively.

4 experiments

4.1 Ablation experiment

1. Hyperparameters for positive sample weighting

d08a3f8c502a59d64845ac483d371427.png

The pos weight has 2 hyperparameters: β and µ;

  • β balances the contributions between the cls score and IoU in the consistency measure t. As the value of β increases, the contribution of IoU also increases. - µ controls the relative scale of the pos weights. A larger µ gives the most consistent samples a relatively larger pos weight compared to the less consistent samples.

The performance of changing DW by changing β from 3 to 7 and µ from 3 to 8 is shown in the table. It can be seen that the best results are obtained when β is 5 and µ is 5. Other combinations of β and µ reduce AP performance from 0.1 to 0.7. Therefore, in all remaining experiments, β and µ were set to 5.

2. Hyperparameters for Negative Sample Weighting

2722ddf3a5f2804bc5e2f58c9d6ceb55.png

The authors also conduct several experiments to investigate the robustness of DW to hyperparameters , which are used to tune the relative scale of negative sample weights. The AP results obtained using different sum combinations ranged from 41 to 41.5, as shown in the table. This means that the performance of DW is not sensitive to these 2 hyperparameters. Therefore, , was used in all experiments.

3. Construction of candidate set

5fdf98e622d267718057ae53cdc5ee38.png

As a common practice for object detection, Soft LA is only applied to the anchors of the candidate set. The authors test the construction of three candidate sets, all of which are based on the distance r from the anchor to the corresponding GT center (normalized by the feature stride).

  • The first method is to select Anchors whose distance is less than a threshold.

  • The second method is to select the top k nearest anchors from each level of the FPN.

  • The third method is to give each Anchor a soft center weight and multiply it by wpos.

The results are shown in Table 4. It can be seen that the AP performance fluctuates slightly between 41.1 and 41.5, which indicates that our DW is robust to the candidate bag separation method.

4. Design of negative sample weighting function

327148e26dd91df01efee20685c0bb57.png

Its effects were investigated by using other alternative methods, as shown in the table. It can be seen that using only the pos weight reduces the performance to 39.5, which indicates that for some low-quality Anchors, just assigning them small is not enough to reduce their Rank score. They can be forced to be behind with a higher rank, resulting in higher AP during testing.

Without using or, we got 40.5AP and 40.0AP respectively, which verifies that both are necessary. As the existing methods do, attempting to use replacement achieves a performance of 40.7AP, which is 0.8 points lower than that of the standard DW.

5、Box Refinement

Without Box Refinement, the DW method achieves 41.5AP, which is the first method to achieve over 41AP performance on COCO without adding FCOS-ResNet-50. Through Box Refinement, DW can reach 42.2AP, as shown in Table 6. Table 7 also shows that Box Refinement can consistently improve the performance of DWs with different Backbone.

6. Weighting strategy

To demonstrate the effectiveness of the DW strategy, it is compared with other LA methods using different weighting strategies. The results are shown in the table. The first 5 rows are Hard LA methods, while the others are Soft LA methods.

95a9fa3278ea2174f3eb285ebe1997fe.png

The best performance of Hard LA is via OTA, 40.7AP. Since OTA treats LA as an optimal planning problem, it will increase the training time by more than 20%. GFLv2 utilizes an extra complex branch to estimate localization quality and achieves the 2nd-place performance of 41.1AP among Soft LA methods.

Unlike mainstream methods that assign weights to losses, automatically assign weights to cls scores and update them by their gradients during training. The authors try to separate the weights in the automatic assignment and assign to the loss, but only get 39.8 and 36.6 AP, which are 0.6 and 3.8 points lower than the original performance, respectively. This means that the weighting scheme in automatic assignment does not work well when adapted to mainstream practice.

4.2 Comparison of SOTA methods

9e500765974837c25b73da7d7d137ab6.png

4.3 Discussion

1. Visualize DW

To further understand how DW differs from existing methods, we present visualizations of DW's cls score, IoU, pos, and negativity weights and two representative methods, GFL and VFL, in Figure 5. It can be seen that the pos and negativity weights in DW are mainly concentrated in the central region of GT, while GFL and VFL distribute weights over wider regions. This difference means that DW can focus more on important samples and reduce the contribution of easily available samples, such as those near object boundaries. This is why DW is more robust to the selection of candidate packages.

7d73fb37368fa79679362baf4d911fee.png

We can also see that anchors in the central region have different (pos, neg) weight pairs in the DW. In contrast, negative weights in GFL and VFL are highly correlated with pos weights. The anchors highlighted by orange circles have almost the same pos and negative weights in GFL and VFL, while DW can distinguish them by assigning different weights, giving the network higher learning ability.

2. Limitations of DW

Although DW can distinguish the importance of different anchors to an object well, it reduces the number of training samples at the same time, as shown in Figure 5. This may affect the training effect on small targets. As shown in Table 7, the improvement of DW for small objects is not as high as that for large objects. To alleviate this problem, the authors can dynamically set different hyperparameters according to the target size to balance the training samples between large and small targets.

DW 论文和代码下载

在CVer公众号后台回复:DW,即可下载上面论文和代码
目标检测 交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-目标检测微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123700822