[Reading notes for target detection papers] Reducing Label Noise in Anchor-Free Object Detection

(Augmentation for small object detection)

Abstract

        Current anchor-free anchor-free object detectors label all features that spatially fall within a predefined center region of the ground-truth box as positive . This approach creates label noise during training  , as some of these positively labeled features may lie on the background or occluder objects , or they are not discriminative features at all . In this paper, we propose a new labeling strategy aimed at reducing label noise in anchor-free detectors . We aggregate predictions derived from individual features into a single prediction. This allows the model to reduce the contribution of non-discriminative features during training . We develop a novel single-stage anchor-free object detector, PPDet , to employ this labeling strategy during training and a similar prediction pooling approach during inference. On the COCO dataset, PPDet achieves the best performance among anchor-free top-down detectors and is comparable to other state-of-the-art methods . It also outperforms all major one-stage and two-stage small object detection methods (APS 31.4).

Code is available at https://github.com/nerminsamet/ppdet.


1 Introduction

        Early deep learning-based object detectors were two-stage, proposal-driven approaches [7, 22]. In the first stage, a sparse set of candidate boxes is generated, and in the second stage they are classified using a convolutional neural network (CNN). Later, the idea of ​​unified detection in a single stage has gained increasing attention [6, 14, 16, 21], where  candidate boxes are replaced by predefined anchor boxes . On the one hand, anchor boxes must densely cover the image (in terms of location, shape, and scale) to maximize recall ; on the other hand, their number should be kept to a minimum to reduce inference time and the amount of noise they generate during training. Imbalance problem [19].

        Substantial efforts have been spent on  addressing the shortcomings of anchors : several methods have been proposed to improve the quality of anchors [27, 29], to address extreme foreground-background imbalance [14, 19, 24], and recently, have developed A single-stage anchor-free approach . There are two main groups of prominent approaches in anchor-free object detection . The first group is  keypoint-based  bottom -up methods , generalized after the seminal work CornerNet [11]. These detectors [4, 11, 17, 32] first detect object keypoints (such as corners, center points, and poles) and then group them to produce detections of the whole object. The second group of anchor-free object detectors [10, 25, 33] employ a top-down approach that directly predicts the class and bounding box coordinates for each location in the final feature map .

        An important aspect of object detector training is  the strategy used to label candidate objects , which can be proposals, anchors or locations (i.e., features) in the final feature map . To label candidates as “positive” (foreground) or “negative” (background) during training, keypoints [4, 11, 17, 32 ] and the position [10, 25, 26] relative to the truth box. Especially in top-down anchor-free object detectors, after the input image passes through the backbone feature extractor and FPN [13], the features that spatially fall within the ground-truth box are labeled as positive and others as negative— — There is also an "ignore" area in between. Each of these positively labeled features contributes to the loss function as a separate prediction . The problem with this approach  is that some of these positive labels may be obviously wrong or of poor quality , thus, they  inject label noise during training .

Noise labels come from (i) non-discriminative features on the target, (ii) background features inside the ground-truth box, and (iii) occluders (Fig. 1). In this paper, we propose an anchor-free object detection method that relaxes the positive labeling strategy, enabling the model to reduce the contribution of non-discriminative features during training . Following this training strategy, our object detector employs an inference approach where highly overlapping predictions enforce each other .

        In our method, during training, we define a "positive region" inside the ground truth (GT) box , which is concentric and has the same shape as the GT box . We experimentally tune the size of the positive regions relative to the GT boxes . Since this is an anchor-free approach, each feature (i.e. location in the final feature map) predicts a class probability vector and bounding box coordinates . Class predictions from positive regions of GT boxes are pooled together and contribute to the loss as a single prediction . This sum-pooling alleviates the aforementioned noisy labeling problem due to contributions from features from non-target (background or occlusion) regions and non-discriminative features are automatically downweighted during training . At inference time, the class probabilities of highly overlapping boxes are pooled again to obtain the final class probabilities . We name our method "PPDet", which is short for " prediction pooling detector ".

        Our contributions to this work are twofold:

(i) a relaxed labeling strategy that allows the model to reduce the contribution of non-discriminative features during training , and

(ii) PPDet, a novel object detection method that uses this strategy for training and a new inference process based on prediction pooling.

We demonstrate the effectiveness of our proposal on the COCO dataset. PPDet outperforms all anchor-free top-down detectors and performs on par with other state-of-the-art methods . PPDet is especially effective for detecting small objects (31.4 APS, outperforming the state-of-the-art) .


2 Related Work

        In addition to the classic one-stage [6, 14, 16, 21] and two-stage [3, 7, 22] classifications of object detectors, we can also divide current methods into two categories: anchor-based and anchor-free .

The top-down anchor-free object detector simplifies the training processby eliminating complex IoU operations andfocusing on identifying regions likely to contain objects In this sense, FCOS [25], FSAF [33] and FoveaBox [10] first map GT boxes to the FPN level, and then label locations (i.e., features) as positive or negative depending on whether they are inside the GT box. Bounding box prediction is only available for positively labeled locations. FoveaBox [10] and FSAF [33] define three regions for each object instance; positive, ignored and negative. FoveaBox defines a positive (fovea concave) region as a region concentric to the GT box whose dimensions are scaled by a (shrink) factor of 0.3. All positions within this positive region are labeled as positive samples. Similarly, use a shrinkage factor of 0.4 to obtain another area. Any position outside this area is marked as a negative value. If a position is neither positive nor negative, it is ignored during training. FSAF follows the same approach and uses shrinkage factors of 0.2 and 0.5, respectively. FCOS does not predefine discrete regions like [10, 26, 33], but uses the centerness branch to downweight features according to. FCOS and FoveaBox implement static feature pyramid layer selection, which assign objects to layers according to GT box scale and GT box regression distance, respectively. Different from them, FSAF relaxes the feature selection step anddynamically assigns each target to the most appropriate feature pyramid level.

        Bottom-up anchor-free object detection methods [4, 11, 17, 32] aim to  detect certain key points of objects , such as corners and centers. Theirlabeling strategy uses heatmaps, and in this sense it is quite different from top-down anchor-free approaches. Recently, HoughNet, a novel bottom-up voting-based approach that exploits near- and long-range evidence to detect object centers, showed comparable performance to leading one-stage and two-stage top-down methods [twenty three].

        In anchor-based methods [3, 14, 16, 21, 22, 26, 31], objects are predicted from regressive anchor boxes . During training, the label of an anchor box is determined based on its Intersection over Union (IoU) with the GT boxes . Different detectors use different criteria, e.g. Faster RCNN [22] labels anchors as positive if IoU > 0.7 and as negative if IoU < 0.3; R-FCN [3], SSD [16] and Retinanet [ 14] Positive labeling with IoU > 0.5, but slightly different criteria for negative labeling. There are two prominent anchor-based approaches that directly address the labeling problem . Guided Anchoring [26] introduces a new adaptive anchoring scheme that learns arbitrarily shaped boxes instead of dense and predefined ones . Similar to FSAF [33], FoveaBox [10] and our method PPDet, Guided Anchoring follows region-based labeling and defines three types of regions for each ground-truth target; center regions, ignore regions and outer regions, if the generated If the anchor is inside the central region, mark it as positive, if it is outside the region, mark it as negative, and ignore the rest. On the other hand, FreeAnchor [31] applies the idea of ​​relaxing positive labels to anchor-based detectors. This is the method most similar to ours . It replaces the hand-crafted anchor assignment with a maximum likelihood estimation procedure, where the anchors are free to choose their GT boxes. Since FreeAnchor is using a custom loss function to optimize object-anchor matching, it  cannot be directly applied to anchor-free object detectors .


3 Methods

Labeling strategy and training.

        Anchor-free detectors constrain the prediction of GT boxes by assigning them to appropriate FPN levels according to their scale [10] or object regression distance [25] . Here, we follow the ratio-based assignment strategy [10] because it is a way to naturally associate GT boxes with feature pyramid levels. Then, we construct two different regions for each GT box. Wedefine a "positive region" as a region that is concentric with and has the same shape as the GT box. We set the size of the "positive region" experimentally. We thenidentify all locations (ie, features) that spatially fall within the "positive region" of the GT box as "positive (foreground)" features, and the rest as "negative (background)" features. Each positive feature is assigned to the ground-truth box containing it. In Figure 2, the blue and red cells represent foreground cells, and the rest (blank or white) are background cells. Blue cells are assigned to Frisbee objects and red cells are assigned to Human objects. To obtain the final detection score for an object instance, we pool the classification scores of all features assigned to that object and add them together to obtain the final C-dimensional vector , where C is the number of classes. All features are negative except for positively labeled features. Each negative feature affects the loss individually (i.e. without pooling). This final prediction vector is fed to the Focal Focal Loss (FL). For example, assume the red foreground feature that represents the assignment to the person object in Figure 2. Let y be the human class person class. This particular object instance willthencontribute " " to the loss function in training.Each object instance is represented with a separate prediction .

        By default, we assign positive features to the object instances of the box they are in. At this point, the feature assignment in the intersection regions of different GT boxes is a problem that needs to be dealt with . In this case, we  assign these features to the GT box with the smallest distance to its center . Similar to other anchor-free methods [10, 25, 32, 33], in our model, each foreground feature assigned to an object is trained to predict the coordinates of its object GT box.

        We use the focal loss focal loss [14] (α = 0.4 and γ = 1.5) for the classification branch and the smooth L1 loss [7] for the regression branch .


Inference.

         The inference pipeline of PPDet is shown in Figure 3. First, an input image is fed to a backbone neural network model (described in the next section), which produces an initial set of detections . Each detection is associated with (i) a bounding box, (ii) object class (chosen as the class with the greatest probability), and (iii) a confidence score. Among these detections, those labeled with background categories are eliminated . We treat each remaining detection at this stage as a vote for the object it belongs to , where the box is the hypothesis for the target location and the confidence score is the strength of the vote . Next, these detections were pooled together as follows. If two detections belonging to the same object class overlap by more than a certain amount (i.e. Intersection over Union (IoU) > 0.6), we treat them as votes for the same object , and the score of each detection is increased by that of the other Multiples   ​​​​​​​​, where k is a constant. The more IoUs, the greater the increase . After applying this process to each pair of detections, we obtain the score of the final detection. This step is followed by a class-aware non-maximum suppression (NMS) operation that produces the final detection.

        Note that although prediction pooling used in inference may appear to be different from pooling used in training, they are actually the same process . The pooling used in training  assumes that the bounding boxes predicted by features in positive regions overlap each other perfectly (i.e., IoU = 1) .


Network Architecture.

        PPDet uses the network model of RetinaNet [14] , which consists of a backbone convolutional neural network (CNN) and a feature pyramid network (FPN) [13]. FPN computes multi-scale feature representations  and generates feature maps at five different scales . On top of each FPN layer are two independent parallel networks, a classification network and a regression network. The classification network outputs a W × H × C tensor, where W and H are the spatial dimensions (width and height, respectively), and C is the number of categories. Similarly, the regression network outputs a W × H × 4 tensor, where 4 is the number of bounding box coordinates. We refer to each pixel in these tensors as a feature .


4 Experiments

        This section describes the experiments we performed to demonstrate the effectiveness of our proposed method. First, we propose ablation experiments to find the optimal relative area and regression loss weights of positive regions inside GT boxes . Next, we will perform several performance comparisons on the COCO dataset. Finally, we provide sample heatmaps that show the relative positions of the GT boxes of the features responsible for correct detection.

Implementation details.

        We use Feature Pyramid Network (FPN) [13] as our backbone network on top of ResNet [9] and ResNeXt [28] for ablation and state-of-the-art comparison, respectively. For all experiments, we resize the images to be 800 pixels on the short side and up to 1300 pixels on the long side. The constant k used in vote aggregation is (ie k^{IoU-1}) experimentally set to 40. We trained all experiments on 4 Tesla V100 GPUs and tested with a single Tesla V100 GPU. We implement our model using the MMDetection [2] framework and Pytorch [20] .


4.1 Ablation experiment

        In the ablation experiments, we use ResNet-50 with FPN backbone unless otherwise stated. They train for 12 epochs with a batch size of 16 using stochastic gradient descent (SGD) with weight decay of 0.0001 and momentum of 0.9. The initial learning rate of 0.01 is dropped by a factor of 10 in rounds 8 and 11. All ablation models are trained on the COCO [12] train2017 dataset and tested on the val2017 set.

The size of the "positive region".

        As mentioned before, we define a "positive region" as a region that is concentric with the GT box and has the same shape as the GT box. We resize this "positive region" by multiplying its width and height by a shrink factor. We experimented with shrinkage factors between 1.0 and 0.2. The performance results are shown in Table 1. From a shrinkage factor of 1.0 to 0.4, AP increases, however, performance drops sharply after that. Based on this ablation, we set the shrinkage factor to 0.4 for the rest of the experiments .


regression loss weights

        To find the best balance between classification and regression losses, we conduct ablation experiments on regression loss weights. As Table 2 shows, 0.75 yields the best results. For the rest of the experiments, we set the weight of the regression loss to 0.75 .


Improve.

        We  also adopt improvements used in other state-of-the-art object detectors [10, 25, 32] . First, we train our baseline model using ResNet-101 with FPN backbone. Later, we  replaced the last convolutional layer before category prediction in the classification branch with a deformable convolutional layer . This modification improves the performance of all APs by around 0.3 (see Table 3). Later, on top of this modification, we added another modification employing group normalization after each convolutional layer in the regression and classification branches . As shown in Table 3, this modification improves AP by 0.6 and AP50 by 1.1. In this table, we also present the results of the recently introduced moLRP [18] metric, which combines localization, precision, and recall in one metric . Lower values ​​are better. The model was trained for 24 epochs with a batch size of 16 using stochastic gradient descent (SGD) with weight decay of 0.0001 and momentum of 0.9. The initial learning rate of 0.01 is dropped by a factor of 10 in rounds 16 and 22. We included both modifications in the final model.


The categories are unbalanced.

         PPDet aggregates predictions into a single prediction for each object instance , which reduces the number of positive samples during training . One might argue that it further exacerbates the class imbalance [19]. To analyze this problem, we calculated the average positive number per image, 7 for PPDet, 41 for FoveBox, and 165 for RetinaNet. PPDet greatly reduces the number of positive samples . However, this is still small compared to the number of negative samples (tens of thousands), thus, it does not exacerbate the existing class imbalance problem . We  use a focal loss to address the imbalance .


4.2 Comparison of state-of-the-art

        To compare our model with state-of-the-art methods, we use ResNet-101 with FPN and ResNeXt-101-64x4d with FPN backbone. They train for 24 and 16 epochs with batch sizes of 16 and 8, respectively, using SGD with weight decay of 0.0001 and momentum of 0.9. For the ResNet backbone, the initial learning rate of 0.01 is dropped by a factor of 10 at epochs 16 and 22. For the ResNeXt backbone, the initial learning rate of 0.005 is dropped by a factor of 10 at epochs 11 and 14. The model is tested on the COCO [12] train2017 dataset and on the test-dev set. We used (800,480), (1067,640), (1333,800), (1600,960), (1867,1120), (2133,1280) scales for multi-scale testing. Table 4 shows the performance of PPDet and several established state-of-the-art detectors.

        FSAF [33] and FoveaBox [10] use a similar approach to ours to construct "positive regions". While the single-scale test performance of PPDet is comparable to that of FSAF on the same ResNeXt-101-64x4d with FPN backbone, the multi-scale test performance of PPDet is 1.7 AP points higher than that of FSAF . Both models we tested using single scale achieve better results than FoveaBox, while outperforming it by more than 1.0 on small objects . Our multi-scale test results outperform FoveaBox by 1 AP on the same ResNet-101 with FPN backbone.

        Our multi-scale performance is the best among all anchor-free top-down methods . Furthermore, our multi-scale performance on small objects (i.e., APS) sets a new state-of-the-art among all detectors in Table 4.

        We conduct experiments to analyze the impact of prediction pooling on training and inference. When we remove prediction pooling from the inference pipeline of the ResNet-101-FPN backbone model  , we observe a drop of 2.5 points in AP on the val2017 set . To analyze the impact of predictive pooling on training, we only added predictive pooling to RetinaNet [14] and FoveaBox [10] during inference (hence, no PP in training). This leads to an AP drop of 0.5 and 2.8 points for RetinaNet and FoveaBox, respectively.

        We also conduct another experiment to test the effectiveness of sum-pooling relative to max-pooling . For max pooling, we identify the features within the positive region whose predicted boxes overlap most with the GT boxes. Then, only this feature is included in the focal loss to represent its GT box during training . This strategy lowers the AP by more than 2 points, yielding 38.4 for ResNet101 using the FPN backbone.

        As an additional result, we show the performance of PPDet on the PASCAL VOC dataset [5]. For training, we used the joint set (“07+12”) of PASCAL VOC 2007 trainval and VOC 2012 trainval images. For testing, we use the test set of PASCAL VOC 2007. When both use the ResNet-50 backbone, our PPDet model achieves 77.8 average precision (mAP), outperforming FoveaBox [10]'s 76.6 mAP , which we consider the baseline here.

        Figure 4 shows the heatmap of the center of the cell relative to the ground-truth box responsible for the detection. RetinaNet's heatmap is centered at the center of the ground-truth object box . In contrast, PPDet's final detections are formed by relatively wider regions, validating its dynamic and automatic properties in assigning weights to features of positive regions . In addition to detections from the center of the ground-truth box, they may also come in large numbers from different parts of the ground-truth box .


5 Conclusion 

        In this work, we introduce a novel labeling strategy for training anchor-free object detectors . While current anchor-free methods enforce positive labels on all features within a predefined central region of the ground-truth box, our labeling strategy relaxes this constraint by sum-pooling predictions derived from a single feature into a single prediction . This allows the model to reduce the contribution of non-discriminative features during training . We develop PPDet , a single-stage anchor-free object detector that employs a novel labeling strategy and a novel inference method based on pooled predictions during training. We analyze our idea by performing several ablation experiments. We report results on COCO test-dev, showing that PPDet performs on par with the state-of-the-art and achieves state-of-the-art results (APS 31.4) on small objects . We further validate the effectiveness of our method by visual inspection.

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130094352