NWD (2022)

A Normalized Gaussian Wasserstein Distance for Tiny Object Detection

Abstract

Detecting tiny objects is a very challenging problem because tiny objects consist of only a few pixels in size. We demonstrate that state-of-the-art detectors fail to produce satisfactory results** on tiny objects due to lack of appearance information. Our main observation is that intersection-over-union (IoU) based metrics such as IoU itself and its extensions are very sensitive to positional biases of small objects and significantly degrade detection performance when used in anchor-based detectors. To alleviate this situation, we propose a new evaluation metric** for tiny object detection using Wasserstein distance. Specifically, we first model bounding boxes as 2D Gaussian distributions, and then propose a new metric called Normalized Wasserstein Distance (NWD) to compute the similarity between them by their corresponding Gaussian distributions. The proposed NWD metric can be easily embedded into any anchor-based detector's assignment, non-maximum suppression and loss function to replace the commonly used IoU metric . We evaluate our metrics on a new dataset for Tiny Object Detection (AI-TOD), where the average object size is much smaller than existing object detection datasets. Extensive experiments show that when equipped with the NWD metric, our method produces performance that is 6.7 AP points higher than the standard fine-tuning baseline and 6.0 AP points higher than the state-of-the-art competitor. The code is located at: https://github.com/jwwangchn/NWD.

1 Introduction

Tiny objects are ubiquitous in many real-world applications, including driver assistance, mass surveillance, and maritime rescue. Although object detection has made significant progress due to the development of deep neural networks [21, 15, 27], most of them focus on detecting normal-sized objects. Although tiny objects (less than 16×16 pixels in the AI-TOD dataset [29]) usually exhibit extremely limited appearance information, this increases the difficulty of learning discriminative features, resulting in a large number of failure cases in detecting tiny objects [25, 29, 35].

Recent advances in tiny object detection (TOD) have mainly focused on improving feature discrimination [14, 37, 20, 12, 1, 19]. Several efforts have been devoted to normalizing the scale of the input image to improve the resolution of small objects and corresponding features [24, 25]. Whereas Generative Adversarial Networks (GANs) are proposed to directly generate super-resolution representations for small objects [12, 1, 19]. Furthermore, Feature Pyramid Networks (FPN) are also proposed to learn multi-scale features for scale-invariant detectors [14, 37, 20]. In fact, existing methods improve TOD performance to a certain extent, but the accuracy improvement usually requires additional cost to achieve .

Besides learning discriminative features, the quality of training sample selection also plays an important role for anchor-based tiny object detectors [36], where the assignment of positive and negative (pos/neg) labels is crucial. However, for tiny objects, the feature of few pixels will increase the difficulty of training sample selection. As shown in Figure 1, we can observe that the sensitivity of IoU to objects of different scales varies greatly. Specifically, for tiny objects with 6×6 pixels, small positional deviations will lead to a significant drop in IoU (from 0.53 to 0.06), resulting in inaccurate label assignment. However, for normal objects with 36×36 pixels, the IoU varies slightly (from 0.90 to 0.65) under the same positional bias. In addition, Figure 2 shows the IoU-Deviation curves for four different object scales, and the curves drop faster as the object size becomes smaller. It is worth noting that the sensitivity of IoU stems from the particularity that the bounding box position can only vary discretely.

Figure 1: Sensitivity analysis of IoU for tiny and normal scale objects. Note that each grid represents one pixel, box A represents the ground truth bounding box, and boxes B and C represent predicted bounding boxes with 1-pixel and 4-pixel diagonal bias, respectively.
insert image description here
Figure 2: Comparison of IoU-bias curves and NWD-bias curves in two different scenarios. The abscissa value represents the number of pixel deviations between the center points of A and B, and the ordinate value represents the corresponding measurement value. Note that the location of the bounding box can only vary discretely, and the Value-Deviation curve is presented as a scatterplot.

This phenomenon implies that the IoU metric is no longer invariant to object scales under discretized positional biases, which ultimately leads to the following two flaws in label assignment. Specifically, IoU thresholds (θp, θn) are used to assign pos/neg training samples in anchor-based detectors, and (0.7, 0.3) are used in Region Proposal Networks (RPN) [7]. First, the sensitivity of IoU to tiny objects makes small position deviations flip anchors, resulting in similarity of pos/neg sample features and difficulty in network convergence. Second, we find that the average number of positive samples per ground-truth (gt) in the AI-TOD dataset [29] is less than 1, because the IoU between some gts and any anchor is smaller than the minimum positive threshold. Therefore, the supervisory information for training tiny object detectors will be insufficient. Although a dynamic allocation strategy such as ATSS [36] can adaptively obtain a threshold of IoU according to the object's statistical characteristics to assign pos/ignored values, the sensitivity of IoU makes it difficult to find a good threshold and provide high quality for tiny object detectors pos/negative samples.

Given that IoU is not a good metric for tiny objects, this paper proposes a new metric that uses Wasserstein distance to measure the similarity of bounding boxes to replace the standard IoU. Specifically, we first model bounding boxes as 2-D Gaussian distributions, and then use our proposed Normalized Wasserstein Distance (NWD) to measure the similarity of the derived Gaussian distributions. The main advantage of the Wasserstein distance is that it measures the similarity of distributions even if there is no or negligible overlap. In addition, NWD is not sensitive to objects of different scales, thus it is more suitable for measuring the similarity between tiny objects .

This method can be applied to single-stage and multi-stage anchor detectors. Furthermore, NWD can replace IoU not only in label assignment, but also in non-maximum suppression (NMS) and regression loss functions . Extensive experiments on the new TOD dataset AI-TOD [29] show that our proposed NWD can consistently improve the detection performance of all experimental detectors. The contributions of this paper are summarized as follows.

• We analyze the sensitivity of IoU to small object position biases and propose NWD as a better measure of the similarity between two bounding boxes.
• We design a robust tiny object detector by applying NWD to label assignment, NMS, and loss functions in anchor-based detectors.
• Our proposed NWD can significantly improve the TOD performance of popular anchor-based detectors, and it achieves 11.1% to 17.6% performance improvement over Faster R-CNN on the AI-TOD dataset.

2 Related Work

2.1 Tiny Object Detection

Most of the previous small/tiny object detection methods can be roughly classified into three categories: multi-scale feature learning, designing better training strategies, and GAN-based detection [28].

Multi-scale feature learning : A simple and classic approach is to resize the input image to different scales and train different detectors, each of which can achieve the best performance in a certain range of scales. In order to reduce the computational cost, some literatures [18, 14, 37] try to build feature-level pyramids of different scales. For example SSD [18] detects objects from feature maps of different resolutions. Feature Pyramid Network (FPN) [14] constructs a top-down structure through horizontal connections, and combines feature information of different scales to improve object detection performance. On this basis, many methods to further improve the performance of FPN have been proposed, including PANet [17], BiFPN [26], Recursive-FPN [[20]. Furthermore, TridentNet [13] builds a parallel multi-branch architecture with different receptive fields to generate scale-specific feature maps.

Design better training strategies : Singh et al. observed that it is difficult to detect small objects and large objects at the same time. Inspired by this, they proposed SNIP [24] and SNIPER [25] to selectively train objects within a certain scale range . In addition, Kim et al. [10] introduced a scale-aware Network (SAN), which maps features extracted from different spaces to a scale-invariant subspace, making the detector more robust to scale changes.

GAN-based detection : Perceptual GAN ​​[12] is the first attempt to apply GAN to small object detection, which improves small object detection by narrowing the representation difference between small objects and large objects. Furthermore, Bai et al. [1] proposed a MT-GAN to train an image-level super-resolution model to enhance the features of small RoIs. Furthermore, the work in [19] proposes a feature-level super-resolution method to improve the performance of small object detection based on the proposed detector.

2.2 Evaluation Metric in Object Detection

IoU is the most widely used metric for measuring the similarity between bounding boxes. However, IoU only works when the bounding boxes overlap. To solve this problem, Generalized IOU (GIoU) ​​[22] is proposed to convert the bounding box by increasing the penalty term of the minimum box. However, GIoU is downgraded to IoU when one bounding box contains another. Therefore, by considering the three geometric properties of overlapping area, center point distance and aspect ratio, DIoU [38] and CIoU [38] are proposed, which overcome the limitations of IoU and GIoU. GIoU, CIoU, and DIoU are mainly applied to replace IoU in NMS and loss functions to improve general object detection performance, but their application in label assignment is rarely discussed. In the present work, Yang et al. [32] also proposed a Gauss-Wotherstein Distance (GWD) loss to detect oriented objects by measuring the positional relationship of oriented bounding boxes. However, the motivation of GWD is to solve the boundary discontinuity and square-like problem in oriented object detection. Our motivation is to reduce the sensitivity of IoU to small object position biases, and our proposed method can replace IoU in all parts of anchor-based object detectors .

2.3 Label Assignment Strategies

Localizing high-quality anchors to a large number of tiny objects is a challenging task. A simple approach is to lower the IOU threshold when selecting positive samples. Although it can match tiny objects to more anchors, the overall quality of training samples will decrease. Furthermore, many recent works try to make the label assignment process more adaptive, aiming to improve detection performance [6]. For example, Zhang et al. [36] proposed an adaptive training sample selection (Adaptive Training Sample Selection, ATSS), which automatically calculates the pos/neg threshold of each gt through the statistics of the IoU of a set of anchor points. Kang et al. [9] introduced probabilistic anchor assignment (PAA) by assuming that the joint loss distribution of pos/neg samples follows a Gaussian distribution. Furthermore, Optimal Transport Allocation (OTA) [6] describes the label assignment process as an optimal transport problem from a global perspective. However, these methods all use the IoU metric to measure the similarity between two bounding boxes, and mainly focus on threshold setting in label assignment, which are not suitable for TOD. In contrast, our approach mainly focuses on designing a better evaluation metric that can be used to replace the IoU metric in tiny object detectors.

3 Methodology

Inspired by the fact that IoU is actually the Jaccard similarity coefficient that calculates the similarity between two finite sample sets, we design a better measure of tiny objects based on the Wasserstein distance because it can consistently reflect the distance between distributions, even They do not overlap. Therefore, the new metric has better performance than IoU when measuring the similarity between tiny objects. Details are as follows.

3.1 Gaussian Distribution Modeling for Bounding Box

For small objects, there tend to be some background pixels in their bounding boxes, since most real objects are not strictly rectangular. In these bounding boxes, foreground and background pixels are concentrated on the center and border of the bounding box, respectively [30]. To better describe the weights of different pixels in the bounding box, the bounding box can be modeled as a two-dimensional Gaussian distribution, where the weight of the center pixel of the bounding box is the highest, and the importance of pixels gradually decreases from the center to the border. Specifically, for a horizontal bounding box R = (cx, cy, w, h), where (cx, cy), w and h represent the center coordinates, width and height, respectively. The equation of its inscribed ellipse can be expressed as

3.2 Normalized Gaussian Wasserstein Distance




where C is a constant closely related to the dataset. In the following experiments, we empirically set c to be the average absolute size of AI-TOD and obtain the best performance. Additionally, we observed that C is robust to a certain range, see Supplementary Material for details.

Compared with IoU, NWD has the following advantages in tiny object detection:
(1) scale invariance,
(2) smoothness to position deviation,
(3) ability to measure the similarity between bounding boxes that do not overlap or contain each other
.
As shown in Fig. 2, without loss of generality, we discuss the variation of the metric values ​​in the following two cases. In the first row of Figure 2, we keep the scales of boxes A and B constant, and move B along the diagonal of A. It can be seen that the four curves of NWD completely overlap, indicating that NWD is not sensitive to the scale variance of the box. Furthermore, we can observe that IoU is overly sensitive to small positional deviations, while positional deviations cause smoother NWD changes. The smoothness to position bias suggests that there may be better discrimination between pos/negative samples compared to IoU for the same threshold. In the second row of Figure 2, we set the side length of B to half the side length of A, and move B along the diagonal of A, compared to IoU, even though |A∩B| = A or B And |A∩B| = 0, the NWD curve is also much smoother, which can consistently reflect the similarity between A and B.

3.3 NWD-based Detectors

The proposed NWD can be easily integrated into any anchor-based detector to replace IoU. Without loss of generality, this paper adopts the representative anchor-based Faster R-CNN to describe the use of NWD. Specifically, all modifications are made in the three parts of the originally used pos/neg label assignment, NMS and regression loss function. Details are as follows.

NWD-based label assignment . Faster R-CNN [21] consists of two networks: RPN for generating region proposals and R-CNN [7] for object detection based on these proposals. Both RPN and R-CNN include a label assignment process. For RPN, anchors of different scales and scales are first generated, and then binary labels are assigned to the anchors, which are used to train classification and regression heads. The label assignment process of R-CNN is similar to RPN, the difference is that the input of R-CNN is the output of RPN. To overcome the shortcomings of the above-mentioned IOUs in tiny object detection, we design a NWD-based label assignment strategy, using NWD for label assignment. Specifically, for training RPN, positive labels will be assigned to two kinds of anchors :
(1) the anchor with the highest NWD value with gt box and NWD value greater than θn or
(2) with any gt with NWD value higher than positive Anchor for thresholding θp.
Therefore, an anchor will be assigned a negative label if its NWD value is below the negative threshold θn for all gt boxes. In addition, anchors who were neither positively nor negatively labeled did not participate in the training process. Note that in order to directly apply NWD to anchor detectors, θp and θn are used as original detectors in the experiments.

NWD-based NMS . NMS is an integral part of the object detection pipeline to suppress redundantly predicted bounding boxes, where the IoU metric is applied. First, it sorts all predicted boxes according to their scores. The predicted box M with the highest score is selected, and all other predicted boxes that have significant overlap with M (using a predefined threshold Nt) are suppressed. This process is applied recursively to the remaining boxes. However, due to the sensitivity of IoU to tiny objects, the IoU values ​​of many prediction boxes are lower than Nt, resulting in false positive predictions . To address this issue, we argue that NWD is a better NMS criterion in tiny object detection because NWD overcomes the scale sensitivity problem. Furthermore, nwd-based NMS can be flexibly integrated into any tiny object detector with only a small amount of code.

NWD-based regression loss . IoU-Loss [34] was introduced to eliminate the performance gap between training and testing [22]. But in the following two cases, IoU-Loss cannot provide the gradient of the optimized network:
(1) the predicted bounding box P does not overlap with the ground truth box G (i.e. |P∩G| = 0) or
(2) the box P is completely Include box G or vice versa (i.e. |P∩G| = P or G).
Furthermore, both cases are very common in tiny objects. Specifically, on the one hand, a deviation of a few pixels in P leads to no overlap between P and G, and on the other hand, tiny objects are easily mispredicted, resulting in |P∩G| = P or G. Therefore, IoU-Loss is not suitable for tiny object detectors. Although CIoU and DIoU can handle the above two cases, since they are both based on IoU, they are very sensitive to the position deviation of tiny objects. To solve the above problems, we design the NWD metric as a loss function:

where Np is the Gaussian distribution model of the prediction box P, and Ng is the Gaussian distribution model of the prediction box G. According to the introduction in Section 3.2, the nwd-based loss can provide gradients in both cases of |P∩G| = 0 and |P∩G| = P or G.

4 Experiments

We evaluate the proposed method on the AI-TOD [29] and VisDrone2019 [4] datasets. Ablation studies are performed on AI-TOD, a challenging dataset for tiny object detection. It has 8 categories, 700,621 object instances, and covers 28,036 aerial images of 800 × 800 pixels. The average absolute size of AI-TOD is only 12.8 pixels, which is larger than other object detection datasets such as PASCAL VOC (156.6 pixels) [5], MS COCO (99.5 pixels) [16] and DOTA (55.3 pixels) VisDrone2019 [4] is used UAV datasets for object detection. It consists of 10209 images in 10 categories. VisDrone2019 has a lot of complex scenes and a lot of tiny objects, because the images were taken in different places and at different heights.

We adopt the same evaluation metrics as the AI-TOD [29] dataset, including AP, AP0.5, AP0.75, APvt, APt, AP, and APm. Among them, AP is the average AP IoU={0.5, 0.55,...,0.95} under different IoU thresholds, and AP0.5 and AP0.75 are the APs when the IoU thresholds are 0.5 and 0.75, respectively. Furthermore, in AI-TOD [29], APvt, APt, ap, and APm are used for very tiny (2-8 pixels), tiny (8-16 pixels), small (16-32 pixels) and medium (32- 64 pixels) scale evaluation.

We conduct all experiments on a computer with 4 NVIDIA Titan X GPUs, and the code used for our experiments is based on the MMdetection [3] codebase. ImageNet [23] pretrained ResNet-50 [8] with FPN [14] is used as the backbone unless otherwise specified. All models were trained for 12 epochs using the SGD optimizer with a momentum of 0.9, weight decay of 0.0001, and a batch size of 8. We set the initial learning rate as 0.01 and decay it by a factor of 0.1 at the 8th and 11th epochs. The batch sizes of RPN and Fast R-CNN are set to 256 and 512, respectively, and the sampling ratio of positive and negative samples is set to 1/3. The number of proposals generated by RPN is set to 3000. In the inference stage, we filter the background bounding boxes with a preset score of 0.05, using NMS with an IoU threshold of 0.5. All experiments used the above training and inference parameters unless otherwise stated.

4.1 Comparison with Other Metrics based IoU

There are some IOU-based metrics that can be used to measure the similarity between bounding boxes mentioned in Section 2. In this work, we reimplement the above four metrics (i.e. iou, GIoU, CIoU and DIoU) and our proposed NWD on the same base network (i.e. faster R-CNN) to compare their performance on tiny objects. performance. Specifically, they are applied to label assignment, NMS, and loss functions, respectively. The experimental results of the AI-TOD dataset are shown in Table 1.

Comparisons in Label Assignment . Note that the metrics in both RPN and R-CNN assignment modules are modified. It can be seen that compared with the IoU metric, the AP of NWD reaches up to 16.1%, which is improved by 9.6% on Apt, which shows that NWD-based label assignment can provide more high-quality training samples for tiny objects. Furthermore, to analyze the essence of the improvements, we conduct a set of statistical experiments. Specifically, we calculated the average number of positive anchors matched by each gt box under the same default threshold for IoU, GIoU, DIoU, CIoU, and NWD as 0.72, 0.71, 0.19, 0.19, and 1.05, respectively. It can be found that only NWD can guarantee a considerable number of positive training samples. Furthermore, while simply lowering the threshold of the IOU-based metric can provide more positive anchors for training, IOU-based small object detectors do not perform better than NWD-based detectors after threshold fine-tuning, which will be discussed in the Supplementary Material discussed further in . This is due to the ability of NWD to address the sensitivity of IOU to small object position deviations.

Comparison in NMS . In this experiment, we only modify the NMS module of RPN, because only NMS in RPN can directly affect the detector training process. It can be seen that using different metrics during training to filter redundant predictions also affects detection performance. Specifically, the best AP of NWD is 11.9%, which is 0.8% higher than the commonly used IoU. This means that NWD is a better metric for filtering redundant bounding boxes when detecting tiny objects.

Comparison of loss functions . Note that we modify the loss function in both RPN and R-CNN, which will affect the convergence of the detector. It can be seen that the loss function AP based on nwd is the highest, which is 12.1%.

4.2 Ablation Study

In this section, Faster R-CNN [21] is used as the baseline, which consists of two stages: RPN and R-CNN. Our proposed method can be applied to label assignment, NMS, loss function modules of both RPN and R-CNN, so there are six modules in total that can convert from IoU metric to NWD metric. To verify the effectiveness of our proposed method in different modules, we conduct the following two sets of ablation studies: a
comparison of applying NWD to one of the six modules in RPN or R-CNN and a comparison of applying NWD to all modules.

Apply NWD to a single module . The experimental results are shown in Table 2. Compared with the baseline method, the AP improvement rate of the nwd-based allocation module in RPN and R-CNN is the highest, 6.2% and 3.2%, respectively, indicating that the problem of small object training label allocation caused by IoU is the most obvious . Our proposed nwd-based allocation Strategies greatly improve assignment quality. We can also observe that our proposed method improves the performance in 5 out of 6 modules, which greatly verifies the effectiveness of our nwd-based method. The NMS performance degradation of R-CNN may be due to the fact that the default NMS threshold is suboptimal and needs to be fine-tuned to improve performance.

Apply NWD to multiple modules . The experimental results are shown in Table 3. When training for 12 epochs, the detection performance improves significantly when using NWD in RPN, R-CNN, or all modules. When we apply NWD to the three modules of RPN, the best performance of 17.8% can be obtained. However, we find that when NWD is used in all six modules, AP drops by 2.6% compared to using NWD only in RPN. In order to analyze the reasons for the performance drop, we added a set of experiments and trained the network for 24 epochs. It can be seen that the performance gap is reduced from 2.6% to 0.9%, which indicates that when using NWD in R-CNN, the network needs more time to converge. Therefore, in the following experiments, we only used NWD in RPN, which achieved a considerable performance improvement in a shorter time.
insert image description here
Comparison of loss functions. Note that we modify the loss functions in both RPN and R-CNN, which both affect the convergence of the detector. It can also be seen that the NWD-based loss function achieves the highest AP of 12.1%.

4.3 Main Results

To reveal the effectiveness of NWD on TOD, we conduct experiments on the tiny object detection datasets AI-TOD [29] and VisDrone2019 [4].

Key results of AI-TOD . To verify that NWD can be applied to any anchor-based detector and improve TOD performance, we select 5 baseline detectors, including one-level anchor detectors (i.e. RetinaNet [15], ATSS [36]) and multi-level anchor fixed detectors (i.e. Faster R-CNN [21], Cascade R-CNN [2], detectors [20]). The experimental results are shown in Table 4. It can be seen that the current state-of-the-art detectors have extremely low APvt close to zero, which means that they cannot produce satisfactory results on tiny objects. Moreover, our proposed nwd-based detector improves the AP metrics of RetinaNet, ATSS, Faster R-CNN, Cascade R-CNN and detectors by 4.5%, 0.7%, 6.7%, 4.9% and 6.0%, respectively. The performance improvement is even more pronounced when the objects are very small. Notably, nwd-based DetectoRS achieves state-of-the-art performance (20.8% AP) on AI-TOD. The visualization results using the IOU-based detector (first row) and the nwd-based detector (second row) on the AI-TOD dataset are shown in Figure 3. We can observe that NWD can significantly reduce false negatives (FN) compared to IoU.


Key Results on Visdrone . In addition to AI-TOD, we also use VisDrone2019 [4], which contains many tiny objects in different scenes, to verify the generalization of nwd-based detectors. The results are shown in Table 5. It can be seen that the nwd-based anchor-based detectors all achieve considerable improvements over the baselines.

5 Conclusion

We observe that the IOU based metric is very sensitive to the positional deviation of tiny objects, which greatly degrades the detection performance of tiny objects. To address this issue, we propose a new metric called Normalized Wasserstein Distance (NWD) to measure the similarity between bounding boxes of tiny objects. Based on this, we further propose a novel NWD-based tiny object detector, which embeds NWD into the anchor-based detector's label assignment, non-maximum suppression and loss function to replace the original IoU metric . The experimental results show that this method can greatly improve the performance of tiny target detection on the AI-TOD dataset, reaching the state-of-the-art level.

References

Guess you like

Origin blog.csdn.net/m0_60534571/article/details/129836106