[Reading notes for target detection papers] Extended Feature Pyramid Network for Small Object Detection

(no code found, only yaml file)

Abstract.

        Small object detection is still an unsolved challenge because it is difficult to extract information of small objects with only a few pixels . While scale-level correspondence detection in feature pyramid networks alleviates this problem , we find that feature coupling at various scales still hurts performance for small objects . In this paper, we propose Extended Feature Pyramid Network (EFPN) with ultra-high-resolution pyramid levels specialized for small object detection . Specifically, we  design a new module named Feature Texture Transfer (FTT) to super-resolve features and simultaneously extract plausible regional details . Furthermore, we design a  foreground-background balanced loss function  to alleviate the regional imbalance of foreground and background . In our experiments, the proposed EFPN is efficient in terms of computation and memory , and produces state-of-the-art results on the small traffic sign dataset Tsinghua-Tencent 100K and the small general object detection dataset MS COCO.


1 Introduction

        Object detection is a fundamental task for many advanced computer vision problems, such as segmentation, image captioning, and video understanding. In the past few years, the rapid development of deep learning has promoted the popularity of CNN-based detectors , which mainly include two-stage pipelines [8, 7, 28, 5] and single-stage pipelines [24, 27, 20]. Although these general-purpose object detectors greatly improve the accuracy and efficiency, they still perform poorly in detecting small objects with few pixels . Since CNN reuses pooling layers to extract high-level semantics, pixels of small objects can be filtered out during downsampling .

        Exploiting low-level features is one way to obtain information about small objects . Feature Pyramid Network (FPN) [19] is the first method to enhance features by fusing features from different levels and constructing a feature pyramid, where upper feature maps are responsible for larger object detection, while lower feature maps are responsible for smaller ones. Target Detection. Although FPN improves multi-scale detection performance, the heuristic mapping mechanism between pyramid level and proposal size in FPN detector may confuse small object detection . As shown in Figure 1(a),small objects must share the same feature map with medium objects and some large objects, while simple cases like large objects can select features from suitable levels. Furthermore, as shown in Fig. 1(b), the detection precision and recall of the FPN bottom layer drop sharply as the target scale. Figure 1 shows that feature coupling across scales in common FPN detectors  still degrades the ability to detect small objects .

        Intuitively, another way to compensate the information loss of small objects is  to increase the feature resolution . Therefore, some super-resolution (SR) methods are introduced into object detection. Early approaches [11, 3] directly super-resolve input images, but extracting features in subsequent networks would be computationally expensive . Li et al. [14] introduced GAN [10] to improve the features of small objects to higher resolution. Noh et al. [25] use high-resolution object features to supervise the SR of the entire feature map containing contextual information. These feature SR methods avoid burdening the CNN backbone, but they only imagine missing details from low-resolution feature maps , while ignoring plausible details encoded in other features of the backbone . Therefore, they  tend to create false textures and artifacts on CNN features, leading to false positives .

[25 Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection.]

        In this paper, we  propose Extended Feature Pyramid Network (EFPN) , which employs large-scale SR features with rich regional details to decouple small and medium object detection . EFPN extends the original FPN to a high-resolution level specifically for small-sized object detection . To avoid expensive computations caused by direct high-resolution image input, the  extended high-resolution feature maps of our method are generated by a feature SR embedded FPN-like framework . After building common feature pyramids, the proposed Feature Texture Transfer (FTT) module  first  combines deep semantics from low-resolution features and shallow region textures from high-resolution feature references . Then, the subsequent FPN-like lateral connections will further enrich the regional features through tailor-made intermediate CNN feature maps. One advantage of EFPN is that the generation of high-resolution feature maps relies on the original real features produced by CNN and FPN , rather than the unreliable imagination in other similar methods . As shown in Fig. 1(b), the extended pyramid level with trusted details in EFPN significantly improves the detection performance for small objects .

        Furthermore, we introduce features generated from large-scale input images as supervision for optimizing EFPN , and design a foreground-background balanced loss function . We argue that a general reconstruction loss will lead to under-learning of positive pixels , since small instances only cover partial regions on the entire feature map. Given the importance of foreground-background balance [20], we add the loss of object regions to the global loss function , drawing attention to the feature quality of positive pixels .

        We evaluate our method on the challenging small traffic sign dataset Tsinghua-Tencent 100K and the general object detection dataset MS COCO. The results show that the proposed EFPN outperforms other state-of-the-art methods on both datasets . Furthermore, single-scale EFPN achieves similar performance but with fewer computational resources compared to multi-scale tests .

        For clarity, the main contributions of our work can be summarized as:

(1) We propose Extended Feature Pyramid Network (EFPN), which improves the performance of small object detection .

(2) We design a feature reference-based key SR module named Feature Texture Transfer (FTT) to endow the expanded feature pyramid with reliable details for more accurate small object detection.

(3) We introduce a foreground-background balance loss function to attract attention to positive pixels and alleviate the regional imbalance of foreground and background .

(4) Our efficient method significantly improves the performance of the detector and becomes the state-of-the-art on Tsinghua-Tencent 100K and small class MS COCO.


2 Related Work

2.1 Deep Object Detectors

        Deep learning based detectors have dominated general object detection due to their high performance. Successful two-stage methods [8, 7, 28, 5] first generate regions of interest (RoIs) and then use classifiers and regressors to refine the RoIs. One-stage detectors [24, 27, 20] are another popular detector that directly classify and localize CNN feature maps with the help of predefined anchor boxes. Recently, anchor-free frameworks [13, 38, 31, 39] have also become more popular. Despite the development of deep object detectors, small object detection remains an unsolved challenge . Dilated convolutions [34] were introduced in [23, 17, 16] to enhance the receptive field for multi-scale detection. However, general-purpose detectors tend to focus more on improving performance on simpler large-scale instances , since the metric for general-purpose object detection is the average precision across all scales. Detectors specialized for small objects still need more exploration.


2.2 Cross-scale features

        Utilizing cross-scale features is an effective way to alleviate the problem of target scale variation. Building image pyramids is a traditional way to generate features across scales . Using features from different network layers is another cross-scale practice. SSD [24] and MS-CNN [4] detect objects at different scales at different layers of the CNN backbone. FPN [19] builds a feature pyramid by merging features from lower and higher layers in a top-down path. Following FPN, FPN variants explore more information pathways in the feature pyramid. PANet [22] adds an additional bottom-up path to pass shallow localization information upward. G-FRNet [1] introduces gate units on the path, passing key information and blocking ambiguous information. NAS-FPN [6] uses AutoML to study optimal path configurations. Although these FPN variants improve the performance of multi-scale object detection, they continue to use the same number of layers as the original FPN . But these layers are not suitable for small object detection, resulting in poor performance on small objects .


2.3 Super-resolution in object detection

        Some studies introduce SR into object detection, since small object detection always benefits from large scales . Image-level SR is adopted in some specific cases where extremely small objects exist , such as satellite images [15] and images with crowded small faces [2]. But large-scale images are a burden for subsequent networks . Instead of super-resolving the whole image, SOD-MTGAN [3] super-resolves the regions of RoIs only, but a large number of RoIs still requires considerable computation. Another way of SR is to directly super-resolve features . Li et al. [14] used Perceptual GAN ​​to enhance the features of small objects with those of large objects. STDN [37] employs sub-pixel convolutions on top of DenseNet [12] to detect small objects while reducing network parameters. Noh et al. [25] super-resolve the entire feature map and introduce supervisory signals during training. However, the aforementioned feature SR methods are all based on restricted information from a single feature map . Recent reference-based SR methods [35,36] have the ability to enhance SR images using the texture or content of reference images . Inspired by reference-based SR, we design a novel module to  super-resolve features under the reference of shallow features with plausible details , resulting in features that are more suitable for small object detection .


3 Our Approach

        In this section, we introduce the proposed Extended Feature Pyramid Network (EFPN) in detail. First, we construct an extended feature pyramid , which is specialized for small objects with high-resolution feature maps at the bottom. Specifically, we design a module named Feature Texture Transfer (FTT) to generate intermediate features for the expanded feature pyramid . Furthermore, we adopt a novel foreground-background balance loss function to further strengthen the learned positive pixels. Sections 3.1 and 3.2 explain the pipeline of the EFPN network and FTT module, and Section 3.3 elaborates our loss function design.


3.1 Extended Feature Pyramid Network

        Vanilla FPN builds a 4-layer feature pyramid by upsampling high-level CNN feature maps and fusing them with lower features via lateral connections. Although features on different pyramid levels are responsible for objects of different sizes, small object detection and medium object detection are still coupled on the same bottom layer P2 of FPN , as shown in Figure 1. To alleviate this problem, we propose EFPN to extend the original feature pyramid to a new level , which can detect small objects with more regional details .

        We implement the extended feature pyramid through an FPN-like framework embedded with a feature SR module. This pipeline generates high-resolution features directly from low-resolution images to support small object detection while keeping the computational cost low . The overview of EFPN is shown in Fig.2.

        The first 4 pyramid layers are constructed by a top-down path for medium and large object detection. The bottom extension of EFPN contains FTT module, top-down pathway and purple pyramid layer in Fig. 2 , aiming to capture the regional details of small objects. More specifically, in the extension, the 3rd and 4th pyramidal layers of EFPN denoted by the green and yellow layers respectively in Fig  . In Figure 2 they are represented by blue diamonds. Then, the top-down path merges P3' with the customized high-resolution CNN feature map C2' , yielding the final expanded pyramid layer P2' . We remove a max-pooling layer in ResNet/ResNeXt stage2 , and get C2' as the output of stage2 , as shown in Table 1. C2' has the same level of representation as the original C2, but contains more regional detail due to its higher resolution . And the smaller receptive field in C2' also helps to better localize small objects . Mathematically, the extended operation in the proposed EFPN can be described as,

where ↑2× means double magnification by nearest neighbor interpolation .

        In the EFPN detector, the mapping between candidate box size and pyramid level still follows the way in [19]:

Here l represents the pyramid level, w and h are the width and height of the candidate box, 224 is the canonical ImageNet pre-training size, and l0 is the 224^{2} target level to which the candidate box of w×h = should be mapped. Since the detector after EFPN adapts to various receptive fields adaptively, the receptive field drift mentioned in [25] can be ignored .


3.2 Feature Texture Migration

        Inspired by image-reference-based SR [35], we design an FTT module to super-resolve features and simultaneously extract region textures from reference features . Without FTT, noise in EFPN layer 4, P2, would pass directly down to the expanded pyramid layers and drown out meaningful semantics . However, the proposed FTT output synthesizes the strong semantics in the low-resolution features of  the upper layers  and the key local details in the high-resolution reference features of the lower layers , but discards the interfering noise in the references .

        As shown in Fig. 3, the main input of the FTT module is the feature map P3 from EFPN layer 3, and the reference is the feature map P2 from EFPN layer 4 . The output P3' can be defined as,

Where Et(·) represents the texture extractor component, Ec(·) represents the content extractor component, where ↑2× represents double magnification through sub-pixel convolution [29] , and || represents feature connection . Both the content extractor and the texture extractor consist of residual blocks .

        In the mainstream, we apply sub-pixel convolutions to amplify the spatial resolution of the content features of the main input P3, considering its efficiency . Subpixel convolutions augment pixels in width and height dimensions by shifting pixels in the channel dimension . Denote the features generated by the convolutional layers as F ∈ RH×W×C r 2 . The pixel shuffle operator in subpixel convolution  rearranges features into a map of shape rH × rW × C. This operation can be defined mathematically as,

Among them, PS(F)x,y,c represents the output feature pixel on the coordinates (x, y, c) after the pixel shuffling operation PS( ), and r represents the magnification factor. In our FTT module, we use r = 2 to double the spatial scale.

        In the reference stream, the combination of reference features P2 and super-resolved content features P3 is fed to the texture extractor. The texture extractor aims to extract reliable textures for small object detection and block useless noise in packages .        

        The element-wise addition of texture and content at the end ensures that the output integrates semantic and regional information from both input and reference . Therefore, the feature map P3' possesses reliable textures selected from the shallow feature reference P2, and similar semantics from the deeper layer P3 .


3.3 Training Loss

Foreground-background-balance loss.

        The foreground-background-balance loss aims to improve the comprehensive quality of EFPN. Common global losses lead to  under-learning of small object regions , since small objects only occupy a small portion of the entire image . The foreground-background balance loss function improves the feature quality of background and foreground through two parts: 1) global reconstruction loss 2) positive patch loss .

        The global reconstruction loss mainly strengthens the similarity to real background features , since background pixels constitute most of the image. Here we use the l1 loss commonly used in SR as the global reconstruction loss Lglob :

where F represents the generated feature map and F t represents the target feature map.

        Positive patch loss Positive patch loss is used to draw attention to positive pixels , since severe foreground-background imbalance can affect detector performance [20]. We  use the l1 loss  in the foreground area as the positive block loss Lpos (the original text is wrong here):

where Ppos represents the patches of the ground-truth object, N represents the total number of positive pixels, and (x, y) represents the coordinates of the pixel on the feature map. Positive patch loss acts as a stronger constraint on the region where the target is located, forcing the learning of the true representation of these regions .

        The foreground-background-balanced loss function Lfbb is defined as,

where λ is the weight balancing factor. The balanced loss function mines true positive samples by improving the feature quality of the foreground area , and eliminates false positives by improving the feature quality of the background area


overall loss.

        Feature maps from 2× scale FPN are introduced to supervise the training process of EFPN. Not only the underlying extension pyramid levels are supervised, but the FTT modules are also supervised. The overall training objective of EFPN is defined as :

Here P2 2× is the target P2 from the 2× input FPN, and P3 2× is the target P3 from the 2× input FPN.


4 experiments

4.1 Datasets and Evaluation Metrics

Tsinghua-Tencent 100K.

         Tsinghua-Tencent 100K [40] is a dataset for traffic sign detection and classification. It contains 100,000 high-resolution (2400×2400) images and 30,000 traffic sign instances. Importantly, in the test set, 92% of the instances cover less than 0.2% of the entire image . Tsinghua-Tencent 100K is dominated by small and medium objects, making it an excellent benchmark for small object detection. The Tsinghua-Tencent 100K benchmark divides all instances into three scales: small (area < 322), medium (322 < area < 962), large (area > 962). Following the protocol in [40, 25, 14], we selected 45 classes with more than 100 instances for evaluation and reported accuracy and recall at IoU = 0.5 on three scales.

MS COCO.

        Microsoft COCO (MS COCO) [21] is a widely used large-scale dataset for general object detection, segmentation and captioning. It consists of three subsets: a train subset with 118k images, a val subset with 5k images, and a test-dev subset with 20k images. Object detection on MS COCO faces three challenges: (1) Small objects: about 65% of the instances are smaller than 6% of the image size. (2) More instances in a single image than other similar datasets (3) Different lighting and shapes of objects .

32^{2}We report average precision (AP) and average recall (AR) for         small categories (area < ) on the test-dev subset to highlight small object detection performance. In MS COCO, AP and AR are averaged over 10 IoU thresholds (IoU = 0.5 : 0.05 : 0.95), which rewards detectors with better localization.


4.2 Implementation Details

        We implement our proposed EFPN using the Faster R-CNN detector, where ResNet50 and ResNeXt-101 [32] are used as backbones. The original Faster R-CNN with FPN is first trained as a baseline. Then, we train EFPN and freeze the backbone and heads. When EFPN converges, we fine-tune a new detector head for the extended pyramid level with the help of OHEM [30] , since  there is always a gap between the extended feature map P2' and the target map P2 from the 2× input image . During inference, the new detection head outputs small bounding boxes from the expanded pyramid levels , while the original detection head outputs medium and large bounding boxes from the first 4 layers of the pyramid . Finally, all predicted boxes from different pyramid levels are combined to produce the final detection result.

We use 2 residual blocks         for the content extractor and the texture extractor in the texture transfer module  . The weight λ for balancing the foreground and background in the training loss  is set to 1 .

        In the Tsinghua-Tencent 100K experiment, due to the uneven number of different classes, we increased each class to about 1000 instances by cropping and color dithering . Those labels not included in the evaluation 45 classes are also used in training for better generalization. The model is trained on the training split and tested on the testing split. Single-scale testing uses images resized to 1400×1400, and RoIs smaller than 56 are assigned to pyramid level P2' accordingly .

        In the MS COCO experiment, we follow the training scheme in Detectron [9] and add data augmentation with scale and color jitter. The model is trained on the train split and tested on the test-dev split. Single-scale testing uses images resized to 800 on the shorter side, and  RoIs of size less than 112 are assigned to pyramid level P2' accordingly .


4.3 Performance Comparison

Tsinghua-Tencent 100K

        We present our model results in Table 2 and compare with other state-of-the-art on Tsinghua-Tencent 100K. EFPN demonstrates its ability to localize and classify small objects more precisely . Compared with Faster R-CNN using ResNeXt-101-FPN, single-scale EFPN greatly improves small object accuracy by 3.4% and small object recall by 0.2%. Precision and recall for medium objects also improve slightly by 0.6% and 0.8%, respectively. We infer that the reason is that some medium objects are shrunk after image resizing and assigned to the expanded pyramid level P2' for detection. Notably, the 1400 × 1400 single-scale test of our proposed EFPN outperforms the state-of -the-art 1600 × 1600 single-scale test of Noh et al. 82.1%, and 87.1% and 86.6%.

        Furthermore, we introduce F1-score to comprehensively evaluate the detector's performance. The multi-scale evaluation of EFPN not only yields the best accuracy on small and medium objects, but also produces new state-of-the-art composite F1 scores at three scales .

MS COCO.

        We report single-scale model results on the small-category MS COCO test-dev split for our proposed EFPN and other general detectors. Although the number of small objects in MS COCO is smaller than that in Tsinghua-Tencent 100K, EFPN still significantly enhances the capabilities of general-purpose object detectors . EFPN is suitable for different backbone networks and achieves significant gains on ResNet-50/ResNeXt-101 compared with FPN. Moreover, EFPN outperforms not only other FPN variants such as Libra R-CNN [26] and PANet [22] on small object detection, but also surpasses similar methods based on Noh et al. [25] and Bai et al. [3]. SR approach] . Specifically, our model outperforms other state-of-the-art multi-scale general detectors such as TridentNet [16], FSAF [39] and RPDet [33] on small objects.


4.4 Ablation studies

        We conduct ablation experiments to verify the efficiency of EFPN and the contribution of each network component. The backbone of ResNeXt-101 and the detection head of Faster R-CNN are adopted . All models are trained on the Tsinghua-Tencent 100K train split and tested on the test split. The results are shown in Table 4 and Table 5.

EFPN is efficient in terms of computation and memory.

        As shown in Table 4, we compare the performance of EFPN with FPN tests at different scales. All models were tested on a single GTX 1080Ti GPU . The large input scale in FPN-2800 improves performance on small objects by 1.6%, but sacrifices performance on large objects by 20.8% . Combining FPN-1400 and P2 from FPN-2800 achieves multi-scale high performance, but is computationally more expensive in runtime and GPU memory than the 2× test . Our proposed EFPN achieves the same high accuracy as FPN-1400+P2-2800, but with affordable computational cost between 1× test and 2× test of FPN . EFPN efficiently achieves multi-scale FPN test accuracy with a single forward pass .


It is not enough to scale the pyramid layers.

        We test the effect of extended feature pyramid without FTT module and foreground-background balance loss, because FPN-1400+P2-2800 is effective in Table 4. ESPCN [29] is an SR method based on single image SR. We replace the FTT module with a three-layer ESPCN , which achieves the same function of creating intermediate upstream feature maps in the EFPN extension and passing them to downstream lateral connections. The supervision of P2 and P3 from the 2× image input is achieved by a global l1 loss. As shown in Table 5, it turns out that expanding the pyramid level without FTT module and foreground attention has limited effect, only improving the F1 score of small categories by 0.3%. Extended pyramid levels hardly recall any additional missing small objects .


The foreground-background balance loss is crucial.

        A balanced loss function with foreground attention is added to the extended feature pyramid embedded with ESPCN. Table 5 shows that the balanced loss improves small class accuracy by 2.2%, leading to a 0.7% improvement in F1-score , suggesting that the foreground-background-balanced loss encourages meaningful changes to the positive regions of the expanded feature map .

        We further investigate different configurations of the balancing hyperparameter λ. When λ is set to 0.5/1.0/1.5, we get F1 scores of 84.8/85.3/85.1 on small categories. Therefore, we adopt λ = 1.0 to achieve a better balance between precision and recall.


The FTT module further improves the quality of EFPN.

        Finally, we replace ESPCN with our proposed FTT module. In Table 5, it improves precision and recall for small classes by 0.8% and 1.0%, respectively. Compared with single-image SR, the FTT module mines more hard small cases . At the same time, the FTT module also ensures fewer false positives by reducing artifacts on the background .

        In Fig. 4, we visualize the extended features of ESPCN and FTT modules to further demonstrate the superiority of FTT modules. We find that features from the FTT module are more similar to target features and have sharper boundaries between object and background regions . Richer region details help the detector to distinguish positive and negative samples, thus giving better localization and classification .


4.5 Qualitative results

        In Figure 5, we show detection examples for Tsinghua-Tencent 100K and MS COCO. Compared with the FPN baseline, our proposed EFPN can recall tiny and crowded instances better . Although the original ground-truth labels in MS COCO do not include all small objects,  our method still detects objects that are present but not labeled , which can be regarded as reasonable examples of false positives.


5 Conclusion

        In this paper, we propose EFPN to address the problem of small object detection. A novel FTT module  is embedded into an FPN-like framework to efficiently capture more regional details at the extended pyramid level. Furthermore, we design a foreground-background balanced training loss  to alleviate the regional imbalance of foreground and background. State-of-the-art performance on various datasets demonstrates the superiority of EFPN in small object detection.

        EFPN can be combined with various detectors to enhance small object detection , which means that EFPN can be transferred to more specific small object detection scenarios, such as face detection or satellite image detection. We hope to further explore the application of EFPN in more fields.

 

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130052426