Paper's EfficientDet: Translation and Interpretation of "Scalable and Efficient Object Detection—Scalable and Efficient Object Detection"—Sequel

 EfficientDet of Paper: Translation and Interpretation of "Scalable and Efficient Object Detection—Scalable and Efficient Object Detection"

Introduction : On November 21, 2019, the Google Brain team released the paper EfficientDet: Scalable and Efficient Object Detection.
Three Auto ML leaders from the Google Brain team, Mingxing Tan Ruoming Pang Quoc V. Le, recently published this article on Arxiv, and some netizens speculated that it was voted for CVPR 2020. By improving the structure of multi-scale feature fusion in FPN and drawing on the  model scaling method of EfficientNet  , a model scalable and efficient object detection algorithm EfficientDet is proposed.
This work can be seen as an extension of EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks in ICML 2019 Oral, extending from classification tasks to detection tasks (Object Detection).
From Figure 1, it can be seen that there is a certain balance between the FLOPS speed of the neural network and the mAP accuracy according to the scene requirements. From the curve of EfficientDet D1 ~ EfficientDet D7, it can be seen that FLOPS gradually becomes slower, while mAP gradually increases.

content

Translation and interpretation of Scalable and Efficient Object Detection

Abstract

1. Introduction

2. Related Work

3、BiFPN

3.1. Problem Formulation

3.2. Cross-Scale Connections


Translation and interpretation of Scalable and Efficient Object Detection

Paper address : https://arxiv.org/pdf/1911.09070.pdf
Paper author : Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team {tanmingxing, rpang, qvl}@google.com

Abstract

Model efficiency has become increasingly important in  computer vision. In this paper, we systematically study various  neural network architecture design choices for object  detection and propose several key optimizations to improve  efficiency. First, we propose a weighted bi-directional feature  pyramid network (BiFPN), which allows easy and fast  multi-scale feature fusion; Second, we propose a compound  scaling method that uniformly scales the resolution, depth,  and width for all backbone, feature network, and box/class  prediction networks at the same time. Based on these optimizations,  we have developed a new family of object detectors,  called EfficientDet, which consistently achieve an  order-of-magnitude better efficiency than prior art across a  wide spectrum of resource constraints. In particular, without  bells and whistles, our EfficientDet-D7 achieves stateof-the-art  51.0 mAP on COCO dataset with 52M parameters  and 326B FLOPS1  , being 4x smaller and using 9.3x  fewer FLOPS yet still more accurate (+0.3% mAP) than the  best previous detector. Model efficiency is increasingly important in computer vision. In this paper, we systematically study the design choices of various neural network architectures for object detection and propose several key optimization schemes to improve efficiency. First, we propose a weighted bidirectional feature pyramid network (BiFPN) , which can fuse multi-scale features conveniently and quickly; second, we propose a hybrid scaling method that simultaneously The resolution, depth, and width of the prediction network are scaled uniformly. Based on these optimizations, we develop a new family of object detectors, called EfficientDet , which consistently achieves orders of magnitude better efficiency than state-of-the-art within a wide range of resource constraints. In particular, without any additional features, our EfficientDet-D7 achieves the state-of-the-art 51.0 mAP on COCO dataset with 52M parameters and 326B FLOPS1, which is 4 times smaller than the previous best detector, less With 9.3x FLOPS, but still more accurate (+0.3% mAP) than the previous detector.

1. Introduction

Figure 1: Model FLOPS vs COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves much better accuracy with fewer computations than other detectors. In particular, EfficientDet-D7 achieves new state-of-the-art 51.0% COCO mAP with 4x fewer parameters and 9.3x fewer FLOPS. Details are in Table 2.
Figure 1: Model FLOPS vs COCO accuracy - all numbers are for single model single scale. Compared to other detectors, our high-efficiency detector achieves higher accuracy with less computation. In particular, Effecentett-D7 achieves state-of-the-art 51.0% COCO mapping with 4x fewer parameters and 9.3x fewer failures. See Table 2 for details.

Tremendous progresses have been made in recent years towards more accurate object detection; meanwhile, stateof-the-art object detectors also become increasingly more expensive. For example, the latest AmoebaNet-based NASFPN detector [37] requires 167M parameters and 3045B FLOPS (30x more than RetinaNet [17]) to achieve state-ofthe-art accuracy. The large model sizes and expensive computation costs deter their deployment in many real-world applications such as robotics and self-driving cars where model size and latency are highly constrained. Given these real-world resource constraints, model efficiency becomes increasingly important for object detection. In recent years, tremendous progress has been made in improving the accuracy of object detection; at the same time, state-of-the-art object detectors have become increasingly expensive. For example, the state-of-the-art AmoebaNet-based NASFPN detector [37] requires 167 million parameters and 3045B FLOPS (30 times more than RetinaNet [17]) to achieve state-of-the-art accuracy. Large model sizes and expensive computational costs hinder their deployment in many real-world applications, such as robotics and self-driving cars, where model size and latency are highly constrained. Given these realistic resource constraints, model efficiency becomes increasingly important for object detection.
There have been many previous works aiming to develop more efficient detector architectures, such as onestage [20, 25, 26, 17] and anchor-free detectors [14, 36, 32],or compress existing models [21, 22]. Although these methods tend to achieve better efficiency, they usually sacrifice accuracy. Moreover, most previous works only focus on a specific or a small range of resource requirements, but the variety of real-world applications, from mobile devices to datacenters, often demand different resource constraints. 之前有许多致力于开发更高效的探测器架构的工作,如onestage[20,25,26,17]和无锚探测器[14,36,32],或压缩现有模型[21,22]。虽然这些方法趋向于获得更好的效率,但它们通常会牺牲准确性。此外,以前的大多数工作只关注特定的或小范围的资源需求,但是从移动设备到数据中心的各种实际应用程序常常需要不同的资源约束。
A natural question is: Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPS)? This paper aims to tackle this problem by systematically studying various design choices of detector architectures. Based on the onestage detector paradigm, we examine the design choices for backbone, feature fusion, and class/box network, and identify two main challenges:
  • Challenge 1: efficient multi-scale feature fusion – Since introduced in [16], FPN has been widely used for multiscale feature fusion. Recently, PANet [19], NAS-FPN [5], and other studies [13, 12, 34] have developed more network structures for cross-scale feature fusion. While fusing different input features, most previous works simply sum them up without distinction; however, since these different input features are at different resolutions, we observe they usually contribute to the fused output feature unequally. To address this issue, we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features, while repeatedly applying topdown and bottom-up multi-scale feature fusion.
  • Challenge 2: model scaling – While previous works mainly rely on bigger backbone networks [17, 27, 26, 5] or larger input image sizes [8, 37] for higher accuracy, we observe that scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and efficiency. Inspired by recent works [31], we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.
一个很自然的问题是:是否有可能构建一个可伸缩的检测架构,该架构具有更高的准确性和更大的效率,可以跨越各种资源约束(例如,从3B到300B FLOPS)?本文旨在通过系统地研究探测器结构的各种设计选择来解决这一问题。基于onestage检测器范例,我们检查了主干、特征融合和类/盒网络的设计选择,并确定了两个主要挑战:
  • 挑战1:高效的多尺度特征融合——自[16]引入以来,FPN被广泛用于多尺度特征融合。最近,PANet[19]、NAS-FPN[5]等研究[13、12、34]开发了更多用于跨尺度特征融合的网络结构。虽然融合了不同的输入特性,但以往的大多数工作只是简单地将它们相加,没有区别;然而,由于这些不同的输入特征具有不同的分辨率,我们观察到它们通常对融合的输出特征的贡献是不平等的。针对这一问题,我们提出了一种简单而高效的加权双向特征金字塔网络(BiFPN),该网络在重复应用自顶向下和自底向上多尺度特征融合的同时,引入可学习权值来学习不同输入特征的重要性。
  • 挑战2:模型缩放——虽然以前的工作主要依赖于更大的主干网络[17,27,26,5]或更大的输入图像大小[8,37]来获得更高的精度,但我们注意到,在考虑准确性和效率的同时,放大特征网络和box/class预测网络也很关键。摘要受近年来[31]算法的启发,我们提出了一种用于目标检测的复合标度方法,该方法可以对所有主干、特征网络、盒类预测网络的分辨率/深度/宽度进行联合标度。

Finally, we also observe that the recently introduced EfficientNets [31] achieve better efficiency than previous commonly used backbones (e.g., ResNets [9], ResNeXt [33], and AmoebaNet [24]). Combining EfficientNet backbones with our propose BiFPN and compound scaling, we have developed a new family of object detectors, named EfficientDet, which consistently achieve better accuracy with an order-of-magnitude fewer parameters and FLOPS than previous object detectors. Figure 1 and Figure 4 show the performance comparison on COCO dataset [18]. Under similar accuracy constraint, our EfficientDet uses 28x fewer FLOPS than YOLOv3 [26], 30x fewer FLOPS than RetinaNet [17], and 19x fewer FLOPS than the recent NASFPN [5]. In particular, with single-model and single testtime scale, our EfficientDet-D7 achieves state-of-the-art 51.0 mAP with 52M parameters and 326B FLOPS, being 4x smaller and using 9.3x fewer FLOPS yet still more accurate (+0.3% mAP) than the best previous models [37]. Our EfficientDet models are also up to 3.2x faster on GPU and 8.1x faster on CPU than previous detectors, as shown in Figure 4 and Table 2.

最后,我们还观察到,最近推出的EfficientNets [31]比之前常用的骨干(例如,ResNets [9], ResNeXt [33], AmoebaNet[24])的效率更高。我们将effecentnet主干与我们提出的BiFPN和复合标度相结合,开发了一个新的对象检测器家族,命名为efficient entdet,与以前的对象检测器相比,它始终能够在较少数量级的参数和错误的情况下获得更好的准确性。图1和图4显示了对COCO数据集[18]的性能比较。在类似的精度约束下,我们的effecentdet使用的FLOPS比YOLOv3[26]少28倍,比RetinaNet[17]少30倍,比最近的NASFPN[5]少19倍。特别地,在单模型和单测试时间尺度的情况下,我们的效率测点- d7在52M参数和326B FLOPS的情况下,实现了最先进的51.0 mAP,比以前最好的模型[37]小4倍,减少了9.3倍的FLOPS,但仍然比以前的模型更精确(+0.3% mAP)。我们的EfficientDet模型在GPU上比以前的检测器快3.2倍,在CPU上比以前的检测器快8.1倍,如图4和表2所示。

Our contributions can be summarized as:

• We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion. • We proposed a new compound scaling method, which jointly scales up backbone, feature network, box/class network, and resolution, in a principled way. • Based on BiFPN and compound scaling, we developed EfficientDet, a new family of detectors with significantly better accuracy and efficiency across a wide spectrum of resource constraints.

我们的贡献可以总结为:

•我们提出了一个加权的双向特征网络BiFPN,用于方便快速的多尺度特征融合。•我们提出了一种新的复合标度方法,可以原则性地对主干、feature network、box/class network、resolution进行联合标度。•基于BiFPN和复合标度,我们开发了EfficientDet,这是一种新的探测器家族,在广泛的资源约束范围内具有更高的准确性和效率。

2. Related Work

One-Stage Detectors: Existing object detectors are mostly categorized by whether they have a region-ofinterest proposal step (two-stage [6, 27, 3, 8]) or not (onestage [28, 20, 25, 17]). While two-stage detectors tend to be more flexible and more accurate, one-stage detectors are often considered to be simpler and more efficient by leveraging predefined anchors [11]. Recently, one-stage detectors have attracted substantial attention due to their efficiency and simplicity [14, 34, 36]. In this paper, we mainly follow the one-stage detector design, and we show it is possible to achieve both better efficiency and higher accuracy with optimized network architectures.
Multi-Scale Feature Representations: One of the main difficulties in object detection is to effectively represent and process multi-scale features. Earlier detectors often directly perform predictions based on the pyramidal feature hierarchy extracted from backbone networks [2, 20, 28]. As one of the pioneering works, feature pyramid network (FPN) [16] proposes a top-down pathway to combine multi-scale features. Following this idea, PANet [19] adds an extra bottom-up path aggregation network on top of FPN; STDL [35] proposes a scale-transfer module to exploit cross-scale features; M2det [34] proposes a U-shape module to fuse multi-scale features, and G-FRNet [1] introduces gate units for controlling information flow across features. More recently, NAS-FPN [5] leverages neural architecture search to automatically design feature network topology. Although it achieves better performance, NAS-FPN requires thousands of GPU hours during search, and the resulting feature network is irregular and thus difficult to interpret. In this paper, we aim to optimize multi-scale feature fusion with a more intuitive and principled way.
Model Scaling: In order to obtain better accuracy, it is common to scale up a baseline detector by employing bigger backbone networks (e.g., from mobile-size models [30, 10] and ResNet [9], to ResNeXt [33] and AmoebaNet [24]), or increasing input image size (e.g., from 512x512 [17] to 1536x1536 [37]). Some recent works [5, 37] show that increasing the channel size and repeating feature networks can also lead to higher accuracy. These scaling methods mostly focus on single or limited scaling dimensions. Recently, [31] demonstrates remarkable model efficiency for image classification by jointly scaling up network width, depth, and resolution. Our proposed compound scaling method for object detection is mostly inspired by [31].

3、BiFPN

In this section, we first formulate the multi-scale feature fusion problem, and then introduce the two main ideas for our proposed BiFPN: efficient bidirectional cross-scale connections and weighted feature fusion.

Figure 2: Feature network design – (a) FPN [16] introduces a top-down pathway to fuse multi-scale features from level 3 to 7 (P3 - P7); (b) PANet [19] adds an additional bottom-up pathway on top of FPN; (c) NAS-FPN [5] use neural architecture search to find an irregular feature network topology; (d)-(f) are three alternatives studied in this paper. (d) adds expensive connections from all input feature to output features; (e) simplifies PANet by removing nodes if they only have one input edge; (f) is our BiFPN with better accuracy and efficiency trade-offs.

3.1. Problem Formulation

Multi-scale feature fusion aims to aggregate features at different resolutions. Formally, given a list of multi-scale features P~ in = (P in l1 , Pin l2 , ...), where P in li represents the feature at level li , our goal is to find a transformation f that can effectively aggregate different features and output a list of new features: P~ out = f(P~ in). As a concrete example, Figure 2(a) shows the conventional top-down FPN [16]. It takes level 3-7 input features P~ in = (P in 3 , ...Pin 7 ), where P in i represents a feature level with resolution of 1/2 i of the input images. For instance, if input resolution is 640x640, then P in 3 represents feature level 3 (640/2 3 = 80) with resolution 80x80, while P in 7 represents feature level 7 with resolution 5x5. The conventional FPN aggregates multi-scale features in a top-down manner:


where Resize is usually a upsampling or downsampling op for resolution matching, and Conv is usually a convolutional op for feature processing.

3.2. Cross-Scale Connections

Conventional top-down FPN is inherently limited by the one-way information flow. To address this issue, PANet [19] adds an extra bottom-up path aggregation network, as shown in Figure 2(b). Cross-scale connections are further studied in [13, 12, 34]. Recently, NAS-FPN [5] employs neural architecture search to search for better cross-scale feature network topology, but it requires thousands of GPU hours during search and the found network is irregular and difficult to interpret or modify, as shown in Figure 2(c).
By studying the performance and efficiency of these three networks (Table 4), we observe that PANet achieves better accuracy than FPN and NAS-FPN, but with the cost of more parameters and computations. To improve model efficiency, this paper proposes several optimizations for cross-scale connections: First, we remove those nodes that only have one input edge. Our intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features. This leads to a simplified PANet as shown in Figure 2(e); Second, we add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost, as shown in Figure 2(f); Third, unlike PANet [19] that only has one top-down and one bottom-up path, we treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion. Section 4.2 will discuss how to determine the number of layers for different resource constraints using a compound scaling method. With these optimizations, we name the new feature network as bidirectional feature pyramid network (BiFPN), as shown in Figure 2(f) and 3.

3.3. Weighted Feature Fusion

When fusing multiple input features with different resolutions, a common way is to first resize them to the same resolution and then sum them up. Pyramid attention network [15] introduces global self-attention upsampling to recover pixel localization, which is further studied in [5].
Previous feature fusion methods treat all input features equally without distinction. However, we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input during feature fusion, and let the network to learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches:
Unbounded fusion: O = P i wi · Ii , where wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). We find a scale can achieve comparable accuracy to other approaches with minimal computational costs. However, since the scalar weight is unbounded, it could potentially cause training instability. Therefore, we resort to weight normalization to bound the value range of each weight.
Softmax-based fusion: O = P i e wi P j e wj · Ii . An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. However, as shown in our ablation study in section 6.3, the extra softmax leads to significant slowdown on GPU hardware. To minimize the extra latency cost, we further propose a fast fusion approach.

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/124395019