[Paper reading notes] EfficientDet: Scalable and Efficient Object Detection

Paper address: EfficientDet

Paper summary

  This paper is based on the detection network developed by EfficientNet. It proposes a new feature fusion method (BiFPN) and a scaling scheme on the detector (a multi-dimensional combined scaling scheme like EfficientNet), which can get a high efficiency and high performance network of.
  BiFPN adds a bottom-up path to FPN and makes some modifications, and then adds a weight effect to each feature fusion feature, which is different from the direct addition of traditional feature maps of the same size.

  The network performance is shown as follows:

Introduction

  BiFPN is an effective two-way cross-scale connections, as shown in (d) below.

  The effective cross-scale connection scheme is as follows:

  1. Remove edges with only one input; the author believes that if a node has only one input and no feature fusion, it can be considered relatively unimportant;
  2. When feature fusion, add one more input of the same size, so that more feature information can be fused without adding too much calculation;
  3. Unlike PANet, which has only one top-down and one bottom-up path, EfficientNet regards this combination as a basic block, which can be repeated multiple times to obtain higher-level feature fusion.

  At the same time, the author believes that the weights of different inputs of feature fusion should be different , so this article proposes to increase the weight of the input corresponding to the learning feature fusion. Generally speaking, there are several ways to increase the weight: (1) scalar, that is, each input corresponds to the weight of a scalar; (2) vector, that is, each channel of each input corresponds to a weight; (3) tensor, That is, each element has its corresponding weight. After experiments, the author found that the weight of a scalar can achieve the same effect as a vector and tensor.

  But the tensor is boundless, which may cause training instability, so the author intends to limit the weight to 0 0 through normalization0~ 1 1 1 between. At first, softmax was used, but that would bring about performance degradation (slow speed), and then directly normalized with weights∑ iwi ε + ∑ jwj \sum_i \frac{w_i}{\varepsilon+\sum_jw_j}iε + jwjwi, And obtained similar results;

  After general feature fusion, a conv operation is used. To further improve performance, EfficientDet uses depthwise conv for the operation after feature fusion.

Network Architecture

Combined zoom

  Backbone network uses the same scaling strategy as EfficientNet, namely EfficientNet-B0~B6, so that you can simply use ImageNet pre-trained checkpoints.
  BiFPN network , width W bifpn W_{bifpn}Wbifpn(channel) Scale using exponential growth: W bifpn = 64 ∗ (1.3 5 ϕ) W_{bifpn}=64*(1.35^\phi)Wbifpn=64(1.35ϕ ) in thetextϕ \phiϕ Use grid search in {1.2, 1.25, 1.3, 1.35, 1.4, 1.45}, and finally choose 1.35 as the zoom factor of width; depthD bifpn D_{bifpn}Dbifpn(Layer) The solution using linear growth: D bifpn = 3 + ϕ D_{bifpn}=3+\phiDbifpn=3+ϕ   Box/class prediction network, the number of channels used is the same as BiFPN, but the number of layers grows as follows:D box = 3 + ⌊ ϕ ⌋ D_{box}=3+\lfloor \phi \rfloorDbox=3+ϕ    Input image resolutionbecause the network finally downsampled128 1281 2 8 , so take128 1281 2 8 as the coefficient: R input = 512 + ϕ ∗ 128 R_{input}=512+\phi*128Rinput=512+ϕ128

  It should be noted that the scaling scheme is based on heuristics, but not optimal.

Thesis experiment

  The loss function uses Focal Loss, its coefficient α = 0.25 \alpha=0.25a=0 . 2 5γ = 1.5 \ gamma = 1.5c=1 . 5 . Training preprocessing using RetinaNet: multi-resolution crop/scale and flip; no automatic enhancement (data enhancement) strategy is used.

  Performance comparison:

  Experimental comparison between softmax normalization and ordinary normalization:

  From the weight learning of three random nodes in Figure 5, it can be seen that the learning curves of the two schemes are similar. The final performance (AP) is not much different, but there is a big speed difference.

Guess you like

Origin blog.csdn.net/qq_19784349/article/details/107215668
Recommended