EfficientDet: Scalable and Efficient Object Detection Paper Notes

Paper address: https://arxiv.org/pdf/1911.09070.pdf
github address: https://github.com/google/automl/tree/master/efficientdet

Motivation

The author of this paper expands the efficient target detection network EfficientDet based on EfficientNet, and the main contributions are BiFPN and compound scaling. The author obtained a series of EfficientDet according to the architecture search, represented by D0-D7. They are progressively slower, but also progressively more accurate. Among them, EfficientDet D7 achieved 51.0 mAP and sota results on the COCO 2017 validation data set with 326B FLOPS and 52 M parameters.

Methods

First, in the multi-scale feature fusion stage, the author proposes a bidirectional FPN structure. The FPN of CVPR 2017 points out the importance of feature fusion between different layers, and uses a relatively simple and heuristic method to multiply the underlying features by twice and add shallow layers to fuse. After that, various other fusion methods appeared. For example, PANet is connected from the bottom up first, and then connected back from the top down; M2Det adds skip-connection when connecting. These methods are based on a combination of candidate operations such as Conv, Sum, Concatenate, Resize, and Skip Connection. Based on certain observations, the author proposed BiFPN. Its structural unit is shown in the figure.
BiFPN
The author observed that the effect of PANet is better than that of FPN and NAS-FPN, but the amount of calculation is larger; therefore, the author starts from PANet and removes the node with only one input. This is done by assuming that only one input node is relatively unimportant; in addition, the author connects an edge between the input and output nodes of the same l scale, assuming that more features can be integrated, similar to skip-connection; finally, PANet There are only two paths from bottom to top and top to bottom. The author believes that this connection method can be used as a basic layer and repeated many times. The number of repetitions is a trade-off between speed and accuracy. Combine the backbone of EfficientNet and the two branches used for detection to obtain the basic framework of EfficientDet as shown in the figure below.
EfficientIt
In addition, in feature fusion, the previous features were added with equal weights, and the author believes that from the perspective of modeling, a better way is to add different weights to the features.
The second contribution of this paper is compound scaling. Model Scaling refers to resource constraints, often need to adjust the model. For example, in order to scale up the backbone part and get a larger model, it will consider deepening the number of layers, or increasing the resolution of the input image.
Based on the composite scaling of EfficientNet on the three elements of width, depth and resolution, EfficientDet is further expanded to use EfficientNet as the backbone, so that from EfficientNet B0 to B6, the scale of the Backbone can be controlled; and the number of channels of BiFPN and the number of repeated layers It is also controlled by Backbone; in addition, there are the number of layers in the head part, and the resolution of the input image, which constitute the scaling config of EfficientDet.
The formula for determining the width and depth of BiFPN is:
depth&width
The formula for determining the number of layers in the Head part is:
Head
The formula for determining the image resolution is:
resolution
ϕ \phiϕ is the level of the backbone, divided into 0-7 levels.
config

Experiments

Dataset: COCO2017
Optimizer: SGD
Learning rate: Increase linearly from 0 to 0.16 in the first training epoch. Then decay in cosine mode.
Loss function: focal loss
Preprocessing: flipping and scaling
Number of epochs: 300 for D0-D6, 600 for D7
Batchsize: 128 on 32 TPUv3

Results

experiments

Guess you like

Origin blog.csdn.net/qq_43812519/article/details/107143911