Read the paper in mind EfficientDet: Scalable and Efficient Object Detection

Topic: EfficientDet: Scalable and Efficient Object Detection

Literature Address: https://arxiv.org/pdf/1911.09070v1.pdf

(Unofficial) Source Address:

  (1) Pytorch version: https://github.com/toandaominh1997/EfficientDet.Pytorch

  (2) Keras&&TensorFlow版:https://github.com/xuannianz/EfficientDet

  The official source from a results point of view has not been open ... really shines, but as to how measured, remains to be seen.

  

  In the field of machine vision, the importance of increasing the efficiency of the model. Recently, Google brain team systematically chose the target detection neural network architecture design, and made several key optimization can improve the efficiency of the model. First, we propose a two-way feature weighted pyramid (the Feature Pyramid Weighted BI-directional Network, BiRPN ), can easily and quickly implement multi-scale feature fusion. Secondly, a composite scaling method (compound scaling), the method can all backbone (back bone), wherein the network, the bounding box / class prediction network resolution, depth and width for uniform scaling . Based on these optimizations, proposed new target detector EfficientDet, in a wide range of resource constraints, EfficientDet always more efficient than the existing model of an order of magnitude. Specifically, in the data set COCO, EfficientDet-D7 parameters using only 52M and 326B FLOPS, and to obtain optimum 51.0mAP. The best accuracy than conventional detectors + 0.3% mAP.  

EfficientDet results on data set COCO

  On resource-constrained devices for different (from 3B to 300B FLOPS) designed a scalable model, as shown, to a comparative EfficientDet-D0 EfficientDet-D6, and YOLOv3, MaskRCNN, NAS-FPN like models, in the accuracy and computational EfficientDet are thriving.

  In general, one-stage detector, such as a YOLO, SSD, etc., detection speed, real-time, but not as good as the accuracy of detecting Mask-RCNN, RetinaNet other two-stage detector. But the figure below you can see, EfficientDet conducted between FLOPS and mAP good break, is currently both accurate and fast detector .

  OF also model size (the amount of parameters), delay the GPU, the CPU three delay compared with existing models on accuracy, as shown below, in every respect, EfficientDet (D0- D6 result) with respect to the other models are in the position of non-dominated frontier .

Status of target detection

  In recent years, a large number of target detection has made progress in the direction of a more accurate result on. However, the cost of advanced target detection device has become higher and higher. For example, the recently proposed based AmoebaNet NAS-FPN parameter detector comprises an amount of FLOPS 167M and 3045 B (which is 30 times the RetainNet), was obtained current optimal accuracy in such high cost. Massive scale and expensive computing resources and costs will be limited due to the delay and other reasons , to limit the application of these algorithms in real-world models , such as robots and automated driving. Due to the real world problem of limited resources, the importance of considering the efficiency of target detection model is growing.

  Previously there have been many committed to developing a more efficient detector structure of research results, such as some one-stage model (SSD / YOLOv3 / YOLO9000. Etc ), do not need to anchor frame (anchor-free) detector (CornerNet. etc), an existing model or compression (YOLO-Lite.etc). Although such an approach would achieve better efficiency, but often at the expense of accuracy for the price. And, most previous studies have focused on conditions at a specific resource needs, from mobile devices to data centers, often have different resource limits.

  Then, a natural question is: for a while with a high-precision, high-efficiency scalable detection architecture to respond to different resource constraints (3B to 300B FLOPS) feasible?

  Based on the example of the one-stage detector, on the backbone network, several aspects of the design features of the way, bounding box / category predicted convergence networks to choose, we found two major challenges:

  . 1 Challenge : Efficient multi-scale feature fusion. The integration of the different input characteristics, most previous studies were not distinguished, plus a simple summary and up. However, these characteristics are often different inputs not derived from, the degree of contribution of the output characteristic thereof should be unequal at different resolutions .

  Based on this, the authors propose a simple and efficient weighted bi-directional characteristic pyramid network BiFPN , this model introduces the right to be re-learned the importance of learning for different input features , and re-use bottom-up (bottom-up) and multi-scale features from top to bottom (top-down) fusion. ---- with Weight and SENet thought something like .

  Challenge 2: Model Scale . To obtain a high accuracy, previous research depends on the large backbone network (back bone network) and the large size of the input image. Authors noted that , taking into account the accuracy and efficiency, the expansion characteristics of the network and the bounding box / category prediction network is also critical.

  Based on this, the authors problem for target detection method proposed a hybrid scaling, which can be the backbone of the network backbone, network characteristics, the bounding box / category predicted network resolution, depth, width, unified scaling.

  Finally, the authors also found that ResNets, ResNetXt and AmoebaNet backbone structure such as the widespread use of recent literature EfficientNet more efficient compared to before. Therefore, the authors propose the BiFPN and mixed scaling and EfficientNet structure combining named EfficientDet.  

Improved network structure

  The first issue of multiscale fusion formulaic, and then put forward two ideas BiFPN proposed:

  • Effective two-way connections across scales
  • Fusion weighted feature

1. FPN formulation

  Multiscale feature intended to fusion characteristics under different polymerization resolution. For a series of features at different scales:

  l1, l2, ..., li represent different levels at different depths, that model of the network. With the deepening of level i, generally reduces the resolution characteristics of 1/2 ^ i is the rate. For example, a 640 * 640 resolution input features, after three conversion, the resolution will be reduced dimension 640/2 ^ 3 = (resolution of 80 * 80) 80. Multiscale goal is to find a feature fusion can effectively be a series of different scales converted into one output, as follows:

  As shown below, the left is a conventional the FPN , which contains only features fusion top-down (as shown in FIG. A). YOLO v3 can be associated with the network structure (right side as shown in FIG.), 13 * 13 resolution features fused by upsampling the resolution of 26 * 26 feature, 26 * 26 resolution and 52 characterized by upsampling * 52 resolution features fusion. As can be seen, the process of integration includes only the fusion from low resolution to high resolution, i.e., the integration of features from the network to the network deep shallow features, top-down mode is described in the paper.

  Top-down pattern features fusion FPN can be formulated in the following form, which is generally the sampling Resize (or downsampling) operation for matching resolution; Conv conventional convolution operation. "+" Indicates the character stacked on the channel.

2. Cross-connect scale (Cross-Scale Connections)

  FPN traditional top-down one-way flow of information is limited in nature. PANet added a way to bottom, as shown on the left; NAS-FPN using neural network structure of the search, wherein the network topology looking across scales irregular, as shown on the right.

  In order to improve the efficiency of the performance of the model, Google brain research team for cross-scale coupling structure made a number of optimization:

  • First, only one input node is removed, the inspiration relatively simple, a node if it contains only one input, and did not feature integration, it contributed to the characteristics of the network is very small . This can generate a simplified Panet ( Sim Panet PLE ), shown at left below, compared to Panet, can be omitted, it is evident that only one input node .

  • Secondly, for the same level, starting from an added input to the output connection, as shown above the right wiring purple. This does not add any parameters at the same time, the integration of more features.
  • 最后,与只有一条自上而下和自下而上路径的 PANet 不同,研究者将每个双向路径(自上而下和自下而上)作为一个特征网络层,并多次重复同一个层,以实现更高级的特征融合。形成最终的BiFPN。

3. 加权特征融合

  融合多个不同分辨率特征的常规方式是先将分辨率调整一致,再对其进行加和或者堆叠。金字塔注意力网络(pyramid attention network)引入了全局自注意力上采样来恢复像素定位。

  之前的特征融合思想对于输入的不同特征一视同仁,然而不同分辨率的特征对于输出特征的贡献是不同的。为解决该问题,研究者提出在特征融合过程中为每一个输入添加额外的权重,再让网络学习每个输入特征的重要性。

  

  最后一届ImageNet的冠军SENet,也意识到了这样的问题,下图为SE-block的结构,通过全连接的方式,让网络学习每个尺度特征的贡献度:

  

  作者提出了三种加权融合的方式

  • Unbounded fusion:

其中,wi是学习的权重,这个权重可以是针对每个feature的(per-feature),也可以是针对通道的(per-channel),还可以是针对像素的(per-pixel)。由于权值不受限,会造成训练的不稳定。因此,采用权值归一化的方式限定每个权重的范围。

  • softmax-based fusion:

这种方式实际上也是通过softmax的方式对权重值的范围进行限定。但是,softmax的操作对硬件GPU不是很友好,会降低运算速度(yolov3中都不使用softmax做分类)。

  • Fast normalized fusion:

这种方式在某种程度上是为了避免softmax操作的资源消耗。权重wi都使用ReLu激活函数,以保证wi≥0。避免数值的不稳定,采用一个很小的值ε = 0.0001。同样,经过这样的处理,最终的加权值将会归一化到0到1。对于BiFPN的结构,加权操作如下式所示:

  其中,P6td表示P6级的中间节点的输出特征。作者为了提升效率,使用深度可分离卷积(Mobile Net)减少参数量和运算量。

  从下表中可以看出,使用BiFPN的EfficientNet-B3要比使用FPN在准确度上提高了4个点,参数量和运算量也极大缩小。

EfficientDet

1. 网络结构

  下图显示了EfficientDet网络结构,大致采用了one-stage检测器的范例。采用EfficientNet作为网络的backbone;BiFPN作为特征网络;将从backbone网络出来的特征{P3,P4,P5,P6,P7}反复使用BiFPN进行自上而下和自下而上的特征融合。反复使用的特征通过class prediction net和box prediction net 对检测类别和检测框分别进行预测。

EfficientNet - B0结构:

其中,MBConv为取反的bootlenet单元,即mobilev2的瓶颈单元,并将shortcut部分改为SE-block。

2. 复合缩放

  为了优化准确性和效率,作者想要开发一系列的模型满足广泛的资源约束

  以前的工作主要是通过使用更大的主干网络来扩展基线检测器(e.g. ResNeXt),使用较大的输入图片或者堆叠更多的FPN层。这样的做法往往不够高效。

  研究者提出一种目标检测复合缩放方法,它使用简单的复合系数 φ 统一扩大主干网络、BiFPN 网络、边界框/类别预测网络的所有维度

 

  

  如果没有研读过EfficientNet,将会很难理解复合缩放的含义。对EfficientNet的思想进行简要描述。

  下图为模型缩放的示意图,其中图a描述了一个基础的网络结构,图b-图d为分别对网络宽度、网络深度、分辨率进行扩大。图e为对网络宽度、深度、分辨率统一扩大。

  对于一个卷积层,其操作可以定义为一个函数:

  其中,F表示卷积操作,Yi是输出张量,Xi为输入张量。

  那么,一个ConvNet N就可以表示为一个递归的操作:

  卷积神经网络N可以定义为(考虑Inception的存在)

  Fi^Li:表示第i层卷积操作重复的次数。【Inception 操作,在一层会进行多次卷积操作】

  H:表示输入特征高度

  W:表示输入特征宽度

  C: 表示输入特征通道数量

  X: 表示输入张量

  因此,在运算量、资源受限的环境下,一个优化问题可以定义为

  简单来说,就是在运算量、资源允许的情况下,通过优化网络深度d,宽度w,分辨率r,使得准确度最大的优化问题。

  对于一个优化问题,需要限制优化变量的取值范围,否则优化的搜索空间过大,难以寻到最优。因此,设定一个变量准则,如下所示。

  其中,α,β,γ为常量;φ为用户定义的尺度变化参数;

  将网络的深度d加倍,将会使得计算量变为原来的2倍,将网络的通道宽度w和分辨率r加倍将会使得计算量变为原来的4倍。因此,设置α · β^2 · γ^2 ≈ 2,最终的计算量FLOAPS数目为2φ。

 

Guess you like

Origin www.cnblogs.com/monologuesmw/p/12423232.html