Speed and accuracy combination - EfficientNet Detailed

From google's EfficientNet, the paper proposes a multi-dimensional hybrid model Grading. Papers link (end of the article there is the code):

https://arxiv.org/pdf/1905.11946.pdfarxiv.org

The authors hope to find a possible taking into account the speed and accuracy of the model scaling method, for which the authors re-examine several dimensions of the previous model proposed scaling: Network depth, width network, image resolution, more than previous article wherein one dimension is enlarged in order to achieve a higher accuracy, such ResNet-18 to ResNet-152 is a method to increase the depth of the network to improve accuracy.

He jumped out of the previous understanding of the scaling model, from a height to look at these scaling dimensions. The author believes is the mutual influence between these three dimensions and explore the best combination between the three, based on this proposed new network EfficientNet, the performance of the network as follows:

Model performance comparison chart

The red curve piece is EfficientNet, the horizontal axis represents the size of the model, the vertical axis represents the accuracy. Just look at this picture EfficientNet you know how tough, we look at those familiar names, Resnet, Xception, ResNeXt, can be said to be rolled up. On the accuracy, EfficientNet only higher than the previous model GPipe SOTA 0.1%, in order to achieve this accuracy GPipe with the 556M parameters EfficientNet took only 66M, terrorist predicament! In actual use, the accuracy of which 0.1% are fundamentally we may feel, but the speed increase is indeed the real deal, eight times the speed increase greatly improving the network of utility and industrial floor possible.

Abstract problems

To this question the following manner represented by the formula, the symbol will be more, but it is not difficult to understand. We called the entire convolutional network N, it is the i-th layer can be viewed as the convolution function mapping the following:
$Y_i=F_i(X_i)$

$Y_i$ Tensor output, $X_i$ the input tensor, which dimension is provided $<H_i, W_i, C_i>$ (omitted herein for convenience of description Batch dimension), then the entire network convolution N, convolution of k layers, may be expressed as:
$N=F_k \odot ...\odot F_2 \odot F_1(X_1) = \odot_{j=1...k} F_j(X_1)$

In practice, a plurality of generally the same structural layer is called a convolution stage, e.g. ResNet stage can be divided into five, the same layer structure of each convolution stage (except the first layer is a layer downsampling). Stage in units of a convolutional network N may be expressed as:

image

Where the subscript i (from 1 to s) represents the stage number, $F^{L_i}_i$ denotes the i th stage, by a convolution layer which $F_i$ was repeated $L_i$ twice constituted $<H_i, W_i, C_i>$ represents a dimension of the tensor of the input stage.

In order to reduce the search space, the basic structure of the fixed network, but only three zoom change dimensions mentioned above, network depth (Li), the width of the web (Ci of), the size of the input resolution (Hi, Wi). However, even if only to search three dimensions, the search space is too large, the authors also added a restriction can only enlarge the network (that is, behind the EfficientNet-B0) in the acquaintance network infrastructure, multiplying the constant rate, then we only those who need to optimize the ratio is like, this final abstract mathematical model:

image

Wherein, w, d, r are the width of a network, the network height resolution magnification.

experiment

The above difficulty of the problem is that there are intrinsic link between three magnification, such as higher resolution images we need deeper network to increase receptive field capture feature. Accordingly OF two experiments (to be much more practical) to verify that the first experiment, two of the three dimensions is fixed, only one amplified to give the following results:

image

FIG respectively from left to right only network enlarged width (width, w is the magnification), network depth (depth, d is a magnification), the results of image resolution (resolution, r is the magnification), the observed single only enlarged dimension precision at the most about 80. The experimental results of a viewpoint: Zoom three dimensions of any dimension can bring to enhance the accuracy, but with the rate increasing to enhance getting smaller and smaller.

Thus made of a second experiment, different attempts vary d, r combination w, give the following figure:

image

从实验结果可以看出最高精度比之前已经有所提升，且不同的组合效果还不一样，最高可以到 82 左右。作者又得到一个观点，得到更高的精度以及效率的关键是平衡网络宽度，网络深度，图像分辨率三个维度的放缩倍率(d, r, w)。

由此，作者提出了一种混合维度放大法(compound scaling method)，该方法使用一个混合系数 [图片上传失败...(image-70ab47-1570798285064)]

来决定三个维度的放大倍率：

image

其中， $\alpha, \beta,\gamma$ 均为常数(不是无限大的因为三者对应了计算量)，可通过网格搜索获得。混合系数 $\phi$ 可以人工调节。考虑到如果网络深度翻番那么对应计算量会翻番，而网络宽度或者图像分辨率翻番对应计算量会翻 4 番，即卷积操作的计算量(FLOPS) 与 $d,w^2,r^2$ 成正比，因此上图中的约束条件中有两个平方项。在该约束条件下，指定混合系数 $\phi$ 之后，网络的计算量大概会是之前的 $2^\phi$ 倍。

网络结构

网络结构作者主要借鉴了 MnasNet，采取了同时优化精度(ACC)以及计算量(FLOPS)的方法，由此产生了初代 EfficientNet-B0，其结构如下图：

image

有了初代的网络结构之后，放大就分为下面两步：

第一步，首先固定 $\phi$ 为 1，即设定计算量为原来的 2 倍，在这样一个小模型上做网格搜索(grid search)，得到了最佳系数为 $\alpha=1.2,\beta=1.1,\gamma=1.15$ 。
第二步，固定 $\alpha=1.2,\beta=1.1,\gamma=1.15$ ，使用不同的混合系数 $\phi$ 来放大初代网络得到 EfficientNet-B1 ～ EfficientNet-B7。

作者选择只在小模型上进行网络搜索，大大减少了计算量。因为在大模型上进行网格搜索的成本实在是太高了。

网络表现

跟其他网络的对比：

image

On further amplified by this method and common network MobileNets ResNets, in the case where the amount of calculation have been relatively higher accuracy than before. This part can be summed up my faster than you, than you also prospective.

to sum up

The paper presents a new network combining speed and precision, very practical, can be used as a common baseline, the change can put on it.

Code

pytorch：https://github.com/lukemelas/EfficientNet-PyTorch

tensorflow：https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet

hard: https://github.com/qubvel/efficientnet