Google's open-source new model EfficientNet: image recognition efficiency increased by 10 times, parameters reduced by 88%

Introduction

Model extension is widely used to improve the accuracy of convolutional networks. For example, the ResNet series can be expanded from ResNet-18 to ResNet-200 by increasing the number of layers. Google's open source neural network training library GPipe can achieve a top-1 accuracy of 84.3% on the ImageNet database through a four-fold expansion of the baseline network. However, although there are many ways to extend the convolutional network, there is little work to understand it in depth. Many previous works have been adjusted for one of the three dimensions of neural networks-depth, width, and image size. Although it seems feasible to adjust any two or three of these factors, it actually requires a lot of manual adjustment to achieve the reluctant improvement in the past. For the effect of EfficientNet, let's first look at the picture:

image

The abscissa in the figure represents the parameter amount, and the ordinate represents the accuracy of Top1 on the ImageNet database. It can be seen that the EfficientNet series defeated all other convolutional networks. Among them, Efficient-B7 achieved the new highest accuracy rate, reaching 84.4%. But its parameter quantity is reduced by 8.4 times compared with GPipe, and the inference speed has reached 6.1 times of GPipe. For more detailed data, please refer to the following experimental part.

Research motivation

The author of this article has conducted research and reflection on the expansion process of neural networks. In particular, the author puts forward a thought: Can we find a standardized neural network extension method that can improve the accuracy and efficiency of the network at the same time. To achieve this, a very critical step is how to balance the three dimensions of width, depth, and resolution. Through some empirical research, the author found that a fixed-scale scaling operation can be used to easily achieve a balance between the three. In the end, the author proposed a simple but effective compound scaling method. For example, for a standard model, if you want to use 2^N times the computing resources, the author believes that you only need to increase the network width by α^N, increase the depth by β^N, and increase the image size by γ^N times. Among them, α, β, and γ are a set of constant coefficients, and their values ​​are obtained by using a small-scale grid search in the original standard model. In order to visually illustrate the difference between the composite extension method proposed in this article and the traditional method, the author provides the following figure as a reference:

image

Among them, (a) is a baseline network, (b) to (d) are three different traditional methods, which respectively expand the width, depth and image resolution, and (e) is the composite expansion method proposed in this paper, using The fixed ratio expands the three dimensions at the same time. Intuitively, the compound expansion method makes sense. Because for larger input images, the network needs more layers to increase the receptive field, and more channels are needed to obtain fine-grained information. Generally speaking, the core work of this paper is mainly divided into two aspects:

  1. A composite expansion method is proposed, which is the first attempt to simultaneously expand the three dimensions of a convolutional network. This method can effectively improve the training effect of the existing network structure on large-scale computing resources.

  2. A new network structure with excellent performance-EfficientNet is designed. The network not only has far better performance than other network structures, but also has fewer network parameters and faster inference speed.

Composite model extension method

This part will introduce to you in detail what is the problem of network expansion, and conduct research on different methods, leading to our protagonist: the compound expansion method.

Problem modeling

The essence of a convolutional network is a mapping function, which can be written in the following form:

其中 Fi 表示第 i 层进行的运算,Xi 是输入的张量,我们假设这个张量的大小为:<Hi,Wi,Ci>。为了方便推理,省去了张量的批大小信息。

通常,我们会使用多个叠加的子模块来组成完整的卷积网络。例如 ResNet 由 5 个子模块构成,也被称为五个阶段。除了第一个阶段进行了降采样外,每个阶段中的所有层的卷积操作都是一样的。因此,神经网络也可以被定义为下面这个形式(公式(1)):

其中, 表示的是层 Fi 在第 i 阶段被重复了 Li 次。<Hi,Wi,Ci>表示的是第 i 层的输入张量 X 的形状。当一个输入的张量流经整个卷积网络,它的空间维度通常会缩减,并伴随着通道位数的增加,例如一个输入为<224,224,3>的张量经过一个特定的卷积网络后它的形状最终会变为<7,7,512>。

对于一个神经网络,作者假定所有的层都必须通过相同的常数比例进行统一的扩展。因此,模型扩展问题的可以表示为(公式(2)):

其中 w,d,r 分别是扩展网络的宽度、深度和分辨率。 是基线网络中预定义的网络参数。

单维度扩展

解决公式(2)的一个难点就在于 d,w,r 彼此依赖,并且会由于不同的资源条件所改变。传统方法主要集中于独立地解决其中一个维度的扩展问题。下图是改变每个维度对模型性能影响的部分实验结果:

图中从左至右分别表示的是不同宽度、深度、分别率系数对模型性能的影响。随着宽度、深度、分辨率的提高,更大的网络会获得更好的准确率。但是,在达到 80% 后,准确率很快就趋于饱和,这说明了 单维度的扩展是具有局限性的。这里的实验结果均使用的是 EfficientNet-B0 作为基线网络,具体结构如下表所示:

表 1:EfficientNet-B0 网络,每一行表示多层网络的某个阶段,resolution 输入张量大小,Channels 表示输出通道数。表中的符号与公式(1)中的符号意思

通过这一部分的比较作者得出:

观察 1:对网络的宽度、深度以及分辨率中的任意一项做扩展都可以提高其准确率,但是随着模型越来越大,这种提升会逐渐缩小。

复合扩展

实际上,不同的扩展维度之间并不是各自独立的。直观地讲,对于更高分辨率的图像,应当使用更深的网络,这样会有更大的感受野对图像进行采样与特征提取。同样的,网络的宽度也应该增加,这是为了通过分高分辨图像中更多的像素点来捕获更加细粒度的模式。基于上述直觉,本文的作者做出了一个假——“我们应当平等地对不同的扩展维度进行平衡,而不是像传统方法那样仅进行单维度扩展。”

为了验证这个假设,作者比较了不同深度和分辨率下对网络进行宽度扩展时的实验结果:

image

上图中每条线上的每个点表示模型在不同宽度系数配置下的效果。所有的基线网络都使用表 1 中的结构。第一个基线网络(d=1.0,r=1.0)有 18 个卷积层,其输入的分辨率是 224*224。最后一个基线网络(d=2.0,r=1.3)有 36 个卷积层,输入分辨率为 299*299。可以看出,在宽度不变得情况下,如果仅改变深度和分辨率,准确率很快趋于饱和。在 FLOPS(每秒浮点运算次数)消耗相同的情况下,分辨率更高、网络更深的模型可以获得更好的准确度。通过这部分分析,作者得出:

观察 2:为了得到更好的准确率和效率,在卷积网络扩展中,平衡网络的宽度、深度和分辨率这三种维度是非常关键的一步。

事实上,一些类似的工作也尝试过随机的平衡网络的宽度和深度,但是这些工作都需要冗长的手动微调。与上述方法不同,本文的作者提出了一种新的复合扩展方法。该方法使用一个复合系数Φ通过一种规范化的方式统一对网络的深度、宽度和分辨率进行扩展:

image

其中α,β,γ是常数,它们有小型网络搜索确定。Φ则是一个由用户指定的扩展系数,它用来控制到底有多少资源是模型扩展可用的。对于一般的卷积操作,其 FLOPS 需求与 d,w^2,r^2 是成比例的。由于卷积网络中最消耗计算资源的通常是卷积操作,因此使用公式(3)对网络进行扩展会导致总 FLOPS 近似变为image。本文中作者使用公式image对这三个参数进行了约束,因此,总 FLOPS 增加 2^Φ。

EfficientNet 结构

上面提到的模型扩展方法并不会改变基线网络中每一层的运算操作,因此要想提升模型的准确率,有一个好的基线网络也非常重要。在实验部分,作者使用现有的卷积网络对复合扩展法进行了评估。但是为了更好地展示复合扩展法的有效性,作者设计了一个新的轻量级基线网络 EfficientNet。(注:这里的轻量级网络表示可用于移动端的参数较少的卷积网络。)

EfficientNet 的结构已经在表 1 中列出,它的主干网络是由 MBConv 构成,同时作者采取了 squeeze-and-excitation 操作对网络结构进行优化(见 SENet,ILSVRC 2017 冠军)。对于 Efficient-B0,若要使用复合扩展法对其进行扩大需要通过两步来完成:

  • 第一步:首先将Φ固定为 1,假设至少有两倍以上的资源可用,通过公式(2)和公式(3)对α,β,γ进行网格搜索。特别的是,对于 EfficientNet-B0,在约束条件image 下,α,β,γ分别为 1.2,1.1 和 1.15 时网络效果最好。

  • 第二步:α,β,γ作为常数固定,然后通过公式(3)使用不同Φ对基线网络进行扩展,得到 EfficientNet-B1 到 EfficientNet-B7

The reason why we only use grid search on a small baseline network (step 1) and then directly extend the parameters to a large model (step 2) is because it is very expensive and infeasible to perform a parameter search directly on a large model Therefore, the author uses this two-step method to determine the extended parameters of the model.

In this part of the experiment, the author first verified the model extension method they proposed on the widely used MobileNets and ResNets. The experimental results are shown in Table 3:

image

Compared with the single-dimensional expansion method, the composite expansion method has improved on the three network models, which shows that the composite expansion method is effective for the current existing network.

Subsequently, the author trained EfficientNet on the ImageNet database. The experimental results show that the EfficientNet model has an order of magnitude less parameter and FLOPS than other convolutional networks, but it can get an approximate accuracy rate. Especially Efficient-B7 reached 84.4% in top1 and 97.1% in top5, which is more accurate than GPipe but the model is 8.4 times smaller:

image

At the same time, the author also evaluated EfficientNet on the commonly used transfer learning data set. The performance of EfficientNet is consistently improved compared to other types of networks. The experimental results are shown in the figure below:

image

In order to understand the reasons why the composite expansion method can achieve better results, the author visualized the activation map of the network. Compare the changes in the activation map of the baseline network with the same configuration after different expansion methods:

image

It can be seen that the compound expansion method makes the model pay more attention to the area related to the details of the target, while the models in other configurations cannot capture the detailed information of the target through the image.

to sum up

The author of this paper discusses the existing problems in the model expansion method, and proposes a composite expansion method from the aspects of how to weigh the depth, width and resolution of the network. This extension method was verified on two network structures, MobileNets and ResNet. In addition, the author also designed a new baseline network EfficientNet through neural structure search, and extended it to get a series of EfficientNets. On the image classification standard data set, EfficientNets surpasses the previous convolutional network, and EfficientNet has fewer parameters and faster inference process.


Guess you like

Origin blog.51cto.com/15060462/2678456