Paper reading-MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

This week I plan to learn about lightweight networks. Let's start by reviewing MobileNetv1 in 2017.

Summary

MobileNetv1: A lightweight network suitable for mobile or embedded devices proposed by Google in 2017, the first depthwise seperable convolution (depthwise seperable convolution) was proposed .


Section I Introduction

Since AlexNet took the lead in the 2012 ILSVRC competition, in recent years the network has continued to develop deeper and more complex architectures, but it has ignored the efficiency of the network and the real-time requirements of some tasks. MobileNet designs a low-latency lightweight neural network suitable for mobile and embedded devices by setting two hyperparameters (width multiplier, resolution multiplier).

Section II Prior Work

With the help of depthwise seperable convolution, MobileNet focuses on solving the problems of latency and size, while making the network model more streamlined; other ideas for streamlining the network include: SqueezeNet uses a bottletneck strategy to decompose the network, compress the network (shrinking), and Kazakhstan Greek, pruning, network distillation, etc.

Section III MobileNet

This section first introduces the core design of MobileNet-depthwise separable convolution (depthwise separable convolution), and then introduces the network structure of MobileNet.

Part A Depthwise Separable Convolution Depthwise Separable Convolution

decomposes the standard network into DW convolution (Depthwise) and PW convolution (Pointwise). Each convolution kernel of DW convolution convolves each input channel, so the number of filters is the same as the number of channels; PW convolution uses a 1x1xM specification convolution kernel to weight the feature map obtained after DW convolution, and M corresponds to the channel number. Through DW and PW, the conventional convolution operation is completed in depth and combination respectively, the number of output channels in the convolution process is separated from the size of the convolution kernel, and the filter is first combined before the combination, which significantly reduces the amount of calculation and model scale.

Insert picture description here

DW only performs feature filtering on the channel but does not combine to generate new feature expressions; while PW 1x1 convolution is responsible for completing this linear combination. Therefore, the calculation amount of the depth separable convolution is:

Insert picture description here

Compared with standard convolution, the amount of calculation is compressed:

Insert picture description here

For example, 3x3 convolution is used in MobileNet, so that the calculation amount of using depth separable convolution is 1/9 of standard convolution. For a
visual explanation, please refer to: Depthwise convolution and Pointwise convolution
Part B MobileNet

except for the last full connection The layer is fully connected, and after each depth separable convolution, it goes through BN+ReLU. Fig3 shows the comparison of the conventional convolution kernel depth separable convolution. In the depth separable convolution, the down-sampling is achieved by step convolution, and the ave-pooling spatial resolution compression sweep 1 before being sent to the FC layer. If Calculate DW and PW separately, the entire MobileNet has 28 layers, see Table I
Insert picture description here
Insert picture description here
for details. It can be seen in Table II that the main calculation time (95%) of MobileNet is in the 1x1 convolution calculation, which is the source of the main network parameters. One source of parameters is the parameters of the fully connected layer. Since small-scale models are not so easy to overfit, the problems of regularization and data augmentation are not overemphasized in MobileNet training

Part C
Hyperparameter Width Operator-Width Multiplier


Although the scale of the constructed MobileNet network has been reduced a lot, some specific applications still require a more streamlined network. In order to further reduce the network width, the width operator alpha is proposed. The range of alpha is [0,1] which will act on the input and output channels to reduce the width of each layer of the network. The overall parameter amount is the original 1/(alpha)2.
Insert picture description here

The resolution operator-Resolution Multiplier



compresses the resolution by applying p to the input image and the feature representation of the intermediate layer, and similarly changes the amount of calculation to the original 1/p2.



Insert picture description here







# Section IV Experiments




Part A The image classification




experiment part is mainly verified The effectiveness of the depth separable convolution is compared with the network built by the conventional convolution for ImageNet classification. The accuracy is only reduced by 1.1%, but the amount of calculation is reduced by an order of magnitude; and in the process of streamlining the network, the compression width is more compressed than the compression. The resolution is more effective (3%). For example, the classification accuracy of MobileNet is similar to that of VGG-16, but the amount of parameters is one percent of VGG; compared with GoogLeNet with similar parameters, the number of multi-adds is significantly reduced and the effect is better than GoogLeNet. After further compression by the width operator and the resolution operator, the simplified version of MobileNet still has better classification performance than AlexNet.
Insert picture description here

Part B Fine-grained image classification





This article also tested the performance of MoobileNet for fine-grained image classification, and tested the classification accuracy of the Stanford Dogs fine-grained image data set. See Table 10 for details. MobileNet greatly reduces the model while achieving approximate SOTA accuracy. Parameter.

Insert picture description here

In addition, the effectiveness of MobileNet was also tested in tasks such as object detection and face recognition.




# Section V Conclusion





This article proposes a lightweight network based on deep separable convolution-MobileNet, which is very suitable for high implementation requirements or applied to mobile or embedded devices. In addition, the network can be further compressed by adjusting the two hyperparameters of width and resolution, and a trade-off between accuracy and network parameters can be made. The experiments carried out also verified the effectiveness and efficiency of MobileNet in some classic tasks.





Summary: The core of MobileNet v1: the proposal of depth separable convolution.

Guess you like

Origin blog.csdn.net/qq_37151108/article/details/106208453