【MobileNetsv1】: Efficient Convolutional Neural Networks for Mobile Vision Applications

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

https://arxiv.org/pdf/1704.04861.pdf%EF%BC%89
2017
Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko
Weijun Wang Tobias Weyand Marco Andreetto Hartwig Adam


Summary

We provide an efficient model class called MobileNets for mobile and embedded vision applications. MobileNets is based on a simplified architecture that uses depthwise separable convolutions to build lightweight deep neural networks. We introduce two simple global hyperparameters that effectively trade off latency and accuracy. These hyperparameters allow model builders to choose the right-sized model for their application based on the constraints of the problem. We conduct extensive resource and accuracy trade-off experiments and demonstrate strong performance compared with other popular models on the ImageNet classification task. We then demonstrate the effectiveness of MobileNets in a variety of applications and use cases including object detection, fine-grained classification, facial attributes, and large-scale geolocation.


1 Introduction

Convolutional neural networks have become ubiquitous in computer vision since AlexNet [19] popularized deep convolutional neural networks by winning the ImageNet Challenge: ILSVRC 2012 [24]. The general trend is to make deeper and more complex networks to achieve higher accuracy [27, 31, 29, 8]. However, these advances in accuracy do not necessarily make the network more efficient in terms of size and speed. In many real-world applications, such as robotics, autonomous vehicles, and augmented reality, recognition tasks need to be completed in a timely manner on platforms with limited computing resources.

This paper describes an efficient network architecture and a set of two hyperparameters to build very small, low-latency models that can easily match the design requirements of mobile and embedded vision applications. Section 2 reviews previous work on building small models. Section 3 introduces the MobileNet architecture and two hyperparameters width multiplier and resolution multiplier to define smaller and more efficient MobileNets. Section 4 describes experiments on ImageNet and a variety of different applications and use cases. Section 5 summarizes and concludes.

aaaaAAaaaa cute little kitten"

Figure 1. MobileNet models can be applied to various recognition tasks for efficient on device intelligence.


2. Previous employment

Interest in building small and efficient neural networks is increasing in recent literature, e.g. [16, 34, 12, 36, 22]. Many different approaches can be generally classified as compressing pretrained networks or directly training small networks. This paper proposes a class of network architectures that allow model developers to specifically select small networks that match the resource constraints (latency, size) of their applications. MobileNets focuses primarily on optimizing latency, but also produces small networks. Many papers on small networks focus only on size, not speed.

MobileNets are mainly built from depthwise separable convolutions originally introduced in [26] and subsequently used in the Inception model [13] to reduce the computation of the first few layers. Flattened Networks [16] uses fully factorized convolutions for network construction and demonstrates the potential of extremely factorized networks. Independently of this paper, factorized networks [34] introduced similar use of factored convolutions and topological connections. Subsequently, the Xception network [3] demonstrated how to scale up deeply separable filters to exceed the performance of the Inception V3 network. Another small network is Squeezenet [12], which uses a bottleneck approach to design a very small network. Other networks that reduce computational load include structural transformation networks [28] and deep convolutional neural networks [37].

Another way to obtain small networks is to shrink, factorize or compress pretrained networks. Compression methods based on product quantization [36], hashing [2], pruning, vector quantization and Huffman coding [5] have been proposed in the literature. Furthermore, various factorizations have been proposed to accelerate pre-trained networks [14, 20]. Another method for training small networks is distillation [9], which uses a larger network to teach a smaller network. It complements our approach and is presented in some use cases in Section 4. Another emerging approach is low-level networks [4, 22, 11].


3. MobileNet architecture

In this section, we first describe the core layer of the MobileNet architecture, namely depthwise separable convolutions. We then describe the MobileNet network structure and summarize two model reduction hyperparameters: width multiplier and resolution multiplier.

3.1. Depthwise separable convolution


The MobileNet model is built on depthwise separable convolutions, a form of factorized convolution that decomposes standard convolutions into depthwise convolutions and 1×1 point convolutions. For MobileNets, deep convolutions apply a single filter to each input channel. Then, the point convolution applies a 1×1 convolution to combine the output of the depth convolution. While standard convolution filters and combines the input in one step, depthwise separable convolution splits it into two layers, one for filtering and one for combining. The effect of this factorization is to significantly reduce the computational effort and model size.figure 2Shows how standard convolutions are decomposed into depthwise convolutions and 1 × 1 point convolutions.

The standard convolutional layer will DF × DF × M D_F×D_F×MDF×DF×M feature maps F take as input and produce aDF × DF × N D_F×D_F×NDF×DF×N feature map G, whereDF D_FDFis the spatial width and height of the input feature map (assumed to be square), M is the number of input channels (input depth), DG D_GDGare the spatial width and height of the output feature map, and N is the number of output channels (output depth).

The standard convolutional layer consists of the size DK × DK × M × N D_K ×D_K ×M × NDK×DK×M×The convolution kernel K of N is parameterized, whereDK D_KDKis the spatial dimension of the kernel, assumed to be square, M is the number of input channels, and N is the number of output channels, as described previously.

Assuming a stride of 1 and padding, the output feature map of a standard convolution is calculated as follows:

G k , l , n = ∑ i , j , m K i , j , m , n ⋅ F k + i − 1 , l + j − 1 , m (1) G _ { k ,l , n } = \sum _ { i , j , m } K _ { i , j , m, n } ·F _ { k + i -1 , l+j -1 , m }\tag{1} Gk,l,n=i,j,mKi,j,m,nFk+i1,l+j1,m(1)

The computational cost of standard convolution is:

D K ⋅ D K ⋅ M ⋅ N ⋅ D F ⋅ D F (2) D _ { K } \cdot D _ { K } \cdot M \cdot N \cdot D _ { F } \cdot D _ { F }\tag{2} DKDKMN

Guess you like

Origin blog.csdn.net/wagnbo/article/details/131176772