Interpretation of paper | MobileNets Efficient Convolutional Neural Networks for Mobile Vision Application

Reprinted: https://blog.csdn.net/u013082989/article/details/77970196

1 Overview

  • Neural network model proposed by Google in 2017 for mobile phones end
  • The main use of separable convolution depth Depthwise Separable Convolution calculating the convolution kernel decomposition to reduce the amount of calculation
  • It introduces two super parameters computation and reducing the amount of parameter 

              Width multiplier (Width Multiplier): [reducing input and output channels] Pointwise

              Multiplier resolution (Resolution Multiplier): [O to reduce the size of the feature maps] deepwise

2, a separable convolution depth Depthwise Separable Convolution( )

  • A standard convolution kernel may be divided into a convolution depth depthwise convolutionand a 1X1convolution (convolution called pointwise pointwise convolution). As shown below

depthwise separable convolution

2.1 standard convolution

  • Standard convolution layer is the dimension D_{F}\times D_{F}\times Mof the input layer into a dimension D_{G}\times D_{G}\times N, I introduce mobilenet said in detail how the calculation in another.

D_{F} Input feature map length and width, M is the number of input channels (channels)

D_{G}Output feature map length and width, N is the number of output channels

  • Suppose that the size of the convolution kernel filter D_{k}\times D_{k}, the convolution calculation is the standard

D_{k}\cdot D_{k} \cdot M \cdot N \cdot D_{F} \cdot D_{F}

Papers referenced in FIG, look kernel matrix portion, Dk⋅DkDk⋅Dk size is a square, and then multiplied by the number of input and output channels, and then acts in the input feature maps 

kernel matrix

Standard convolution is that, regardless of the current pixel that is, after how many channels, a channel convolution is

 

2.2 Depthwise Separable Convolution

  • Two steps 

The first step in the depth of convolution: the convolution kernel size is D_{k}\cdot D_{k} \cdot M \cdot 1, the total amount is calculated:

D_{k}\cdot D_{k} \cdot M \cdot D_{F} \cdot D_{F}
Step-by-point convolution: convolution kernel size is 1 \cdot 1 \cdot M \cdot N, the total amount is calculated: M \cdot N \cdot D_{F} \cdot D_{F}
it is calculated and compared to the standard convolution ratio was:  

(D_{k}\cdot D_{k} \cdot M \cdot D_{F} \cdot D_{F} + M \cdot N \cdot D_{F} \cdot D_{F}) /D_{k}\cdot D_{k} \cdot M \cdot N \cdot D_{F} \cdot D_{F} = 1/N + 1/D_{k} ^ 2
MobileNet using a 3x3 convolution kernel, the amount of calculation can be reduced 8-9 times (because the ratio is 1 / N + 1/9)
The first step is the convolution depth of the convolution operation performed on each channel

regular conv

Step-by-point convolution is combined

PointwiseConvolution

A standard convolution kernel may be divided depthwise convolution and a depth of convolution convolution of a 1X1 (called point-wise convolution pointwise convolution). Shown below
-

 

 

 

 


Width multiplier (Width Multiplier): reducing the input and output channels
to reduce the size of the input and output of the feature map: multiplier resolution (Resolution Multiplier)


3:深度可分离卷积是将一个标准的卷积核分成深度卷积核和1x1的点卷积核,假设输入为M个通道的feature map,卷积核大小为DK∗DKDK∗DK,输出通道为N,那么标准卷积核即为M∗DK∗DK∗NM∗DK∗DK∗N。例如,输入feature map 为m∗n∗16m∗n∗16,想输出32通道,那么卷积核应为16∗3∗3∗3216∗3∗3∗32,则可以分解为深度卷积:16∗3∗316∗3∗3得到的是16通道的特征图谱。点卷积为16∗1∗1∗3216∗1∗1∗32,如果用标准卷积,则计算量为:m∗n∗16∗3∗3∗32=m∗n∗4608m∗n∗16∗3∗3∗32=m∗n∗4608。用深度可分解卷积之后的计算量为m∗n∗16∗3∗3+m∗n∗16∗1∗1∗32=m∗n∗656m∗n∗16∗3∗3+m∗n∗16∗1∗1∗32=m∗n∗656

4:所以和标准卷积核相比计算量比率为:Dk∗Dk∗Df∗Df∗M+Df∗Df∗M∗NDk∗Dk∗M∗N∗Df∗Df=1N+1Dk2Dk∗Dk∗Df∗Df∗M+Df∗Df∗M∗NDk∗Dk∗M∗N∗Df∗Df=1N+1Dk2

5:深度可分解卷积操作示意图如下: 
 


6:mobilenet共28层(深度卷积和点卷积单独算一层),每层后边都跟有batchnorm层 和relu层 
 


7:引入宽度乘数和分辨率乘数两个超参数, 
宽度乘数αα主要用于减少channels,即即输入层的channels个数MM,变成αMαM,输出层的channels个数NN变成了αNαN 
所以引入宽度乘数后的总的计算量是:Dk⋅Dk⋅αM⋅DF⋅DF+αM⋅αN⋅DF⋅DFDk⋅Dk⋅αM⋅DF⋅DF+αM⋅αN⋅DF⋅DF

Multiplier ρρ mainly used to reduce the resolution of the picture resolution, i.e., acting on the feature map 
 so that the resolution of the total amount of computation for the introduction of the multiplier: Dk⋅Dk⋅αM⋅ρDF⋅ρDF + αM⋅αN⋅ρDF⋅ ρDFDk⋅Dk⋅αM⋅ρDF⋅ρDF + αM⋅αN⋅ρDF⋅ρDF

In MobileNet_ssd See 
mobilenet_ssd caffe model visualization Address: MobileNet_ssd 
can be seen, conv13 backbone network is the last layer, modeled on the structure of VGG-SSD in the back conv13 Mobilenet added convolution eight layers, six layers and a total decimation as detected, seemingly not a resolution of 38 * 38 layers, it may be too far forward of the bar. 
The default frame extraction layer 6 conv11, conv13, conv14_2, conv15_2, conv16_2, conv17_2, the default number of the frame 6 of each cell generated feature map layer respectively 3,6,6,6,6,6. That is to say that in the back contact layer 6 number convolution kernel (layer called conv11_mbox_loc ......) output (num output) coordinates for regression 3 * 3 are 12,24,24,24,24 , 24, 24.                             

That back contact layer 6 3 * 3 convolution kernel category score for the number of output (layer called conv11_mbox_conf ......) is 3 * 21 (21 category class, default block 3) = 63,126, 126 , 126, 126, 126.                        
 

Guess you like

Origin blog.csdn.net/weixin_39875161/article/details/90615482