MobileNet series (detailed explanation of tens of thousands of words, one article is enough)

foreword

This article will talk about CV-related things, MobileNet, I think everyone is already familiar with it, including some modules in it, and some lightweight ideas are also often used. Here I also want to make a summary, sort it out, and talk about my understanding and opinions. Convolutional neural network CNN has been widely used in the field of computer vision, and has achieved good results. In recent years, the CNN model has become deeper and deeper, and the complexity of the model has become higher and higher. For example, the deep residual network (ResNet) has as many as 152 layers. However, in some real application scenarios such as mobile or embedded devices, such large and complex models are difficult to be applied. The first is that the model is too large and faces the problem of insufficient memory. Second, these scenarios require low latency, or fast response. Imagine what terrible things will happen if the pedestrian detection system of a self-driving car is very slow. Therefore, it is crucial to study small and efficient CNN models in these scenarios, at least for now, although future hardware will become faster and faster.

The current research can be summarized into two directions:

  • One is to compress the trained complex model to obtain a small model;
  • The second is to directly design a small model and train it.

In any case, the goal is to reduce the model size (parameters size) while maintaining the model performance (accuracy), while increasing the model speed (speed, low latency). The protagonist of this article, MobileNet, belongs to the latter. It is a small and efficient CNN model recently proposed by Google, which makes a compromise between accuracy and latency.

1. Detailed Explanation of MobileNet V1

The main innovation in V1 is to replace ordinary convolution with depth-separable convolution, and introduce two hyperparameters to make it more flexible to control the size of your own model according to resources.

So what is Depthwise separable convolution?

According to historical records, dating back to the 2012 paper Simplifying ConvNets for Fast Learning , the author proposed the concept of separable convolution:

  Dr. Laurent Sifre expanded the separable convolution to depth during his internship at Google in 2013, and described it in detail in his doctoral thesis Rigid-motion scattering for image classification . Interested students can go to the paper . There are two main types of separable convolutions: spatially separable convolutions and depthwise separable convolutions .

Spatial Separable Convolution

As the name implies, spatial separability is to turn a large convolution kernel into two small convolution kernels, such as dividing a 3*3 kernel into a 3*1 and a 1*3 kernel:

 Because MobileNet does not use this, so I won't talk about it in detail here.

Depthwise Separable Convolution

Depth-level separable convolutions are actually factorized convolutions. It can be decomposed into two smaller operations: depthwise convolution and pointwise convolution.

For a standard convolution, an input feature map of 12*12*3 is input, and an output feature map of 8*8*1 is obtained through a convolution kernel of 5*5*3. If we have 256 feature maps at this time, we will get an output feature map of 8*8*256, as shown in the following figure:

 For deep convolution ( actually, group convolution with group 1 ), all feature map channels are decomposed, each feature map is a single-channel mode, and each individual channel feature map is convolved. In this way, a generated feature map with the same number of channels as the original feature map will be obtained. Assuming that the feature map of 12*12*3 is input, after the depth convolution of 5*5*1*3, the output feature map of 8*8*3 is obtained. The dimensions of the input and output are constant 3, so there will be a problem, the number of channels is too small, the dimension of the feature map is too small, and information cannot be obtained effectively.

Point-by-point convolution is 1*1 convolution, and its main function is to increase and reduce the dimension of the feature map. In the process of deep convolution, we get an output feature map of 8*8*3. We use 256 convolution kernels of 1*1*3 to perform convolution operations on the input feature map. The output feature map and the standard convolution The product operation is also 8*8*256. As shown below:

 The process of standard convolution and depth separable convolution is compared as follows:

 Advantages of Depthwise Separable Convolutions

For standard convolution, the size of the convolution kernel is Dk*Dk*M, and there are N in total, so the parameter amount of standard convolution is:

The parameter volume of depthwise separable convolution consists of two parts: depthwise convolution and pointwise convolution. The convolution kernel size of depth convolution is Dk*Dk*M; the convolution kernel size of point-by-point convolution is 1*1*M, and there are a total of N, so the parameter amount of depth separable convolution is:

The size of the convolution kernel is Dk*Dk*M, there are a total of N, each of which requires Dw*Dh operations, so the calculation amount of the standard convolution is:

The calculation amount of depth separable convolution is also composed of two parts: depth convolution and point-by-point convolution. The convolution kernel size of depth convolution is Dk×Dk×M, and a total of Dw×Dh multiplication and addition operations are required; the convolution kernel size of point-by-point convolution is 1×1×M, there are N, and a total of Dw× Dh multiplication and addition operations, so the calculation amount of depth separable convolution is:

It can be seen that the depth-separable convolution is much smaller than the ordinary convolution in terms of calculation and parameter volume, which can greatly reduce the inference delay of the network and speed up the speed.

The depth separable convolution has been finished here, let's take a look at how MobileNet V1 is applied. As shown in the figure below ( the figure should be RELU6, but the figure is wrong ). The left image is a traditional convolution, and the right image is a depthwise separable convolution. More ReLU6 increases the nonlinear variation of the model and enhances the generalization ability of the model.

Here you should be able to understand the basic meaning of the block above. Next, let’s talk about why Relu6 is used, what are the benefits of this, and why not use Relu.

 In fact, Relu6 is an ordinary Relu function, but the maximum output value of the input is constrained, that is, the maximum output value is 6. The ReLU6 function and its derivative are shown in the figure below:

So why use Relu6 instead of Relu? In fact, it is for low-precision devices such as float16/int8 on the mobile terminal, which can also have a good numerical resolution. If there is no restriction on the value of Relu, the output range can be from 0 to infinity, which makes the activation value very large and distributed in a large range, while embedded devices such as low-precision float16 cannot be very good The precise description of such a large range, resulting in a loss of precision. This is the first innovation point of this paper.

The second innovation point: Although the MobileNet network structure and delay are relatively small, in many cases, a smaller and faster model is needed in specific applications. For this reason, the width factor alpha is introduced. In order to control the size of the model, the article introduces the resolution Factor rho.

The width factor alpha (Width Mutiplier) reduces the number of input and output channels of the network at each layer . The number of output channels is from M to alpha*M, and the number of output channels is from N to alpha*N. The calculation amount after transformation is:

Usually alpha is between (0, 1], and typical values ​​are 1, 0.75, 0.5, 0.25. The amount of calculation and the number of parameters are reduced by  1/alpha**2 times before the width factor is not used.

 The resolution factor rho (resolution multiplier) is used to control the input and internal layer representation, that is, the resolution factor is used to control the resolution of the input. The calculation amount of depth convolution and point-by-point convolution is:

Usually rho is between (0, 1], and the typical input resolution is 224, 192, 160, 128. The reduction in calculation volume is increased by 1/(alpha**2*rho*rho) before using the width factor times, the amount of parameters has no effect. Through two hyperparameters, the model can be further reduced. The article also gives specific experimental results, and I will map it here.

  • The Depthwise conv parameter does not use L2 regularization because of its small number of parameters.
  • Reducing the thickness of the network is more cost-effective than reducing the depth of the network, indicating that depth is more important (proved by experiments in the paper)

In MobileNetV1, a single filter is applied to each input channel of the deep convolutional network. Then, pointwise convolution applies a 1*1 convolutional network to combine the outputs of depthwise convolutions. This standard convolution method both filters and combines the inputs into a new set of outputs in one step. Among them, depthwise separable convolution splits it twice, one layer for filtering and another layer for pooling. The network structure of MobileNet is as follows, consisting of 28 layers in total (excluding the AvgPool and FC layers, and the depth convolution and point-by-point convolution are separately calculated), except that the first layer uses a standard convolution kernel, the rest The convolutional layers all use Depth Wise Separable Convolution.

 2. Detailed explanation of MobileNet V2

The MobileNet V2 architecture was released in early 2018. MobileNet V2 is based on some ideas of MobileNet V1 and combined with new ideas to optimize. From an architectural point of view, MobileNet V2 adds two new modules to the architecture: 1, the introduction of a linear bottleneck between layers; 2, the shortcut connection between bottlenecks. The core idea in MobileNetV2 is that the bottleneck encodes the intermediate input and output of the model, while the inner layer is used to encapsulate the transformation of the model from lower-level concepts (such as: pixels, etc.) to higher-level descriptors (such as: image category etc.) ability (it doesn't matter if you don't understand, I will explain it in detail later). Finally, like traditional residual connections, shortcuts enable faster training and higher accuracy.

The structure of MobileNet V1 is relatively simple. In addition, the main problem is in Depthwise Convolution. Depthwise Convolution does reduce the amount of calculation, but the Kernel training of the Depthwise part is easy to be abolished, that is, most of the convolution kernels are zero. The author believes that in the end it is After ReLU, the output is 0, and the author in the paper gave an explanation. That is, Relu will cause a large information loss to the feature map with a relatively low number of channels, and these information losses will easily cause most of the convolution kernels to be 0.

Specifically, when low-dimensional information is mapped to high-dimensional and then mapped back to low-dimensional through Relu, if the mapped dimension is relatively high, the loss of information transformation back is small; if the mapped dimension is relatively low, the information After transforming back, the loss is very large, as shown in the following figure:

As can be seen in the figure, when the original input dimension is increased to 15 and ReLU is added, much information will not be lost; but if the original input dimension is only increased to 2~5 dimensions and then ReLU is added, there will be relatively Serious loss of information. Therefore, it is considered that performing ReLU operations on low dimensions can easily cause information loss. However, if the ReLU operation is performed in high dimensions, the loss of information will be very small. Another explanation is that when high-dimensional information is converted back to low-dimensional information, it is equivalent to doing a feature compression, which will lose part of the information, and after performing ReLU, the loss will be even greater. The characteristics of ReLU make the output 0 for negative input, and the dimensionality reduction itself is a process of feature compression, which makes the feature loss more serious . For this problem, the author replaced ReLU with a linear activation function. 

V2 added a new PW convolution before the DW convolution. The reason for this is that the DW convolution has no ability to change the number of channels due to its own calculation characteristics. He can only use the number of channels given to him by the previous layer. How many channels to output. Therefore, if the number of channels in the upper layer is very small, DW can only extract features in a low-dimensional space, so the effect is not very good. Now V2 is equipped with a PW for each DW in order to improve this problem. It is specially used to increase the dimension, and the dimension increase coefficient is defined as t=6, so that regardless of the number of input channels Cin is large or small, after the first PW dimension increase, DW is in a relatively higher dimension (t*Cin) is More or less, after the first PW is upgraded, DW is working hard in a relatively higher dimension (t*Cin). And V2 removes the activation function of the second PW, which the author of the paper calls Linear Bottleneck. The origin of this is because the author believes that the activation function can effectively increase the nonlinearity in the high-dimensional space, but it will destroy the features in the low-dimensional space, which is not as good as the linear effect. Since the main function of the second PW is dimensionality reduction, according to the above theory, it is not appropriate to use ReLU6 after dimensionality reduction. This operation is also called Inverted Residuals and Linear Bottlenecks.

This can be translated into "inverted residual module". What does that mean? Let's compare the difference between the residual module and the inverted residual module.

  • Residual module: The input is first compressed by 1*1 convolution, then feature extraction is performed by 3*3 convolution, and finally the number of channels is converted back by 1*1 convolution. The whole process is "compression-convolution-expansion". The purpose of this is to reduce the calculation amount of the 3*3 module and improve the calculation efficiency of the residual module.
  • Inverted residual module: The input first undergoes 1*1 convolution for channel expansion, then uses 3*3 depthwise convolution, and finally uses 1*1 pointwise convolution to compress the number of channels back. The whole process is "dilation-convolution-compression". Why do you do this? Because the depthwise convolution cannot change the number of channels, the feature extraction is limited by the number of input channels, so the number of channels is increased first. The expansion factor in this paper is 6.

It is more appropriate to use the following figure to represent it.

 The network module of Mobilenet V2 is shown in the figure below. When stride=1, the input first goes through 1*1 convolution to expand the number of channels. At this time, the activation function is ReLU6; then after 3*3 depthwise convolution, the activation function is ReLU6; then after 1*1 pointwise convolution, the number of channels is compressed back, and the activation function is linear; finally, shortcut is used to add the two. When stride=2, since the size of the input and output feature maps is inconsistent, there is no shortcut.

 3D graphics:

 Finally, the network structure of V2 is given. Among them, t is expansion and sparseness, c is the number of output channels, n is the number of times the layer is repeated, and s is the step size. It can be seen that the V2 network is much deeper than the V1 network, and V2 has 54 layers.

 3. Detailed Explanation of MobileNet V3

  MobileNet V3 was published in 2019. MobileNet-V3 provides two versions, MobileNet-V3 Large and MobileNet-V3 Small, which are suitable for situations with different resource requirements. V3 combines the deep separable convolution of v1, the Inverted Residuals and Linear Bottleneck of v2, the SE module, and uses NAS (Neural Structure Search) to search for network configuration and parameters. Coincidentally, I worked on NAS during my postgraduate period, so I was really comfortable writing this part. Although I won’t talk too much about NAS here, I still feel very kind when I see NAS⭐ .

First, list some bed innovations of V3, and then explain them separately based on these points.

  • Modify the number of initial convolution kernels
  • Change the computationally expensive layer at the end of the network
  • Introduced the SE module
  • H-Swish activation function

Modify the number of initial convolution kernels

 For the input layer of v2, the input is expanded to 32 dimensions by 3*3 convolution. The author found that it can actually be reduced to 32, so it was changed to 16 here. On the premise of ensuring the accuracy, the speed was reduced by 3ms. Changes about this can be seen in the network structure given at the end

Change the computationally expensive layer at the end of the network

In MobileNetV2, before Avg Pooling, there is a 1*1 convolutional layer, the purpose is to increase the dimension of the feature map, which is more conducive to the prediction of the structure, but this actually brings a certain amount of calculation. So the author made a modification here and put it behind avg Pooling. First, it is beneficial for avg Pooling to reduce the size of the feature map from 7*7 to 1*1, and then use 1*1 to increase the dimension. , thus reducing the calculation amount by 7*7 =49 times. And in order to further reduce the amount of calculation, the author directly removed the 3*3 and 1*1 convolution of the previous spindle convolution, further reducing the amount of calculation, and it became the structure shown in the second line of the figure below. The author will After 3*3 and 1*1 are removed, the precision is not lost. Here, the delay of about 10ms is reduced, and the operation speed is increased by 15%, and there is almost no loss of precision.

 Introduce H-Swish activation function

In the previous V2, after the last point-by-point convolution of the block, it was replaced with a linear activation function. The previous activation function was Relu6, but the author found that the Swish function can effectively improve the accuracy of the network, but because the Swish function contains the Sigmoid function. The amount of calculation is too large, so the author proposes the H-Swish activation function here.

Swish image:

 (I will write another article about the activation function to explain in detail) Swish features : 

 h-swish:

This nonlinearity brings many advantages while maintaining accuracy. First, ReLU6 can be implemented in many software and hardware frameworks. Second, it avoids the loss of numerical accuracy during quantization and runs fast. This non-linear change increases the latency of the model by 15%. But the network effect it brings has a positive effect on accuracy and latency, and the remaining overhead can be eliminated by fusing non-linearities with previous layers.

Introduced the SE module

This is also the part I want to focus on, mainly because other articles explaining MobileNet do not explain this part in detail, and most of them just pass it by. For the CNN network, its core calculation is the convolution operator, which learns a new feature map from the input feature map through the convolution kernel. In essence, convolution is to perform feature fusion on a local area, including spatial (H and W dimensions) and inter-channel (C dimension) feature fusion. We can find that convolution is actually a feature fusion of local regions. This also leads to the small receptive field of ordinary convolutional neural networks. Of course, more channel features can be designed to increase this, but doing so will greatly increase the amount of calculation. Therefore, in order to integrate more features in space, or to extract multi-scale spatial information. Many different methods such as the multi-branch structure of the Inception network have also been proposed. For the feature fusion of the channel dimension, the convolution operation basically fuses all channels of the input feature map by default. The innovation of the SENet network is to focus on the relationship between channels, hoping that the model can automatically learn the importance of different channel features. To this end, SENet proposed the Squeeze-and-Excitation (SE) module, as shown in the following figure:

 In fact, this is a self-attention mechanism for multi-channels. By assigning weights to different channels, the feature map can improve the useful feature channels, and in turn suppress the features that are not very useful for the current task. The specific SE operation is shown in the figure below.

 This is all the operations in the SE module. First, a global pooling layer is used to convert each channel into a specific value, and then two fully connected layers are learned to assign weights. Finally, the final weight is obtained through an H-Sigmoid function, and then assigned to the initial feature map. At present, most of the mainstream networks are constructed based on the superposition of these two similar units by means of repeat. It can be seen that SE modules can be embedded in almost all network structures today. By embedding SE modules in building block units in the original network structure, we can obtain different kinds of SENets. Such as SE-BN-Inception, SE-ResNet, etc.

The 3D graph in V3 is as follows:

 The SE module is introduced into the bottleneck structure of v2 and placed after the depthwise filter. The SE module is a lightweight channel attention module. Because the SE structure will consume a certain amount of time, after the depthwise, it passes through the pooling layer. , and then in the first fc layer, the number of channels is reduced by 4 times, and then after the second fc layer, the number of channels is transformed back (expanded by 4 times), and then added bitwise with depthwise, so the author found that the accuracy is improved, At the same time, there is no increase in time consumption.

MobileNet V3 network structure

MobileNet V3 first uses MnasNet to search for a coarse structure, and then uses reinforcement learning to select the optimal configuration from a discrete set of choices. Afterwards, MobileNet V3 fine-tunes the architecture with NatAdapt, which embodies the complementary function of NetAdapt, which can adjust underutilized activation channels with a small decrease.

(1) Resource-constrained NAS (platform-aware NAS): Search for each module of the network under the premise of limited calculation and parameter quantity, so it is called block-wise search (Block-wise Search).

(2) NetAdapt: ​​Used to fine-tune the network layer after each module is determined.

        Web search is a powerful tool for the exploration and optimization of model structures. The researchers first used the neural network search function to build a global network structure , and then used the NetAdapt algorithm to optimize the number of cores in each layer . For the global network structure search, the researchers used the same RNN-based controller and hierarchical search space as in Mnasnet, and optimized the accuracy-delay balance for a specific hardware platform, in the target delay (~80ms) Search within the range. Each layer is then tuned sequentially using the NetAdapt method. Maintain accuracy while optimizing model latency as much as possible, reducing the size of the augmentation layers and bottlenecks in each layer.

  In addition, another novel idea of ​​​​MobileNet is to add a neural network called "Squeeze-and-Excitation" (SENet for short, which is also the ImageNet 2017 image classification champion) to the core architecture. The core idea of ​​this neural network is to improve the quality of the representations produced by the network by explicitly modeling the interdependencies between the convolutional feature channels of the network. Specifically, it is to automatically obtain the importance of each feature through learning, and then use this result to promote useful features and suppress features that are not very useful for the current task.

  To this end, the developers propose a mechanism that allows the network to perform feature recalibration. Through this mechanism, the network can learn to use global information to selectively emphasize informative features and suppress less useful features. While in the case of MobileNet V3, the architecture extends MobilenetV2 to include SENet as part of the search space, resulting in a more stable architecture.

  Another interesting optimization in Mobilenet V3 is the redesign of some of the more expensive layers of the architecture. Some layers in the second generation of MobilenetV2 are the basis for model accuracy, but also introduce latent variables. By incorporating some basic optimization functions. MobileNet V3 removes three expensive layers of the MobilenetV2 architecture without sacrificing accuracy.

   The structure of V3 is as shown in the figure below. The author provides two versions of V3, large and small, corresponding to high-resource and low-resource situations, both of which are searched by NAS.

 Speaking of this, this series is basically over. I dare not say that the whole network is the most complete, but basically everything I can think of is here, and it is further summarized after searching many excellent blogs. What is wrong, I hope everyone can point it out in time, thank you! ! !

Reference: Convolutional Neural Network Study Notes - Lightweight Network MobileNet Series (V1, V2, V3)

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125171412