4. CNNs network architecture - depth separable convolution (MobileNet v1, MobileNet v2, MobileNet v3, Xception, ShuffleNet v1, ShuffleNet v2)

The paper "A review of convolutional neural network architectures and their optimizations" pointed out that some high-performance convolutional neural network methods inevitably bring huge computing costs, and often require the support of high-performance GPUs or highly optimized distributed CPU architectures. Although CNNs applications are expanding to mobile terminals, most mobile devices do not have powerful computing power and huge memory space. Therefore, research on lightweight network architectures is needed to properly handle the above issues. Lightweight convolutional neural network generally refers to a smaller convolutional neural network structure obtained after compression and acceleration, and its characteristics are as follows:

1. Small communication requirements with the server;

2. Few network parameters and low amount of model data;

3. Suitable for deployment on devices with limited memory.

To enable CNNs to meet the requirements while maintaining performance, an architecture of depthwise separable convolutions is proposed. This paper mainly introduces the following network structures: MobileNet v1, MobileNet v2, MobileNet v3, Xception, ShuffleNet v1, ShuffleNet v2.

Below, this blog post will introduce what is a depthwise separable convolution and introduce the above network architecture.

Table of contents

1. Depthwise separable convolution

1.1 Channel-by-channel convolution (Depthwise Convolution)

1.2 Pointwise Convolution

1.3 Comparison of parameters and calculation amount

2.MobileNet v1(2017)

2.1 Network structure

2.2 Papers 

2.3 Reference blog post

3.MobileNet v2(2018)

3.1 Network structure 

3.2 Linear Bottlnecks (linear bottleneck structure)

3.3 Inverted Residuals

3.4 Papers 

4.MobileNet v3(2019)

4.1 Network Architecture 

4.2 Network Improvements 

4.3 Thesis 

5.Xception(2017)

5.1 Improvement based on Inceptionv3 model 

5.2 Network structure

5.3 Papers 

5.4 Reference blog post

6.ShuffleNetV1(2017)

6.1  group convolution

6.2 channel shuffle

6.3 ShuffleNet Units

6.4 Network structure

6.5 Papers 

6.6 Reference blog post 

7.ShuffleNetV2(2018)

7.1 4 Tips for Designing Lightweight Networks

7.2  ShuffleNet Unit

7.3 Network structure

7.4 Papers 

7.5 Reference blog post


1. Depthwise separable convolution

Depthwise separable convolution  is mainly used in combination of two convolution variants, namely Depthwise Convolution and Pointwise Convolution .

1.1  Channel-by-channel convolution (Depthwise Convolution)

A convolution kernel of Depthwise Convolution has only one channel, and one channel of input information is convolved by only one convolution kernel. The number of feature map channels generated by this process is exactly the same as the number of input channels. Take a 5x5x3 color image (height and width are 5, RGB3 channels) as an example. The number of deep convolution convolution kernels in each layer is the same as the number of channels in the previous layer (one-to-one correspondence between channels and convolution kernels). Set padding=1, stride=1, a three-channel image generates three feature maps after operation, as shown in the blue dashed box in the above figure.

The number of Feature maps after Depthwise Convolution is completed is the same as the number of channels in the input layer, and the number of Feature maps cannot be expanded or compressed in the channel dimension. Moreover, this operation independently performs convolution operations on each channel of the input layer, and does not effectively utilize the correlation of features of different channels at the same spatial position. In short, although the amount of calculation is reduced, the information interaction in the channel dimension is lost. Therefore, Pointwise Convolution is needed to combine these Feature maps to generate new Feature maps.

1.2  Pointwise Convolution

The operation of Pointwise Convolution is very similar to the conventional convolution operation, which is actually a 1X1 convolution. The size of its convolution kernel is 1×1×M, and M is the number of channels of the output information of the previous layer. So here, each convolution kernel of Pointwise Convolution will weight and combine the feature maps of the previous step in the channel direction to generate a new feature map. There are several convolution kernels and several new feature maps are output, as shown in the red dotted line in the figure.

1.3 Comparison of parameters and calculation amount

The number of conventional convolution parameters:

The convolutional layer has a total of 4 Filters, and each Filter contains a kernel with a channel number of 3 (same as the input information channel) and a size of 3×3. Therefore, the number of parameters of the convolutional layer can be calculated by the following formula (ie: convolution kernel W x convolution kernel H x number of input channels x number of output channels):

N_std = 4 × 3 × 3 × 3 = 108

Number of depthwise separable convolution parameters:

N_depthwise = 3 × 3 × 3 = 27

N_pointwise = 1 × 1 × 3 × 4 = 12

N_separable = N_depthwise + N_pointwise = 39

The amount of conventional convolution calculation:

The calculation amount can be calculated by the following formula (ie: convolution kernel W x convolution kernel H x (picture W-convolution kernel W+1) x (picture H-convolution kernel H+1) x number of input channels x output number of channels):

C_std =3×3×(5-2)×(5-2)×3×4=972

Depthwise separable convolution computation:

The calculation amount of depth convolution can be calculated by the following formula (convolution kernel W×convolution kernel H×(picture W-convolution kernel W+1)×(picture H-convolution kernel H+1)×number of input channels):

C_depthwise=3x3x(5-2)x(5-2)x3=243

The calculation amount of point-by-point convolution is (1 x 1 feature layer W x feature layer H x number of input channels x number of output channels):

C_pointwise = 1 × 1 × 3 × 3 × 3 × 4 = 108

C_separable = C_depthwise + C_pointwise = 351

in conclusion:

With the same input, 4 Feature maps are also obtained. The number of parameters and data computation of Separable Convolution is about 1/3 of that of conventional convolution. Therefore, under the premise of the same amount of parameters, the number of neural network layers using Depthwise Separable Convolution can be made deeper.

2.MobileNet v1(2017)

MobileNetv1 decomposes the traditional convolution process into two steps of Depthwise convolution and Pointwise convolution, achieving significant compression in terms of model size and computational burden.

2.1 Network structure

The network structure of MobileNet is shown in the table below. The first is a 3x3 standard convolution, and then the stacked depthwise separable convolution is followed, and it can be seen that part of the depthwise convolution will be down sampled through strides=2. After convolution to extract features, average pooling is used to change the feature into 1x1, and a fully connected layer is added according to the size of the predicted category, and finally a softmax layer.

From the perspective of the entire network architecture, not including the average pooling layer and softmax layer, there are 28 layers in total, and the convolution with a step size of 2 also acts as a downsampling function. After each convolutional layer, it will also follow the BN operation. And the ReLU activation function, the structure is shown in the system diagram, the left side is the standard convolution followed by batchnorm and ReLU, and the right side is the depthwise convolution in the depth separable convolution and the batchnorm and ReLU followed by the pointwise convolution.

2.2 Papers 

论文:《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》

https://arxiv.org/pdf/1704.04861.pdf

Contribution: Replacing ordinary convolution with depthwise separable convolution;

Defect: there is an invalid convolution kernel in depth convolution;

2.3 Reference blog post

1. Detailed Explanation of Lightweight Neural Network MobileNet Family Bucket_Lightweight Network Model_Rayne Layne's Blog-CSDN Blog

2.

Image Classification for Deep Learning (9): MobileNet Series (V1, V1, V3) - Elementary School Student of Magic Academy

3.MobileNet v2(2018)

Based on MobileNetv1, Mark Sandler et al. proposed the inverted residual of the linear bottleneck (linear bottleneck), and built MobileNetv2 based on it, which is faster and more accurate than MobileNetv1.

3.1 Network structure 

The MobileNetv2 network structure is shown in the figure below, t represents the multiple of the first 1*1 convolution in the inversion residual; c represents the number of channels; n represents the number of stacked bottlenecks; s represents the magnitude of DWconv (1 or 2 ), different strides correspond to different modules (as shown in Figure d). In effect, in the ImageNet image classification task, the number of parameters is reduced compared to V1, and the effect is better.

There are a total of 17 Bottleneck layers in the MobileNetV2 network model ( each Bottleneck contains two pointwise convolutional layers and a deep convolutional layer ), a standard convolutional layer (conv), and two pointwise convolutional layers (pw conv) , there are a total of 54 trainable parameter layers. In MobileNetV2, the linear bottleneck (Linear Bottleneck) and the reverse residual (Inverted Residuals) structure are used to optimize the network, making the network layer deeper, but the model size is smaller and the speed is faster.

3.2 Linear Bottlnecks (linear bottleneck structure)

This module removes the Relu at the bottleneck in the Inverted Residuals block. As can be seen from the figure below, the essence is to remove the Relu activation function after the 1*1 convolution and convert it to a Linear activation function, because the 1X1 convolution dimensionality reduction operation will originally Losing part of the information, ReLU will cause a large loss of information to tensors with a lower number of channels, and the output will be 0. There are two types of linear bottleneck structures, the first is to use the residual structure when the step size is 1, and the second is not to use the residual structure when the step size is 2.

3.3 Inverted Residuals

Through the comparison in the figure below, if the residual block is used to perform convolution channel by channel in mobilenet, the original feature dimension is not many. If it is compressed, the model will be too small, so use the inverted residual and expand it first (6 times) Then compress it, so that the model will not be compressed too much. The residual structure of MobileNetV2 actually increases the residual propagation on the basis of the linear bottleneck structure .

The residual structure in ResNet uses the first layer of point-by-point convolution to reduce the dimension, then uses the depth convolution, and then uses the point-by-point convolution to increase the dimension.

The residual structure in the MobileNetV2 version uses the first layer of point-by-point convolution to increase the dimension and uses the Relu6 activation function instead of Relu, then uses the depth convolution, also uses the Relu6 activation function, and then uses the point-by-point convolution to reduce the dimension, and then uses Linear activation function.

3.4 Papers 

论文:《MobileNetV2: Inverted Residuals and Linear Bottlenecks》

https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf

Contribution: Introducing inverted residual and linear bottleneck structures;

Defect: Optimized feature extraction efficiency;

4.MobileNet v3(2019)

Andrew G. Howard proposed MobileNet V3 in 2019. Two network models are proposed in this paper, MobileNetV3-Small and MobileNetV3-Large correspond to versions with low and high requirements for computing and storage respectively.

  • The Large version has a total of 15 bottleneck layers, one standard convolution layer, and three point-by-point convolution layers.

  • The Small version has a total of 12 bottleneck layers, one standard convolutional layer, and two pointwise convolutional layers.

4.1 Network Architecture 

MobileNet V3-Large structure diagram:

MobileNetV3 introduces a 5×5 depth convolution to replace part of the 3×3 depth convolution. A Squeeze-and-excitation (SE) module and h-swish (HS) activation function are introduced to improve model accuracy. The last two layers of point-by-point convolution do not use batch normalization (Batch Norm), and use the NBN logo in the MobileNetV3 structure diagram.

MobileNet V3-Small structure diagram:

4.2 Network Improvements 

1. Improvements based on MobileNetV2:

The author found that the last part of the structure of the v2 network can be optimized. As shown in the figure above , the convolutional layers such as 3x3 Dconv and 1x1Conv are removed, and the original structure uses 1×1 convolution to adjust the feature dimension, thereby improving the accuracy of model prediction, but this part of the work It will cause a certain time consumption. Therefore, the author advances the average pooling , so that the size of the feature can be reduced in advance (the feature size after pooling is reduced from 7×7 to 1×1), and the accuracy is almost not reduced.

2.h-swish:

The article points out that using the swish activation function is more accurate than the ReLU activation function, but the swish activation function consumes resources on mobile devices, so the author proposes the h-swish activation function instead of swish. The effect is similar to swish, but the amount of calculation is greatly reduced. The following figure is h-swish activation function expression:

"hard" comparison of Sigmoid and Swish nonlinear functions:

3.squeeze-and-excite(SE):

The SE module is an image recognition structure announced by the self-driving company Momenta in 2017. It aims to improve the accuracy by modeling the correlation between feature channels and strengthening important features. Using the SE module makes the effective weight larger, and the invalid or less effective weight smaller, so that a better model can be trained. Based on the traditional SE module, the improved SE module guarantees accuracy, improves performance, and reduces the amount of parameters and calculations.

Compared with MobileNetV2, the SE structure is added to MobileNetV3, and the number of channels of the expand layer containing the SE structure part is reduced. It is found that this increases the accuracy, moderately increases the number of parameters, and has no obvious delay cost.

 Reference link:

SE: Interpretation of Squeeze-and-Excitation Networks (SENet) - Programmer Sought

4.3 Thesis 

Paper: "Searching for MobileNetV3"

https://openaccess.thecvf.com/content_ICCV_2019/papers/Howard_Searching_for_MobileNetV3_ICCV_2019_paper.pdf

Contribution: Introduce the Squeeze-and-excitation (SE) module and h-swish (HS) activation function, and introduce the network structure generated by Neural Architecture Search (NAS) technology;

Defects: poor network interpretation, chaotic model structure, not conducive to model deployment, expensive calculation costs;

5.Xception(2017)

Xception draws on the idea of ​​depth-separable convolution introduced in AlexNet, which can be considered as an extreme Inception architecture. The following figure describes the Xception module, which broadens the original Inception, and uses a layer of 3 × 3 convolution kernels followed by 1 × 1 to replace different spatial dimensions (1 × 1, 3 × 3, 3 × 3), This adjusts the computational complexity. First, the input feature map is processed through a 1 × 1 convolution kernel. Then, each channel of the output feature map is convolved with a 3 × 3 convolution kernel. Finally, all the outputs are stitched together to get a new feature map. Although the Xception network has fewer training parameters than Inceptionv3, the recognition accuracy and training speed of the Xception network are the same as Inceptionv3, and it also performs better on larger data sets.

Compared with the depthwise separable convolution, the only difference is whether the execution order of the 1X1 convolution is before or after the Depthwise Conv. They revealed the powerful role of depth-separable convolution from different angles. MobileNet's idea is to reduce the number of parameters by splitting ordinary convolution, and Xception is done by fully decoupling Inception.

5.1 Improvement based on Inceptionv3 model 

Improvement steps for Inceptionv3:

(1) The original module of Inceptionv3 is shown in the figure below

(2) Simplify the Inception module:

(3) Inception improvement

For an input Feature Map, first obtain three sets of Feature Maps through three sets of 1X1 convolutions, which is completely equivalent to first using a set of 1X1 convolutions to obtain Feature Maps, and then dividing this set of Feature Maps into three groups.

(4) Xception module

An "extreme" version of the Inception module, with one spatial convolution per output channel of a 1x1 convolution. At this time, the number of parameters of this is 1/k of the ordinary convolution, and k is the number of 3×3 convolutions.

5.2 Network structure

The data first passes through the Entry flow, then passes through the Middle flow for eight times, and finally passes through the Exit flow. Note that all convolutional and separable convolutional layers are followed by batch normalization (not included in the figure). All Separable Convolution layers use a depth multiplier of 1 (no depth expansion).

5.3 Papers 

论文:《Xception: Deep Learning with Depthwise Separable Convolutions》

https://arxiv.org/pdf/1610.02357v3.pdf

Contribution: Combining convolutional separation and depthwise separable convolution;

Defects: high computational overhead and complex model structure;

5.4 Reference blog post

1. Image Classification for Deep Learning (5): GoogLeNet - Elementary School Student at Magic Academy

6.ShuffleNetV1(2017)

ShuffleNet v1 is a lightweight convolutional neural network that can be used for mobile devices proposed by Megvii Technology at the end of 2017.

The innovation of this network is that it uses group convolution and channel shuffle to ensure the accuracy of the network while greatly reducing the required computing resources. The author proposes pointwise group convolution in the article to reduce the calculation amount of pointwise convolution, but in the group convolution operation, there is no connection between group and group, which affects the accuracy of the network to a certain extent, so the author proposes the shuffle operation , the shuffle operation in the channel can strengthen the connection between the groups. Under a certain computational complexity, the network allows more channels to retain more information.

6.1  group convolution

group convolution:

Compared with ordinary convolution, although the amount of parameters and operation are the same, group convolution can generate g times the number of feature maps, so group conv can generate a large number of features with a small number of parameters and calculations map to get more feature information. As shown in the figure below, g is used as the control parameter of the number of groups, the minimum value is 1, and the maximum value is the number of channels input to the feature map. When g=1, it is ordinary convolution. When g=input feature map channel number, it is depthwise sepereable convolution.

 

6.2 channel shuffle

In Group Convolution, only the feature maps in this group are fused, but there is a lack of calculation between different groups, and the feature maps in different groups know less and less about the features of other groups. Therefore, the author proposes that after the feature map of each group is calculated by group convolution, the results are combined to a certain degree out of order and then sent to the next layer of group convolution. In this way, the number of connection and fusion of feature maps is increased. The process is as shown in the figure below Shown:

In the above figure (a) is a normal group convolution; (b) and (c) are channel shuffle methods.

6.3 ShuffleNet Units

As shown in the figure below, (a) bottleneck unit of depth convolution ( DWConv )]; (B) ShuffleNet unit of point group convolution ( GConv ) and channel shuffling; (c) ShuffleNet unit of stride = 2.

(b) is the case where the 3X3 convolution stride is equal to 1. It can be seen that it is very similar to DWconv, but in order to further reduce the amount of parameters, the convolution of the 1X1 kernel size is optimized into a 1X1 group convolution, and channel shuffle is added to ensure different groups. information exchange between them. Note: The Channel Shuffle operation is after the 1×1 convolution operation, that is, the channel is first shrunk, then the channel is adjusted, and finally the convolution is adjusted back to the original number of channels; (c) is the case where the stride is equal to
2 , that is, the size of the output feature map is halved, and the channel dimension is doubled. In order to ensure the final concat connection, it is necessary to ensure that the output feature maps of the two branches have the same size. Therefore, a 3X3 with a stride of 2 is added to the shortcut branch. Global pooling.

6.4 Network structure

The following figure shows the ShuffleNet-v1 network structure. Stride represents the stride, and different strides correspond to different ShuffleNet Units; repeat represents the number of repetitions. For example, in stage 3, the ShuffleNet Units with a stride of 2 are woken up once, and the ShuffleNet Units with a stride of 1 are awakened once. The unit repeats seven times;

On the basis of the network architecture in the above figure, the author also set some hyperparameters s for the network according to the idea of ​​MobileNetV1, indicating the number of channels. For example, s=1, which is the standard network structure; s=0.5 indicates that each stage The number of output and input channels is half of the number of channels in the above picture, and the others are similar. By scaling the channel by s times, the entire computational complexity and parameters are reduced by s2 times.

6.5 Papers 

论文:《ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices》

https://arxiv.org/pdf/1707.01083.pdf

Contribution: introduce pointwise group convolution and channel shuffle;

Defects: produce boundary effects;

6.6 Reference blog post 

1.ShuffleNet: Disruption and shuffling of channels

 2. Image classification for deep learning (10): ShuffleNet series (V1, V2) - Elementary school student of Magic Academy

7.ShuffleNetV2(2018)

In 2018, Ma et al. proposed a new architecture called ShuffleNetv2. And pointed out in the paper that as a measure of computational complexity, FLOPs are not actually equivalent to speed. In networks with similar FLOPs, there is a huge difference in speed, so memory consumption and GPU parallel computing also need to be considered.

7.1 4 Tips for Designing Lightweight Networks

In the paper, the author pointed out 4 essentials for designing lightweight networks:

G1) Equal channel width minimizes memory access cost (MAC).

When the input and output of the convolution are channels of the same dimension, the memory access cost will be minimized. As shown in the table below, when input/output=1:1, the number of paints processed per second is the largest.

G2) Excessive group convolution increases MAC.

Too many group convolutions will increase the cost of memory access, as shown in the table below, as the number of groups increases, the speed will drop sharply, especially on the GPU. At the same time, it also shows that the smaller the number of parameters, the smaller the computational complexity, and the faster the calculation speed. Therefore, the authors recommend choosing the number of groups carefully according to the target platform and task. It is unwise to simply use a large number of groups, as this may lead to the use of more channels, since the benefit of increased accuracy is easily outweighed by the rapidly increasing computational cost.

G3) Network fragmentation reduces degree of parallelism.

The fragmentation operation (decomposing a large convolution into multiple small convolutions) will reduce the parallelism of the network. In the experiment, the author found that in the case of fixed FLOPs, comparing the string and parallel branch structures respectively, it was found that increasing the parallelism The high-degree parallel structure actually reduces the speed, and the results are shown in the following table:

G4) Element-wise operations are non-negligible.

Don't ignore element-level operations, that is, ReLU, AddTensor, AddBias, etc. operate on matrix elements, which have small FLOPs but relatively heavy MAC. In order to verify this idea, the author made corresponding modifications to the bottleneck level, and tested whether there are two cases of ReLU and short-cut operations. The comparison is shown in the following table. It can be seen that after removing ReLU and short-cut, Both GPU and ARM get around 20% speedup.

Conclusion and Discussions:

1) use "balanced" convolutions (equal channel width);

2) Be aware of the cost of using grouped convolutions;

3) Reduce the degree of fragmentation;

4) Reduce element operations.

7.2  ShuffleNet Unit

As shown in the figure below, (a) (b) corresponds to shufflenetV1 is uints; (c), (d) corresponds to the improved V2 version units. In order to solve the problem of maintaining the same number of input channels and output channels for most convolutions in full convolution or group convolution, the author proposes the Channel Split operation.

The split channel divides the entire feature map into two groups (simulating the grouping operation of grouped convolution, and the next 1X1 convolution turns back to normal convolution). This grouping avoids too much increase like grouped convolution The number of groups during convolution is specified, which conforms to G2. After the split channel, the data of one group passes through the short-cut channel, while the data of the other group passes through the bottleneck layer; at this time, since the dimension of the split channel has been reduced, the 1X1 of the bottleneck does not need to be reduced again, input and output The number of channels can be kept consistent, in line with G1. At the same time, due to the concat operation used last, the TensorAdd operation is not used, which conforms to G4.

7.3 Network structure

7.4 Papers 

论文:《ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design》

https://arxiv.org/pdf/1807.11164.pdf

Contributions: Proposed four criteria for lightweight and fast models; optimized the shufflenet network structure;

Defect: Significant loss of features;

7.5 Reference blog post

1. Lightweight neural network - shuffleNet2_Bald Xiaosu's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/damadashen/article/details/130871245