MobileNet v1 v2 v3 network detailed explanation notes

Learning videos from:

7.1 MobileNet network detailed explanation_bilibili_bilibili

When deploying deep convolutional networks on the mobile terminal, no matter what the visual task is, choosing a high-precision backbone network with less calculations and fewer parameters is the only way to go. Lightweight networks are the focus of mobile research.

MobileNet_v1

Traditional convolutional neural networks require large amounts of memory and calculations, making them unable to run on mobile devices and embedded devices.

The MobileNet network was proposed by the Google team in 2017, focusing on lightweight CNN networks in mobile or embedded devices. Compared with the traditional convolutional neural network, the model parameters and calculation amount are greatly reduced while the accuracy is slightly reduced. (Compared to VGG16, the accuracy is reduced by 0.9%, but the model parameters are only 1/32 of VGG)

paper:

 Highlights of the network:

  • Depthwise Convolution (greatly reduces the amount of operations and number of parameters) 
  • Add hyperparameters α and β

1. Convolution

Traditional convolution:

  • Convolution kernel channel = input feature matrix channel
  • Output feature matrix channel = number of convolution kernels

DW convolution—Depthwise Conv:

  •  Convolution kernel channel=1
  •  Input feature matrix channel = number of convolution kernels = output feature matrix channel

Depthwise Separable Conv:  composed of DW convolution and PW convolution

Pointwise Conv:

  • It is an ordinary convolution, but the convolution kernel size is 1 
  • Convolution kernel channel = input feature matrix channel
  • Output feature matrix channel = number of convolution kernels

 Understanding: Conventional convolution (standard convolution) applies each convolution kernel to all channels of the input image and integrates them together to form a channel output.

2. How many parameters can be saved by using Depthwise Separable Conv compared with ordinary convolution?

As shown in the figure below, ordinary convolution and depth-separable convolution also finally obtain a feature matrix with a depth of 4

 

 D_{F} : The height and width of the input feature matrix

 D_{K}: The size of the convolution kernel

 M: depth of the input feature matrix

 N: The depth of the output feature matrix, corresponding to the number of convolution kernels

Ordinary convolution:  D_{K} ·   D_{K}  · M ·   N   ·   D_{F}  ·  D_{F}

The understanding of this formula is: the first four are the number of calculated parameters, the last two DF are the number of times each parameter needs to participate in the calculation, and the calculated amount is obtained

DW+PW:     D_{K} ·   D_{K} · M ·   D_{F}  ·  D_{F}   +    M ·   N   ·   D_{F}  ·  D_{F}

Theoretically, the calculation amount of ordinary convolution is 8 to 9 times that of DW + PW. 

3. MobileNet

Similar to the vgg network, it is composed of a stack of a large number of convolution kernels.

 Multiply-Add calculation amount

 Hyperparameter  \large \alpha  Width Multiplie    \large \beta  Resolution Multiplier

 A small decrease in accuracy results in a significant reduction in the amount of calculations

Disadvantages: The depthwise part of the convolution kernel is easy to waste, that is, most of the convolution kernel parameters are zero.

MobileNet_v2

The MobileNet v2 network was proposed by the Google team in 2018. Compared with the MobileNet V1 network, it has higher accuracy and smaller models.

paper:

Highlights from the network:

  • Inverted Residuals (inverted residual structure)
  • Linear Bottlenecks

1. Inverted residual structure

 1. The left side is the residual structure provided by the ResNet network: for the input feature matrix, a 1 * 1 convolution kernel is used to compress the input feature matrix, that is, the channel of the input feature matrix is ​​reduced, and then a 3 * 3 convolution kernel is used to compress the input feature matrix. Perform convolution processing, and finally use a 1 * 1 convolution kernel to expand the channel, forming a bottleneck structure with two large ends and a small one in the middle.

        The activation function used is Relu

2. The right side is the inverted residual structure of the MobileNet_v2 structure. First, a convolution kernel of 1 * 1 is used to perform a dimensionality raising operation to make the channel deeper, and then convolution is performed through a DW operation with a convolution kernel size of 3 * 3 . Finally, dimensionality reduction is performed through 1 * 1 convolution.

        The activation function used is Relu6

Relu6

 The figure below shows the Relu function. When it is less than 0, it is set to zero, and when it is greater than 0, it is not processed.

The figure below shows the Relu6 activation function. When it is less than 0, it is set to zero. The range of 0-6 is also not processed. When the input value is greater than 6, all input values ​​​​are set to 6.

二、Linear Bottlenecks

In the original article, for the last 1 * 1 convolution layer of the inverted residual structure, a linear activation function is used instead of the previously mentioned Relu activation function. Why is this done? The original article did an experiment:

 First, the input is a two-dimensional matrix whose channel is equal to 1;

        Use matrix T of different dimensions to transform it and change it to a higher dimension;

        Using a Relu activation function to get an output value;

        Using the inverse matrix of the T matrix   \large T^{-1}, restore the output matrix back to a two-dimensional feature matrix;

        When matrix T =2,3, as shown in Figures 2 and 3 above, it can be found that a lot of information is lost at this time;

        As matrix T continues to grow, less and less information is lost.

Conclusion: The ReLU activation function causes a large amount of loss to low-dimensional feature information, but causes very little loss to high-dimensional feature information. Because the inverted residual structure is thin on both sides and thick in the middle, the output is a low-dimensional feature vector, so a linear activation function should be used instead of the relu function to avoid information loss.

3. Structure diagram in the paper

 Corresponds to the structural information in the figure below

First layer: h, w, k are height, width and depth respectively, 1 * 1 convolution kernel is used for dimensionality enhancement, t is the expansion factor, the number of 1 * 1 convolutions is equal to tk, so the depth of the output feature matrix is for tk;

Second layer: The input is the output of the previous layer, a DW convolution kernel of 3 * 3 size, and the stride is s. The depth of the output feature matrix is ​​the same as the depth of the input feature matrix. Since the stride is s, it is high The sum width becomes one times the original s;

The third layer: 1 * 1 convolution layer for dimensionality reduction, the number of convolution kernels is k', and the depth of the feature matrix is ​​changed from tk to k'

 When stride=1 and the input feature matrix and the output feature matrix have the same shape, there is a shortcut connection. When this condition is not met, there is no shortcut connection.

Parameters of network structure: 

 t is the expansion factor

c is the channel of the output feature matrix, which is k' in the previous article

n represents the number of repetitions of bottleneck. Bottleneck is actually the inverted residual structure in the paper.

s is the step distance (for the first layer, 1 for other layers)

Performance comparison:

Classification 

 Object Detection

MobileNet_v3

paper: 

Improve:

  • UpdateBlock(bneck)
  • Using NAS search parameters (Neural Architecture Search)
  • Redesign the time-consuming layer structure

As can be seen from the figure below, v3 is more accurate and efficient

1. Update Block(bneck)

1. Add SE module (attention mechanism)

2. Updated activation function 

Shortcut connection is available only when stride == 1 and input_c == output_c 

 The attention mechanism of Mobilenet V3 is shown

        Perform pooling processing on each channel of the obtained feature matrix. The one-dimensional vector obtained will have as many elements as there are channels in the feature matrix, and then the output vector is obtained through two fully connected layers;

        The number of nodes in the first fully connected layer is equal to 1/4 of the channels of the feature matrix, and the number of nodes in the second fully connected layer is consistent with the number of channels in the feature matrix;

        The output vector analyzes a weight relationship for each channel. It assigns greater weight to channels that it thinks are more important, and assigns smaller weights to channels that it thinks are not very important.

Use the diagram below to deepen your understanding 

Assume that the channel of the feature matrix is ​​equal to 2;

First, the average pooling operation is used to find the average value of each channel, and a vector with a number of elements of 2 is obtained;

Then pass through two fully connected layers in sequence:

        The number of nodes in FC1 is 1/4 of the feature matrix channel, and the Relu activation function is used.

        The number of nodes in FC2 is consistent with the channel of the feature matrix, and the H-sig activation function is used.

Get a vector with two elements, corresponding to the weight of each channel. Use the first weight 0.5 to multiply all elements in the first channel to get a new channel data.

Note: The last 1*1 dimensionality reduction convolution layer does not use an activation function.

2. Redesign the time-consuming layer structure

1. Reduce the number of convolution kernels in the first convolution layer (32->16)

2. Streamline Last Stage

The original article said that using fewer convolution kernels can achieve the same accuracy. 

Compared with the above, Efficient Last Stage obviously has a lot less structure. After the adjustment, there is no change in accuracy, but it saves 7ms of time.

3. Redesign the activation function 

Using swish can indeed improve the accuracy of the network, but the calculation and derivation are complex and unfriendly to the quantification process, especially to mobile devices. Therefore, the author proposed the h-swish activation function, but you need to look at h-sigmoid first. activation function;

As shown in the figure, h-sigmoid and sigmoid functions are relatively close, so h-sigmoid is used instead of sigmoid in many situations.

So from the h-sigmoid substitution  \large \sigma (x) function, we get the h-swish activation function;

As shown in the figure, the swish and h-swish curves are very similar. After replacement, it will be helpful to the speed, and it is also very friendly to the quantification process.

 4. Network structure

 out: The channel of the output feature matrix, the corresponding channel after dimensionality reduction;

 NL: Activation function, divided into two types;

 s: step distance;

exp size: Dimension-raising convolution, given the given value, use 1 * 1 convolution to raise the dimension;

SE: represents whether to use the attention mechanism;

NBN: Indicates that the BN structure is not applicable

Note: The channel of the first bneck is the same as the input feature matrix, and there is no dimensionality increase, so there is no first 1 * 1 convolution layer in the first bneck, and the DW operation is directly performed on the feature matrix.

Guess you like

Origin blog.csdn.net/weixin_45897172/article/details/128482645