Clear and clear MobileNet

  • Traditional convolution neural networks, large memory requirements, large amount of computation, making it impossible to run on mobile devices and embedded devices, before the right to re-VGG-16 model about 490M, right ResNet-152 model has a weight of about 644M , so Large model parameters are basically impossible to run on mobile devices and embedded devices, and deep learning is ultimately to serve the society rather than simply playing in the laboratory. The MobileNet network enables the deep learning network to move. Run in the device.
  • The MobileNet network was proposed by the Google team in 2017. It is a lightweight network that focuses on mobile and embedded devices. Compared with traditional convolutional neural networks, it greatly reduces model parameters and calculations under the premise of a small decrease in accuracy. For example, the MobileNet-v1 network has a classification accuracy reduced by 0.9% compared to VGG-16 on the ImageNet data set, but the model parameters are only 1/32 of VGG.
  • Highlights of the network: 1> The biggest highlight in this article: Depthwise Convolution (DW convolution can greatly reduce the amount of calculation and the number of parameters)
    2> Added two hyperparameters α (superparameters that control the number of convolution kernels in the convolutional layer) , Β (hyperparameter that controls the size of the input image)

Depthwise Convolution (DW convolution)

The picture on the left is the traditional convolution, summarized in two sentences:

  • The depth of any convolution kernel = the depth of the input feature matrix
  • Output feature matrix depth = number of convolution kernels

The picture on the right is DW convolution. The depth of each convolution kernel is 1. It is only responsible for convolution operation with one channel of the input feature matrix. After obtaining one channel of the output feature, since each convolution kernel is responsible for one channel,Then the number of convolution kernels used should be the same as the depth of the input feature matrix, so as to ensure that each convolution kernel is responsible for a channel, and the depth of the output feature matrix = the depth of the input feature matrix. Therefore, after DW convolution, the depth of the feature matrix will not change.
Insert picture description here

Depthwise Separable Conv (depth separable convolution)

Insert picture description here
The convolution operation consists of two parts: DW convolution + PW convolution (Pointwise Conv), PW convolution is our ordinary convolution, but the size of the convolution kernel is equal to 1 , usually DW and PW are placed Used together, the depth of each convolution kernel is equal to the depth of the input feature matrix, and the depth of the output feature matrix is ​​equal to the number of convolution kernels. How many parameters can be reduced by using depth separable convolution compared with ordinary convolution? Let’s make a comparison: the
Insert picture description here
upper part is a common convolution operation, and the figure below is a DW+PW operation (MobileNet-v1 network structure). Take the feature matrix of the same depth as an example. DF is the height and width of the input feature matrix. DK represents the size of the convolution kernel, M represents the depth of the input feature matrix, and N represents the depth of the output feature matrix, that is, the number of convolution kernels. Because DK is equal to 3 by default in MobileNet, it can be obtained according to the calculation formula. Theoretically, the calculation amount of ordinary convolution is 8-9 times that of DW+PW convolution. The figure shows the model structure of the MobileNet-v1 network, conv/s2 represents ordinary convolution and the step distance is 2, the third 3 of 3 3 3 32 means that the input is a three-channel image, and 32 means 32 convolution kernels. conv DW represents the DW convolution operation. Since the depth of the DW convolution kernel is 1, it is ignored and not written.
Insert picture description here
Table 8 compares the accuracy, computation and model parameters of MobileNet, GoogleNet and VGG networks on ImageNet. Compared with VGG, the accuracy of the MobileNet network is only reduced by 0.9, but the model parameters are only 1/32 of VGG.
The corresponding effect of different α in Table6: α is used to control the number of convolution kernels. When α=1.0, the accuracy rate is 70.6%. When α=0.75, the number of convolution kernels is reduced to 0.75 times. The time is reduced to 68.4%, which corresponds to the amount of calculation and model parameters. When α=0.5, the number of convolution kernels is reduced to half of the original, and the accuracy rate is shown in the figure. Therefore, the appropriate value of α can be selected according to project requirements.
Table 7 reflects the comparison of the classification accuracy of the network and the calculation amount of the model for different input image sizes. β is the resolution super parameter. For 224
224 RGB images, the accuracy rate is 70.6%, the calculation amount is 569 Mollion, and so on. Similarly, by appropriately reducing the size of the input image, the amount of calculation is greatly reduced while ensuring that the input accuracy is small.

During the use of the MobileNet-v1 network, after the DW is trained, part of the convolution kernel will be abolished. When observing the parameters of the DW, it is found that most of the parameters are equal to 0, that is, the DW convolution kernel is not enabled. To be effective, there will be some improvement in MobileNet-v2 in response to this problem.

  • The MobileNet-v2 network was proposed by the google team in 2018. Compared with the MobileNet-v1 network, it is only a year away, with higher accuracy and a smaller model.
  • Two highlights of the network: Inverted Residuals (inverted residual structure) and Linear Bottlenecks

Let’s first review a residual structure provided by the ResNet network : first, use a 1 1 convolution kernel for the input feature matrix to compress, reduce the channel, and then use a 3 3 convolution kernel for convolution processing, and finally use 1 1 convolution kernel to expand the channel, which forms a bottleneck structure with two large ends and a small middle. In addition, the activation function used is the ReLU activation function.
The inverted residual structure used in the MobileNet-v2 version is called the inverted residual structure: first use a 1 * 1 convolution kernel to increase the dimension, make the channel deeper,
convolve through the 3 3 DW convolution operation, and finally pass The 1*1 convolution is used for dimensionality reduction, and its activation function is the ReLU-6 activation function.
Insert picture description here

Insert picture description here
For the last convolutional layer of the residual structure, it uses a linear activation function instead of the ReLU activation function. Why do you want to do this?
In the original paper, the author did such an experiment: input a two-dimensional matrix, transform it with a different dimension matrix T, transform it to a higher dimension, then use ReLU to get the output, and then use the inverse of the T matrix Matrix, the output matrix is ​​restored to a two-dimensional matrix. When the dimensions of T are 2 and 3, the final output in the figure reflects that a lot of information is lost, but as the dimension of T continues to deepen, the lost information becomes less and less. That is, the ReLU activation function will cause a relatively large loss to the low-dimensional feature information, and the loss to the high-dimensional feature is small . In the inverted residual structure, the two sides are thin and the middle is thick, and the output is a low-dimensional feature vector, so a linear activation function is required. Replace the ReLU activation function to avoid information loss .
MobileNet-v2
The left side of the figure below is the structure diagram of the inverted residual structure given in the original MobileNet-v2 paper: it also corresponds to the information of each layer in the table, and t represents the expansion factor. One point needs to be emphasized: Not every inverted residual structure in the MobileNet-v2 network has a shortcut branch. When stride=1 and the input feature matrix and the output feature matrix have the same shape, there is a shortcut connection. If it is not satisfied, there must be no shortcut .
Insert picture description here
Note: t=1 in the first bottleneck, which means that the depth of our input feature matrix is ​​not adjusted in the first layer.
Insert picture description here

Performance comparison

Insert picture description here
For classification tasks, the accuracy of MobileNet-v2 is 72%, while that of MobileNet-v1 is 70.6%. Obviously the accuracy has increased, and the model parameters, calculation amount, and operation time are obviously better than MobileNet-v1. Tested on a specific mobile phone, it only takes 75ms to predict a picture, which is basically a real-time effect on a mobile device. When α=1.4, the accuracy rate has reached 74.7%.
For target detection, the combination of MobileNet-v2 and SSD is used, that is, MobileNet-v2 is currently placed on the network, and the convolution layer of the SSD is also replaced by a deep separable convolution (dw+pw), compared to SSD300, SSD512, YOLOv2 When its accuracy rate is 22.1, its execution time, calculation amount, etc. are optimal.

Guess you like

Origin blog.csdn.net/qq_42308217/article/details/110585357