Another new member of the mobilenet series ---mobilenet-v3

Original link: https://blog.csdn.net/Chunfengyanyulove/article/details/91358187
Original statement: Copyright statement: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and This statement.

Copyright statement: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting.
Link to this article: https://blog.csdn.net/Chunfengyanyulove/article/details/91358187

Paper name: Searching for MobileNetV3

Author: Googler

Link to the paper : https://arxiv.org/abs/1905.02244


Brief overview of the essence of the article

mobilenet-v3 is another masterpiece of Google after mobilenet-v2. As a new member of the mobilenet series, the natural effect will be improved. Mobilenet-v3 provides two versions, mobilenet-v3 large and mobilenet-v3 small, which are applicable respectively In the case of different requirements for resources, the paper mentioned that mobilenet-v3 small has an accuracy increase of about 3.2% compared to mobilenet-v2 in imagenet classification tasks, but the time has been reduced by 15%. Mobilenet-v3 large is used in imagenet classification tasks. Compared with mobilenet-v2, the accuracy is improved by about 4.6%, and the time is reduced by 5%. Compared with v2, mobilenet-v3 large achieves the same accuracy on COCO, and the speed is 25% faster . At the same time, there is a certain segmentation algorithm. Improvement. Another highlight of this article is that the design of the network uses the NAS (network architecture search) algorithm and the NetAdapt algorithm . In addition, this article also introduces some tricks to improve the network effect, these tricks also improve a lot of accuracy and speed.

Article Introduction

In recent years, with the continuous increase of smart applications, lightweight networks have become a research hotspot in recent years. After all, not all devices have GPUs for calculation. As the name suggests, lightweight networks have fewer parameters and faster speeds. The following summarizes some commonly used methods to reduce the amount of network calculations:

  • Based on lightweight network design: such as mobilenet series, shufflenet series, Xception, etc., use Group convolution, 1x1 convolution and other technologies to reduce the amount of network calculations while ensuring the accuracy of the network as much as possible.
  • Model pruning: Large networks often have a certain degree of redundancy. By trimming the redundant parts, the amount of network calculations is reduced.
  • Quantization: Using TensorRT to quantize, generally it can speed up several times on GPU.
  • Knowledge distillation: Use a large model (teacher model) to help the small model (student model) to learn and improve the accuracy of the student model.

The mobilenet series is of course the typical first method. Before introducing mobilenet v3, let’s review the innovations of mobilenet v1 and v2:

mobilenet v1

  • Use packet convolution to reduce the amount of network calculations, and mobilenet applies packet convolution to the extreme, that is, the number of packets in the network is equal to the number of channels in the network, which minimizes the amount of network calculations, but there is no interaction between channels. So, the author used point-wise conv again, which uses 1x1 convolution for fusion between channels.
  • Straight structure.

mobilenet v2:

  • The bottleneck structure is introduced.
  • The bottleneck structure is turned into a spindle type, that is, the resnet is first reduced to 1/4 of the original, and then enlarged, it is enlarged to 6 times of the original, and then reduced.
  • And removed the last ReLU of Residual Block.

So what black technologies does mobilenet v3 introduce? :

Don’t say much, just focus on it

  • 1. Introducing the SE structure The SE structure
    is added to the bottlenet structure and placed after the depthwise filter, as shown in the figure below. Because the SE structure consumes a certain amount of time, the author changes the channel of the expansion layer to 1/4 of the original in the structure containing the SE. In this way, the author finds that the accuracy is improved without increasing the time consumption. And the SE structure is placed after depthwise.

Insert picture description here

  • 2. Modify the tail structure:
    In mobilenetv2, before avg pooling, there is a 1x1 convolutional layer. The purpose is to improve the dimensionality of the feature map, which is more conducive to structure prediction, but this actually brings a certain amount of calculation. So the author modified it here and put it behind avg pooling. First, use avg pooling to reduce the size of the feature map from 7x7 to 1x1. After reducing it to 1x1, then use 1x1 to increase the dimension, which reduces the size of 7x7=49 times. Calculation amount. And in order to further reduce the amount of calculation, the author directly removed the 3x3 and 1x1 convolutions of the previous spindle-type convolution, further reducing the amount of calculation, and it became the structure shown in the second row of the following figure. The author added 3x3 and After 1x1 is removed, the accuracy is not lost. This reduces the speed by about 15ms.

Insert picture description here

  • 3. Modify the number of channels. Modify the number of
    head convolution kernel channels. In mobilenet v2, 32 x 3 x 3 is used. The author found that 32 can be reduced a little bit, so the author here changed it to 16, which guarantees accuracy. Down, the speed is reduced by 3ms. , Here is a structural comparison between mobilenet v2 and mobilenet v3:

Insert picture description here
Insert picture description here

  • 4. Non-linear transformation changes
    Use h-swish to replace swish. Swish is Google’s own research result, which is quite self-satisfied. This time, based on it, it has been optimized for speed. The swish and h-swish formulas are as follows. Because the calculation of sigmoid takes a long time, especially on the mobile terminal, these time-consuming will be more obvious, so the author uses ReLU6(x+3)/6 to approximate the sigmoid, observe As you can see in the figure below, there is not much difference. There are several advantages to using ReLU, 1. It can be calculated on any hardware and software platform, 2. When quantizing, it eliminates the potential loss of accuracy, using h-swish to replace swith, and improving the efficiency by about 15% in quantization mode In addition, h-swish is more obvious in the deep network.
    Insert picture description here
    Insert picture description here
    Insert picture description here
    The following two figures show the impact of using h-swish on time and accuracy. It can be found that using h-swish@16 can increase the accuracy by about 0.2%, but the sword holding is extended by about 20%.
    Insert picture description here

Although you already know the structure of mobilenet v3, the focus of this article is how to design this network, that is, how to search the network structure. It is necessary to mention it. I did not study it in detail here. Students who want to know more You can read papers on your own

The overall process is very simple. First, use the NAS algorithm to optimize each block to get a general network structure, and then use the NetAdapt algorithm to determine the number of channels for each filter

Here, due to the relatively large impact of the accuracy and time-consuming of the small model, mobilenet v3 large and mobilenet v3 small are designed using NAS respectively.

After NAS, you can use the NetAdapt algorithm to design each layer, the process is as follows:

  • First use NAS to find a usable structure A.

    1. A series of candidate structures are generated on the basis of A, and the consumption of these candidate structures is reduced a little, which is actually an exhaustive list of substructures.
    2. For each candidate structure, use the previous model to initialize (just initialize the parameters that are not in the previous model randomly), finetune T epochs, and get a rough accuracy.
    3. Among these candidate structures, find the best.
  • Iterate repeatedly, know the target time is reached, and find the most suitable result.

How are candidates selected?

  • Reduce the size of the expansion layer.
  • Reduce bottleneck

Experimental part:

The first is the experiment of the classification part. Google, which has always been more proud, is no exception this time. The author uses 16 TPUs with a batchsize of 4096 for training. Then the author chose to test on Google's Pixel Phone.

The following figure shows the test results of the author’s ImageNet network. The results show that the accuracy of V3 large is about 3 points higher than that of V2 1.0, but the speed drops from 78ms to 66ms (Pixel-1 mobile phone). Compared with V3 small V2 0.35, the accuracy has been improved from 60% to 67%, and the speed has been slightly increased, from 19ms to 21ms (Pixel-1 mobile phone).
Insert picture description here

The following figure compares the time consumption of different google phones after model quantization ( float quantization, non-int8 quantization ), where P-1, P-2, and P-3 respectively represent mobile phones with different performance. I will mainly analyze the V3-Large network. After quantification, the accuracy of TOP-1 has dropped from 75.2% in the above figure to 73.8%, which is about 1.5 points lower, which is in line with the normal situation. In terms of the acceleration effect of P1-P3 P1 accelerated by 9ms (66ms), P2 accelerated by 20ms (77ms), and P3 accelerated by 15ms (52.6ms). Compared with the V2 network, the speed is actually not much different. ( Why the original speed difference is quite large, but after quantification, the difference is not much? This is a question worth considering )
Insert picture description here

The following figure shows the author experimented with different resolutions and the accuracy comparison of different model depths. The resolutions were selected as [96,128,160,192,224,256], and the depths were selected as the original [0.35,0.5,0.75,1.0,1.25]. It can be seen that in fact, resolution has a better effect on the balance of accuracy and speed, and can achieve faster speeds. At the same time, the accuracy does not change the model depth and the accuracy is lower, but higher. (But in many cases, the resolution is actually determined by the scene, for example, detection and segmentation require larger-scale images).
Insert picture description here

The following figure shows the process from MnasNet through a series of modifications to the accuracy and speed of mobileNet v3.
Insert picture description here
The following figure is the accuracy result of applying mobilenet v3 to SSD-Lite in the COCO test set. Observation shows that on the V3-Large, mAP has not been greatly improved, but the speed is indeed reduced.
Insert picture description here
The following figure is a segmented structure diagram, which will not be described in detail here, and interested readers can read the paper by themselves.
Insert picture description here

to sum up

In summary, mobilenet V3 does not actually have a stunning structure proposed. The most important thing is to apply tricks such as SE, H-Swish, and then use the NAS and NetAdapt algorithm proposed by Google to automatically search the structure, which improves a certain accuracy. , Reduced a certain speed. Perhaps the focus of this paper is to further demonstrate the effectiveness of automatic network structure search, which corresponds to the title of the article: Searching for. After all, this is also a momentum of development.

There are deficiencies or mistakes in the article, and friends from all over the world are welcome to criticize and correct.

                                </div>
            <link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-095d4a0b23.css" rel="stylesheet">
                </div>
Copyright statement: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting.
Link to this article: https://blog.csdn.net/Chunfengyanyulove/article/details/91358187

Paper name: Searching for MobileNetV3

Author: Googler

Link to the paper : https://arxiv.org/abs/1905.02244


Brief overview of the essence of the article

mobilenet-v3 is another masterpiece of Google after mobilenet-v2. As a new member of the mobilenet series, the natural effect will be improved. Mobilenet-v3 provides two versions, mobilenet-v3 large and mobilenet-v3 small, which are applicable respectively In the case of different requirements for resources, the paper mentioned that mobilenet-v3 small has an accuracy increase of about 3.2% compared to mobilenet-v2 in imagenet classification tasks, but the time has been reduced by 15%. Mobilenet-v3 large is used in imagenet classification tasks. Compared with mobilenet-v2, the accuracy is improved by about 4.6%, and the time is reduced by 5%. Compared with v2, mobilenet-v3 large achieves the same accuracy on COCO, and the speed is 25% faster . At the same time, there is a certain segmentation algorithm. Improvement. Another highlight of this article is that the design of the network uses the NAS (network architecture search) algorithm and the NetAdapt algorithm . In addition, this article also introduces some tricks to improve the network effect, these tricks also improve a lot of accuracy and speed.

Article Introduction

In recent years, with the continuous increase of smart applications, lightweight networks have become a research hotspot in recent years. After all, not all devices have GPUs for calculation. As the name suggests, lightweight networks have fewer parameters and faster speeds. The following summarizes some commonly used methods to reduce the amount of network calculations:

  • Based on lightweight network design: such as mobilenet series, shufflenet series, Xception, etc., use Group convolution, 1x1 convolution and other technologies to reduce the amount of network calculations while ensuring the accuracy of the network as much as possible.
  • Model pruning: Large networks often have a certain degree of redundancy. By trimming the redundant parts, the amount of network calculations is reduced.
  • Quantization: Using TensorRT to quantize, generally it can speed up several times on GPU.
  • Knowledge distillation: Use a large model (teacher model) to help the small model (student model) to learn and improve the accuracy of the student model.

The mobilenet series is of course the typical first method. Before introducing mobilenet v3, let’s review the innovations of mobilenet v1 and v2:

mobilenet v1

  • Use packet convolution to reduce the amount of network calculations, and mobilenet applies packet convolution to the extreme, that is, the number of packets in the network is equal to the number of channels in the network, which minimizes the amount of network calculations, but there is no interaction between channels. So, the author used point-wise conv again, which uses 1x1 convolution for fusion between channels.
  • Straight structure.

mobilenet v2:

  • The bottleneck structure is introduced.
  • The bottleneck structure is turned into a spindle type, that is, the resnet is first reduced to 1/4 of the original, and then enlarged, it is enlarged to 6 times of the original, and then reduced.
  • And removed the last ReLU of Residual Block.

So what black technologies does mobilenet v3 introduce? :

Don’t say much, just focus on it

  • 1. Introducing the SE structure The SE structure
    is added to the bottlenet structure and placed after the depthwise filter, as shown in the figure below. Because the SE structure consumes a certain amount of time, the author changes the channel of the expansion layer to 1/4 of the original in the structure containing the SE. In this way, the author finds that the accuracy is improved without increasing the time consumption. And the SE structure is placed after depthwise.

Insert picture description here

  • 2. Modify the tail structure:
    In mobilenetv2, before avg pooling, there is a 1x1 convolutional layer. The purpose is to improve the dimensionality of the feature map, which is more conducive to structure prediction, but this actually brings a certain amount of calculation. So the author modified it here and put it behind avg pooling. First, use avg pooling to reduce the size of the feature map from 7x7 to 1x1. After reducing it to 1x1, then use 1x1 to increase the dimension, which reduces the size of 7x7=49 times. Calculation amount. And in order to further reduce the amount of calculation, the author directly removed the 3x3 and 1x1 convolutions of the previous spindle-type convolution, further reducing the amount of calculation, and it became the structure shown in the second row of the following figure. The author added 3x3 and After 1x1 is removed, the accuracy is not lost. This reduces the speed by about 15ms.

Insert picture description here

  • 3. Modify the number of channels. Modify the number of
    head convolution kernel channels. In mobilenet v2, 32 x 3 x 3 is used. The author found that 32 can be reduced a little bit, so the author here changed it to 16, which guarantees accuracy. Down, the speed is reduced by 3ms. , Here is a structural comparison between mobilenet v2 and mobilenet v3:

Insert picture description here
Insert picture description here

  • 4. Non-linear transformation changes
    Use h-swish to replace swish. Swish is Google’s own research result, which is quite self-satisfied. This time, based on it, it has been optimized for speed. The swish and h-swish formulas are as follows. Because the calculation of sigmoid takes a long time, especially on the mobile terminal, these time-consuming will be more obvious, so the author uses ReLU6(x+3)/6 to approximate the sigmoid, observe As you can see in the figure below, there is not much difference. There are several advantages to using ReLU, 1. It can be calculated on any hardware and software platform, 2. When quantizing, it eliminates the potential loss of accuracy, using h-swish to replace swith, and improving the efficiency by about 15% in quantization mode In addition, h-swish is more obvious in the deep network.
    Insert picture description here
    Insert picture description here
    Insert picture description here
    The following two figures show the impact of using h-swish on time and accuracy. It can be found that using h-swish@16 can increase the accuracy by about 0.2%, but the sword holding is extended by about 20%.
    Insert picture description here

Although you already know the structure of mobilenet v3, the focus of this article is how to design this network, that is, how to search the network structure. It is necessary to mention it. I did not study it in detail here. Students who want to know more You can read papers on your own

The overall process is very simple. First, use the NAS algorithm to optimize each block to get a general network structure, and then use the NetAdapt algorithm to determine the number of channels for each filter

Here, due to the relatively large impact of the accuracy and time-consuming of the small model, mobilenet v3 large and mobilenet v3 small are designed using NAS respectively.

After NAS, you can use the NetAdapt algorithm to design each layer, the process is as follows:

  • First use NAS to find a usable structure A.

    1. A series of candidate structures are generated on the basis of A, and the consumption of these candidate structures is reduced a little, which is actually an exhaustive list of substructures.
    2. For each candidate structure, use the previous model to initialize (just initialize the parameters that are not in the previous model randomly), finetune T epochs, and get a rough accuracy.
    3. Among these candidate structures, find the best.
  • Iterate repeatedly, know the target time is reached, and find the most suitable result.

How are candidates selected?

  • Reduce the size of the expansion layer.
  • Reduce bottleneck

Experimental part:

The first is the experiment of the classification part. Google, which has always been more proud, is no exception this time. The author uses 16 TPUs with a batchsize of 4096 for training. Then the author chose to test on Google's Pixel Phone.

The following figure shows the test results of the author’s ImageNet network. The results show that the accuracy of V3 large is about 3 points higher than that of V2 1.0, but the speed drops from 78ms to 66ms (Pixel-1 mobile phone). Compared with V3 small V2 0.35, the accuracy has been improved from 60% to 67%, and the speed has been slightly increased, from 19ms to 21ms (Pixel-1 mobile phone).
Insert picture description here

The following figure compares the time consumption of different google phones after model quantization ( float quantization, non-int8 quantization ), where P-1, P-2, and P-3 respectively represent mobile phones with different performance. I will mainly analyze the V3-Large network. After quantification, the accuracy of TOP-1 has dropped from 75.2% in the above figure to 73.8%, which is about 1.5 points lower, which is in line with the normal situation. In terms of the acceleration effect of P1-P3 P1 accelerated by 9ms (66ms), P2 accelerated by 20ms (77ms), and P3 accelerated by 15ms (52.6ms). Compared with the V2 network, the speed is actually not much different. ( Why the original speed difference is quite large, but after quantification, the difference is not much? This is a question worth considering )
Insert picture description here

The following figure shows the author experimented with different resolutions and the accuracy comparison of different model depths. The resolutions were selected as [96,128,160,192,224,256], and the depths were selected as the original [0.35,0.5,0.75,1.0,1.25]. It can be seen that in fact, resolution has a better effect on the balance of accuracy and speed, and can achieve faster speeds. At the same time, the accuracy does not change the model depth and the accuracy is lower, but higher. (But in many cases, the resolution is actually determined by the scene, for example, detection and segmentation require larger-scale images).
Insert picture description here

The following figure shows the process from MnasNet through a series of modifications to the accuracy and speed of mobileNet v3.
Insert picture description here
The following figure is the accuracy result of applying mobilenet v3 to SSD-Lite in the COCO test set. Observation shows that on the V3-Large, mAP has not been greatly improved, but the speed is indeed reduced.
Insert picture description here
The following figure is a segmented structure diagram, which will not be described in detail here, and interested readers can read the paper by themselves.
Insert picture description here

to sum up

In summary, mobilenet V3 does not actually have a stunning structure proposed. The most important thing is to apply tricks such as SE, H-Swish, and then use the NAS and NetAdapt algorithm proposed by Google to automatically search the structure, which improves a certain accuracy. , Reduced a certain speed. Perhaps the focus of this paper is to further demonstrate the effectiveness of automatic network structure search, which corresponds to the title of the article: Searching for. After all, this is also a momentum of development.

There are deficiencies or mistakes in the article, and friends from all over the world are welcome to criticize and correct.

                                </div>
            <link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-095d4a0b23.css" rel="stylesheet">
                </div>

Guess you like

Origin blog.csdn.net/yyyllla/article/details/102523139