Manual design of CNN model and NAS analysis summary for Mobile device, MobileNet V1, V2, V3, Efficient, MnasNet and Efficient network design

      Manual method and NAS efficient network model design summary and analysis

This article focuses on efficient neural network design (Manually) and search (NAS) for mobile and resource-constrained platforms.

        ​​​​​​The efficient CNN design is not only used on servers, clouds, and devices with sufficient resources, but also gradually migrated to mobile devices, robotics, etc. These platforms have the characteristics of limited memory, certain computing resources, and sensitivity to application delays. ​Recent articles have gone from the time-consuming and resource-consuming large-scale model design to the lightweight model design that can be actually worked on the mobile device. This article briefly summarizes the design and development of efficient mobile-size models . (Updated gradually, not all models will be listed at once.)                                                                                                                    

        Based on manual design, including MobileNet v1, v2; then stage-wise NAS: MnasNet; then MobileNet V3 based on MnasNet, first performs stage-wise block search, and then performs layer-wise fine-tuning. EfficientNet is the same as MnasNet, block-wsie search. These papers are all from a team of Google, the same people. Subsequently, Facebook-Berkeley also based on NAS, searched out some efficient models for mobile device. Including layer-wise FBNet, model adaptive ChamNet. The main idea of ​​this article is efficient network design , methods include manually design and platform-aware NAS .

 

  • 1. MobileNet V1 , V2 , V3。

MobileNet V1 and V2 are manually designed.

  • MobileNet V1

MobileNet V1 uses depthwise separable convolution for the first time to greatly reduce FLOPs and parameters. The depthwise separable convolution uses a depthwise convolution followed by a 1 \times1 pointwise convolution. This is an innovation in 17 years. When everyone is paying attention to how to design a more powerful model (do not care about the amount of parameters, calculations, model size), to achieve relatively better state-of-the-art effects on various tasks, MobileNet V1 is Smaller models have pioneered the design of network structure for mobile devices and opened up new ideas. 

论文名称:MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Comparison of standard convolution and depthwise separable convolution
Figure 1: Comparison of standard convolution and depthwise separable convolution.

  • MobileNet V2

MobileNet V2 mainly designed the inverted residual block and linear bottleneck structure to further reduce FLOPs and parameters. Among them, the inverted residual block first uses 1 \times1 pointwise convolution to expand the input to a deeper intermediate layer, expansion factor = t (the article is set to 6), then uses 3 \times3 depthwise convolution, and finally uses a linear 1 \times1 pointwise convolution to deepen The middle layer is compressed to the original input.

Figure 2: Analysis of calculation steps of bottleneck residual block.

The architecture based on this module can further implement functions such as target classification and target detection on mobile phones with limited memory.

Paper name: MobileNetV2: Inverted Residuals and Linear Bottlenecks

Figure: 3: Comparison of residual block and inverted residual block.

 


  • MobileNet V3

MobileNet V3 first uses platform-aware NAS to perform block-wise search. Establish a multi-objective function of accuracy ACC(m) and latency LAT(m), and use reinforcement learning to search for a global network architecture. In fact, it is to search out several blocks and the fixed modules of each block. This is similar to MnasNet. (The author also said that the search is carried out according to MnasNet, and then fine-tuned). Among them, a large stage, also called a block, contains 2, 3, 4~~ or more layers, and the operation of each layer is fixed. The inverted residual bottleneck block is used, which is the model shown in Figure 4. Then, use NetAdapt for complementary search, this step is a layer-wise operation. Under the framework of NAS search, search for the specific and optimal number of channels for each layer of output channels and expansion layer channels in each block. Among them, the basic module of MobileNet V2 has been improved, and the Squeeze-and-Excite module is added to the inverted residual module. This is a lightweight attention mechanism module that can enhance the feature extraction performance of the convolution module. Then, the activation function was modified. The first two models, MobileNet V1 and MobileNet V2, did not modify the activation function and directly adopted ReLU. In V3, the hardware-friendly h-swish non-linear function is adopted, which is a modification of the swish function. Finally, the running time on the CPU and GPU was evaluated.

The schematic diagram of the main modules and structure is as follows:

Figure 4: The basic module of MobileNet V3.
Figure 5: The widely used activation function in the article: h-swish.

In the end, the two versions of MobileNet V3 large and small have achieved good results in terms of latency, model accuracy, and parameters on the actual Google Pixel phone. Figure 6 shows the comparison of the MobileNet series: the overall accuracy of the model is improved greatly, and the latency is relatively low.

Figure 6: Comparison of actual running delay of MobileNet V3 on Google P1 mobile phone.

Paper name: Searching for MobileNetV3


 

PS, digression:

Google's official blog has measured the performance of MobileNet V3 on Google Pixel4 phones and the details of the implementation. A general description of the MobileNet series, including why on-device efficient neural networks is designed, as well as how the MobileNet series is used in Google Pixel4 mobile phones, describes the camera and mobile phone unlocking, which are worth a closer look. When the model is streamlined and designed as a model adapted to MobileNet Edge TPU, the performance and efficiency will be improved differently at this time. Describes the importance of hardware-aware model customization. Each CPU or Edge device has its particularity. If you want to get the best performance, you need to re-customize and design. The overall idea is like this. It's okay to look at the details, look at the original Google article.  

地址:Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU


  • 2. MnasNet

MnasNet, block-wise, can also be called stage-wise search . It adopts Platform-Aware mobile-size NAS based on reinforcement learning: automated mobile neural architecture search. Directly use the actual mobile phone to run latency and accuracy indicators as rewards for training.

Figure 7: Platform-Aware's NAS flow chart, that is, the search process of MnasNet.

For mobile-sized NAS, the most important thing is the predefined search space. MnasNet proposed Factorized Hierarchical Search Space, which hierarchically decomposes the search space. The idea is to define the search space as a block, and then connect the searched blocks. The operator in each block is fixed, but the layers in different blocks, which can also be called operators, are different. This method reduces the complexity of the search space. The content to be searched for each block is as follows:

  1. Convolution operations: including regular conv, depthwise conv, inverted residual and bottleneck block (MobileNet V2 module).
  2. kernel size:3\times3,5\times5 。
  3. Squeeze-and-excitation ratio: 0, 0.25. When 0, it means that the SE module is not used and the attention mechanism is not added. 0.25 is the standard attention mechanism. Compressed to 0.25, expanded back.
  4. Skip operations: pooling, identity residual, no skip. No need to explain this.
  5. output filters size: that is, the number of output channels.
  6. Number of layers per blocks: The number of layers in blocks.

Note that the number of blocks is predefined, which means that the basic skeleton is defined. The article is set to 7. Therefore, the layer of each block is different, but the layer operation of the same block is the same. The overall idea is shown in Figure 8.

Figure 8: Factorized Hierarchical Search Space, search content and structure diagram.

This method is better than simple cell-based, cell-based, only search for a cell, and then stack it up repeatedly, every cell in the entire network is the same. There is no diversity in each layer or block of the network. The search results are shown in Figure 9.

Figure 9: Structure diagram of MnasNet-A1. The operation of each block is different, but the layer operation of the same block is the same.

 It can be seen from Figure 9 that each block of MnasNet is different, including separable conv, MBConv6, MBConv3, and so on. Moreover, the layer of each block is different, with different layers such as 1, 2, 3, 4, 2, 3, 1. MobileNet V3 is based on this and optimizes each layer in each block individually.

论文名称:MnasNet: Platform-Aware Neural Architecture Search for Mobile


 

  • 3. EfficientNet and others.

EfficientNet, in addition to the above series of optimized models, also has EfficientNet , which uses the basic block of MobileNet V2 to search. The actual latency is no longer considered here, but FLOPs and memory are used as the objective function. EfficientNet-B0 was searched out, and then the scale of the mixed model was adjusted. The main research is the influence of different scales on the performance of the model. Finally, it is found that by adjusting the depth, width, and resolution, we can jointly search for models that meet different conditions.

Figure 10: The objective function of search optimization
Figure 11: Model scale transformation methods, including width, depth, and resolution. This article uses a hybrid scale scaling method.

Finally, the result of promising was achieved. The proposed method explores not only the individual search of blocks, the stacking of cells, but also the three-scale adjustment of the obtained model, which can further improve the performance.

Figure 12: Performance comparison between three scales individually changed and mixed scales

 It can be seen from Figure 12 that the adjustment of the mixed scale is better than the single adjustment of the three scales.

Paper Title: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Application of EfficientNet in Edge Computing. Google also explained the fact that EfficientNet searches on the CPU. His building blocks are very efficient for the CPU. However, when a special customized hardware accelerator is actually used, some basic operations need to be changed. This article describes the performance achieved by EfficientNet on TPU: EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML .

There is also a large part of the work, including ShuffleNet V1, V2, Proxyless Net, CondenseNet, etc., a series of optimizations for the resource-constrained platform.


 

Insert an example of efficient implementation: I recently discovered an article about implementing MobileNet V1 on FPGA, which can make the frame rate 3000 FPS. This is a very fast implementation speed I have seen recently, which can put the entire network in FPGA's on-chip storage RAM resources are realized without the help of off-chip memory. The whole network adopts multi-precision realization, and it is a combination of software and hardware. The entire implementation process is shown in the figure.

Figure 13: MobileNet V1 accelerator flow based on FPGA implementation

The calculation of the entire convolution is implemented using registers, which can be operated efficiently. Friends who are interested in deep learning FPGA implementation, neural network accelerators, and high-efficiency hardware implementations can jump to my next article to take a look at the interpretation of the entire implementation process and my understanding. FPGA-based MobileNet V1, FPGA deep learning accelerator

Figure 14: Implementation of the network

 

论文名称:Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs

Address: https://arxiv.org/pdf/1910.10075v1.pdf

 


To be continued! ! !

It feels that if you put the manually efficient network design and the NAS-based efficient network search together, the content will be too much, and the space will be too long, and it may seem to be very exhausting. I think if I want to divide it into two parts, I will conclude it in this article.

 

11-30:

The two recent articles have improved on the basis of MobileNet V2's MBBlock. One is to introduce idle transformation and design Idle Block. Divide an MBBlock into two parts, a 1x1 pointwise convolution similar to MBBlock, followed by a 3x3 depthwise convolution, and then a 1x1 pointwise convolution. The other part is directly connected to the output without conversion, and concat is the output together. As shown below. The proposed network is a hybrid composition network.

Figure 15: Schematic diagram of Idle Block
Figure 16: The specific construction and calculation process of Idle Block, compared to MBBlock and ShuffleNetV2

 论文名称:Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition

Address: https://arxiv.org/abs/1911.08609


 

The second article is MixNet, which expands the fixed 3x3 convolution kernel of the depthwise convolution in the middle of MBBlock, some use 3x3, and some use 5x5, 7x7,. . . This kind of similar scheme. An input is also partitioned into multiple groups, which are convolved with different kernels. As shown below.

Figure 17: Module schematic diagram of Mixed Depthwise convolution

 There are many descriptions in this article. Regarding Group size, kernel size per-group, channel size per-group, the ideas here are very good in the existing group convolution, depthwise convolution, directly write one sentence or two lines of code Just get it done, it's still very easy to implement, without a lot of skills and techniques. Google also officially gave the code to write this Mixed Convolution with TensorFlow:

Figure 18: TF's code to implement Mixed Conv

 

Paper name: MixNet: Mixed Depthwise Convolutional Kernels

Address: https://arxiv.org/abs/1907.09595

 

 

 

Guess you like

Origin blog.csdn.net/qq_32998593/article/details/103045915