MobileOne

Here we propose MobileOne, a neural network model that is friendly to deployment on mobile devices. The inference time on iPhone12 is under 1 ms while achieving top-1 accuracy of 75.9% on ImageNet. And can be extended to multiple tasks: image classification, object detection and semantic segmentation 

This paper proposes MobileOne, a deployment-friendly neural network model on mobile devices. This paper identifies and analyzes bottlenecks in the architecture and optimization of some recent efficient neural networks, and provides methods for mitigating these bottlenecks. Finally, an efficient Backbone model MobileOne was designed, and the inference time on iPhone12 was less than 1 ms when the top-1 accuracy reached 75.9% on ImageNet. MobileOne is an efficient architecture on end-to-end devices. Moreover, MobileOne generalizes to multiple tasks: image classification, object detection, and semantic segmentation, with significant improvements in latency and accuracy compared to existing efficient architectures deployed on mobile devices.

Because MobileOne is a straight-barrel architecture at the time of reasoning, this structure leads to lower memory access costs, so the width of MobileOne can be appropriately increased. For example, the parameter of MobileOne-S1 is 4.8M, and the delay is 0.89ms, while the parameter of MobileNet-V2 is 3.4M, but the delay has risen to 0.98ms. Moreover, the top-1 accuracy of MobileOne is 3.9% higher than that of MobileNet-V2.

Figure 1: Accuracy and Latency Comparison for MobileOne

MobileOne: Mobile Vision Architecture with 1ms Inference Latency

Paper name: An Improved One millisecond Mobile Backbone (CVPR 2023)

Paper address:

https://arxiv.org/pdf/2206.04040.pdf

One of the directions for the design and deployment of efficient deep learning architectures for mobile devices is to continuously reduce the number of floating-point operations (FLOPs) and parameters (Params) while increasing accuracy. However, the relationship between these two indicators and the specific Latency of the model is not so clear.

For example, FLOPs, two models of the same FLOPs, their delays may differ greatly. Because FLOPs only considers the total calculation amount of the model, but does not consider the memory access cost (memory access cost, MAC) and the degree of parallelism (degree of parallelism) [1].

  1. For MAC, the computation required for Add or Concat is negligible, but MAC is not. And Add or Concat does not take up calculations. It can be seen that under the same FLOPs, the model with a large MAC will have a greater delay.

  2. As for parallelism, a model with high parallelism may be much faster than another model with low parallelism for the same FLOPs. In ShuffleNet V2 [2], the author reported that the larger the number of fragmentation operators, it is not friendly to devices with high parallel computing capabilities (such as GPU), and introduces additional overhead, such as kernel startup and synchronization. In contrast, the Inception architecture has multiple branches, while the VGG-like straight architecture is single-branched.

Another example is Params, two models with the same Params, their delays will not be exactly the same.

  1. For MAC, the required parameter for Add or Concat is zero, but MAC cannot be ignored. So under the same Params, a model with a large MAC will have a greater delay.

  2. When the model uses shared parameters, it will bring higher FLOPS, but at the same time Params will be reduced.

Therefore, the goal of this article is to design a neural network with lower latency on actual devices. The test method is to use the CoreML[3] tool to test Latency on iPhone12. The optimization problem of small models is another bottleneck. For this problem, the author hopes to use the help of the structural reparameterization technology in RepVGG[4]. The authors further alleviate the optimization bottleneck by dynamically relaxing regularization throughout training to prevent underfitting problems caused by over-regularization of small models. Specifically, this paper designs a new architecture, MobileOne, whose variants run in less than 1ms on iPhone 12, achieving high accuracy while running faster on real devices.

Correlation between lightweight indicators Params, FLOPs, and Latency

The author first evaluated the correlation of lightweight indicators Params, FLOPs, and Latency. Specifically, the author developed an iOS application to measure the delay of the ONNX model on iPhone12, and the model conversion toolkit uses the CoreML[3] package.

Figure 2 below shows the relationship between delay and FLOPs and delay and Params. It can be observed that many models with higher Params have lower latency, and there is a similar relationship between FLOPs and latency. The author also estimated the Spearman rank correlation, as shown in Figure 3 below. It can be found that for efficient architectures on mobile devices, latency is moderately correlated with FLOPs and weakly correlated with Params. This correlation is even lower on CPU.

Figure 3: Spearman correlation coefficient between FLOPs and Params and Latency

Where is the delay bottleneck

activation function

To analyze the effect of activation functions on latency, the authors built a 30-layer convolutional neural network and benchmarked it on the iPhone 12 with different activation functions. The results are shown in Figure 4 below, where you can see that there is a big difference in latency. The simple ReLU activation function has the lowest latency cost. Complex activation functions may cause high latency costs due to synchronization costs. In the future, these activations could be accelerated by hardware. Therefore, only ReLU is used as the activation function in MobileOne.

 Figure 4: Latency comparison of different activation functions

Memory access cost (memory access cost, MAC) and degree of parallelism (degree of parallelism) [1]

As mentioned above, memory access costs increase significantly in multi-branch architectures. Because the activations of each branch have to be stored to compute the next tensor in the graph. This memory bottleneck can be avoided if the network has a low number of branches. In order to demonstrate the impact of hidden costs such as MAC, the author performed ablation experiments using skip connections and squeeze-excite blocks in a 30-layer convolutional neural network. As can be seen from the results in Figure 5, these things will bring about improvements in Latency. Therefore, we adopt a branchless architecture at inference time, which reduces the memory access cost.

 Figure 5: Effect of SE blocks and skip connections on Latency

MobileOne Block Architecture

First of all, the Basic Block of MobileOne is designed according to MobileNet-V1[5], and the basic architecture adopted is 3x3 depthwise convolution + 1x1 pointwise convolution.

During training, MobileOne uses structural reparameterization technology to add several parallel branches to 3x3 Depthwise Convolution and several parallel branches to 1x1 Pointwise Convolution during training, as shown in Figure 6 below. However, unlike RepVGG:

RepVGG adds 3x3 Convolution during training: 1x1 convolution and only the Shortcut branch of BN.

Figure 6: MobileOne Architecture

At the time of reasoning, the MobileOne model has no branches and is a straight-barrel architecture. The specific method of structural reparameterization is the same as that of RepVGG, which is to first "absorb" BN into the previous convolution, and then merge the parameters of the parallel convolution.

Figure 8: Effect of whether to use structural reparameterization

MobileOne architecture

The specific architecture parameters of MobileOne are shown in Figure 9 below. The general design principle is to use fewer Blocks in the shallow layer of the model, and stack more Blocks in the deep layer, because the input resolution of the shallow layer is larger, which increases the delay of the entire model.

Because the MobileOne model does not have a multi-branch architecture during inference, it does not incur data movement costs, in other words, it saves latency. Therefore, when designing the model, more parameters can be used without incurring significant latency costs.

Figure 9: MobileOne Architecture

The authors ablate various training strategies while keeping all other parameters constant. It can be seen that the accuracy improvement of 0.5% is obtained after annealing the Weight decay coefficient. Figure 10: Ablation experiment for the training policy

Experimental results

Image classification experiment results

The image classification experiment results on ImageNet-1K are shown in Figure 11 below. It can be seen that the top-1 accuracy of the most advanced MobileFormer is 79.3%, and the delay is 70.76ms, while the accuracy of MobileOne-S4 is 79.4%, the delay is only 1.86ms, and the speed of the mobile terminal is 38 times faster. The top-1 accuracy of MobileOne-S3 is 1% higher than that of EfficientNet-B0, and the speed of the mobile terminal is 11 times faster.

 Figure 11: Image classification experiment results

As a small model on the device side, MobileOne has also conducted experiments on downstream tasks in addition to image classification, and the results are shown in Figure 12 below. Figure 12: Experimental results for downstream tasks

COCO target detection experiment results

Use the Backbone pre-trained on ImageNet-1K, plus SSDLite as the detection head. Trained on MS COCO dataset. The input resolution is set to 320×320, and the model is trained for 200 Epochs. The best model on MobileOne outperforms MNASNet by 27.8% and outperforms the best version on MobileViT by 6.1%.

Pascal VOC and ADE20K Semantic Segmentation Experimental Results

Use the Backbone pre-trained on ImageNet-1K, plus Deeplab V3 as the segmentation head. Trained on Pascal VOC and ADE20K datasets. For the VOC dataset, MobileOne outperforms Mobile ViT by 1.3% and outperforms MobileNetV2 by 5.8%. For the ADE20K dataset, MobileOne outperforms MobilenetV2 by 12.0%. With the smaller MobileOne-S1, it's still 2.9 percent higher.

Summarize

MobileOne is a highly efficient visual backbone architecture on end-to-end devices with the help of structural reparameterization technology. Moreover, MobileOne generalizes to multiple tasks: image classification, object detection, and semantic segmentation, with significant improvements in latency and accuracy compared to existing efficient architectures deployed on mobile devices.

reference

  1. ^abRepVGG: Making VGG-style ConvNets Great Again

  2. ^ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

  3. ^abhttps://coremltools.readme.io/docs

  4. ^MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

 

 

 

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130096826