The new version of the ultra-lightweight AI inference engine MindSpore Lite is released, supporting the comprehensive upgrade of the HMS Core AI field

After Huawei opened MindSpore Lite 1.0.0 in September 20th, its interface ease of use, operator performance and completeness, and extensive support for third-party models have been widely recognized by many mobile phone application developers. MindSpore Lite provides a full-scenario AI reasoning framework for the HMS Core AI field. It supports AI-related modules such as the camera, gallery, wallet, browser QR code scanning, and object recognition of Huawei mobile phones. It provides various types of Huawei wearables, smart screens and other devices. Provide basic AI services. At the same time, as one of the important capabilities of HMS Core open to global developers, Huawei's machine learning service has been accessed by 1000+ applications worldwide, with an average daily call volume of more than 300 million.

Currently, at the beginning of the 21st New Year, Huawei has released MindSpore Lite version 1.1.0, which is optimized for operator performance, model miniaturization, acceleration library automatic cropping tools, end-side model training, voice model support, Java interface opening, and model visualization. All aspects have been upgraded. The upgraded version is lighter, faster and easier to use, and new features will also be reflected in the new version of HMS Core.

1. Optimization and expansion of operator library

Inference performance optimization is the highlight of this version. In addition to the continuous performance optimization of ARM CPU (FP16/FP32/INT8), the optimization of ARM GPU and X86_64 is also the highlight of this time. On the GPU side, in addition to traditional operator optimization, we also added online fusion, AutoTuning and other technologies, which greatly improved the inference performance of ARM GPU. At the same time, in order to better support PC-side inference, we have done a lot of assembly level in the X86_64 operator. Optimization; After a large number of models are tested, MindSpore Lite 1.1.0 version is very competitive among various frameworks in the industry in terms of reasoning performance.

1.1 ARM CPU optimization

From introducing better algorithms that reduce the amount of calculations to reducing hardware access as much as possible to increase instruction throughput, the performance of the CPU operator of MindSpore Lite has been greatly improved. We used 100+ end-side preset models on the TF Hub official website to perform a comparison test of inference delay. The test results show that MindSpore Lite has fully surpassed the official website data on high-end models such as Mate30/P30, and is used on mid-to-low-end models such as P20. The reasoning performance is better than the official website data accounted for 97%.

1.1.1 FP16 inference performance

MindSpore Lite fully supports ARMv8.2 FP16 reasoning, and the reasoning delay is basically half that of FP32 type reasoning. The reasoning delay is greatly reduced while the accuracy meets the business requirements; our FP16 reasoning solution has been developed in Huawei HMS MLKit and It is widely used in various AI services preset by Huawei mobile phones.

Since TF Lite does not support FP16 inference, we only selected the latest 1.1 version of MNN in the FP16 inference performance comparison test. From the test results, it can be seen that MindSpore Lite has lower inference latency and better performance in FP16.

Insert picture description here

Comparison of overall network delay on Huawei Mate30

Insert picture description here

Comparison of FP16 inference delay on Huawei Mate30

Insert picture description here

Snapdragon 865+ on FP16 reasoning delay comparison

1.1.2 Int8 quantitative model inference performance

For quantization operators, the current version of MindSpore Lite implements the addition of Winograd optimization algorithms such as Convolution Kernel 3x3 (currently mainly for non-ARMv8.2 models) at the algorithm level, and uses SDOT instructions to match MatMul, The optimization of operators such as Fully Connection and Convolution, as well as a series of optimization strategies to improve the hit rate of the underlying cache, have greatly improved the quantitative inference performance of MindSpore Lite, which has a 40%+ performance improvement compared with FP16 inference.

We chose the latest 2.4 version of TF Lite and the latest 1.1 version of MNN for the inference performance comparison test. The model used is the quantized model preset by TF Hub (during the test, we found that MNN has a large number of quantitative models that cannot be converted, even TF Lite There is also a conversion problem for its own model). From the test results, MindSpore Lite has the lowest latency and the best performance for the quantitative model in terms of support and reasoning performance.

Insert picture description here

Comparison of the overall delay of the quantified network on Huawei Mate30

ARMv8.2 model test

Insert picture description here

Comparison of quantization model delay on Snapdragon 865+

ARMv8 model test

Insert picture description here
Comparison of quantized model delays on Huawei P20

1.1.3 FP32 inference performance

At the same time, in order to ensure that the industry's best inference performance can also be obtained when using MindSpore Lite inference on low-end CPUs, we continue to optimize the inference performance of FP32. We used TFLite (version 2.4) and MNN (version 1.1) as comparison objects on Huawei P20, and conducted benchmark performance tests. From the test results, we can see that MindSpore Lite FP32 has the lowest inference delay and the best performance, but it is compared with others. The gap in the framework is not large.

Insert picture description here

Comparison of quantized model delays on Huawei P20

1.2 ARM GPU optimization

In MindSpore Lite version 1.1, we have focused on optimizing GPU inference performance. In addition to regular optimizations at the operator level, we also added multiple optimization methods such as online fusion, AutoTuning, and OpenCL kernel binary cache mechanism, making the overall performance better than MindSpore Lite 1.0 has an increase of 25%+;

We also used TF Hub official website 100+ preset models on Huawei Mate30 to perform GPU inference performance comparison tests with MNN (version 1.1) and TF (version 2.4). It can be seen from the test results in the figure below that the inference performance of MindSpore Lite GPU is in Most models have the lowest latency, while MNN has a relatively high latency.

Insert picture description here

Comparison of GPU FP32 inference latency on Huawei Mate30

1.3 X86_64 CPU optimization

In this version, we have also performed a lot of optimization work on the inference performance on the X86_64 platform. We performed benchmark tests on the Intel Core i7-8700 CPU with Intel OpenVINO and MNN on several classic CV networks. From the test results MindSpore Lite also has the lowest latency;

Insert picture description here

Intel Core i7-8700 X86_64 CPU inference performance comparison

1.4 More integration

The current version of MindSpore Lite has basically covered the general convolution correlation fusion pattern in the field of machine vision. At the same time, it has carried out deep fusion optimization for the voice model based on the Transformer structure and the model of the LSTM structure, mainly including the fusion of small operators into Layernorm, LSTM, etc. Large operator, multiple MatMul fusion into BatchMatMul operator, slice operator segmentation matrix forward fusion, etc., make the speech model get 20%+ improvement, follow-up we will try to integrate the automatic schedule function of the pattern.

2. Operator completeness extension

MindSpore Lite supports multiple hardware platforms including ARM CPU, ARM GPU, X86 CPU, Kirin NPU, MTK APU.

2.1 ARM CPU

MindSpore Lite is currently one of the frameworks with the most abundant CPU operator support in the end-to-side reasoning framework. Currently, our model conversion tool supports TF Lite (100), TF (53), ONNX (96) and Caffe (26) ) And other third-party framework operator definitions, achieving high compatibility. In the performance test above, we also mentioned that MNN cannot convert many models, and TF Lite's support for the preset models on its official website is not perfect; MindSpore Lite has implemented 121 FP32, 55 FP16, and 71 INT8 CPU operators. In version 1.1, we also made a major adjustment and improvement to the control flow operator to better support voice models.

2.2 ARM GPU

Added OpenCL operator 10+, the total number of GPU operators currently supported is 58, which basically realizes common CV network coverage; added support for online fusion, Auto Tuning and other features, and supports weight quantization, realizing 8bit weight quantization network in the GPU network run.

2.3 Kirin NPU

In version 1.1, we have improved the support for the Huawei Kirin NPU hardware platform, added support for the Kirin 9000 chip, and added support for 50+ NPU operators, so as to support the accelerated execution of most CV scenarios on the NPU; we Several typical network benchmark verifications have been carried out on Huawei’s latest Mate 40 mobile phone, and the inference delay on the NPU has been significantly improved compared to the CPU inference;

Insert picture description here

Comparison of inference delay between NPU and CPU FP32/16 on Mate 40

3. Support end-to-side training

Because the model trained with public data sets has a certain deviation from the real user scene, such as face recognition, speech recognition and other scenes, we often need to use local data to fine-tune the pre-trained model to improve the accuracy of local model inference and improve users Experience.

Insert picture description here

In MindSpore Lite version 1.1, we open sourced the end-to-side training framework. The first version brought us the following features:

1) Support 30+ reverse operators, provide common optimizers such as SGD, ADAM, and loss functions such as CrossEntropy/SparsCrossEntropy/MSE; you can train the model from zero, or specify a specific network layer to fine-tune to achieve the purpose of migration learning;

2) It has supported network training such as LeNet/AlexNet/ResNet/MobileNetV1/V2/V3 and EffectiveNet, providing complete model loading, conversion and training scripts, which is convenient for users to use and debug;

3) MindSpore cloud-side training and end-side training are seamlessly connected, and the cloud-side model can be directly loaded to the end-side for training;

4) Support checkpoint mechanism, after the training process is interrupted abnormally, it can quickly resume to continue training;

Our end-to-side training framework has been commercialized in AI applications of some Huawei devices, such as home photo albums, and has achieved a good user experience.

4. Quantify after training

As the deployment of AI applications on end-side equipment becomes more and more common, and subject to the limitations of end-side resources, the challenge of miniaturizing models and improving inference performance is increasing. MindSpore Lite provides a simple and practical post-training quantization function, which minimizes the model size, reduces memory usage, improves inference speed, and reduces power consumption.

Compared with quantized retraining, post-training quantization has two obvious advantages. One is that it does not require a large amount of training data set, and the other is that it does not need to be retrained, and it is quickly converted offline. MindSpore Lite post-training quantization tool provides two methods of weighting and full quantization, supports 1~16bit quantization, supports classification, detection, NLP and other models.

In order to ensure that the accuracy loss of the quantized model after training is small, we use the pipeline combination quantization method. The first stage uses the conventional linear quantization method to quantify the weight and activation value, the second stage analyzes the quantization error, and uses the statistical method to correct the quantization model. Compensate for the loss of accuracy due to quantization

Insert picture description here

Pipeline portfolio quantification

Take the TF official website MobileNet_v2 model as an example. After training, MindSpore Lite quantizes the accuracy of A8W8 (8bit activation value quantization, 8bit weight quantization) compared to the FP32 model. After loss correction, the accuracy loss is reduced from 0.82% to 0.4%, and 7bit quantization is also applicable The accuracy loss still does not exceed 1%.

Insert picture description here

Accuracy comparison of fully quantized mobilenet_v2 model after training

In the HMS Face scene, the model is INT8 weighted (model size range 364KB~2.9MB), and the actual end-to-end recognition accuracy fully meets the service requirements. The relative accuracy error of the weighted accuracy loss correction scheme is compared as follows, and it can be seen that the quantization accuracy loss under the loss correction scheme is significantly reduced.

Insert picture description here

Comparison of relative accuracy loss correction schemes for weighted accuracy loss of Face scene model weights

After a large number of internal tests and actual commercial delivery feedback, the pipeline combination quantification method has a significant effect, even a model as small as 300KB, and the accuracy still meets commercial requirements after INT8 quantization and compression.

5. Enhanced ease of use

5.1 Accelerated library automatic cropping tool

In order to meet some scenarios that require extreme miniaturization of the release package size, we provide a one-click cropping tool, which can automatically crop out the minimum MindSpore Lite version that is sufficient to run the specified model in the list according to the model list specified by the user. .

5.2 Simplified offline tool parameters

We have streamlined the parameters of the offline conversion tool to maximize the ease of use of the conversion tool, so that developers do not need to perceive the quantification type, input and output node names, and corresponding data types of the tripartite model when converting the tripartite model.

5.3 Support Java interface

Version 1.1 officially opened the Java interface to make it easier for Android developers to use MindSpore Lite for application development.

5.4 Model visualization

In order to facilitate debugging by developers, we have submitted code to support MindSpore Lite model visualization in the Netron open source community. Now developers can use Netron tools to visualize MindSpore Lite models. I believe it can bring great convenience to developers to debug models, especially some models with complex structures.

6. Open more end-to-side preset models

In order to facilitate developers to quickly deploy their own AI services on the end-side, MindSpore has opened more models suitable for end-side use, including some of the original models first released by MindSpore. These models can be easily obtained on the MindSpore Hub.

6.1 Model of pruning ResNet50 network on Oxford-III Pet data set using SCOP algorithm

SCOP: Scientific Control for Reliable Neural Network Pruning, is a scientific control mechanism jointly proposed by Huawei's Noah's Ark Laboratory and Peking University to minimize the impact of pruning nodes on the network output. Using this pruning method, it is possible to achieve a top-1 accuracy loss of only 0.01% of the ResNet101 network on the ImageNet data set, and reduce the amount of model parameters and calculations by 57.8% and 60.2%, which is significantly better than the SOTA method. Model link: https://www.mindspore.cn/resources/hub/details?noah-cvlab/gpu/1.0/resnet-0.65x_v1.0_oxford_pets

6.2 Model VGG-Small based on SLB lightweight technology

This model uses the SLB quantization technology (Searching for Low-Bit Weights in Quantized Neural NetWorks) selected by Huawei Noah’s Ark Laboratory in the NeurIPS 2020 model lightweight technology, based on 2-bit weight and 2-bit activation quantization on CIFAR10. End-to-side model. Model link: https://www.mindspore.cn/resources/hub/details?noah-cvlab/gpu/1.0/VGG-Small-low-bit_cifar10

If you want to know more about the lightweight technology used in the above MindSpore launch model, please refer to:https://mp.weixin.qq.com/s/H1zg3ezZDdXQ-IQ7ki0Mtw

The test data comes from Huawei's internal laboratory test data. If you have any questions, you can give feedback on the MindSpore forum:https://bbs.huaweicloud.com/forum/forum-1076-1.html

MindSpore open source code warehouse link:https://gitee.com/mindspore/mindspore


Original link:https://developer.huawei.com/consumer/cn/forum/topic/0202453926225910779?fid=18

Author: Pepper

Guess you like

Origin blog.51cto.com/14772288/2660009