Open source behind | the challenge side inference engine, engineer Ali how to deal with?

image

Ali sister REVIEW: MNN (Mobile Neural Network) was officially open on May 7 this year Github. Taobao wireless development experts - Chen Liu (from green) in GMTC global front-end technology conference for everyone to share MNN development of open source in thinking and summarized by Taobao mobile experience on the AI, you will understand the development of mobile AI's conditions and scenarios, as well as by side inference engine optimization strategies to understand the depth of mobile / IoT.

Open Source and background

image

Artificial Intelligence since 2006, ushered in the third wave. With AlphaGo in 2016, 2017 and Shishi has defeated Jie Ke, artificial intelligence completely into the public eye. Behind the hot artificial intelligence, big data is accumulated, the development of deep learning, but also enhance the power of the device count. At the same time, deep learning framework is also evolving - from Torch, Caffe to TensorFlow, PyTorch, and then more for CoreML mobile side, NNAPI, NCNN, MACE and so on. Taobao deep learning inference engine MNN also in May 2019 announced the open source.

image

MNN is a lightweight deep learning inference engine side, the depth of the neural network model to solve the core reasoning operational problems at its end, covering the optimization of the depth of the neural network model, conversion and reasoning. Currently, MNN is already in hand wash, hand cats, Youku, together cost-effective, more than 20 App UC, flying pig, KN like to use, live coverage, short videos, search recommendation, product image search, interactive marketing, equity issuance, risk control and other security scene, the stable operation of hundreds of millions of times a day. In addition, the rookie from mentioning cabinets and other IoT devices are also applied. In 2018, two-eleven shopping section, MNN day cat show smiling faces red, sweep the stars mora battle scenes such use.

Open source address

The project has been open in Github

Fanger Wei code identifying the ↓↓

Github you can get the download link for more details.

image

MNN after the project started from 2017, more than a year of development iterations undergone and passed the test of two-eleven Taobao, the open source projects started in late 2018, after the last open-source transformation of a small half-year, in May this year officially open at Github .

开源首先还是因为经历过双十一之后,我们觉得自己做好了准备,开源有助于我们鞭策自己,把 MNN 做的更好;另一方面,业界的开源方案,不论是 TensorFlow Lite 、 NCNN 还是 Mace ,都给了我们很好的输入和借鉴,我们也希望借着开源,将我们的思考和创新回馈社区。

下文将主要围绕着 MNN ,来介绍淘宝在移动 AI 上的一些实践经验。

挑战与应对

image

端侧推理引擎面临的挑战中,碎片化是最为显著的,这种碎片化是多层次、多维度的 。

  • 训练框架上, Caffe 、 TensorFlow 、 PyTorch 、 MXNet 在训练模型时都很常用;
  • 计算设备上, CPU 、 GPU 已是主流, NPU 、 TPU 渐渐成为标配, DSP 、 FPGA 在 IoT上也很常见;
  • 算子层面上,众多参数会形成不同的组合,从而对应出不同的优化方式,轻量化和通用化需要取舍;

image

一款优秀的端侧推理引擎,就需要在这样碎片化的环境下,利用设备有限的资源,尽可能发挥出设备的性能。为此,也需要在转换、调度、执行上加入相应的优化策略。下文,会就其中的部分展开说明。

转换工具

模型优化

image

在模型优化中,MNN 引入了前端的概念来统一训练框架。不同的前端负责加载不同训练框架的模型,统一转换为 MNN 的模型格式。对于最常用的训练框架 TensorFlow 和 Caffe ,我们提供了独立的前端;其他训练框架,比如 MXNet ,则需要先将模型转换为 ONNX ,再通过 ONNX 前端加载。这里,由于 TensorFlow 的算子颗粒度相比 Caffe 和 ONNX 要更小,我们引入了图优化的模块来对齐算子之间的颗粒度。模型转换之后,会经过优化器优化,包含算子融合、算子替换、布局调整等等。之后,可以选择对浮点模型执行量化压缩。目前模型压缩的模块还没有开源,我们会在完善之后,将相关代码开源。这些步骤都完成之后,会使用 flatbuffer 来保存部署模型。

图优化

image

这里以 RNN-GRU cell 为例,说明一下图优化。

左图是 RNN-GRU cell 在 TensorBoard 中的可视化描述。它足足包含了 3584 个节点,而每一个节点都代表了一定的数据读写或运算,累积起来的总量非常大。然而,所有这些节点可以打包使用一个大颗粒的算子来替代。这不仅大幅降低了部署模型的大小,还可以在大颗粒算子的实现中聚合大量的计算,避免不必要的数据读写。

右图展示的是一个实际业务模型在图优化前后的性能对比。在华为 P10 、红米 3x 、小米 6 上都有 1 倍左右的性能提升。而如果是双向 GRU ,效果还会更明显。

算子融合

image

再以 Convolution、Batchnorm、Scale、ReLU 为例说明优化器中的算子融合。

首先融合 Convolution 和 Batchnorm,Convolution 的 weight 等于 weight 乘 alpha ,而 bias 等于 bias 乘 alpha 再加 beta ;而后融合 Convolution 和 Scale ,融合过程和 Batchnorm 类似;最后融合 Convolution 和 ReLU ,在输出结果前,计算激活函数 ReLU 即可。

这样,四个算子就可以合并成一个算子。融合的过程避免了三次 tensor 读写、两次 tensor 乘加。优化效果:MobileNet V1 在小米 5 和华为 P10 上有 20 ~ 40% 的性能提升,效果还是比较明显的。

智能调度

整体设计

image

在调度上, MNN 将每一类计算设备抽象为一个后端,将算子在特定后端上的实现抽象为执行器。后端负责特定设备上的资源分配和计算调度,执行器负责具体的实现。后端和算子的添加都通过注册表来实现,这是一个双层注册表结构,拓展起来就相对灵活。

调度时,可以为子图选择相应的后端,再由后端创建出相应的执行器,组成管线;也可以为子图选择后端组,实现混合调度。比如,在 GPU 上不宜实现排序算子时,可以回退到 CPU 来执行。

Currently, MNN on the CPU to achieve the 76 operators, 55 on Metal, OpenGL-based CNN network covers, OpenCL and Vulkan 29 and 31, respectively.

Cache Management

image


After you've created actuators, sub-graphs and pipelines already in place. Down, all need to calculate the shape of the tensor, to allocate memory on the respective rear end. Then, in preparation for actuators, actuators and then all of a good pre-apply for the necessary buffer on the back end. After the run, return tensor can be.

Since all memory in the preparation of the application has been completed reasoning required in subsequent reasoning, if the shape of the input of the same, you can reuse tensor and buffer, thus avoiding the frequent application, free up memory; only the shape of the input time for a change only need to start from the shape calculating, adjusting a memory allocation. At the same time, the use of back-end unified management of cache between the actuator in the back end, the cache can fully reuse, which greatly reduces the memory requirements. In addition, when allocating memory MNN, 32 aligned in the default, read and write data to the memory of Qiyou Li.

Perform optimization

Layout data convolution sliding window

image

Data placement tremendous impact on performance.

First look at the layout NCHW, how SIMD acceleration using a 3x3 convolution depth-wise.

First, when reading data, a one-time read the data of the first row as the float four reads two lines also similar; in this case, three lines of data read out enough to calculate two outputs, i.e., , part of the data may be multiplexed; then, in order to improve data multiplexing, the fourth row will be taken read data, a computation two rows and two columns, i.e., loop unrolling can be introduced; however, the remaining 5 to 25 and 21 to 25 boundary luminance eye not using SIMD computation, the reader can complete the calculation by one cycle; in this manner, several channels can be calculated accordingly after completion.

However, in NCHW layout, unable to take full advantage of SIMD accelerated, at the same time, to optimize the more branches, the more the size of the footprint package.

image

Another look at the NC / 4HW4 layout, the case of the use of SIMD acceleration is like.

这里的 "C/4" 指的是按照 4 个通道对齐的方式重排数据。重排所有输入和权重数据后,每次 SIMD 读写都天然是 4 个通道的输入数据和 4 个通道的权重数据。这样,不论 kernel、stride、dilation 怎么变化,我们都可以简单地使用 for 循环和 SIMD 的一套通用优化完成卷积计算。既不会有边缘数据无法加速的问题,也不会对包大小造成影响。

Winograd

image

对于对于 KxK 卷积,可以使用 Winograd 算法进一步加速。MNN 中支持 2x2 到 7x7 的 Winograd 实现。Winograd 计算时,需要把输出拆分成 NxN 的小块,把输入拆分成 (N+K-1)x(N+K-1) 的小块。这样,问题就可以简化为两个小矩阵的卷积。

image

再套用 Winograd 的公式,将矩阵间的卷积运算转换为矩阵点乘运算。在这个过程中,除了矩阵点乘外,还引入三个矩阵转换,分别是输入矩阵 d 、权重矩阵 g 和结果矩阵 Y’ 的转换。其中,权重转换时, G 矩阵可以利用中国剩余数定理计算, GgGT 就可以在准备执行器时提前计算;输入转换和输出转换时用到的 A 和 B 矩阵需要根据 N 和 K 计算,我们在代码中内置了几种优化后的组合,所以实际计算时,这两个转换并不需要经过复杂的矩阵乘法。

这样,原来矩阵卷积所需要的 9x4 次乘法计算就可以用矩阵点乘的 4x4 次乘法计算代替。只考虑乘法耗时的话,加速了 2.25 倍。示例中, K=3,N=2 ,但实际使用时,可以选择更大的 N 值,获取高的加速倍数,但也要相应消耗更多的内存。

Strassen

image

MNN 可能是端侧推理引擎中,第一个应用 Strassen 算法优化矩阵乘法的。

Strassen 在计算矩阵乘法时,首先需要将矩阵平均拆分成四个小矩阵。这里使用 a11 ~ a22、b11 ~ b22、c11 ~ c22 代表四个小矩阵,计算过程一共需要8次小矩阵乘法运算。

这里可以引入中间小矩阵, s1 ~ s4、t1 ~ t4、m1 ~ m7、u1 ~ u7 。其中,只有 m1 ~ m7 包含小矩阵乘法,一共 7 次小矩阵乘法运算。而其他的,只包含小矩阵的加减法。也就是说,通过 4 + 4 + 7 次小矩阵加减法,替代了一次小矩阵乘法。

与原来的矩阵乘法相比, Strassen 的时间复杂度从 n 的 3 次方,降低到 n 的 2.81 次方。在矩阵较大时,矩阵乘法远远慢于矩阵加减法,收益就更明显。

image

在 MNN 中,我们会递归使用 Strassen 。也就是说,递归拆分矩阵。在矩阵足够大时,继续拆分;在矩阵不够大时,使用普通的矩阵算法。这里使用减免的矩阵乘法开销作为收益,使用小矩阵 s 、小矩阵 t 、小矩阵 u 矩阵的加减法开销之和作为代价,收益大于代价时,就可以考虑使用 Strassen 算法。

链路优化

image

链路优化可以举一个 19 年春节淘宝扫年货的例子。在获得手机相机输入后,每一帧的图像首先需要经过一次预处理,将图片缩放到年货检测模型的输入大小上,然而再经过推理,判定图像有没有年货,如果有,就发放相关权益。这个过程中,图片预处理的耗时也不容忽视。降低这个耗时,就可以帮助我们提升帧率,从而改进用户体验。为此,我们引入了一个轻量级的 2D 图片处理库,可以高效地完成色值变化、色彩空间的转换或者仿射变换等。这样, MNN 的用户就不再需要为图片处理引入 libyuv 或者 opencv 了。

性能比较

image

经过种种优化后,这是我们在性能上交出的答卷。

MobileNet V2 ,在 OPPO r17 和 iPhone 7Plus 上做了一系列的性能对比。

如图, MNN 的性能在 CPU 和 GPU 上都有一定的优势。

小结

image

总的来说, MNN 吸纳了前人的经验,也结合自己对端侧推理引擎的认知,做了许多创新。综合考虑性能、模型和后端的拓展性、缓存、 CPU 和 GPU 的算子实现,以及 CV 库等方面的表现,在端侧推理引擎中, MNN 是值得移动 AI 用户尝试的选择。

后续规划

image

On subsequent program conversion section, we plan to add more and more operators optimize FIG matching template, quantitative tools also open plan model; scheduling section, we plan to implement step end edge learning and training, computing device automatically selects also planned; operative, or will it continue to achieve optimization of existing terminal sub-operator, also plans to optimize the quantization convolution, matrix multiplication algorithm, program support GPU directly on the CV database, we also consider the existing NC / 4HW4 algorithm, sorting library as a separate high-performance computing, the algorithm automatically selects the same in the planning; other parts availability, we will continue construction projects, continued to add more documentation and examples.

Taobao basic platform unit - end smart team welcomes inference engine R & D engineers, adding AR technology research and development engineers, high-performance computing research and development engineers. Interested in new technology, be good at innovation breakthrough, eager to bring innovative user experience with the new technology students please contact us resume to [email protected].

Original publication time: 2019-07-2
author: Chen gilt (from green)
article from Yunqi community partners, " Ali technology " for information may concern " Ali technology ."

Guess you like

Origin yq.aliyun.com/articles/707074