Performance optimization of Int8 quantization operator in mobile CPU




This article introduces the performance optimization solution of Depthwise Convolution's Int8 operator on mobile CPUs. The upgrade of the ARM architecture and the update of the corresponding instruction set continue to improve the performance upper limit of each operator on the mobile terminal. Combining data rearrangement and Sdot instructions can greatly improve the performance of the DepthwiseConv quantization operator.


background

MNN has greatly optimized the performance of the ConvolutionDepthwise Int8 quantization operator on ARM V8 (64-bit) and ARM V8.2. The main optimization methods include changing the arrangement of data and using the Sdot instruction. For the convenience of description, DepthwiseConvInt8 will be used in the following to represent the ConvolutionDepthwiseInt8 quantization operator. Measuring this quantization operator on an Android phone before optimization takes three times more time than using FP16 inference, which seriously affects the performance of the quantization model on the device side. This article explains the optimization solution of the DepthwiseConvInt8 operator from the perspective of ARM assembly operators.


Performance bottleneck location and optimization plan


ARM V8 optimizes data arrangement  


  • An Introduction


The input data format of DepthwiseConvInt8 in MNN is ( hereinafter referred to as C4), that is, the data of each 4 Channels are arranged together. In the same way, we use C16 to represent the data arrangement method in which the data of each 16 Channels are arranged continuously. The following figure intuitively describes the data arrangement of C4. Assume that each small square in the figure represents a piece of data. The four colors represent the four coordinate points on the feature map. The numbers in the squares represent their order in memory. .



Generally, we will adopt multi-threaded computing and ARM assembly parallel computing. This article focuses on performance optimization of dimensions when single-threaded . In ARM assembly, the results of multiple points on the output feature map are usually calculated at the same time, which requires us to try to read the point data on the input feature map at the same time. For example, when the Stride parameter of the Depthwise layer is equal to 1, and the data of the coordinates of 4 points in the output feature map are calculated at the same time (a total of 4 (coordinate points) * 4 (channel) = 16 data), we need to read the number in the above picture. square data. When each data occupies 4 bytes (32 bits), 4 input data exactly fills a vector register. However, in the Int8 quantization operator, each input data occupies 8 bits (1 byte). If the data arrangement of C4 is still adopted, 4 data will not fill a vector register. In particular, when the Stride parameter of the Depthwise layer is not 1, the C4 data arrangement method will cause multiple data on the feature map to be unable to be read continuously, resulting in performance loss. Therefore, for Int8 type input data, on the ARM platform, we first consider changing the C4 data arrangement to the C16 arrangement, so as to ensure that 16 Int8 type data can be read continuously to fill a vector register. We use assembly code to express the difference between these two data reading methods.


/* x0: source address, 读取feature map上的4个点 *//* pack=4, stridex = 2, sizeof(inputData)=1 */ld1 {v0.s}[0], [x0], #8ld1 {v1.s}[0], [x0], #8ld1 {v2.s}[0], [x0], #8ld1 {v3.s}[0], [x0], #8/* pack=16, stridex = 2, sizeof(inputData)=1 */ld1 {v0.4s}, [x0], #32ld1 {v1.4s}, [x0], #32ld1 {v2.4s}, [x0], #32ld1 {v3.4s}, [x0], #32


It can be seen from the code that the same 4 instructions are used to read data, and C16 reads 3 times more data than C4. So for the ARM V8 platform, we changed pack=4 to pack=16. Although this solution increases the conversion time of data arrangement during inference, the benefits of better parallelization within ARM assembly are greater.


  • ARM V8 performance improvement results


All performance data in this article come from the test results of the beauty model on Huawei Mate40 Pro. The beauty model contains a total of 23 Convolution Depthwise operators, all of which are 3x3 kernels, of which 19 operators have stride=1 and 4 operators have stride=2. The operator time recorded in the table is US The total time consumption of all 23 convolution depthwise operators in the Yan model in one inference, unit: ms.

Huawei Mate40 Pro ARM V8

C4 data arrangement before optimization

C16 data arrangement after optimization

Convolution Depthwise Int8 quantization operator

4.46 ms

2.78 ms

The performance acceleration ratio of the solution of changing the data arrangement is 1.6.

ARM V8.2 uses sdot instructions to improve performance  


  • Why can the sdot instruction improve operator performance?


Although the performance acceleration of the DepthwiseConvInt8 operator on the ARM V8 platform has had obvious effects, the ARM V8.2 instruction set provides more room for performance acceleration of the CPU operator on the ARM platform. At the same time, with the popularity of the ARM V8.2 instruction set on flagship mobile phones, the floating-point model uses fp16 inference to greatly improve mobile terminal performance, and the quantized model also needs to be further optimized to remain competitive. The DepthwiseConvInt8 operator contains a large number of multiplication and addition calculations. Taking 3x3kernel as an example, it first performs 9 multiplications and then 1 addition. At this time, 9 instructions are needed to obtain an output data. The sdot instruction provided by the ARM V8.2 platform can achieve 3 instructions to obtain 1 input data. The assembly code can visually show the difference between the two:
// 3x3kernel 不使用sdot指令,进行9次循环Loop_Kernel_H3:  Loop_Kernel_W3:    smlal v0.4s, v1.4h, v2.4s // 累加结果存储在v0中
// 3x3kernel 使用sdot指令sdot v0.4s, v1.16b, v3.16bsdot v0.4s, v2.16b, v4.16bsmlal v0.4s, v5.4h, v6.4h

  • Does using the sdot instruction bring additional overhead?


Let's first focus on how the s dot instruction accelerates the multiplication and addition operations of several elements. The following code shows the calculation principle of the sdot instruction.

// v0.16b : [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]// v1.16b : [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]
// v2.s[0]=v0.b[0]*v1.b[0]+v0.b[1]*v1.b[1]// +v0.b[2]*v1.b[2]+v0.b[3]*v1.b[3]sdot v2.4s, v0.16b, v1.16b // v2.4s: [14,126,366,734]

From the code example, we know that the four elements to be accumulated are arranged continuously in the memory. As mentioned above, the input data arrangement of the operator is C4. All elements in the same Kernel must not be read continuously in the memory.


Before using the sdot instruction, you need to rearrange the data to get the correct result. Note that the data rearrangement at this time is completely different from the data rearrangement in ARM V8 optimization. The data rearrangement on ARM V8 we mentioned above only changes C4 to C16 in the Channel dimension, while the data rearrangement before the sdot instruction on ARM V8.2 requires changing every 4 consecutive elements of 9 elements in a kernel. Arrange them together. We use a diagram to explain the rearrangement rules. It may be assumed that the kernel size is 3x3, and this assumption will be adopted in the following articles.


Obviously, the time of data rearrangement will greatly affect the operator performance. Efficient data rearrangement is a key step in ARM V8.2 DepthwiseConvInt8 performance optimization.

  • How to rearrange data in ARM V8.2


When thinking about the rearrangement problem, we simplify the problem and get the most efficient solution without writing code (the premise is that you are familiar with ARM assembly instructions, because the more you know about the instruction set, the more solutions there are). We abstract the problem into how to efficiently rearrange 16 int8_t data in a vector register. The comparison chart of data arrangement before and after rearrangement is as follows:


From the abstract problem back to rearranging the data in the 3x3 kernel, we ultimately need 9 elements to be multiplied and accumulated, so we rearrange every 4 elements in the kernel, and the remaining 1 is accumulated using the smlal instruction. The result is enough. Of course, we can rearrange the corresponding elements of the weight matrix before reasoning, which does not take up reasoning time. Different from the optimization on ARM V8, the rearrangement of ARM V8.2 is based on C4 instead of C16. Both C8 and C16 will bring more overhead to data rearrangement, specifically the required rearrangement. The number of reordering instructions is larger, so I won’t explain in detail the optimal reordering scheme in C8 and C16 scenarios.

Because the data rearrangement step is essential, the prerequisite for using the sdot instruction in this operator is that the kernel size is known. Currently, MNN's optimization of DepthwiseConvInt8 on the ARM V8.2 platform only supports 3x3 kernel.


  • ARM V8.2 performance improvement results


Huawei Mate40 Pro ARM V8
C4 data arrangement before optimization
ARM V8 is optimized using C16
ARM V8 after optimization using sdot instruction
Convolution Depthwise Int8 quantization operator
4.46 ms
2.78 ms
1.75 ms
通过使用sdot 指令,算子性能加速比达到了2.55.

  ARM V8.2和ARM V8的优化方案差异总结


在ARM V8.2 上优化DepthwiseConvInt8算子性能的核心是重排输入数据以方便使用sdot指令进行累加。因为Depthwise算子的计算复杂度比Convolution低,所以数据重排的耗时对算子性能影响更大。经过稿纸推演和实际测试,我们确定了耗时最低的数据重排方案,即在输入数据是NC4HW4时,用tbl指令将与3x3kernel中9个数据分成3组,第一组和第二组分别有4个数据连续排列,最后一个数据单独排列。在ARM V8上的优化受限于指令种类数量,目前仅从减少数据读取时间角度优化算子性能。


  总结


ARM架构的升级和相应指令集的更新不断提高移动端各算子的性能上限,目前在ARM V8.2上对于Convolution depthwise Int8量化算子的性能已经接近最优,但受限于数据重排带来的额外负载,Int8量化算子的性能仍然无法超越半浮点精度推理性能。


团队介绍

大淘宝技术Meta Team,负责面向消费场景的3D/XR基础技术建设和创新应用探索,通过技术和应用创新找到以手机及XR 新设备为载体的消费购物3D/XR新体验。团队在端智能、商品三维重建、3D引擎、XR引擎等方面有深厚的技术积累。先后发布端侧推理引擎MNN,端侧实时视觉算法库PixelAI,商品三维重建工具Object Drawer等技术。团队在OSDI、MLSys、CVPR、ICCV、NeurIPS、TPAMI等顶级学术会议和期刊上发表多篇论文。

¤  拓展阅读  ¤

3DXR技术 |  终端技术 |  音视频技术
服务端技术  |  技术质量 |  数据算法


本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

博通宣布终止现有 VMware 合作伙伴计划 deepin-IDE 版本更新,旧貌换新颜 WAVE SUMMIT 迎来第十届,文心一言将有最新披露! 周鸿祎:鸿蒙原生必将成功 GTA 5 完整源代码被公开泄露 Linus:圣诞夜我不看代码,明年再发布新版 Java 工具集 Hutool-5.8.24 发布,一起发发牢骚 Furion 商业化探索:轻舟已过万重山,v4.9.1.15 苹果发布开源多模态大语言模型 Ferret 养乐多公司确认 95 G 数据被泄露
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4662964/blog/10306645