MNN has greatly optimized the performance of the ConvolutionDepthwise Int8 quantization operator on ARM V8 (64-bit) and ARM V8.2. The main optimization methods include changing the arrangement of data and using the Sdot instruction. For the convenience of description, DepthwiseConvInt8 will be used in the following to represent the ConvolutionDepthwiseInt8 quantization operator. Measuring this quantization operator on an Android phone before optimization takes three times more time than using FP16 inference, which seriously affects the performance of the quantization model on the device side. This article explains the optimization solution of the DepthwiseConvInt8 operator from the perspective of ARM assembly operators.
▐ ARM V8 optimizes data arrangement
An Introduction
The input data format of DepthwiseConvInt8 in MNN is ( hereinafter referred to as C4), that is, the data of each 4 Channels are arranged together. In the same way, we use C16 to represent the data arrangement method in which the data of each 16 Channels are arranged continuously. The following figure intuitively describes the data arrangement of C4. Assume that each small square in the figure represents a piece of data. The four colors represent the four coordinate points on the feature map. The numbers in the squares represent their order in memory. .
Generally, we will adopt multi-threaded computing and ARM assembly parallel computing. This article focuses on performance optimization of dimensions when single-threaded . In ARM assembly, the results of multiple points on the output feature map are usually calculated at the same time, which requires us to try to read the point data on the input feature map at the same time. For example, when the Stride parameter of the Depthwise layer is equal to 1, and the data of the coordinates of 4 points in the output feature map are calculated at the same time (a total of 4 (coordinate points) * 4 (channel) = 16 data), we need to read the number in the above picture. square data. When each data occupies 4 bytes (32 bits), 4 input data exactly fills a vector register. However, in the Int8 quantization operator, each input data occupies 8 bits (1 byte). If the data arrangement of C4 is still adopted, 4 data will not fill a vector register. In particular, when the Stride parameter of the Depthwise layer is not 1, the C4 data arrangement method will cause multiple data on the feature map to be unable to be read continuously, resulting in performance loss. Therefore, for Int8 type input data, on the ARM platform, we first consider changing the C4 data arrangement to the C16 arrangement, so as to ensure that 16 Int8 type data can be read continuously to fill a vector register. We use assembly code to express the difference between these two data reading methods.
/* x0: source address, 读取feature map上的4个点 */
/* pack=4, stridex = 2, sizeof(inputData)=1 */
ld1 {v0.s}[0], [x0], #8
ld1 {v1.s}[0], [x0], #8
ld1 {v2.s}[0], [x0], #8
ld1 {v3.s}[0], [x0], #8
/* pack=16, stridex = 2, sizeof(inputData)=1 */
ld1 {v0.4s}, [x0], #32
ld1 {v1.4s}, [x0], #32
ld1 {v2.4s}, [x0], #32
ld1 {v3.4s}, [x0], #32
It can be seen from the code that the same 4 instructions are used to read data, and C16 reads 3 times more data than C4. So for the ARM V8 platform, we changed pack=4 to pack=16. Although this solution increases the conversion time of data arrangement during inference, the benefits of better parallelization within ARM assembly are greater.
ARM V8 performance improvement results
Huawei Mate40 Pro ARM V8 |
C4 data arrangement before optimization |
C16 data arrangement after optimization |
Convolution Depthwise Int8 quantization operator |
4.46 ms |
2.78 ms |
▐ ARM V8.2 uses sdot instructions to improve performance
Why can the sdot instruction improve operator performance?
// 3x3kernel 不使用sdot指令,进行9次循环
Loop_Kernel_H3:
Loop_Kernel_W3:
smlal v0.4s, v1.4h, v2.4s // 累加结果存储在v0中
// 3x3kernel 使用sdot指令
sdot v0.4s, v1.16b, v3.16b
sdot v0.4s, v2.16b, v4.16b
smlal v0.4s, v5.4h, v6.4h
Does using the sdot instruction bring additional overhead?
// v0.16b : [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]
// v1.16b : [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]
// v2.s[0]=v0.b[0]*v1.b[0]+v0.b[1]*v1.b[1]
// +v0.b[2]*v1.b[2]+v0.b[3]*v1.b[3]
sdot v2.4s, v0.16b, v1.16b // v2.4s: [14,126,366,734]
From the code example, we know that the four elements to be accumulated are arranged continuously in the memory. As mentioned above, the input data arrangement of the operator is C4. All elements in the same Kernel must not be read continuously in the memory.
How to rearrange data in ARM V8.2
When thinking about the rearrangement problem, we simplify the problem and get the most efficient solution without writing code (the premise is that you are familiar with ARM assembly instructions, because the more you know about the instruction set, the more solutions there are). We abstract the problem into how to efficiently rearrange 16 int8_t data in a vector register. The comparison chart of data arrangement before and after rearrangement is as follows:
ARM V8.2 performance improvement results
|
|
|
|
|
|
|
|
▐ ARM V8.2和ARM V8的优化方案差异总结
在ARM V8.2 上优化DepthwiseConvInt8算子性能的核心是重排输入数据以方便使用sdot指令进行累加。因为Depthwise算子的计算复杂度比Convolution低,所以数据重排的耗时对算子性能影响更大。经过稿纸推演和实际测试,我们确定了耗时最低的数据重排方案,即在输入数据是NC4HW4时,用tbl指令将与3x3kernel中9个数据分成3组,第一组和第二组分别有4个数据连续排列,最后一个数据单独排列。在ARM V8上的优化受限于指令种类数量,目前仅从减少数据读取时间角度优化算子性能。
▐ 总结
本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。