arm架构32位优化

序
　　本文介绍arm架构32位neon汇编优化，适合于任何基础。
　　温馨提醒：嵌入式设备（即arm架构的板子）在编译时，最好加上 -fsigned-char 因为嵌入式设备默认类型为unsigned char类型，非char 类型。此外在编译arm汇编优化代码时，编译选项需要加上-c 。

1. 初识arm语法

　　arm纯汇编语法分armasm语法和 gnu asm语法，本文基于gnu asm语法讨论。

1.1 常用语法

　　（1）定义一个函数

    .text
    .align  4
    .global     name
    .type       %function
name:

     FUNCTION STATEMENT

     bx lr

　　(2) 定义一个宏代码

   .macro  name arg1, arg2, arg3
        ldr        r0,            \arg1
        vstl.u32   \arg2\()[0],  [r0]
   .endm

　　本示例意在告诉，宏参数可以通过 \ 来取，针对特殊的需要用　＼()　来分隔，假设arg2是d0寄存器，如果需要将d0[0]里面的数据存储到r0中，就不能用 \arg2[0] 来获取，编译器会认为是解析宏参数arg2[0]。
　　
　（3）.ltorg的使用
　　在代码中，如果常量区跟代码区距离相隔太大，当前函数需要访问常量区的某个常量，则需要在当前函数开头前，上一函数结尾后，添加.ltorg，否则编译会提示相应的错误。

.ltorg Insert the literal pool of constants at this point in the program. The literal pool is used by the ldr = and adrl assembly language pseudo-instructions and is specific to the ARM. Using this assembler directive is almost always optional, as the GNU Assembler is smart enough to figure out when and where to put any literal pool.However, there are situations when it is very useful to include this directive, such as when you need absolute control over where the assembler places your code.

　(4) 注释代码
　温馨提示：注释虽然有多种形式，但为了便于将arm32的优化代码转译为arm64的优化代码，注释最好采用 “//” 或“/* */”的形式，因为arm64汇编不支持以@开头的注释。

     Inline comment char:  ‘@’
     Line comment char: ‘#’
     Statement separator: ‘;’

或者 (使用/* */ 注释多行；使用//注释单行，但是//的使用，需要文件的后缀为.S)

There are two basic comment styles: multi-line and single-line. Multi-line comments start with / ∗ and everything is ignored until a matching sequence of ∗ / is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the ﬁle ends in .S , then single line comments can begin with // . If the ﬁle name does not end with a capital .S , then the // syntax is not allowed.

gnu语法总体介绍：https://www.eecs.umich.edu/courses/eecs373/readings/Assembler.pdf
gnu语法快速入门：http://www.ic.unicamp.br/~celio/mc404-2014/docs/gnu-arm-directives.pdf
gnu常用语法速查：http://www.coranac.com/files/gba/re-ejected-gasref.pdf

2. arm 32位架构寄存器介绍

2.1 arm寄存器

　　arm寄存器有16个32位的通用寄存器(R0-R15)，寄存器列表如图3-5所示，需注意的是：R14（LR）用来存储调用子例程时的返回地址、R0~R3被用来传递函数形参、其它的寄存器如果在被调用者函数中使用，则需要进行Push操作，但是R12寄存器比较特殊，在被调用者函数中使用时可以不用push；关于更详细的调用规则参考ATPCS（2参考网址，5.1.1 Core registers）, ATPCS采用满降序堆栈（STMFD/LDMFD）。

1参考网址：http://home.deib.polimi.it/agosta/lib/exe/fetch.php?id=teaching%3Ainfo1tlc&cache=cache&media=teaching:asm_guide.pdf
2参考网址：http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf

2.2 neon寄存器

　　　neon技术第一次实现是在ARM Cortex-A8处理器，ARMv7架构体系（ARMv7-A与ARMv7-R系列）上；neon寄存器有16个128位的Q寄存器， 32个64位的D寄存器(摘自1参考网址, 5.1.1 Core registers)，寄存器列表如图A2-1所示（摘自2参考网址, A2.6.1 Advanced SIMD and VFP extension registers），需注意的是：S0是D0的低32位，S1是D0的高32位，同理D0是Q0的低64位，D1是Q0的高64位；S、D、Q寄存器之间的关系为：
　　The mapping between the registers is as follows:
　　　　　• S<2n> maps to the least significant half of D
　　　　　• S<2n+1> maps to the most significant half of D
　　　　　• D<2n> maps to the least significant half of Q
　　　　　• D<2n+1> maps to the most significant half of Q.
　　For example, you can access the least significant half of the elements of a vector in Q6 by referring to D12,
and the most significant half of the elements by referring to D13.

1参考网址：http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf
2参考网址：http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/ARM_v7.pdf

　　注意： (d8-d15, q4-q7) 在子程序中使用时，需要压栈保存。参考网址：http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf 5.1.2.1 VFP register usage conventions (VFP v2, v3 and the Advanced SIMD Extension)

2.3 NEON指令集

2.3.1 ARMv7/AArch32指令格式

　　　所有的支持NEON指令都有一个助记符V，下面以32位指令为例，说明指令的一般格式（参考1参考网址，Armv7-A/AArch32 instruction syntax）：

V{<mod>}<op>{<shape>}{<cond>}{.<dt>}{<dest>}, src1, src2

< mod>:
- Q: The instruction uses saturating arithmetic, so that the result is saturated within the range of the specified data type, such as VQABS, VQSHL etc.
- H: The instruction will halve the result. It does this by shifting right by one place (effectively a divide by two with truncation), such as VHADD, VHSUB.
- D: The instruction doubles the result, such as VQDMULL, VQDMLAL, VQDMLSL and VQ{R}DMULH.
- R: The instruction will perform rounding on the result, equivalent to adding 0.5 to the result before truncating, such as VRHADD, VRSHR.
< op>: the operation (for example, ADD, SUB, MUL).
< cond>: Condition, used with IT instruction.
< .dt>: Data type, such as s8, u8, f32 etc.
< dest>: Destination.
< src1>: Source operand 1.
< src2>: Source operand 2.
< shape>: Shape，即NEON数据处理类型Long (L), Wide (W), Narrow (N)。

NEON数据处理类型可分为Normal、Long、Wide、Narrow：
- Normal instructions can operate on any vector types, and produce result vectors the same size, and usually the same type, as the operand vectors.
- Long instructions operate on doubleword vector operands and produce a quadword vector result.（操作双字vectors，生成四倍长字vectors） The result elements are usually twice the width of the operands, and of the same type.（结果的宽度一般比操作数加倍，同类型） Long instructions are specified using an L appended to the instruction.（在指令中加L）
- Wide instructions operate on a doubleword vector operand and a quadword vector operand, producing a quadword vector result.（操作双字 + 四倍长字，生成四倍长字） The result elements and the first operand are twice the width of the second operand elements.（结果和第一个操作数都是第二个操作数的两倍宽度） Wide instructions have a W appended to the instruction.（在指令中加W）
- Narrow instructions operate on quadword vector operands, and produce a doubleword vector result.（操作四倍长字，生成双字） The result elements are usually half the width of the operand elements.（结果宽度一般是操作数的一半） Narrow instructions are specified using an N appended to the instruction.（在指令中加N）

1参考网址：https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

3. arm 32位架构指令手册

3.1 手册

3.1.1 中文手册

　下载地址：http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0013d/index.html

　　　阅读建议：最好将其第2章节、第3章节阅读一遍，有助于对arm基本知识的掌握，真正写汇编优化时，参考中文手册比较够，但是对于模糊不清的就要到 《3.2 英文手册》 中查找更为详细的解释。

3.1.2 英文手册

　下载地址(需要注册)：http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/
　下载地址(不需要注册）：https://static.docs.arm.com/ddi0406/cd/DDI0406C_d_armv7ar_arm.pdf

3.1.2 Programmer’s Guide

　下载地址：http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html

4. NEON优化技巧

Skill1: 减少数据之间的依赖

　　　在ARMv7-A平台上，为了减少指令延时时间，应当避免使用当前指令的目的寄存器作为下一条指令的源寄存器。英文原文:

　　On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.
Skill2: 减少指令分支

　　　 NEON指令集没有jump指令跳转分支；当汇编代码中需要使用分支跳转时，使用的是ARM跳转指令Jump。在ARM处理器中，分支预测技术的使用非常广泛。但是一旦分支预测失败，代价相当大。因此在汇编优化中尽量少用分支跳转指令。英文原文：

　　　There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.
Skill3: 预装载指令PLD的使用
　　　
　　　ARM处理器是load/store系统，除了加载和存储指令，其他的操作都是针对寄存器。提高加载和存储指令的命中率对优化程序很重要。
　　　预装载指令允许处理器发送信号给内存系统，告诉内存系统此处装在的数据在将来可能要用。如果数据被正确的预装载到了cache中，对于提高cache的命中率很有用，命中率提高了，性能也就提高了。但是如果没有预装载正确，将会降低性能。英文原文：

　　ARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.
　　Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.

. Skill4: Misc
　　　
　　　在ARM NEON编程里面，不同的指令序列能实现同样的操作；但是更少的指令并不总是意味着更好的性能。这基于在特定情况下的benchmark and profiling result（基准和分析结果），如下就是一些特定情况下的实践分析。

*Floating-point VMLA/VMLS instruction*

　　　通常，VMUL+VADD/VMUL+VSUB指令能够被VMLA/VMLS指令替换，因为指令数量更少了，更精简了。但是，对比于浮点VMUL操作，浮点VMLA/VMLS操作有更长的指令delay，假如在这段delay空隙中没有其他的指令能够插入的话，使用浮点VMUL+VADD/VMUL+VSUB操作将会表现出更好的性能。

1参考网址：https://community.arm.com/android-community/b/android/posts/arm-neon-optimization

5. 调试优化代码

　　汇编代码中添加如下代码（即.S文件中）

.macro print_m in1=r0, in2=d0
       push {r0-r3, lr}
       vstl.u64       {\in2\()}, [\in1\()]
       mov     r0, \in1
       bl cprintf
       pop {r0-r3, pc}
.endm

　　注意：in1应该是表示内存的arm寄存器， in2表示NEON寄存器如D0。

　　C文件中添加如下代码

void cprintf(unsigned char *srcu8)
{
  int i=0;
  char *srcs8 = (char *)srcu8;
  for(i=0; i < 16; i++){
       printf("%d ", srcu8[i])
  }
  for(i=0; i < 16; i++){
      printf("%d ", srcs8[i])
  }
  printf("\n");
}

参考网址：https://people.cs.clemson.edu/~rlowe/cs2310/notes/debugging-with-printf.pdf