序
本文介绍arm架构64位neon汇编优化,适合于任何基础, 前文《arm架构32位优化》已经讲述arm的基本语法。
1、arm架构64位寄存器介绍
1.1、arm寄存器
本文中无特别说明,arm寄存器均指aarch64寄存器
arm寄存器有31个64位通用寄存器(X0~X30),他们的低32位称为W寄存器(W0~W30),Xn和Wn的对应关系如图:
此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
需注意的是,arm寄存器的调用规则遵循AAPCS调用规则,如图:
X0~X7用来传递函数形参和返回结果,一般来说,单个64位的返回结果存储在X0中,单个128位的返回结果存储在X1:X0中;
X8被用来保存子程序(在这指被调用者函数,后续没特别说明,均指此意)的返回地址;
X9~X28是易损坏的寄存器,在子程序中使用时需要保存;
X18(Platform Register,PR)是跟平台相关的寄存器,用于特殊用途,不要使用他;
注意:SP需要16字节对齐,在对Xn寄存器压栈时特别小心。更多信息参考:https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf
1.2 neon寄存器
neon寄存器有32个128位的寄存器(V0~V31),
1.2.1 标量寄存器
每个寄存器可以根据数据类型映射成不同的标量寄存器,如:
一个128位的寄存器(Q0~Q31);
一个64位的寄存器(D0~D31);
一个32位的寄存器(S0~S31);
一个16位的寄存器(H0~S31);
一个8位的寄存器(B0~B31)。
注意: S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图:
此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页
1.2.2 矢量寄存器
64位宽或128位宽的矢量寄存器可以有一个或多个元素,如图:
然后使用索引去访问相应的元素,如V0.2D[0]。
此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页
1.2.3 调用规则
V0~V7 用于传递函数形参和返回结果;
V8~V15在子程序中被使用时需要压栈保存;
V0~V7和V16~V31 调用者可能需要保存;
参考网址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers
2、Neon指令集
2.1 ARMv8/AArch64指令格式
In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:
{<prefix>}<op>{<suffix>} Vd.<T>, Vn.<T>, Vm.<T>
Where:
- < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
- < op> – operation, such as ADD, AND etc.
- < suffix> - suffix
- P: “pairwise” operations, such as ADDP.
- V: the new reduction (across-all-lanes) operations, such as FMAXV.
- 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.
ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.
- < T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
For example:
UADDLP V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S
For more information, please refer to the documents listed in the Appendix.
参考网址:http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址:https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference
2.2 关于指令中post-index\pre-index的介绍
参考网址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页
3、arm 64位架构指令手册
3.1 aarch64英文手册
下载地址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf
3.2 arm32位指令和aarch64位指令对照表
3.3 指令速查卡
下载地址:https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf
4、arm32优化到aarch64的转变
4.1 函数返回
4.1.1 对于普通寄存器压栈
因为SP指针需要16字节对齐,所以aarch64对寄存器压栈需要成对压栈。
4.1.2 对于neon寄存器压栈
.macro push_v_regs
stp d8, d9, [sp, #-16]!
stp d10, d11, [sp, #-16]!
stp d12, d13, [sp, #-16]!
stp d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
ldp d14, d15, [sp], #16
ldp d12, d13, [sp], #16
ldp d10, d11, [sp], #16
ldp d8, d9, [sp], #16
.endm
至于要用的是v8~v15寄存器,为什么成了压d8~d15,参考“1.2.3 调用规则”。
不幸的是,在GDB调试时,此种压栈方式会提示:
tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.
解决办法:
.macro push_v_regs
stp x8, x9, [sp, #-32]!
stp x10, x11, [sp, #-32]!
stp x12, x13, [sp, #-32]!
stp x14, x15, [sp, #-32]!
.endm
.macro pop_v_regs
ldp x14, x15, [sp], #32
ldp x12, x13, [sp], #32
ldp x10, x11, [sp], #32
ldp x8, x9, [sp], #32
.endm
关于更多aarch64压栈信息可参见:
压栈介绍网址1:https://stackoverflow.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64
压栈介绍网址2:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64
压栈介绍网址3:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop
5、知识扩展
参考网址:https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址:http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址:https://www.nxp.com/docs/en/application-note/AN12212.pdf