arm架构64位优化

序

　　本文介绍arm架构64位neon汇编优化，适合于任何基础，　前文《arm架构32位优化》已经讲述arm的基本语法。

1、arm架构64位寄存器介绍

1.1、arm寄存器

　　　本文中无特别说明，arm寄存器均指aarch64寄存器
　　　arm寄存器有31个64位通用寄存器（X0~X30），他们的低32位称为W寄存器（W0~W30），Xn和Wn的对应关系如图：
这里写图片描述
　　此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
　　需注意的是，arm寄存器的调用规则遵循AAPCS调用规则，如图：
　　　
　　　X0~X7用来传递函数形参和返回结果，一般来说，单个64位的返回结果存储在X0中，单个128位的返回结果存储在X1:X0中；
　　　X8被用来保存子程序（在这指被调用者函数，后续没特别说明，均指此意）的返回地址；
　　　X9~X28是易损坏的寄存器，在子程序中使用时需要保存；
　　　X18（Platform Register，PR）是跟平台相关的寄存器，用于特殊用途，不要使用他；
　　　注意：SP需要16字节对齐，在对Xn寄存器压栈时特别小心。更多信息参考：https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf　　

1.2 neon寄存器

　　　neon寄存器有32个128位的寄存器（V0~V31），

1.2.1 标量寄存器

　　每个寄存器可以根据数据类型映射成不同的标量寄存器，如：
　　　　一个128位的寄存器（Q0~Q31）；
　　　　一个64位的寄存器（D0~D31）；
　　　　一个32位的寄存器（S0~S31）；
　　　　一个16位的寄存器（H0~S31）;
　　　　一个8位的寄存器（B0~B31）。
　　注意： S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图：
　　　这里写图片描述
　　　此图来自：http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页

1.2.2 矢量寄存器

　　　64位宽或128位宽的矢量寄存器可以有一个或多个元素，如图：
　　　这里写图片描述
　　　然后使用索引去访问相应的元素，如V0.2D[0]。
　　　此图来自：http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页

1.2.3 调用规则

　　　V0~V7 用于传递函数形参和返回结果；
　　　V8~V15在子程序中被使用时需要压栈保存；
　　　V0~V7和V16~V31 调用者可能需要保存；
　　　
　　　参考网址：http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers

2、Neon指令集

2.1 ARMv8/AArch64指令格式

　　In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>

　　Where:

< prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
< op> – operation, such as ADD, AND etc.
< suffix> - suffix
- P: “pairwise” operations, such as ADDP.
- V: the new reduction (across-all-lanes) operations, such as FMAXV.
- 2：new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
　For example:

UADDLP    V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S

　　For more information, please refer to the documents listed in the Appendix.
参考网址：http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址：https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

2.2 关于指令中post-index\pre-index的介绍

这里写图片描述

参考网址：https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页

3、arm 64位架构指令手册

4.1 函数返回

这里写图片描述

4.1.1 对于普通寄存器压栈

这里写图片描述
　　因为SP指针需要16字节对齐，所以aarch64对寄存器压栈需要成对压栈。

4.1.2 对于neon寄存器压栈

.macro push_v_regs
    stp    d8, d9, [sp, #-16]!
    stp    d10, d11, [sp, #-16]!
    stp    d12, d13, [sp, #-16]!
    stp    d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
    ldp    d14, d15, [sp], #16
    ldp    d12, d13, [sp], #16
    ldp    d10, d11, [sp], #16
    ldp    d8, d9, [sp], #16
.endm

　　至于要用的是v8~v15寄存器，为什么成了压d8~d15，参考“1.2.3 调用规则”。
不幸的是，在GDB调试时，此种压栈方式会提示：

tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.

解决办法：

.macro push_v_regs
    stp    x8, x9, [sp, #-32]!
    stp    x10, x11, [sp, #-32]!
    stp    x12, x13, [sp, #-32]!
    stp    x14, x15, [sp, #-32]!
.endm
.macro pop_v_regs
    ldp    x14, x15, [sp], #32
    ldp    x12, x13, [sp], #32
    ldp    x10, x11, [sp], #32
    ldp    x8, x9, [sp], #32
.endm

关于更多aarch64压栈信息可参见：
　　压栈介绍网址1：https://stackoverflow.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64
　　压栈介绍网址2：https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64
　　压栈介绍网址3：https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

5、知识扩展

参考网址：https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址：http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址：https://www.nxp.com/docs/en/application-note/AN12212.pdf

序