arm架构64位优化

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/listener51/article/details/82530464

  本文介绍arm架构64位neon汇编优化,适合于任何基础, 前文《arm架构32位优化》已经讲述arm的基本语法。

1、arm架构64位寄存器介绍

1.1、arm寄存器

   本文中无特别说明,arm寄存器均指aarch64寄存器
   arm寄存器有31个64位通用寄存器(X0~X30),他们的低32位称为W寄存器(W0~W30),Xn和Wn的对应关系如图:
这里写图片描述
  此图参考http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf B1.2.1 Register in AArch64 state
  需注意的是,arm寄存器的调用规则遵循AAPCS调用规则,如图:
   这里写图片描述
   X0~X7用来传递函数形参和返回结果,一般来说,单个64位的返回结果存储在X0中,单个128位的返回结果存储在X1:X0中;
   X8被用来保存子程序(在这指被调用者函数,后续没特别说明,均指此意)的返回地址;
   X9~X28是易损坏的寄存器,在子程序中使用时需要保存;
   X18(Platform Register,PR)是跟平台相关的寄存器,用于特殊用途,不要使用他;
   注意:SP需要16字节对齐,在对Xn寄存器压栈时特别小心。更多信息参考:https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf  

1.2 neon寄存器

   neon寄存器有32个128位的寄存器(V0~V31),

1.2.1 标量寄存器

   每个寄存器可以根据数据类型映射成不同的标量寄存器,如:
    一个128位的寄存器(Q0~Q31);
    一个64位的寄存器(D0~D31);
    一个32位的寄存器(S0~S31);
    一个16位的寄存器(H0~S31);
    一个8位的寄存器(B0~B31)。
  注意: S0 is the bottom half of D0, which is the bottom half of Q0. S1 is the bottom half of D1, which is the bottom half of Q1, and so on. 如图:
   这里写图片描述
   此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第54页

1.2.2 矢量寄存器

   64位宽或128位宽的矢量寄存器可以有一个或多个元素,如图:
   这里写图片描述
   然后使用索引去访问相应的元素,如V0.2D[0]。
   此图来自:http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf 第55页

1.2.3 调用规则

   V0~V7 用于传递函数形参和返回结果;
   V8~V15在子程序中被使用时需要压栈保存;
   V0~V7和V16~V31 调用者可能需要保存;
   
   参考网址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf 5.1.2 SIMD and Floating-Point Registers

2、Neon指令集
2.1 ARMv8/AArch64指令格式

  In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>

  Where:

  • < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.
  • < op> – operation, such as ADD, AND etc.
  • < suffix> - suffix
    • P: “pairwise” operations, such as ADDP.
    • V: the new reduction (across-all-lanes) operations, such as FMAXV.
    • 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.
SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

  • < T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).
     For example:
UADDLP    V0.8H, V0.16B
FADD V0.4S, V0.4S, V0.4S

  For more information, please refer to the documents listed in the Appendix.
参考网址:http://caxapa.ru/thumbs/845405/armv8-neon-programming.pdf
参考网址:https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

2.2 关于指令中post-index\pre-index的介绍

这里写图片描述
这里写图片描述
参考网址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf 第150页

3、arm 64位架构指令手册
3.1 aarch64英文手册

  下载地址:https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf

3.2 arm32位指令和aarch64位指令对照表

  下载地址:https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

3.3 指令速查卡

  下载地址:https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf

4、arm32优化到aarch64的转变

  参考网址:https://blog.linuxplumbersconf.org/2014/ocw/system/presentations/2343/original/08%20-%20Migrating%20code%20from%20ARM%20to%20ARM64.pdf

4.1 函数返回

这里写图片描述

4.1.1 对于普通寄存器压栈

这里写图片描述
  因为SP指针需要16字节对齐,所以aarch64对寄存器压栈需要成对压栈。

4.1.2 对于neon寄存器压栈
.macro push_v_regs
    stp    d8, d9, [sp, #-16]!
    stp    d10, d11, [sp, #-16]!
    stp    d12, d13, [sp, #-16]!
    stp    d14, d15, [sp, #-16]!
.endm
.macro pop_v_regs
    ldp    d14, d15, [sp], #16
    ldp    d12, d13, [sp], #16
    ldp    d10, d11, [sp], #16
    ldp    d8, d9, [sp], #16
.endm

  至于要用的是v8~v15寄存器,为什么成了压d8~d15,参考“1.2.3 调用规则”。
不幸的是,在GDB调试时,此种压栈方式会提示:

tbreak _Unwind_RaiseException aarch64-tdep.c:335: internal-error: CORE_ADDR aarch64_analyze_prologue(gdbarch*, CORE_ADDR, CORE_ADDR, aarch64_prologue_cache*): Assertion `inst.operands[0].type == AARCH64_OPND_Rt’ failed.

解决办法:

.macro push_v_regs
    stp    x8, x9, [sp, #-32]!
    stp    x10, x11, [sp, #-32]!
    stp    x12, x13, [sp, #-32]!
    stp    x14, x15, [sp, #-32]!
.endm
.macro pop_v_regs
    ldp    x14, x15, [sp], #32
    ldp    x12, x13, [sp], #32
    ldp    x10, x11, [sp], #32
    ldp    x8, x9, [sp], #32
.endm

关于更多aarch64压栈信息可参见:
  压栈介绍网址1:https://stackoverflow.com/questions/40271180/push-and-pop-a-full-128-bit-neon-register-to-from-the-stack-in-aarch64
  压栈介绍网址2:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch32-and-aarch64
  压栈介绍网址3:https://community.arm.com/processors/b/blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

5、知识扩展

参考网址:https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARM_CPU_Architecture.pdf
参考网址:http://my.presentations.techweb.com/events/esc/boston/2017/conference/download/5299
参考网址:https://www.nxp.com/docs/en/application-note/AN12212.pdf

猜你喜欢

转载自blog.csdn.net/listener51/article/details/82530464