An overview of the ARMv8 architecture, related technical documents, and an introduction to the ARMv8 processor

ARMv8 architecture

reference documents

The ARMv8 Architecture Reference Manual (known as the ARM ARM) provides a comprehensive description of the ARMv8 instruction set architecture, programmer's model, system registers, debugging features, and memory model. It forms a detailed specification to which all implementations of ARM processors must adhere.

ARMv8 Development Reference Documentation

  1. 754-2008 - IEEE Standard for Floating-Point Arithmetic

  2. 1003.1, 2016 Edition - IEEE Standard for Information Technology—Portable Operating System Interface (POSIX™) Base Specifications, Issue 7

  3. 1149.1-2001 - IEEE Standard Test Access Port and Boundary Scan Architecture

  4. ARM® Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile (ARM DDI0487)

  5. ARM® Cortex®-A Series Programmer’s Guide for ARMv7-A (DEN 0013)

  6. ARM® NEON™ Programmer's Guide (DEN 0018)

  7. ARM® Cortex®-A53 MPCore Processor Technical Reference Manual (DDI 0500)

  8. ARM® Cortex®-A57 MPCore Processor Technical Reference Manual (DDI 0488)

  9. Arm Cortex-A73 MPCore Processor Technical Reference Manual

  10. ARM® Generic Interrupt Controller Architecture Specification (ARM IHI 0048)

  11. ARM® Compiler armasm Reference Guide v6.01 (DUI 0802)

  12. ARM® Compiler Software Development Guide v5.05 (DUI 0471)

  13. ARM® C Language Extensions (IHI 0053)

  14. ELF for the ARM® Architecture (ARM IHI 0044)

Overview of the ARMv8 Architecture

The ARMv8 architecture includes 32-bit and 64-bit execution states, which introduces the ability to perform execution using 64-bit wide registers, and provides backward compatibility mechanisms to enable execution of existing ARMv7 software.

  • AArch64: 64-bit execution state in ARMv8.
  • AArch32: 32-bit execution state in ARMv8, almost identical to ARMv7.

In the documentation of GNU and Linux (except for Redhat and Fedora), AArch64 is sometimes referred to as ARM64.

The Cortex-A family of processors now includes implementations in ARMv8-A and ARMv7-A:

  • Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15 and Cortex-A17 processors are all implemented by the ARMv7-A architecture.
  • The Cortex-A53, Cortex-A57 and Cortex-A73 processors are implemented by the ARMv8-A architecture.

ARMv8 processors still support software written for ARMv7-A processors (with some exceptions). This means, for example, that 32-bit code written for an ARMv7 Cortex-A family of processors will also run on an ARMv8 processor such as the Cortex-A57. However, the code will only run when the ARMv8 processor is in the AArch32 execution state.

Also, the A64's 64-bit instruction set does not run on ARMv7 processors, only ARMv8 processors.

The changes from 32 bits to 64 bits

The performance of 64-bit processors has been greatly improved, including the following changes:

1, Larger register pool (larger register pool)

The A64 instruction set offers some significant performance benefits, including a larger register pool. A64 has 31 64bits general-purpose registers and ARM Architecture Procedure Call Standard (AAPCS) to provide performance acceleration. When users need to pass more than four parameters (requires more than four registers) in a function call, ARMv7 may use a stack. In AArch64, up to eight parameters can be passed in registers, so performance can be increased and stack usage can be reduced.

Parameter passing rules in armeabi

The four registers R0-R3 are used to pass the first to fourth parameters of the subroutine call, and the extra parameters are passed through the stack; the R0 register is also used to store the return result of the subroutine. If the data is greater than 32 bits, use R0 and R1 to save the result. The called function does not need to restore the contents of these registers before returning.
For floating-point numbers, armeabi does not provide floating-point instructions, and the soft-simulated floating-point instructions are softvfp.

Thumb mode

Parameter passing in Thumb mode also follows the above rules, but it does not support 32-bit stmfd and ldmfd instructions, use push and pop instead.

armeabi-v7a

The parameter passing in armeabi-v7a is also basically the same, but there are some differences in the parameter passing of floating point instructions. It supports hardware floating-point instructions, so floating-point operations are handed over to floating-point instructions. Since floating-point numbers are 64-bit, two adjacent registers are used to store a complex number of floating-point numbers.

armeabi-v8a

The parameter size and instruction addressing space of armeabi-v8a have changed a lot, and the parameter passing convention has also been modified:
for 32-bit integer parameters, the first 8 parameters are passed using the W0-W7 registers, and the parameters exceeding this number are passed using the stack; for 64-bit integer parameters, the first 8 parameters are passed using the x0-x7 registers, and the parameters exceeding this number are passed using the stack;

2, Wider integer registers (with wider integer registers)

Wider integer registers allow code that operates on 64-bit data to work more efficiently. A 32-bit processor may require multiple operations to perform arithmetic on 64-bit data. A 64-bit processor may be able to perform the same task in a single operation, usually at the same speed as the same processor performing a 32-bit operation. Therefore, code that performs many 64-bit-sized operations is significantly faster.

3, Larger virtual address space (larger virtual address space)

64-bit operation enables applications to use a larger virtual address space. Although Large Physical Address Extension (LPAE) extends the physical address space of a 32-bit processor to 40 bits, it does not expand the virtual address space. This means that even with LPAE, a single application is limited to a 32-bit (4GB) address space. This is because some space in this address space is reserved for the operating system.

Larger virtual address spaces also support memory-mapped larger files. This is a memory map that maps file contents to threads. This can happen even though physical RAM may not be large enough to contain the entire file.

32-bit address space

As a 32-bit microprocessor, the maximum addressing space supported by the ARM architecture is 4GB (232 bytes ), which can be regarded as a size of 232 bytes (8bit). The unit address of these bytes is an unsigned 32-bit value, and its value ranges from 0 to 232-1 . The ARM address space can also be regarded as 230 32 -bit words (1 word = 4 bytes) unit. The addresses of these word units are divisible by 4, that is to say, the lower two bits of the address are 00. The word data whose address is A includes the contents of the 4 byte units whose addresses are A, A+1, A+2, and A+3.

Each time an instruction is executed, the current instruction counter adds 4 bytes.

4, Larger physical address space (larger physical address space)

Software running on a 32-bit architecture may need to map some data in memory for input and output at execution time. Having a larger address space (using 64-bit pointers) avoids this problem.

However, using 64-bit pointers does incur some costs: the same piece of code typically uses more memory than using 32-bit pointers.

Each pointer is stored in memory and requires 8 bytes instead of 4. This may sound trivial, but it can be a significant burden. Additionally, the increased memory space usage associated with 64-bit can lead to a drop in cache hit rates, which in turn can degrade performance.

  • 64-bit pointers: 8 bytes
  • 32-bit pointers: 4 bytes

ARMv8-A architecture

The ARM architecture dates back to 1985, and it has grown tremendously since the early ARM cores, adding features and functionality at every step.

ARMv4 and earlier

These early processors used only the ARM 32-bit instruction set.

ARMv4T

The ARMv4T architecture adds the Thumb 16-bit instruction set to the ARM 32-bit instruction set. This is the first widely licensed architecture. It is implemented by ARM7TDMI® and ARM9TDMI® processors.

ARMv5TE

The ARMv5TE architecture adds improvements for DSP type operations, saturated arithmetic, and ARM and Thumb interworking. The ARM926EJ-S® implements this architecture.

ARMv6

ARMv6 includes several enhancements, including support for unaligned memory accesses, significant changes to the memory architecture, and support for multiprocessors. Also includes some support for SIMD operations on byte or halfword operations in 32-bit registers. The ARM1136JF-S® implements this architecture. The ARMv6 architecture also provides some optional extensions, notably Thumb-2 and security extensions (TrustZone®). Thumb-2 extends Thumb to a mixed-length 16-bit and 32-bit instruction set.

ARMv7-A

The ARMv7-A architecture enforces the use of Thumb-2 extensions and adds advanced SIMD extensions (NEON). Prior to ARMv7, all cores followed essentially the same architecture or feature set. To help address a growing number of different applications, ARM has introduced a set of architectural configurations:

  • ARMv7-A provides all the functions required to support platform operating systems such as Linux
  • ARMv7-R provides predictable real-time high performance.
  • ARMv7-M targets deeply embedded microcontrollers. The M configuration has also been added to the ARMv6 architecture to enable functionality for older architectures. The ARMv6M configuration is used by low-power, low-cost microprocessors.

ARMv8-A

The ARMv8 architecture includes 32-bit and 64-bit implementations. It introduces the use of 64-bit wide registers while maintaining backward compatibility with existing ARMv7 software.Development of the ARMv8 architecture

The ARMv8-A architecture introduces a number of changes that allow the design of higher performing processor implementations:

larger physical address

This enables the processor to access more than 4GB of physical memory.

64-bit virtual addressing

This allows virtual memory beyond the 4GB limit. This is important for modern desktop and server software that uses memory-mapped file I/O or sparse addressing.

automatic event signal

This enables a power-efficient, high-performance spinlock.

Larger Register File

Thirty-one 64-bit general-purpose registers improve performance and reduce stack usage.

Efficient 64-bit immediate data generation

There is less need for text pooling.

Larger PC-relative addressing range

A +/-4GB addressing range enables efficient data addressing in shared libraries and position-independent executables.

Additional 16KB and 64KB conversion granularities

This reduces the translation lookaside buffer (TLB) miss rate and page view depth.

new exception model

This reduces the complexity of the operating system and hypervisor software.

efficient cache management

User-space caching operations improve dynamic code generation efficiency. Clear the fast data cache (DC) with the data cache zero instruction.

hardware accelerated encryption

Provides a 3× to 10× improvement in software encryption performance. This is useful for small-grained decryption and encryption that are too small to efficiently load to hardware accelerators, such as https.

Load-Acquire, Store-Release instructions

Designed for C++11, C11, Java memory models. They improve the performance of thread-safe code by eliminating explicit memory barrier instructions.

NEON Double Precision Floating Point Advanced SIMD

This enables SIMD vectorization to be applied to a wider set of algorithms, such as scientific computing, High Performance Computing (HPC) and supercomputers.

ARMv8-A processors: A53, A57 and A73

Comparison of A53 and A73
A73:
Cortex-A73 processor implementation options

All A73 cores share a common L2 cache, and each core has the same configuration for all parameters.

Cortex-A53 processor

The Cortex-A53 processor is a mid-range, low-power processor with 1 to 4 cores in a single cluster, each core has an L1 cache subsystem, an optional integrated GICv3/4 interface, and an optional L2 cache controller.
The Cortex-A53 processor is an extremely power-efficient processor capable of supporting 32-bit and 64-bit code. It offers significantly higher performance than the highly successful Cortex-A7 processor. It can be deployed as a standalone application processor or paired with a Cortex-A57 processor in a big.LITTLE configuration for optimal performance, scalability and power efficiency.

Cortex-A53 processor

The Cortex-A53 processor has the following features:

  • Arranged in order, eight-stage pipeline.
  • Reduce power consumption by using hierarchical clock gating, power domains, and advanced holdover modes.
  • Enhanced dual-issue capability by re-executing resources and dual-instruction decoders.
  • Power-optimized L2 cache design provides lower latency and balances performance and efficiency.

Cortex-A57 processor

The Cortex-A57 processor targets mobile and enterprise computing applications, including compute-intensive 64-bit applications such as high-end computers, tablets and server products. It can be used in an ARM big.LITTLE configuration with a Cortex-A53 processor for scalable performance and more efficient energy usage.

The Cortex-A57 processor features cache-coherent interoperability with other processors, including the ARM Mali™ family of graphics processing units (GPUs) for GPU computing, and offers optional reliability and scalability features for high-performance enterprise applications. It offers higher performance than the ARMv7 Cortex-A15 processor and is more energy efficient. The inclusion of cryptographic extensions improves the performance of cryptographic algorithms by a factor of 10 compared to previous generation processors.

Cortex-A57 processor core

The Cortex-A57 processor fully implements the ARMv8-A architecture. It supports multi-core operation, multiprocessing with one to four cores in a single cluster. Through AMBA5 CHI or AMBA 4 ACE technology, multiple coherent SMP clusters can be realized. Debugging and tracing are available through CoreSight technology.

The Cortex-A57 processor has the following features:

  • 15+ pipelines out of order.
  • Energy-saving features include way prediction, tag reduction, and cache lookup suppression.
  • Increase peak instruction throughput by re-executing resources. Power-optimized instruction decoding with localized decoding and 3-wide decoding bandwidth.
  • The performance-optimized L2 cache design enables multiple cores in the cluster to access L2 simultaneously.

Cortex-A73 processor

This is the latest A-series processor released by ARM in 2016. Cortex-A73 supports full-size ARMv8-A architecture, including 128-bit AMBR 4 ACE interface and ARM's big.LITTLE system integration interface. It is manufactured with the most advanced 10nm technology and can provide 30% higher continuous processing capacity than Cortex-A72. It is very suitable for mobile devices and consumer devices.
Cortex-A73
Example Cortex-A73 processor configuration

Guess you like

Origin blog.csdn.net/luolaihua2018/article/details/124559092
Recommended