L1 Cache architecture in ARM

L1 Cache architecture

This article sorts out the content of the ARM Programmer's Guide, and discusses the differences between VIVT (virtual indexed virtual tagged), VIPT (virtual indexed physical tagged), and PIPT (physical indexed physical tagged) through practical examples.

Virtually Addressed Caches(VIVT)

The ARM Processor L1 cache more than ten years ago adopted the VIVT cache architecture. The main reasons were mentioned in section 5.5.1 of the ARM architecture reference at that time:

It allows cache line look-up to proceed in parallel with address translation.

Since cache access and address translation can be executed at the same time, when a cache miss occurs, the time to obtain a physical memory address can be shortened (reduce miss penalty). In addition, under normal circumstances, cache hits occur frequently, so cache hit access does not need to go through address translation, which is also helpful for improving performance.
But also because of the characteristics of virtual address, VIVT cache will cause some serious problems, including:

Problems

    1. The homonyms
      have the same virtual address in different processes and point to different physical addresses. Because the virtual address is the same, the same cache line will be used. When switching to another process, it will be misjudged as a cache hit and wrong data will be obtained.
    1. Different virtual addresses of synonyms
      point to the same physical address, such as mmap or multiple processes sharing the same data. Because the virtual addresses are different, it also means that the same data will be cached in different cache lines, which may cause one process to update the cache data, but the cache data of the other process is not updated, thus obtaining older data.

Possible solutions

  • Flush Cache
    For the problem of cache data inconsistency, a direct and violent solution is flush cache. When context switch or context swap, the cache data is cleared directly and obtained from the main memory again. Taking ARMv5 as an example, because the L1 cache is divided into data cache and instruction cache, its flush cache process is:
    • If the data cache is write-back, clear it
    • invalidating the data cache
    • invalidating the instruction cache
    • In addition to clearing the write buffer
      , virtual cache access also needs to examine the page permission issue, so the cache unit will actually ask the TLB for the permission of this virtual address. Therefore, when the context switch, the TLB also needs invalidate.
      From the above, we can know that the cost brought by flush cache is quite high. After all, clean and reload data are required every time context swap or switch is performed. As mentioned in The ARM Fast Context Switch Extension for Linux, the time cost caused by the flush cache during the context switch may be as high as 200 microseconds, which is quite a serious problem for applications with high real-time requirements.

Fast Context Switch Extension

In order to reduce the number of flush caches, in 2009 ARM adopted software implementation with ARMv5 architecture and proposed a Fast Context Switch Extension solution. FCSE is mainly used to solve the problem of using the same virtual address range for different processes.
(图片:The ARM Fast Context Switch Extension for Linux)

The concept is to change the first 7 bits of the highest bit of the virtual address into a process ID, so that different processes can have different virtual address ranges. However, the problem with such a design is that the number of processes is limited, and 7 bits can only have a maximum of 2^7 = 128 processes. It is also stated in the paper that this mechanism is applicable to some embedded system scenarios.

From Virtual Cache to Physical Cache

With the complexity of application scenarios and the development of multi-core processors, invalidating and clean cache costs have gradually had a non-negligible impact on performance (the data diagram will be supplemented later), which is specifically mentioned from ARM in the Cortex-A series processors document Point, so starting from ARMv6, all changed to VIPT instruction cache / PIPT data cache L1 cache architecture.

(Image: ARM® Cortex™-A Series Programmer's Guide)

Physically-addressed caches(VIPT)

Most L1 caches of modern ARM processors have adopted physically-addressed caches to reduce cache coherency problems caused by VIVT. The reason why there is a virtual indexed physical tagged (VIPT) cache is to strike a balance between address translation latency and cache access. In the VIPT mechanism, the index is taken from the virtual address, so part of the cache line search process can be performed first, and the physical tag can be obtained after the address tranlation, and then the final tag comparison can be performed.
insert image description here

However, because VIPT will use some virtual addresses, in order to avoid the synonym problem caused by VIVT, when using VIPT, it will be limited by the page size of the hardware.

For example, the common page size is 4KB (2^12), so the 0 - 11 bits of the physical address and virtual address are the same value. Using this feature, as long as the cache index and cache offset bits are allocated within the 12 bits, it can be ensured that a physical address data will only be on the same cache line.

In this way, the cache size will be limited to 2^12 * multi-way. For example, the physical page size is 4KB (2^12), the cache line is 64 (2^6) bytes, and the four-way associative cache is used, the cache size is 4KB * 4 = 16KB.
insert image description here

What happens if the page size is 4KB and the four-way associative cache size is 64KB?

If the cache size is 64 KB, the cache index needs 8 bits, and the cache line also needs 6 bits. 8 + 6 = 14 is more than the 12 bits of the physical page size, and the extra 2 bits may occur due to address translation Synonyms problem.
Possible solutions - page coloring

Faced with the above-mentioned problems, one of the solutions is page coloring, which is to distinguish colors from the above-mentioned extra 2 bits, and 2 bits will have 2^2 = 4 colors.

And limit that different colors need to correspond to different physical pages. In this way, different virtual addresses and different cache indexes can be avoided from corresponding to the same physical address.

Physically-addressed caches(PIPT)

If the cache size does not want to be limited by the page size, but also avoids page coloring, the simplest and most direct way is to directly adopt the PIPT method. Of course, this also increases the cache hit latency, but compared with the VIVT brought by the context switch, it is Clean cache cost and possible data inconsistency, PIPT is relatively simple. The L1 data cache after ARMv6 uses PIPT.

VIPT v.s. PIPT

In the ARM A series specifications mentioned above, it can be found that in the L1 cache, the instruction cache is a VIPT four-way associative cache, and the data cache is a PIPT four-way associative cache. What kind of investigation will make the two caches have different specifications?

Someone raised this question in the ARM community:
Why is the I-cache designed as VIPT, while the D-cache as PIPT?
And the ARM engineer gave the answer:

PIPT caches are more flexible(can share data across processes without needing a cache flush on context switch),but more power intensive(I need to to a TLB lookup on every access rather than just on line-fill(i.e. when I miss)).
The instruction cache doesn’t generally need these advantages - it’s read-only - and the disadvantage is a significant one if you care about building a very power efficient core.

Guess you like

Origin blog.csdn.net/weixin_45264425/article/details/132434358