Cache consistency

Preface

Study with questions and some thoughts about cache :
1. Where is the L1/L2/L3 cache? How big are the L1/L2/L3 caches respectively?
2. What is the organizational form of L1/L2/L3 cache? n-way groups connected?
3. Have you seen VIVT’s cache? Why do you want to learn VIVT’s cache? It greatly interferes with your understanding of cache, so you might as well not learn it.
4. So is cache VIPT or PIPT? Or is there both VIPT and PIPT in one core?
5. Do you want to learn the principles of MESI? Can you remember it? Do you not understand MESI or cache architecture?
6. What is MOESI? Is the mainstream core now MESI or MOESI?
7. MESI is just a protocol. There must be hardware to execute this protocol. Who is the hardware?
8. The MESI protocol has 4 states. Where are these 4 states recorded?
9. Among the L1/L2/L3 caches, or in the core cache/cluster cache, which caches are maintained in compliance with the MESI protocol, and which ones are not? Why is it designed this way?
10. How many bytes are the data in the cache line? When analyzing problems, why do you always analyze according to conditions, what is the cache line of 16 bytes, and what is the cache line of 64 bytes? Don't you know that the cache lines of mainstream arm cores are all 64bytes?
11. What is the TAG of the cache, and what is in it? Don't tell me that the cache TAG is a physical address?
12. What is in the cache line? Why is there no index?
13. Is the L2 cache in the core or in the cluster?
14. Assuming that a piece of memory is configured as non-cacheable, why is it not cached in the cache?
15. The cache strategy of the cache is defined in the attributes of the page table entry. If mmu is disabled, what is the cache strategy when the CPU reads and writes memory?
16. As a software engineer, what can be modified about the caching strategy of L1/L2/L3 cache? Which ones are fixed in hardware and cannot be modified?
And what are these replacement strategies?
17. What is an inclusive cache? What is an exclusive cache? What about Strictly and Weakly?
18. Understanding of some concepts, such as CCI, SCU, DSU, ACE, CHI?
19. How to configure the cacheable attribute of a page? How to configure the cacheable attribute of the page table?

Preface

As a low-level security engineer and a front-line customer support FAE, my work involves many modules such as TF-A, TEE, TA, Linux Kernel, Linux native programs, etc., as well as some hardware module drivers. Among these different hardware or system software, there are different memory attribute configurations and different caching strategies. Then when we communicate through shared memory with multiple hardware and multiple software, we will encounter various problems. In fact, Many times, it is also the soul of the customer who asks. In order to give the customer a professional feeling, as an FAE, you have to understand the underlying principles...

I am not an expert, let alone a big shot. I just read some arm documents, added my own understanding, and then summarized the following article. Of course, when I summarized, everything was based on official information, and I tried not to talk nonsense. Don't talk nonsense, I asked some ASIC experts for some information that I couldn't find. In fact, compared with other modules (such as MMU, exceptions, gic...), cache should be considered the most difficult, but fortunately most of its actions are done by hardware for us, so our software is simple, but The more automatic the behavior of the hardware, the harder it will be for us software engineers to understand, because we can't see the data or design, and many of them have to rely on guesswork.

Finally, I hope this series of articles can be helpful to everyone. Study hard, make progress every day, and roll up, comrades.

illustrate:

  • What this series describes is based on the armv8/armv9 architecture. If the execution status is involved, it is aarch64. If the specific core is involved, it is A710 and A53.
  • Most of the content comes from arm official documents, a small part is consulted with ASIC colleagues, plus some of my own understanding...

Before introducing cache consistency, let’s clarify some concepts~

1. When talking about memory space, we know that memory space includes memory and register space.

2. The relationship between CPU, cache, memory, and peripherals. And how cache consistency is caused

There are two ways when the CPU writes to the memory:
1. write through: The CPU directly writes to the memory without going through the cache.
2. write back: The CPU only writes to the cache. The hardware of the cache uses the LRU algorithm to replace the contents of the cache into the memory. It's usually this way.

We assume that there is a red area in MEM, and the CPU has read it, so the red area is also entered into CACHE:

However, suppose that the DMA now moves a white of the peripheral to the original red position of the memory:

At this time, although the memory is white, what the CPU reads is still red, because the cache hits, which causes the cache to be incoherent.
Of course, if the CPU writes data to the memory, it is only written into the cache first (not necessarily into the memory). At this time, if a DMA operation is performed from the memory to the peripheral, the peripheral may get the old data in the wrong memory .


So the easiest way to cache coherent is naturally to let the CPU access the DMA buffer without cache. In fact, by default, the memory requested by dma_alloc_coherent() is configured with uncache by default.
 

However, because modern SoCs are particularly strong, there are some SoCs that can use hardware as the cache coherence of the CPU and peripherals, such as the cache coherent interconnect in the figure:

These SoC manufacturers can overwrite the general implementation of the kernel, and the memory requested by dma_alloc_coherent() can also be cached. This part is still explained by Daniel Arnd Bergmann children's shoes:

From: https://www.spinics.net/lists/arm-kernel/msg322447.html

Arnd Bergmann:

dma_alloc_coherent() is a wrapper around a device-specific allocator,

based on the dma_map_ops implementation. The default allocator

from arm_dma_ops gives you uncached, buffered memory. It is expected

that the driver uses a barrier (which is implied by readl/writel

but not __raw_readl/__raw_writel or readl_relaxed/writel_relaxed)

to ensure the write buffers are flushed.

If the machine sets arm_coherent_dma_ops rather than arm_dma_ops,

the memory will be cacheable, as it's assumed that the hardware

is set up for cache-coherent DMAs.
 

When I grep the kernel source code, I found that part of the SoC is indeed implemented like this:

Therefore, the memory requested by dma_alloc_coherent() can also be cached. This depends on the hardware and manufacturer, but generally the cache is not provided by default.

Therefore, there are generally three methods to solve cache inconsistency.

1. One is a hardware solution. For example, the hardware called "Cache Coherent interconnect" integrated in the SoC described above can allow DMA to step into the CPU's cache or help refresh the cache. In this case, the memory requested by dma_alloc_coherent() does not need to be non-cacheable.

2. One is to disable cache in software (kernel mechanism------consistency mapping)

3. One is DMA Streaming Mapping (kernel mechanism ------ streaming mapping)

The following mainly introduces the latter two situations ~

There are generally two situations in projects that will lead to cache inconsistency, but the kernel has corresponding mechanisms.

(1) Operation register address space.

Registers are the interface for communication between the CPU and peripherals. Some status registers are changed by peripherals according to their own status. This operation is opaque to the CPU. It is possible that the CPU reads the status register this time and the next time it reads it, the status register has changed, but the CPU still reads the value cached in the cache. However, the register operation must be consistent in the kernel. This is the basis for the kernel to control peripherals. The IO space is mapped to the kernel space through ioremap. The page table is configured as uncached when ioremap maps the register address. The data does not go to the cache and is read directly from the address space. Data consistency is guaranteed.

In this case, the kernel has ensured data consistency, and the application scenario is simple.

(2) The memory space requested by DMA.

DMA operations are also opaque to the CPU. DMA causes data updates in memory to be completely invisible to the CPU. Vice versa, when the CPU writes data to the DMA buffer, it actually writes to the cache. At this time, DMA is started, and the operation of the data in the DDR is not what the CPU really wants to operate.

In this case, both CPU and DMA can operate on mem asynchronously, resulting in data inconsistency.

For mem that can be accessed by both CPU and DMA, the kernel has professional management methods, which are divided into two types:

1. Disable the cache for the memory requested by DMA. Of course, this is the simplest, but it will have an impact on performance.

2. During use, ensure data consistency by flushing cache / invalid cache.

The general DMA layer is mainly divided into 2 types of DMA mapping:

(1) Consistency mapping, representative function:

void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp);
void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle);
Generally, the driver uses a lot, apply for an uncache mem, so There is no need to consider data consistency. Code flow: Set the page property, that is, the page property managed by the kernel page, to uncache. When the TLB is filled with a page fault exception, this property will be written into the storage property field of the TLB. It is guaranteed that the address space mapped by dma_alloc_coherent is uncached.

dma_alloc_coherent first refreshes the cache of the allocated buffer, and then modifies the page table of the buffer to uncached, so as to ensure the consistency between the DMA and the CPU to operate the data of the block.

For common hardware platforms (components without hardware cache coherence), dma_alloc_coherent memory operation, CPU directly operates memory without cache participation.

But there are exceptions. Some CPUs are very powerful, and dma_alloc_coherent can also apply for cached memory.

2. Streaming DMA mapping (DMA Streaming Mapping),

In actual projects, we can use dma_alloc_coherent to apply for consistent memory with our own drivers, but in actual projects, we cannot use dma_alloc_coherent. It is difficult for us to control the memory that is applied by ourselves. For example:

dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction dir)
void dma_unmap_single(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)
 
void dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)
void dma_sync_single_for_device(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir)
 
 
int dma_map_sg(struct device *, struct scatterlist *, int, enum dma_data_direction);
void dma_unmap_sg(struct device *, struct scatterlist *, int, enum dma_data_direction);

The relevant interfaces are dma_map_sg(), dma_unmap_sg(), dma_map_single(), dma_unmap_single().
The consistent cache method is that the kernel specifically applies for a block of memory for DMA use. Sometimes the driver does not do this, but lets the DMA engine do things directly in the memory passed down from the upper layer.

For example, a packet sent from the protocol stack wants to be sent out through the network card.
But the protocol stack doesn't know where this packet is going, so there is no special treatment when allocating memory, and the memory where this packet is usually can be cached. The buffer of the socket passed over is not applied by you, and it is not the consistent memory applied by dma_alloc_coherent. At this time, you need to send the content of the buffer, or throw the received message into this buffer. What happens at this time? What to do?

At this time, before the memory is used by DMA, you need to call dma_map_sg() or dma_map_single() once, depending on whether your DMA engine supports DMA scatter-gather. If you support it, use dma_map_sg(). Use dma_map_single(). After DMA is used up, the corresponding unmap interface must be called.

Since the data of the packet coming down from the protocol stack may still be in the cache, after calling dma_map_single(), the CPU will do a cache flush to flush the cache data to the memory, so that the DMA will read the new data when it reads the memory.

dma_map_single (do a cache flush, flush the contents of the cache to the memory)
 
dma send a message (because the flush has been done, the correct packet is sent)
 
dma_unmap_single 
 
 
 
dma_map_single (do a cache invalid, invalidate the contents of the cache, re-read Take the memory)
 
dma sends a message (because it has done invalid, the received is the correct package)
 
dma_unmap_single 
 
 
dma_map_single has a parameter of one direction to decide whether it is invalid or flush Note
that a parameter must be specified when mapping Whether the direction of the data is from the peripheral to the memory or from the memory to the peripheral:
from the memory to the peripheral: the CPU will perform a cache flush operation to flush the new data in the cache to the memory.
From peripherals to memory: The CPU invalidates the cache, so that if the CPU does not hit when reading, it will read new data from the memory.

(CPU reading cache is done automatically by hardware. Software cannot intervene, but it can control the cache, invalidate or flush)

Also note that these interfaces are all one-time use, and map and unmap must be called once each time the data is manipulated. And during the map period, the CPU cannot operate this memory, so if the CPU writes, it will be inconsistent.
Similarly, the back-end implementation of dma_map_sg() and dma_map_single() is also related to hardware characteristics.

Other methods
mentioned above are conventional DMA. Some SoCs can use hardware for CPU and peripheral cache coherence. For example, a hardware called "Cache Coherent interconnect" is integrated in the SoC, which can allow DMA to step into the CPU's cache or Help with cache refresh. In this case, the memory requested by dma_alloc_coherent() does not need to be non-cache.
 

Extra Story
In terms of thinking, we must think of the interface of dma_alloc_coherent, which is just a front end. Its specific implementation is related to hardware and platform.

(1) The following example is a component inside the CPU that guarantees cache consistency, so dma_alloc_coherent can also apply for memory with cache.

(2) DMA scatter-gather (whether the DMA engine supports aggregated hashing) -- does not require physical memory contiguous, streaming mapping.

DMA scatter-gather (whether the DMA engine supports gathering hash)

Scatter: discrete (discontinuous)

Gather: Aggregation (continuous)

At this time, the DMA memory can be physically discontinuous, and the data in the continuous memory can be moved to the discontinuous memory. Can move non-contiguous memory to contiguous memory.

The hardware can transmit multiple buffers continuously without requiring continuous physical memory.

At this time, if you want to perform streaming mapping, you need to use dma_map_sg(), which will flush multiple discontinuous memories from the cache, or invalid cache.

(3) DMA engine with MMU (called IOMMU or SMMU)

dma_alloc_coherent is just a front-end, the back-end can directly request memory from buddy, or from CMA, or through IOMMU / SMMU.

IOMMU or SMMU will unify discontinuous physical addresses into addresses with continuous virtual addresses through page table mapping, and then use them for DMA.

These are all done by hardware. Therefore, the physical memory applied for by dma_alloc_coherent can be discontinuous.

For specific usage of dma, please refer to:

Documentation/Dmaengine.txt
Documentation/DMA-API-HOWTO.txt
Documentation/DMA-API
drivers/dma/dmatest.c

DMA driver test code under linux
————————————————
Copyright statement: This article is the original article of CSDN blogger "Adrian503", following the CC 4.0 BY-SA copyright agreement, please attach the reprint Original source link and this statement.
Original link: https://blog.csdn.net/Adrian503/article/details/115536886

You can also view the article: https://www.cnblogs.com/dream397/p/15660063.html

DMA multi-core consistency

View article:  Diagram | CPU-Cache | Consistency - Know almost

Article: The basic principles of cache and the consistency of multi-core cache - short book 

Guess you like

Origin blog.csdn.net/u012294613/article/details/132321861