Depth study and cache coherency issues and cache coherency protocol MESI (a)

Some knowledge about the cache of the first to sort out under

Cache is a main memory access rate than the large-capacity smaller than the main memory storage means, each processor has its own cache. After the introduction of the cache, the processor does not deal in the implementation of the memory read and write operations to the main memory directly, but through the cache. Variable name corresponds to the memory address, and the data corresponding to the variable value is stored in the corresponding memory space.

From the internal structure corresponds to a slide fastener cache hash table (ChainedHash Table). It contains a number of bucket (Bucket, called Set hardware), and each bucket may comprise a plurality of cache entries (CacheEntry).

Cache memory structure diagram

Cache entry may be further divided into Tag, Data Block Flag and wherein these three parts, Data Block is also called a cache line (CacheLine), which is the exchange of data between the cache and the main memory from the minimum memory unit for storing read or write data to the memory of preparation. Tag information part contains the data corresponding to the cache line memory address (upper part of the memory address bits). Flag for indicating status information of the corresponding cache line. The capacity of the cache line (also referred to as cache line width) is generally a multiple of 2, ranging in size between 16-256 byte (Byte). A cache line may store values ​​of some variables, and the values ​​of a plurality of variables may be stored in the same cache line.

Cache entry
The processor sets the appropriate memory address decoder when performing memory access operations. It comprises a decoding result memory address tag, index offset and three partial data. Wherein, index numbers corresponding to the tub, it can be used to locate the memory address corresponding to the tub;

A barrel including a plurality of cache entries may tag cache entry corresponding to a relative number, its role is to be compared with each Tag portion cache entries in the same bucket to locate a particular cache entry;. A cache entry the cache line may be used to store a plurality of variables, offset is the offset within the cache line location, its role is to determine a starting position variable is stored in a cache line.

Cache Hit

According to the results of this memory address decoding, if the cache subsystem is able to find the corresponding cache line and the cache line where the cache entry of Flag indicates that the corresponding cache entry is valid, then we say the operation generates a corresponding memory cache hit (CacheHit ) 3; otherwise, we will call the corresponding memory operation generates a cache miss (CacheMiss).

From a performance perspective, reducing cache misses

It comprises a cache miss read miss (Read Miss) and a write miss (Write Miss), respectively corresponding to the memory read and write operations. When a read miss is generated, the data processor loads need to be read and is stored in the corresponding cache line from main memory. This process will lead to a processor to stall (Stall) and other instructions can not be executed, which is not conducive to play the processing power of the processor.

A cache miss is inevitable

Since the total capacity of the cache is much smaller than the total capacity of the main memory, a cache line with a different storage time period may be different data, cache miss is inevitable.

In the Linux system, we can use the Linux Kernel perf tool to view a cache miss situation during program operation.

Commonly referred to as a cache buffer (LI Cache), secondary cache (L2 Cache), three-level cache (L3 Cache)

image.png

image.png
From the cache to the CPU faster closer, but capacity is smaller. Now most of the processors have two or three cache, from bottom to top as L3 cache, L2 cache, L1 cache. Cache can be divided into instruction and data caches, instruction cache for caching program codes, data cache is used to cache data L1 cache program, cache, the local cache of the core, into L1d 32K data cache and instruction cache of L1i 32k, L1 needs access 3cycles, which takes about 1ns; L2 cache, secondary cache, the local a buffer between the core cache, the L1 cache is designed to be shared L3 cache size of 256K, the need to access L2 12cycles, which takes about 3ns; L3 cache, three caches, all the slots in the core with shared L3 cache , 2M divided into a plurality of segments, L3 need access 38cycles, which takes about 12ns;

A cache may be directly integrated into the processor's core, so access is very efficient. Generally comprises two parts, one part for storing instructions, a portion for storing data, the closer the cache from the processor, the faster the memory, the higher manufacturing cost, smaller capacity.

Cache coherency protocol

Cache coherency problem

Multiple threads concurrently access the same time the plant becomes a shared cache on the processor to perform each of these threads will keep a copy of the shared variable spread, which brought a new issue of its copy of the data processor after the update, the processor how other "aware" of the update and to respond appropriately to ensure that the processor can read subsequent reads the shared updates to the time varying pull. This is the cache coherency problem. For example: CPU-0 data read main memory, the CPU-0 buffer cache, CPU-1 to do the same thing, and the CPU-1 to modify the value of the count put became 2, and synchronized to the CPU- 1 cache, but the value after modification is not written into the main memory, CPU-0 to the byte access, since the cache is not updated, the previous value remains, will lead to data inconsistency.

substance

In essence, how to prevent and read dirty data loss problems updates. Various manufacturers offer a lot of solutions and finally chose the cache coherency protocol.

When a CPU bus key is operated in its data buffer and transmits a signal to the bus Lock. Requests from other processors will be blocked, then the processor can monopolize shared memory. Bus key is equivalent to the communication between the CPU and memory locked, so this approach would result in decreased performance of the CPU, so after the P6 family of processors, there has been another way, the cache is locked.

If the cache memory is a cache lock region in the processor cache line LOCK is locked during operation, when it performs the write-back memory lock operation, the signal processing LOCK statement is not on the bus, but to modify the internal buffer address, then cache-coherent mechanisms to guarantee atomic operations, because cache coherency mechanism stops while the modified data is two or more processor cache memory region, when other processor write back data has been locked cache line causes the cache line is invalid. So if you declare a lock mechanism of the CPU, generates a LOCK instruction, it will have two effects.

  1. Lock prefix instructions can cause cause the processor cache is written back to memory, after the P6 processor, LOCK signal bus generally do not lock, but the lock cache.
  2. A processor's cache memory is written back to causes other processor's cache is invalid

X86-based problem MESI protocol.

MESI (Modified-Exclusive-Shared-Invalid) protocol is a widely used cache coherency protocol, x86 processor cache coherency protocol used is based on the MESI protocol.

MESI protocol to control access to memory data write is similar to the lock, so that the same address for the memory read operation is complicated, and a memory write operation for the same address is exclusive, i.e., for the same memory address for write operations in any of the time can be executed only by one processor. In the MESI protocol, a processor to memory must hold ownership of the data when writing data.

In order to ensure data consistency. MESI status of the cache entry is divided into Modified (updated), Exclusive (exclusive), Shared (shared) and Invalid (invalid) the four, and is defined on the basis of a group message (message) for coordinating the various processors to read and write memory operations.

Flag value MESI protocol in a cache entry has the following four possibilities: • Invalid (. Invalid denoted I). This state indicates that the corresponding cache line does not contain valid copies of data corresponding to any memory address. This state is the initial state of the cache entry. • Shared (shared, denoted by s). This state indicates that the corresponding cache line contains a copy of data corresponding to a respective memory address. And, the cache processor also may contain other copies of the data corresponding to the same memory address. Accordingly, a state cache entry if the Shared, and there is also the value of Tag Tag cache entry value of the same cache entry on another processor, then the state of the cache entry is also Shared. In this state cache entry, the cache line which is consistent with the data contained in the data contained in the main memory. • Exclusive (exclusive, denoted E). This state indicates that the corresponding cache line contains a copy of data corresponding to a respective memory address. Then, the exclusive cache line in a manner to retain a copy of data of the corresponding memory address, i.e., all the other caches retains the processor is not currently valid copy of the data. At one thousand of the state cache entry, the cache line which is consistent with the data contained in the data contained in the main memory. • Modified (changed, denoted M). This state indicates that the corresponding cache line contains the result data updates made to the corresponding memory address. MESI protocol since any one time only one processor to have the same memory address corresponding to the updated data, so the cache on a plurality of processors in the same Tag value in the cache entry, any one time only to have a cache entry in this state. In this state cache entry, the cache line of data which contains the data in the main memory contains inconsistent.

MESI protocol defines a set of message (Message) for reading the coordination of the various processors, memory write operation, as shown in Table 11-1. Cf. HTTP protocol, we can MESI protocol message into a request message and a response message. The processor will be sent when performing memory read, write operation if necessary to a bus (Bus) specific request message, and each processor further sniffing (Snoop, also referred to as blocking) issued by the other processor bus request message and reply message in response to a corresponding bus under certain conditions.

image.png

MESI protocol processor to read and write operations?

S discussed below for data read on Processor 0.

Data S provided on the memory address A of the processor and the processor Processor 0 Processor l may share data.

S discussed below for data read on Processor 0. Processor 0 will find a corresponding cache entry for the address A, and reads the cache entry and Flag Tag value (state cache entry). For ease of discussion, we are not here to discuss matching Tag value. Processor0 state cache entry if found. Then the processor can read is M, E, or directly from the s corresponding cache line corresponding to the address data A, without having to send any messages to the bus. Processor state cache entry is found in I. 0 if it indicates that the processor cache does not contain a valid copy of the data S, then Read Processor needs to send a message to the 0 bus to read data corresponding to the address A, and other processors processor l (or main memory) is required to provide the corresponding reply data ReadResponse

When the Processor 0 ReadResponse message is received, it will be carried in the data (data block containing data S) into the appropriate cache line status and updates the corresponding cache entry is S. Processor Read Response message 0 received from the main memory may also be possible from other processors (Processor I).

Processor I sniff bus message sent by other processor. Processor I sniffed Read message when the memory address to be read will be removed from the message. And find the corresponding entry in its cache based on the cache address. Processor I If the state cache entry is not found for the I (in the case shown in Table 11-2). Then the copy of the read data in the processor cache to be, at this time will Processor l block data structure (rather than just the requested data Processor0 s) corresponding to the respective cache line ReadResponse message and stored, after. stuffing "the message. If the state of the corresponding cache entry found Processor1 is M, it is possible to Processor1 ReadResponse message before sending the data bus corresponding cache line is written to main memory bus transmission .Processor1 ReadResponse, the corresponding cache the status entry is updated to S. If Processor I found cache entry to state I, then the Processor 0 ReadResponse the received message comes from main memory.

Seen in Processor0 memory read time, even Processor I updated and this updated still remain in the cache Processor I caused by inconsistencies in the data cache and the main memory of the corresponding memory data in the message MESI under the coordination of this inconsistency will not lead to Processor0 also read a stale old value.

A discussion Processor 0 to address to achieve write data

A processor to perform any memory must have ownership of the corresponding data write operation. When performing a memory write operation, Processor0 will first find the corresponding cache entries based on memory address A. State cache entry if it is found Processor0 E or M, the processor indicates that the corresponding data already owns, in which case the processor may write data directly to the corresponding cache line status and updates the corresponding cache entry M. The state of the cache entry if not found Processor0 E, M, the processor needs to send Invalidate bus to obtain ownership of the data message. After receiving the other processors will Invalidate messages corresponding cache status entry in its cache is updated to I (corresponding to delete duplicate data) and Invalidate the Acknowledge Reply message. The processor (i.e., a memory write operation executed by the processor) Invalidate messages sent, I must then receive all data after all other processors nvalidate Acknowledge message is a reply to update into the corresponding cache line.

image.png

Processor state of the found cache entry if 0 is s. It indicates that the cache Processor L may also retains a copy of the data (scenario I) corresponding to the address A at this time needs to be sent Processor 0 Invalidate messages to the bus. Processor 0 will be a corresponding cache entry after receiving all of the other processors InvalidateAcknowledge reply message updates the status E, Processor 0 obtained at this time ownership A data address. Next, Processor 0 can be written to the corresponding cache line data and updates the corresponding cache entry to state M. Processor state cache entry is found if 0 I, indicates that the processor does not contain a valid copy of the data corresponding to address A (Scenario 2), at this time 0 needed to Processor Bus Read Invalidate messages sent. Processor 0 Read Response After receiving the message, and all other Invalidate Acknowledge message replied by the processor, the status of the update to the cache entry corresponding to E, which indicates that the processor has obtained ownership of the corresponding data. Next state, Processor 0 can be a conventional write data in the corresponding cache line and the corresponding cache entry is updated to M. Other processors Invalidate After receiving the message or the Read Invalidate messages must find the corresponding cache entry in the cache memory of the processor according to the address contained in the message. If the state of the cache entry is found Processor I is not I (Scenario 2), the Processor I must state corresponding cache entry is updated to I, to delete a copy of the data bus and to reply Invalidate Acknowledge message. Visible. Invalidate Invalidate Acknowledge the message and the message of the write operation for the same memory address can only be executed by a processor at any one time, thus avoiding multiple processors simultaneously update data inconsistencies that may result from the same data.

From the view of the above examples. In the case of multiple threads sharing becomes arrived, MESI protocol has been able to guarantee a thread is visible updates to shared variables thread running on other processors of; being the case, how can the visibility of existence? This is from the write buffer and queue busting point of view to explain.

Finishing is not easy, like a tap praise.

Guess you like

Origin juejin.im/post/5d67e75a5188256db0644778