Depth study and cache coherency issues and cache coherency protocol MESI (b)

Write buffer and invalidating

background:

MESI cache coherency protocol to solve the problem, but there is also a performance of its own weakness - the processor to perform memory write operation, the processor must wait for all the other copy of the corresponding data in its cache and delete receive these processors reply Invalidate Acknowledge / Read the message after the Response data can be written to the cache. In order to avoid and reduce this delay (Latency) wait for the write operation caused by hardware designers introduced the write buffer and queue-busting,

Write buffer

Write buffer (Store Buffer, also called Write Buffer) is an internal processor capacity is smaller than the pull private high-speed memory cache means 6. Each processor has its write buffer, a write buffer can be internal comprising a plurality of entries (entry). A processor can not read the contents of the write buffer on another processor.

image.png

After the introduction of the write buffer, the processor during a write operation would do such a deal:

If the corresponding cache entry in state E or M. then the processor may write data directly to the corresponding cache line without having to send any messages to fly if the corresponding cache entry state is s. Then the processor first writes relevant data (including memory addresses and data to be operated) memory entries were written into the buffer and sends Invalidate message; if the corresponding cache entry to state I, we will call the corresponding write operation encountered a write miss (WriteMiss) , then the time will be processed first Shao writes an entry in the write data buffer memory persons and send messages ReadInvalidate.

Thus, the introduction of such a processor write buffer when the write operation can be performed without waiting Invalidate Acknowledge message, thereby reducing the latency of the write operation. This allows the write operation executed by the processor in the other processors reply Invalidate Acknowledge / Read Response message this time other instructions can be executed, thereby improving the efficiency of the processor instruction execution.

After the introduction of the invalidation queue (Invalidate Queue), the processor does not delete the copy of the data corresponding to the address specified in the message after receiving the Invalidate message, but the message is stored in the queue after the invalidation reverts Invalidate the Acknowledge message, thereby reducing the write the waiting time required operation executed by a processor. Some processors (such as x86) may not use invalidation queue.

Write buffer and the introduction of invalidation queue will bring some new problems - memory reordering and visibility issues.

Store and Forward

This processor reads data directly from the write buffer memory to implement a read operation technique is referred to as store and forward (Store Forwarding). Store and forward so that the write operation can be performed without affecting the processor where the processor to perform the read operation result is stored in the write buffer write operation.

Revisited memory reordering

Write buffer and queue-busting may cause memory reordering.

- hardware memory barrier layer is divided into two: Load Barrierand Store Barrierthat is read barrier and write barrier.

- a processor under certain conditions (such as the write buffer is full, I / 0 instruction is executed) will write buffer emptying (Drain) or flushing (the Flush), is about to write the contents of the write cache buffer, However, from a program or a set of updated variable pull standpoint, the processor itself does not guarantee that the program is flush timely.

为了保证一个处理器对共享变措所做的更新可以被其他处理器同步,编译器等底层系统需要借助一类被称为内存屏障的特殊指令。内存屏障中的存储屏障(Store Barrier)可以使执行该指令的处理器冲刷其写缓冲器。
复制代码

However, flushing the write buffer is only solves half the problem of visibility. Because the other half is the visibility of the problem due to invalidate queue. The introduction of invalidation queue itself will lead to new problems if a processor does not delete the memory before performing a read operation on the processor cache copy of the relevant data based on the content of the invalid queue, then it may lead to the processor reads data is old, outdated data, so that updates made to other processors loss.

In order to make a thread running on the processor can read additional thread running on the processor updates made to a shared variable, the processor must first remove its cache invalidation queue according to stored messages Invalidate corresponding duplicate data, so that other threads running on processor updates the shared variable save made under the action of cache coherency protocol can be synchronized into the processor cache.

Memory barriers in a load barrier (Load Barrier) It is used to solve this problem. Loading barrier invalidation queue based on the content of the specified memory address, the cache entry corresponding to the state of the cache on the respective processors are labeled I 'that the processor for subsequent execution of the corresponding address (queue content invalidation read messages must be sent when the specified address) of the memory reads. To the other processor updates made to the associated shared variables synchronized to the processor's cache.

Different processor architectures supported (allowed) memory reordering different. For example, modern processors will adopt a write buffer, but some processors (such as x86) will guarantee a sequential write, that these processors are not allowed StoreStore appear reordering.

Visibility Revisited

We say that the write buffer is a hardware root cause visibility problems.

The processor in certain conditions (such as the write buffer is full, I / 0 instruction is executed) will write buffer emptying (Drain) or flushing (the Flush), is about to write the contents of the write cache buffer, However, from a program or a set of updated variable pull standpoint, the processor itself does not guarantee that the program is flush timely. Accordingly, in order to ensure a shared processor measures variations may be made to update the other synchronized processors, such as the compiler by the underlying system requires a special kind of memory barrier instructions are referred to. Storage memory barrier barrier (Store Barrier) instruction may cause the processor to execute the write buffer flushing thereof.

However, flushing the write buffer is only solves half the problem of visibility. Because the other half is the visibility of the problem due to invalidate queue. The introduction of invalidation queue itself will lead to new problems if a processor does not delete the memory before performing a read operation on the processor cache copy of the relevant data based on the content of the invalid queue, then it may lead to the processor reads data is old, outdated data, so that updates made to other processors loss. Accordingly, in order to make a thread running on the processor can read additional thread running on the processor updates made to a shared variable, the processor must first remove its cache invalidation queue according Invalidate messages stored in the respective copies data, so that the thread running on other processors to the shared variable save updates made under the action of cache coherency protocol can be synchronized into the processor cache.

Memory barriers in the plant plus barrier (Load Barrier) It is used to solve this problem. Loading barrier invalidation queue based on the content of the specified memory address, the cache entry corresponding to the state of the cache on the respective processors are labeled I 'that the processor for subsequent execution of the corresponding address (queue content invalidation read messages must be sent when the specified address) of the memory reads. To the other processor updates made to the associated shared variables synchronized to the processor's cache.

Therefore, to solve the visibility problem was first done to make the thread write to shared variables update can reach (to be stored) cache, so that the updates to the other processors are synchronized. Second, the processor reads the thread where you want it invalid content queue "Apply" to its cache, so that it can be shared on other processors become startled updates made to synchronize the processor cache .

This is achieved by two pairs using the storage barrier with load barrier: storage barrier write thread execution performed by the processor to protect the thread shared variable updates made to read a thread, it is synchronized ; load barrier read thread execution performed by the processor will write to update the shared variable pull thread made synchronized to the processor's cache in.

Store and forward technology can also cause visibility problems.

Finishing is not easy, rewarding your thumbs

Guess you like

Origin juejin.im/post/5d67e78d5188251e073a39e5