Discussion on MESI-CPU Cache Coherency Protocol

Why does the CPU have a cache

In order to improve the efficiency of accessing data, modern processors will have multiple levels of small-capacity and fast caches on each CPU core (called L1 cache, L2 cache, multi-core shared L3 cache, etc.), which are commonly used for caching The data. Since the memory speed is nearly 100 times slower than the CPU, when data is modified, the cache is only updated first, and not written directly back to the main memory (the CPU cannot communicate directly with the memory, the CPU only connects to the cache, and then the cache is used. Docking the main memory), this will cause the data in the cache to be inconsistent with the memory. If the system is a single-core processor, all threads see the latest data in the cache, of course there is no problem. However, if the system is a multi-core processor, the same main memory data may be cached in multiple core caches. As long as one of the cores modifies the value in the cache during operation, if the other CPU cores are not notified in time, it will Causing inconsistencies in the cache, affecting the operating results of the system. The emergence of the MESI protocol is to solve the problem of inconsistent cache in the era of multi-core processors.

The flow of calculation performed by the CPU with cache

1 The program and data are loaded into the main memory

2 Instructions and data are loaded into the cache of the CPU

3 The CPU executes the instruction and writes the result to the cache

4 The data in the cache is written back to the main memory

 

The current popular multi-level cache structure

 

Because the computing speed of the CPU surpasses the data I\O capability of the level 1 cache, CPU manufacturers have introduced a multi-level cache structure.

Multi-level cache structure

For win10 system, you can go to the task manager to view:

 

Multi-core CPU multi-level cache coherency protocol MESI

In the case of a multi-core CPU, there are multiple first-level caches. How to ensure the consistency of the internal data in the cache so that the system data is not confused. This leads to a consensus protocol MESI.

 

MESI protocol cache status

MESI refers to the first letter of the state in 4. Each Cache line has 4 states, which can be represented by 2 bits. They are:

Cache line : A unit for caching and storing data.

 

 

Note:
The M and E states are always accurate. They are consistent with the real state of the cache line, while the S state may be inconsistent . If a cache invalidates a cache line in the S state, and another cache may actually have exclusive use of the cache line, but the cache will not promote the cache line to the E state, because other caches will not Broadcast their notification to invalidate the cache line. Also, because the cache does not save the number of copies of the cache line, there is no way to determine whether you have exclusive use of the cache line (even if there is such a notification).

From the above meaning, the E state is a speculative optimization: if a CPU wants to modify a cache line in the S state, the bus transaction needs to change the copy of all the cache lines to the invalid state, and modify the E state cache No need to use bus transactions.

The realization of data consistency in modern CPUs = cache lock (MESI...) + bus lock

Two different data located in the same cache line are locked by two different CPUs, resulting in a false sharing problem that affects each other

In the cache system, the storage is based on the cache line (cache line). The cache line is an integer power of 2 consecutive bytes, generally 32-256 bytes. The most common cache line size is 64 bytes. When multiple threads modify independent variables, if these variables share the same cache line, they will inadvertently affect each other's performance, which is pseudo sharing.

 

Figure 1 illustrates the problem of false sharing. The thread running on core 1 wants to update variable X, while the thread on core 2 wants to update variable Y. Unfortunately, these two variables are in the same cache line. Each thread must compete for the ownership of the cache line to update the variable. If core 1 obtains ownership, the cache subsystem will invalidate the corresponding cache line in core 2. When core 2 obtains the ownership and performs the update operation, core 1 will invalidate its corresponding cache line. This will go through the L3 cache back and forth, which greatly affects performance. If the competing cores are located in different sockets, they must be connected across the sockets, and the problem may be even more serious.

 

Case:

 

 

public class T_CacheLinePadding {

 

    private static class T {

        public volatile long x = 0L;

    }

 

    public static T[] arr = new T[2];

 

    static {

        arr[0] = new T();

        arr[1] = new T();

    }

 

    public static void main(String[] args) throws InterruptedException {

        Thread t1 = new Thread(() -> {

 

            for (long i = 0; i < 10000_0000L; i++) {

                arr[0].x = i;

            }

 

        });

 

        Thread t2 = new Thread(() -> {

 

            for (long i = 0; i < 10000_0000L; i++) {

                arr[1].x = i;

 

            }

 

        });

 

        final long start = System.nanoTime();

        t1.start();

        t2.start();

        t1.join();

        t2.join();

        System.out.println((System.nanoTime() - start) / 100_0000);

    }

 

}

Results of the:

After optimization, not in the same cache line case: Using cache line alignment can improve efficiency

 

 

public class T_CacheLinePadding2 {

 

    private static class Padding {

        public volatile long p1, p2, p3, p4, p5, p6, p7;

    }

 

    private static class T extends Padding {

        public volatile long x = 0L;

    }

 

    public static T[] arr = new T[2];

 

    static {

        arr[0] = new T();

        arr[1] = new T();

    }

 

    public static void main(String[] args) throws InterruptedException {

        Thread t1 = new Thread(() -> {

            for (long i = 0; i < 10000_0000L; i++) {

                arr[0].x = i;

            }

        });

 

        Thread t2 = new Thread(() -> {

            for (int i = 0; i < 10000_0000L; i++) {

                arr[1].x = i;

            }

        });

 

        final long start = System.nanoTime();

        t1.start();

        t2.start();

        t1.join();

        t2.join();

        System.out.println((System.nanoTime() - start) / 100_0000);

    }

 

}

Results of the:

 

 

Guess you like

Origin blog.csdn.net/huzhiliayanghao/article/details/106872909