Dry goods, the basics of CPU cache for a week

Introduction

Basically, cpu caching knowledge is a basic knowledge point to enter the big factory, and it is also very important. If you have a good grasp of this part of the knowledge, it will be a bonus!

Let’s talk about history:

In the first few decades of computing, main memory was very slow and incredibly expensive, but the CPU was not particularly fast. Beginning in the 1980s, the gap began to widen rapidly. The clock speed of microprocessors has developed rapidly, but the improvement in memory access time is far from obvious. As this gap widens, it becomes increasingly obvious that a new type of fast memory is needed to bridge this gap.

1980 and before: CPU has no cache

1980~1995: CPU began to have level 2 cache

So far: there have been L4, some have L0, generally L1, L2, L3

image

Actual combat drill

Basic knowledge of CPU cache

Register (the Register) is a memory for temporarily storing computer instructions, data, and addresses the central processor. The storage capacity of the register is limited, and the reading and writing speed is very fast. In the computer architecture, registers store the intermediate results of calculations made at a known point in time, and speed up the operation of computer programs by quickly accessing data.

Registers are at the top of the memory hierarchy and are the fastest memory that the CPU can read and write. Registers are usually measured by the number of bits they can store, for example, an 8-bit register or a 32-bit register. In the central processing unit, the components that contain registers are the instruction register (IR), the program counter, and the accumulator. Registers are now implemented as register arrays, but they may also be implemented using separate flip-flops, high-speed core memory, thin-film memory, and other methods on several machines.

Register can also refer to a group of registers that can be directly indexed by the output or input of an instruction. The more precise name of these registers is "architectural registers". For example, the x86 instruction set defines a set of eight 32-bit registers, but a CPU that implements the x86 instruction set may have more than eight registers.

CPU cache

In computer systems, CPU cache (English: CPU Cache, referred to as cache in this article) is a component used to reduce the average time required for the processor to access memory. It is located in the second layer from the top to the bottom of the pyramid storage system, second only to CPU registers. Its capacity is much smaller than the memory, but the speed can be close to the frequency of the processor.

When the processor issues a memory access request, it will first check whether there is requested data in the cache. If it exists (hit), the data is returned directly without accessing the memory; if it does not exist (invalid), the corresponding data in the memory must be loaded into the cache first, and then returned to the processor.

The reason why the cache is effective is mainly because the access to the memory when the program is running shows the characteristics of locality. This locality includes both Spatial Locality and Temporal Locality. Effective use of this locality, the cache can achieve a very high hit rate.

From the processor's point of view, the cache is a transparent component. Therefore, the programmer usually cannot directly intervene in the operation of the cache. However, it is indeed possible to implement specific optimizations to the program code according to the characteristics of the cache, so as to make better use of the cache.

Current computers generally have level 3 caches (L1, L2, L3) , let’s take a look at the structure:

image

among them:

  • L1 is slowly divided into two types, one is instruction cache and the other is data cache. L2 cache and L3 cache do not distinguish between instructions and data.
  • L1 and L2 are cached in each CPU core, and L3 is the memory shared by all CPU cores.
  • The closer the L1, L2, and L3 are to the CPU, the smaller and the faster the speed. The farther they are from the CPU, the slower the speed.
  • Behind the memory is the memory, and behind the memory is the hard disk

Let's take a look at their speed:

image

Take a look at the processor I use at work, although it is a bit rubbish~

image

We can see specific information:

The speed of L1 is about 27~36 times that of main memory. L1 and L2 are both at the KB level, and L3 is at the M level. L1 is divided into data and instruction caches of 32KB respectively. Think about why there is no L4?

Let's look at a picture below

image

This chart commented by Haswell from Anandtech is useful because it illustrates the performance impact of adding a huge ( 128MB) L4 cache and a regular L1/L2/L3 structure. Each ladder represents a new cache level. The red line is the chip with L4-note that for large files, it is still almost twice as fast as the other two Intel chips. But larger caches require more transistors, which are slow and expensive, and also increase the size of the chip.

Is there any L0?

The answer is: yes. Modern CPUs also usually have very small "L0" caches, usually only a few KB in size, for storing micro-operations. Both AMD and Intel use this cache. Zen’s cache is 2,048 µOP, while Zen 2’s cache is 4,096 µOP. These tiny buffer pools operate under the same general principles as L1 and L2, but represent even smaller memory pools that the CPU can access with lower latency than L1 . Usually, companies adjust these functions to each other. Zen 1 and Zen + (Ryzen 1xxx, 2xxx, 3xxx APU) have a 64KB L1 instruction cache, which is 4-way associative, and has a 2,048 µOP L0 cache. Zen 2 (Ryzen 3xxx desktop CPU, Ryzen Mobile 4xxx) has a 32KB L1 instruction cache, which is 8-way associative and has a 4,096 µOP cache. Doubling the set associativity and µOP cache size can enable AMD to reduce the size of the L1 cache by half.

Having said that, how does cpu cache work?

The purpose of the cpu cache: My cpu is so fast, and the cost is too high every time I go to the main memory to fetch data. I will open up a memory pool on my own to store some of the data I want most. So what data can be loaded into the cpu cache channel? Complicated calculations and programming codes.

What if I cannot find the data I want in the L1 memory pool? It is a cache miss

What else can I do? Go to L2 to find it. Some processors use an inclusive cache design (meaning that the data stored in the L1 cache will also be repeated in the L2 cache), while other processors are mutually exclusive (meaning two A cache never shares data). If no data is found in the L2 cache, the CPU will continue to move down the chain to L3 (usually still on the die), then L4 (if present) and main memory (DRAM).

Another question arises: how to find it more efficient? It is impossible for the cpu to traverse next to each other.

CPU cache hit

cache line

The cache line is also called the cache block, which means that the CPU load data is loaded block by block. Generally speaking, the minimum load data of a cache line is 64Bytes=16 32-bit integers (there are other cpu32Bytes and 128Bytes). The following is the Cache line size of my computer processor.

image

In the above figure, the L1 data cache of my processor has 32KBytes:

32KBytes/64Bytes=512 Cache line

The mapping strategy between cacheline and memory:

  • Hash: (Memory address% cache line) * 64 Hash conflicts are prone to occur

  • N-Way Set Associative: Simply put, the N cachelines are divided into a group, and each cacheline is addressed according to the offset

From the figure above, it can be seen that the 32KBytes of L1 data buffer is divided into 8-way, so each way is 4KBytes.

How to address it?

We know from the front: most of the Cache lines are 64Bytes

  • Tag:  Before each Cache line, there will be an independently allocated 24bits=3Bytes to store the tag, which is the first 24bits of the memory address.
  • Index  : The 6bits=3/4Bytes behind the memory address stores the index of this way (way) Cache line, through 6bits we can index 2^6=64 Cache lines.
  • Offset  : The offset of the cache line stored in the 6bits after the index.

specific process:

  1. Use the index to locate the corresponding cache block.
  2. Use the tag to try to match the corresponding tag value of the cache block. The result is a hit or a miss.
  3. If hit, use the offset in the block to locate the target word in this block. Then directly rewrite the word.
  4. If there is a miss, there are two processing strategies depending on the system design, which are called Write allocate and No-write allocate. If it is allocated by writing, first read the missed data into the buffer memory , and then write the data to the read-in word unit as if processing a read miss . If it is not allocated by writing, the data is directly written back to the memory .

What if the cache of a certain channel is full?

Replace some of the latest bytes accessed, also known as LRU (the longest unused)

After analyzing the data cache of L1, you can also analyze other L2 and L3 caches, so I won't analyze it here.

Cache coherency

Partly from Wikipedia:

In order to maintain data consistency with lower-level storage (such as memory), data updates must be propagated in a timely manner. This propagation is done through write-back. There are two strategies generally write back: write-back (Write back) and a write-through (Write through).

According to the write-back strategy and the allocation strategy of misses mentioned above, please see the following table

image

Through the above picture, we know:

When writing back : If the cache hits, there is no need to update the memory, in order to reduce memory write operations, usually the allocation strategy is to allocate

  • How can the tag cache be updated when it is loaded by other CPUs? Each Cache line provides a dirty bit to identify whether an update has occurred after being loaded. (The cpu is loaded piece by piece when loading, not byte by byte, as mentioned above)

  • image

Write through :

  • Write-through means that whenever the cache receives a write data instruction, it directly writes the data back to the memory. If this data address is also in the cache, the cache must be updated at the same time. Since this design will cause a large number of write memory operations, it is necessary to set up a buffer to reduce hardware conflicts. This buffer is called the Write buffer, and it usually does not exceed the size of 4 buffer blocks. However, for the same purpose, write buffers can also be used for write-back caches.

  • Write-through is easier to implement than write-back, and it is easier to maintain data consistency.

  • Usually the allocation strategy is non-allocation

For a two-level cache system, the first-level cache may use write-through to simplify the implementation, while the second-level cache uses write-back to ensure data consistency

MESI agreement:

There is a webpage (MESI Interactive Animations) here. This address is too 6x. After referring to a lot of information, it is not as good as animation. . . . https://www.scss.tcd.ie/Jeremy.Jones/VivioJS/caches/MESIHelp.htm

It is recommended to play with the animation of the above URL first to understand the read and write data of each cpu cache and main memory.

Here is a brief explanation: our main memory has a value of x=0, and the processor has two cpu0, cpu1

  • c pu0 reads the value of x , cpu0 first finds it in the cpu0 cache, but it can’t be found. There is an address bus that routes the cpu and main memory. At the same time, go to the cpu and main memory to find the value , compare the version, go to the main memory to get x, and get it The value of x is assigned to the cache of cpu0 through the data bus

  • image

  • cpu0 writes to x+1 , directly obtains x=0 of cpu0, and adds 1 (the main memory will not be updated, nor will the cache of cpu1 be updated, the cache of cpu1 does not have the value of x)

  • image

  • When cpu1 reads the value of x , first find it in the cache of cpu1, but it can’t be found. According to the address bus , go to cpu and main memory at the same time, compare the version (if the version is the same, the value of the main memory will be given priority), and find the x of cpu0 Value, cpu0 updates the value of the cache x of cpu1 with priority over the data bus , and updates the value of the main memory x

  • image

  • cpu1 pairs x+1 , directly obtains x=1 of cpu1, and adds 1 (the main memory will be updated here, and the cache of cpu0 will not be updated, but other CPUs will be notified through RFO

  • image

You can try it yourself in other situations.

Notification agreement:

Snoopy protocol. This protocol is more like a bus-type technology for data notification. CPU Cache can identify the data status on other Caches through this protocol. If there is data sharing, the state of the shared data can be notified to other CPU Caches through the broadcast mechanism. This protocol requires that each CPU Cache can "snoop" notifications of data events and react accordingly.

Status of the MESI protocol:

Modified (modified), Exclusive (exclusive), Shared (shared), Invalid (invalid).

Follow the animation again, it's not very complicated actually.

Expand it a bit:

MOESI : MOESI is a complete cache coherency protocol, which contains all possible states commonly used in other protocols. In addition to the four common MESI protocol states, there is a fifth "owned" state, which represents data that has been modified and shared. This avoids the need to write the modified data back to the main memory before sharing. Although the data must be written back eventually, the write back can be postponed .

MOESF : The data in the Forward state is clean and can be discarded without notice

AMD uses MOESI, Intel uses MESIF

I won’t go further here~

A wave of use cases

Example 1:

 

public class CpuCache {

    static int LEN = 64 * 1024 * 1024; 
    static int arr[] = new int[LEN]; // 64M
    public static void main(String[] args) {
        long currAddTwo = System.currentTimeMillis();
        addTwo();
        System.out.println(System.currentTimeMillis() - currAddTwo);
        long currAddEight = System.currentTimeMillis();
        addEight();
        System.out.println(System.currentTimeMillis() - currAddEight);
    }
    private static void addTwo() {
        for (int i = 0;i<LEN;i += 2) {
            arr[i]*=i;
        }
    }
    private static void addEight() {
        for (int i = 0;i<LEN;i += 8) {
            arr[i]*=i;
        }
    }
}
复制代码

Everyone can guess what the difference between the printed time may be, or how many times the difference is.

Analyze the time complexity: if addTwo is 4n, then addEight is n

But don’t forget that a Cache line 64Bytes is loaded when the CPU is loaded, so no matter if they add 2 or 8 they consume the same time. The time consumption of my machine is:

 

48
36
复制代码

False sharing :

Quoting Martin's example, slightly modified, the code is as follows:

 

public class FalseShare implements Runnable {
        public static int NUM_THREADS = 2; // change
        public final static long ITERATIONS = 500L * 1000L * 1000L;
        private final int arrayIndex;
        private static VolatileLong[] longs;

        public FalseShare(final int arrayIndex) {
            this.arrayIndex = arrayIndex;
        }

        public static void main(final String[] args) throws Exception {
            Thread.sleep(1000);
            System.out.println("starting....");
            if (args.length == 1) {
                NUM_THREADS = Integer.parseInt(args[0]);
            }

            longs = new VolatileLong[NUM_THREADS];
            for (int i = 0; i < longs.length; i++) {
                longs[i] = new VolatileLong();
            }
            final long start = System.currentTimeMillis();
            runTest();
            System.out.println("duration = " + (System.currentTimeMillis() - start));
        }

        private static void runTest() throws InterruptedException {
            Thread[] threads = new Thread[NUM_THREADS];
            for (int i = 0; i < threads.length; i++) {
                threads[i] = new Thread(new FalseShare(i));
            }
            for (Thread t : threads) {
                t.start();
            }
            for (Thread t : threads) {
                t.join();
//                System.out.println(t);
            }
        }

        public void run() {
            long i = ITERATIONS + 1;
            while (0 != --i) {
                longs[arrayIndex].value = i;
            }
        }

        public final static class VolatileLong {
            public volatile long value = 0L;
            public long p1, p2, p3, p4, p5, p6;//, p7, p8, p9;
        }
}
复制代码

The logic of the code is that by default 4 threads modify the contents of different elements in an array . The type of the element is Volatile Long, with only one long integer member value and 6 unused long integer members. The value is set to volatile to make value The modification is visible to all threads

When my thread is set to 4: The 50th line of code, there are 6 long integers, and it runs for 13s. On the contrary, when there are only 4 long integers, it only runs for 9s. When the 50th line is commented out, it runs for 24s. Found this test result a bit strange.

Let's sort out the definition of pseudo-sharing first:

In a Java program, the members of an array are also continuous in the cache. In fact, adjacent member variables from a Java object are also loaded into the same cache line . If multiple threads operate on different member variables, if it is the same cache line, False Sharing can happen.

The following is a reference to the sample pictures and examples in the Disruptor project ( https://github.com/LMAX-Exchange/disruptor ) Lead’s blog post ( https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html ) Experimental example

image

A thread running on processor core 1 wants to update the value of variable X , while another thread running on processor core 2 wants to update the value of variable Y. However, these two frequently changed variables are at the same A cache line . Two threads will send RFO messages in turn, taking ownership of the cache line .

On the surface, X and Y are operated by independent threads, and there is no relationship between the two operations. It is just that they share a cache line , but all competition conflicts are derived from sharing.

According to the above code example, in simple terms, when we operate on an array, an object is 8 bytes (32-bit system) or 12 bytes (64-bit system) . If 6 long integers are added = 48 bytes , so you can make different objects with a cache line , the cache line can avoid frequently send messages RFO shared cache lines, reduce competition, then why do we test out the data in question? When there are 6 longs, it consumes more time than 4 longs.

The reason is that our machine has 2 cores. When the thread is set to 2, 6 longs become 4s. Comment out the 50th line and become 10s.

In this way, through the cache line padding, an object can use one cache line as much as possible to reduce the synchronization of the cache line.

Queue pseudo sharing

In the LinkedBlockingQueue of JDK , there is a reference head pointing to the head of the queue and a reference last pointing to the end of the queue. This kind of queue is often used in asynchronous programming. The values ​​of these two references are often modified by different threads, but they are It is very likely that it is in the same cache line , so false sharing occurs. The more threads and the more cores, the greater the negative effect on performance.

But: Don’t optimize pseudo-sharing for optimization. Grizzly comes with LinkedTransferQueue, which is different from the LinkedTransferQueue that comes with JDK 7. The difference is that PaddedAtomicReference is used to improve concurrency performance. In fact, this is a wrong encoding. Skills are meaningless.

Netty previously used PaddedAtomicReference to replace the original Node, and used a complement method to solve the problem of queue pseudo-sharing, but it was later cancelled .

The essence of AtomicReference and LinkedTransferQueue is optimistic locking. The performance of optimistic locking is very bad during fierce competition. Optimistic locking should be used in non-fierce competition scenarios. Optimizing the performance under fierce competition for optimistic locking is the wrong direction , because if If fierce competition is required, pessimistic locks should be used.

Padded-AtomicReference is also a false proposition. If competition is incentivized, why not use Lock + volatile? If there is no fierce competition, using PaddedAtomicReference has no advantage over AtomicReference. So using Padded-AtomicReference is a wrong coding technique.

So in 1.8, the pad logic related to LinkedTransferQueue was removed, and a 1.7 code was posted--

 

public class FalseShare implements Runnable {
        public static int NUM_THREADS = 2; // change
        public final static long ITERATIONS = 500L * 1000L * 1000L;
        private final int arrayIndex;
        private static VolatileLong[] longs;

        public FalseShare(final int arrayIndex) {
            this.arrayIndex = arrayIndex;
        }

        public static void main(final String[] args) throws Exception {
            Thread.sleep(1000);
            System.out.println("starting....");
            if (args.length == 1) {
                NUM_THREADS = Integer.parseInt(args[0]);
            }

            longs = new VolatileLong[NUM_THREADS];
            for (int i = 0; i < longs.length; i++) {
                longs[i] = new VolatileLong();
            }
            final long start = System.currentTimeMillis();
            runTest();
            System.out.println("duration = " + (System.currentTimeMillis() - start));
        }

        private static void runTest() throws InterruptedException {
            Thread[] threads = new Thread[NUM_THREADS];
            for (int i = 0; i < threads.length; i++) {
                threads[i] = new Thread(new FalseShare(i));
            }
            for (Thread t : threads) {
                t.start();
            }
            for (Thread t : threads) {
                t.join();
//                System.out.println(t);
            }
        }

        public void run() {
            long i = ITERATIONS + 1;
            while (0 != --i) {
                longs[arrayIndex].value = i;
            }
        }

        public final static class VolatileLong {
            public volatile long value = 0L;
            public long p1, p2, p3, p4, p5, p6;//, p7, p8, p9;
        }
}
复制代码

50 threads compete for 10 objects, LinkedBlockingQueue is better than LinkedTransferQueue

It is several times faster in 1.7, but the running speed is almost the same in 1.8.

Finally, I will talk about Disruptor

  • Circular array structure

To avoid garbage collection, use an array instead of a linked list. At the same time, the array is more friendly to the processor's cache mechanism .

  • Element location positioning

The length of the array is 2^n, and the positioning speed can be accelerated through bit operation. The subscript takes the form of increasing. Don't worry about index overflow. Index is a long type, even if the processing speed of 1 million QPS, it will take 300,000 years to use up.

  • No lock design

Each producer or consumer thread will first apply for the position of the operable element in the array. After the application is obtained, it will directly write or read data at that position.

Ignore the ring structure of the array below, and introduce how to implement a lock-free design. The whole process uses the atomic variable CAS to ensure the thread safety of the operation.

Consumer waiting strategy:

  • BlockingWaitStrategy: Locking the scenario where CPU resources are in short supply and throughput and latency are not important
  • BusySpinWaitStrategy: Spinning reduces the system calls caused by switching threads and reduces latency by constantly retrying. It is recommended to use in scenarios where threads are bound to a fixed CPU
  • PhasedBackoffWaitStrategy: Spin + yield + custom strategy, CPU resource shortage, throughput and latency are not important scenarios
  • SleepingWaitStrategy: Spin + yield + sleep, there is a good compromise between performance and CPU resources. Uneven delay
  • TimeoutBlockingWaitStrategy: Locking, timeout limit, CPU resource shortage, throughput and latency are not important scenarios (logfj2 uses this strategy by default)
  • YieldingWaitStrategy: Spin + yield + spin, there is a good compromise between performance and CPU resources. The delay is relatively uniform

 

import com.lmax.disruptor.*;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.dsl.ProducerType;

import java.util.concurrent.ThreadFactory;


public class DisruptorMain
{
    public static void main(String[] args) throws Exception
    {
        // 队列中的元素
        class Element {

            private String value;

            public String get(){
                return value;
            }

            public void set(String value){
                this.value= value;
            }

        }

        // 生产者的线程工厂
        ThreadFactory threadFactory = new ThreadFactory(){
            @Override
            public Thread newThread(Runnable r) {
                return new Thread(r, "simpleThread");
            }
        };

        // RingBuffer生产工厂,初始化RingBuffer的时候使用
        EventFactory<Element> factory = new EventFactory<Element>() {
            @Override
            public Element newInstance() {
                return new Element();
            }
        };

        // 处理Event的handler
        EventHandler<Element> handler = new EventHandler<Element>(){
            @Override
            public void onEvent(Element element, long sequence, boolean endOfBatch)
            {
                System.out.println("Element: " + element.get());
            }
        };

        // 阻塞策略
        BlockingWaitStrategy strategy = new BlockingWaitStrategy();

        // 指定RingBuffer的大小
        int bufferSize = 16;

        // 创建disruptor,采用单生产者模式
        Disruptor<Element> disruptor = new Disruptor(factory, bufferSize, threadFactory, ProducerType.SINGLE, strategy);

        // 设置EventHandler
        disruptor.handleEventsWith(handler);

        // 启动disruptor的线程
        disruptor.start();

        RingBuffer<Element> ringBuffer = disruptor.getRingBuffer();

        for (int l = 0; true; l++)
        {
            // 获取下一个可用位置的下标
            long sequence = ringBuffer.next();
            try
            {
                // 返回可用位置的元素
                Element event = ringBuffer.get(sequence);
                // 设置该位置元素的值
                event.set(l+"rs");
            }
            finally
            {
                System.out.println("push" + sequence);
                ringBuffer.publish(sequence);
            }
            Thread.sleep(10);
        }
    }
}
复制代码

Post a test case! Wait for the next article to explain in depth! The next article is about Volatile, let me take a break~~

Author: Ting Yu notes
Links: https://juejin.cn/post/6932243675653095438
Source: Nuggets

Guess you like

Origin blog.csdn.net/m0_50180963/article/details/113985046