[Operating system] 7 examples of popular science CPU CACHE

Example 1: Memory access and operation

How fast do you think loop 2 will run compared to loop 1?

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

The first loop multiplies each value of the array by 3, and the second loop multiplies every 16 values ​​by 3. The second loop only does the first about 6% of the work, but on modern machines, the two are almost Run the same time: 80 milliseconds and 78 milliseconds on my machine.

The reason why two cycles spend the same time is related to memory. The length of the loop execution time is determined by the number of memory accesses of the array, not the number of multiplications of integers. After explaining the second example below, you will find that the number of hardware memory accesses to these two cycles is the same.

Impact of cache line

Let's explore this example further. We will try different loop steps, not just 1 and 16.

for (int i = 0; i < arr.Length; i += K) arr[i] *= 3;

The following figure shows the running time of the cycle under the unsynchronized length (K):
Insert picture description here
Note that when the step size is in the range of 1 to 16, the cycle running time is almost unchanged. But starting from 16, each time step is doubled, the running time is halved.

The reason behind this is that today's CPUs no longer access memory by bytes, but take them in chunks of 64 bytes, called a cache line. When you read a specific memory address, the entire cache line will be swapped from main memory into the cache, and the overhead of accessing other values ​​in the same cache line is very small.

Since 16 integers occupy 64 bytes (one cache line), the for loop step size between 1 and 16 must touch the same number of cache lines: that is, all cache lines in the array. When the step size is 32, we only make contact once every two cache lines, and when the step size is 64, we only make contact once every four.

Understanding cache lines can be important for certain types of program optimization. For example, data byte alignment may determine whether one operation touches one or two cache lines. In the above example, it is clear that operating on misaligned data will lose half of the performance.

Example 3: L1 and L2 cache size

Today's computers have two or three levels of cache, commonly called L1, L2, and possibly L3 (if you don't understand what a level two cache is, you can refer to this elaborate blog post lol). If you want to know the size of different caches, you can use the system's internal tool CoreInfo, or the Windows API call GetLogicalProcessorInfo. Both will tell you the cache line and the size of the cache itself.

On my machine, CoreInfo shows that I have a 32KB L1 data cache, a 32KB L1 instruction cache, and a 4MB L2 data cache. The L1 cache is exclusive to the processor, and the L2 cache is shared between pairs of processors.

Logical Processor to Cache Map:
*— Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*— Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-*– Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-*– Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
**– Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64
–*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
–*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
—* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
—* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
–** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64

(The platform is a quad-core machine, so L1 number is 0 3, one for each data / instruction, L2 only has data cache, two processors share one, number 0 1. The correlation field is explained in the following example.)

Let us verify these figures through an experiment. Iterate over an integer array, incrementing every 16 values-a saving way to change each cache line. When traversing to the last value, start over. We will use different array sizes. We can see that when the array overflows the first-level cache size, the performance of the program will drop sharply.

int steps = 64 * 1024 * 1024;
// Arbitrary number of steps
int lengthMod = arr.Length - 1;
for (int i = 0; i < steps; i++)
{
    arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}

Insert picture description here
You can see that performance drops significantly after 32KB and 4MB-exactly the L1 and L2 cache sizes on the machine.

Example 4: Instruction level concurrency

Now let's take a look at different things. Which of the following two loops do you think is faster?

int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

The result is that the second cycle is about twice as fast as the first, at least on the machine I tested. why? This is related to the dependence of operating instructions in the two loops.

In the first loop, operations are interdependent (Translator's Note: The next time depends on the previous one):
same value dependency

But in the second example, the dependency is different:
different values dependency

Modern processors have a bit of concurrency for different parts of instructions (Translator's Note: related to the pipeline, for example, Pentium processors have U / V two pipelines, described later). This allows the CPU to access two memory locations in L1 at the same time, or perform two simple arithmetic operations. In the first cycle, the processor cannot discover this kind of instruction level concurrency, but it can be in the second cycle.

Example 5: Cache affinity

A key decision in cache design is to ensure that each main chunk can be stored in any cache slot, or just some of them (here a slot is a cache line).

There are three ways to map the cache slot to the main memory block:

  1. Direct mapped cache (Direct mapped cache)
    each memory block can only be mapped to a specific cache slot. A simple solution is to map to the corresponding slot (chunk_index% cache_slots) through the chunk index chunk_index. Two memory blocks mapped on the same memory slot cannot be swapped into the cache at the same time. (Chunk_index can be calculated by physical address / cache line byte)
  2. N-way set associative cache (N-way set associative cache)
    Each memory block can be mapped to any one of the N-way specific cache slots. For example, a 16-way cache, each memory block can be mapped to 16 different cache slots. Generally, memory blocks with certain low-bit addresses will share 16-way cache slots. (The same low-order address indicates a continuous memory spaced apart by a certain unit size)
  3. Fully associative cache:
    Each memory block can be mapped to any cache slot. The operation effect is equivalent to a hash table.

Direct mapping cache will cause conflicts-when multiple values ​​compete for the same cache slot, they will evict each other, causing the hit rate to plummet. On the other hand, fully associative caching is too complicated, and the hardware implementation is expensive. N-way group association is a typical scheme of processor cache, which makes a good compromise between circuit simplification and high hit rate.

For example, the L2 cache with a size of 4MB is associated with 16 channels on my machine. All 64-byte memory blocks will be divided into different groups, and the memory blocks mapped to the same group will compete for the 16-way slot in the L2 cache.

L2 cache has 65,536 cache lines (translator's note: 4MB / 64), each group needs 16 cache lines, we will get 4096 sets. In this way, which group the block belongs to depends on the lower 12 bits of the block index (2 ^ 12 = 4096). Therefore, the physical addresses corresponding to the cache lines that are divided by multiples of 262,144 bytes (4096 * 64) will compete for the same cache slot. I maintain up to 16 such cache slots on my machine. (Translator's Note: Please expand the understanding with the 2-way association in the above figure, a block index corresponds to 64 bytes, chunk0 corresponds to any slot in group 0, chunk1 corresponds to any slot in group 1, and so on chunk4095 corresponds to any slot in group 4095, the lower 12 bits of chunk0 and chunk4096 addresses are the same, so chunk4096 and chunk8192 will compete with chunk0 for slots in group 0, and the addresses between them differ by multiples of 262,144 bytes, and You can compete for up to 16 times, otherwise you will expel a chunk).

In order to make the cache association effect more clear, I need to repeatedly access more than 16 elements in the same group, proved by the following methods:

public static long UpdateEveryKthByte(byte[] arr, int K)
{
    Stopwatch sw = Stopwatch.StartNew();
    const int rep = 1024*1024; // Number of iterations – arbitrary
    int p = 0;
    for (int i = 0; i < rep; i++)
    {
        arr[p]++;
        p += K;
        if (p >= arr.Length) p = 0;
    }
    sw.Stop();
    return sw.ElapsedMilliseconds;
}

This method iterates through K values ​​in the array each time and starts from the beginning when it reaches the end. The cycle stops after running long enough (2 ^ 20 times).

Use different array sizes (increase by 1MB each time) and different steps to pass in UpdateEveryKthByte (). The following is a chart drawn, blue represents longer running time, and white represents shorter time: The
Insert picture description here
blue area (longer time) indicates that when we repeat the array iteration, the updated value cannot be placed in the cache at the same time. The light blue area corresponds to 80 ms, and the white area corresponds to 10 ms.

Let's explain the blue part of the chart:

  1. Why are there vertical lines? The vertical line indicates that the step value touches too many memory locations in the same group (more than 16 times). During these times, my machine cannot put the touched values ​​into the 16-way associative cache at the same time.

    Some bad step values ​​are powers of 2: 256 and 512. For example, consider a 512-step traversal of an 8MB array. There are 32 elements distributed in a space of 262,144 bytes apart. All 32 elements will be updated in the loop, because 512 can divide 262,144. Long represents one byte).

    Since 32 is greater than 16, these 32 elements will always compete for the 16-way slot in the cache.

    (Translator's Note: Why the vertical line of 512 steps is darker than 256 steps? At the same enough number of steps, 512 than 256 access to the content of the block index double the number of times. For example, across the 262,144 byte boundary 512 512 steps are required, and 256 requires 1024 steps. Then when the number of steps is 2 ^ 20, 512 accesses 2048 times the contention block and 256 only 1024 times. In the worst case, the step size is a multiple of 262,144, because each cycle Will trigger a cache line eviction.)

    Some steps that are not a power of 2 have long run times and are just bad luck. In the end, many elements in the same group are disproportionately accessed. These step values ​​are also shown as blue lines.

  2. Why does the vertical line stop at 4MB array length? Because for arrays of 4MB or less, the 16-way associative cache is equivalent to a fully associative cache.

    A 16-way associative cache can maintain up to 16 cache lines separated by 262,144 bytes. Group 17 or more cache lines within 4MB are not aligned on the 262,144 byte boundary because 16 * 262,144 = 4,194,304.

  3. Why does a blue triangle appear in the upper left corner? In the triangular area, we cannot store all necessary data in the cache at the same time, not because of associativity, but only because of the size of the L2 cache.

    For example, consider traversing a 16MB array with a step size of 128. The array is updated every 128 bytes, which means that we touch two 64-byte memory blocks at a time. In order to store every two cache lines in a 16MB array, we need an 8MB cache. But my machine has only 4MB cache (Translator's Note: This means that there must be a conflict and delay).

    Even though the 4MB cache in my machine is fully associative, it still cannot store 8MB of data at the same time.

  4. Why is the leftmost part of the triangle faded?

    Note the 0 ~ 64 byte part on the left-exactly one cache line! As mentioned in Examples 1 and 2 above, there is almost no overhead for additional access to data from the same cache line. For example, with a step size of 16 bytes, it takes 4 steps to reach the next cache line, which means that there are only 1 overhead for 4 memory accesses.

    In all test cases under the same cycle number, the running time with labor-saving steps is short.

    The extended model of the graph: The
    Insert picture description here
    cache association is interesting to understand and can be verified, but compared to other issues discussed in this article, it is definitely not the first thing you need to consider when programming.

False sharing of cache lines (false-sharing)

On multi-core machines, the cache encounters another problem-consistency. Different processors have completely or partially separated caches. On my machine, the L1 cache is separate (which is very common), and I have two pairs of processors, and each pair shares an L2 cache. This varies with the specific situation. If a modern multi-core machine has multiple levels of cache, the fast and small cache will be monopolized by the processor.

When a processor changes a value that belongs to its own cache, other processors can no longer use its own original value, because its corresponding memory location will be refreshed (invalidate) to all caches. And because the cache operation is based on cache lines instead of bytes, the entire cache line in all caches will be refreshed!

To prove this problem, consider the following example

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

On my quad-core machine, if I pass in the parameters 0,1,2,3 through four threads and call UpdateCounter, all threads will take 4.3 seconds.

On the other hand, if I pass in 16,32,48,64, the entire operation takes 0.28 seconds!

Why is this happening? The four values ​​in the first example are likely to be in the same cache line. Each time a processor increments the count, the cache line where the four counts are located will be refreshed, and the other processors will access their respective counts the next time. Note: Note that the array is a private attribute, exclusive to each thread) will miss a cache. This multi-threaded behavior effectively disables the cache function and impairs program performance.

Example 7: Hardware complexity

Even if you understand the basics of how caching works, sometimes hardware behavior can still surprise you. There are different optimizations, heuristics, and subtle details when working without a processor.

On some processors, the L1 cache can handle two accesses concurrently. If the access comes from different memory banks, access to the same memory bank can only be processed serially. And the processor's clever optimization strategy will also surprise you. For example, in the case of pseudo-sharing, the previous performance on some machines without fine-tuning is not good, but the machine in my house can optimize the simplest example. Reduce cache refresh.

Here is a strange example of "hardware oddities":

private static int A, B, C, D, E, F, G;
private static void Weirdness()
{
    for (int i = 0; i < 200000000; i++)
    {
        // do something...
    }
}

When I perform three different operations in the loop body, I get the following runtime:

operating time
A++; B++; C++; D++; 719 ms
A++; C++; E++; G++; 448 ms
A++; C++; 518 ms

Increasing the A, B, C, and D fields takes longer than adding the A, C, E, and G fields. Even stranger is that adding the A and C fields takes longer than increasing the A, C, E, and G fields!

I ca n’t be sure of the reason behind these numbers, but I suspect that this is related to the storage. If anyone can explain these numbers, I will listen to them.

The lesson of this example is that it is difficult for you to fully predict the behavior of the hardware. You can predict many things, but ultimately, it is important to measure and verify your assumptions.

A reply to the 7th example

Goz: I asked Intel engineers for the final example and got the following reply:

"Obviously this involves how the instructions in the execution unit are terminated, the speed at which the machine handles store-hit-load, and how to quickly and elegantly handle the heuristic execution of loop unrolling (such as whether to loop multiple times due to internal conflicts). But this means that you need a very detailed pipeline tracker and simulator to understand. Predicting out-of-order instructions in the pipeline on paper is extremely difficult work, even for people who design chips. For laymen, there is no way ,Sorry!"

PS personal perception-concurrency of the principle of locality and assembly line

The running of the program has locality in time and space. The former means that as long as the value in the memory is swapped into the cache, it will be referenced many times in the future, and the latter means that the value near the memory is also swapped into the cache. If you pay special attention to applying the principle of locality in programming, you will get a return on performance.

For example, in C language, the reference to static variables should be minimized. This is because static variables are stored in the global data segment. In a function body that is called repeatedly, referencing the variable requires multiple swaps in and out of the cache, and if it is allocated in For local variables on the stack, every time the function calls the CPU, it can be found from the cache, because the stack has a high reuse rate.

For another example, the code in the loop body should be as simple as possible, because the code is placed in the instruction cache, and the instruction cache is a first-level cache, only a few kilobytes in size. The code spans another L1 cache size, and the cache advantage will be lost.

Regarding the pipeline concurrency of CPU briefly, Intel Pentium processor has two pipelines U and V. Each pipeline can read and write cache independently, so it can execute two instructions at the same time in one clock cycle. But the two pipelines are not equal. The U pipeline can process all instruction sets, and the V pipeline can only process simple instructions.

CPU instructions are usually divided into four categories, the first category is commonly used simple instructions, such as mov, nop, push, pop, add, sub, and, or, xor, inc, dec, cmp, lea, can be in any pipeline Execution, as long as there is no dependency between each other, instructions can be completely concurrent.

The second type of instructions need to cooperate with other pipelines, like some carry and shift operations. If such instructions are in the U pipeline, other instructions can run concurrently in the V pipeline. If they are in the V pipeline, the U pipeline is suspended. of.

The third type of instructions are some jump instructions, such as cmp, call and conditional branches. They are the opposite of the second type. When working in the V pipeline, they can only cooperate with the U pipeline, otherwise they can only monopolize the CPU.

The fourth type of instructions are other complex instructions, which are generally not commonly used because they can only monopolize the CPU.

If it is assembly level programming, to achieve instruction level concurrency, we must pay attention to the matching between instructions. Try to use the first type of instructions, avoid the fourth type, and reduce the context dependency in order.

Published 434 original articles · Liked 14 · Visitors 100,000+

Guess you like

Origin blog.csdn.net/LU_ZHAO/article/details/105520354