Things about CPU cache

Things about CPU cache

The CPU cache is integrated inside the CPU and is one of the components that enables the CPU to run efficiently. This article focuses on the following three topics to explain the role of the CPU cache:

  • Why do you need caching?
  • What is the internal structure of the cache?
  • How to make good use of cache and optimize code execution efficiency?

Why do you need caching?

In the architecture of modern computers, in order to store data, the following components are introduced

  • 1.CPU registers
  • 2.CPU cache
  • 3.Memory
  • 4.Hard disk

From 1->4, the speed is getting slower and slower, the price is getting lower and lower, and the capacity is getting larger and larger. This design keeps the price of a computer within a reasonable range, allowing computers to enter thousands of households.

Since the speed of hard disk is slower than memory access, when we develop application software, we often use components such as redis/memcached to speed up the process.

Due to the difference in CPU and memory speeds, a CPU cache is created.

The following table reflects the speed gap between CPU cache and memory.

memory type clock cycle
L1 cache 4
L2 cache 11
L3 cache 24
Memory 167

Usually there are three levels of cache in the CPU, namely L1, L2, and L3 cache. The L1 cache is divided into data cache and instruction cache. The CPU starts with Instructions and data are obtained from the L1 cache. If they do not exist in the L1 cache, they are obtained from the L2 cache. Each CPU core has its own L1 cache and L2 cache. If the data is not in the L2 cache, it is fetched from the L3 cache. The L3 cache is shared byall CPU cores. If the data is not in the L3 cache, it is fetched from memory. Of course, if it is not in the memory, it can only be obtained from the hard disk.

cache line

After understanding this layering concept, you can further understand the internal details of the cache.

Cache internal structure

When CPU Cache reads memory data, it does not read only one word or byte at a time, but reads it piece by piece. Each small piece of data is also calledCPU Cache Line (CPU Cache Line). This is also an application of the principle of locality. When an instruction or data is accessed, there is a high probability that the data at the adjacent address will also be accessed. Storing more data that may be accessed in the cache can improve the cache Hit rate.

cache line is divided into multiple types, namelydirect mapped cache, multi-way group Connected cache, Fully connected cache.

The following are introduced in order.

direct map cache

Direct-mapped cache will map a memory addressfixedly to a cache line of a certain line.

The idea is to divide a memory address into three blocks, namely Tag, Index, and Offset (the memory address here refers to virtual memory). If cacheline is understood as an array, then Index is the subscript of the array, and the corresponding cache-line can be obtained through Index. After obtaining the cache-line data, obtain the Tag value and compare it with the Tag value in the address. If they are the same, it means that the memory address is located in the cache line, that is, the cache hits. Finally, get the corresponding data from the data array according to the value of Offset. The entire process is roughly as shown below:

cache line

The following is an example, assuming there are 8 cache lines in the cache,

cache line

For direct-mapped caches, the mapping relationship between memory and cache is as follows:

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-nXbesGIp-1691050559273)(https://raw.githubusercontent.com/zgjsxx/static-img-repo/main /blog/Linux/application-dev/CPU-cache/direct-mapping3.png)]

From the figure, we can see that the three addresses 0x00, 0x40, and 0x80 have the same value of the index component in their addresses, so they will be loaded into the same cache line.

Just imagine what would happen if we accessed 0x00, 0x40, and 0x00 in sequence?

When we access 0x00, the cache misses, so it is loaded from memory into the cache line on line 0. When 0x40 is accessed, the tag in the cache line of line 0 is inconsistent with the tag component in the address, so the data needs to be loaded from the memory into the cache line of line 0 again. When 0x00 is finally accessed again, the cache misses again because the data at address 0x40 is stored in the cache line. It can be seen that the cache does not play any role in this process. When the same memory address is accessed, the cache line does not have corresponding content, but is loaded from the memory.

This phenomenon is calledcache thrashing (cache thrashing). To solve this problem, multi-way group connected cache is introduced. The following section explains how multi-way group-connected cache works.

multi-way group connected cache

The principle of Multi-way group connected cache is more complicated than that of direct mapping cache. Here we will explain the scenario of two-way group connected cache.

The so-calledmulti-path means that one cache line can be uniquely determined based on the index in the virtual address, but now multiple cache lines can be found based on the index. Run cache line. And two-way means that 2 cache lines can be found through index. After finding the two cache lines, traverse the two cache lines and compare the tag values ​​in them. If they are equal, it means a hit.

cache line

The following is still taking the two-way cache of 8 cache lines as an example. Assume that there is a virtual address0000001100101100, its tag value is 0x19, its index is 1, and its offset is 4 . Then two cache lines can be found based on the index being 1. Since the tag of the first cache line is 0x10, there is no hit, while the tag of the second cache line is 0x19, and the values ​​are equal, so the cache hits.

cache line

For multi-way set-connected cache, the mapping relationship between memory and cache is as follows:

Insert image description here

Since multi-way group-connected caches require multiple tag comparisons, their hardware costs are higher than direct-mapped caches, because in order to improve efficiency, parallel comparisons may be required, which requires more complex hardware design.

In addition, if the cache does not hit, how to deal with it?

Taking two paths as an example, two cache lines can be found through the index. If both cache lines are idle at this time, one of the cache lines can be selected to load data when a cache miss occurs. If one of the two cache lines is idle, the idle cache line can be selected to load data. If both cache lines are valid, then a certain elimination algorithm is required, such as PLRU/NRU/fifo/round-robin, etc.

What will happen if we access 0x00, 0x40, and 0x00 in sequence at this time?

When we access 0x00, the cache misses, so it is loaded from memory into the cache line on line 0 of line 0. When accessing 0x40, the tag in the cache line of line 0 of channel 0 is inconsistent with the tag component in the address, so the data is loaded from the memory into the cache line of line 0 of channel 1. Finally, when 0x00 is accessed again, the cache line of line 0 and line 0 will be accessed, so the cache will take effect. It can be seen that the cache thrashing problem can be improved due to the multi-way group connected cache.

Fully connected cache

From the multi-way group connection, we know that it can reduce the cache thrashing problem, and the greater the number of ways, the better the effect of reducing cache thrashing. So can we imagine that if the number of cache lines is infinite, so large that all cache lines are in one group, will the effect be the best? Based on this idea, fully connected cache was born accordingly.

cache line

The following is an example of a fully connected cache with 8 cache lines. Assume that there is a virtual address0000001100101100, with a tag value of 0x19 and an offset of 4. Traverse in turn until the 4th cache line is traversed and the tag matches.

Insert image description here

All cache lines in the fully connected cache are located in a set, so no part of the address will be set aside as an index. When determining whether a cache line is hit, you need to traverse all cache lines and compare them with the tag components in the virtual address. If they are equal, it means there is a match. Therefore, for full connection cache, data at any address can be cached in any cache line, which can avoid cache thrashing, but at the same time, the hardware cost is also the highest.

How to use cache to write efficient code?

Look at the following example. When summing a two-dimensional array, you can traverse by row or by column. So which one will be faster?

const int row = 1024;
const int col = 1024;
int matrix[row][col];

//按行遍历
int sum_row = 0;
for (int r = 0; r < row; r++) {
    
    
    for (int c = 0; c < col; c++) {
    
    
        sum_row += matrix[r][c];
    }
}

//按列遍历
int sum_col = 0;
for (int c = 0; c < col; c++) {
    
    
    for (int r = 0; r < row; r++) {
    
    
        sum_col += matrix[r][c];
    }
}

We write the following test codes respectively, first of all, the time of traversing by row:

#include <chrono>
#include <iostream>
const int row = 1024;
const int col = 1024;
int matrix[row][col];

//按行遍历
int main(){
    
    
    for (int r = 0; r < row; r++) {
    
    
        for (int c = 0; c < col; c++) {
    
    
            matrix[r][c] = r+c;
        }
    }
    auto start = std::chrono::steady_clock::now();
    
    //按行遍历
    int sum_row = 0;
    for (int r = 0; r < row; r++) {
    
    
        for (int c = 0; c < col; c++) {
    
    
            sum_row += matrix[r][c];
        }
    }

    auto finish = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(finish - start);
    std::cout << duration.count() << "ms" << std::endl;
}

Standard output printed: 2ms

Next is the test code for column-wise traversal:

#include <chrono>
#include <iostream>
const int row = 1024;
const int col = 1024;
int matrix[row][col];

//按行遍历
int main(){
    
    
    for (int r = 0; r < row; r++) {
    
    
        for (int c = 0; c < col; c++) {
    
    
            matrix[r][c] = r+c;
        }
    }
    auto start = std::chrono::steady_clock::now();
    
    //按列遍历
    int sum_col = 0;
    for (int c = 0; c < col; c++) {
    
    
        for (int r = 0; r < row; r++) {
    
    
            sum_col += matrix[r][c];
        }
    }

    auto finish = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(finish - start);
    std::cout << duration.count() << "ms" << std::endl;
}

Standard output printed: 8ms

The answer is obvious, row traversal is much faster than column traversal.

The reason for is that when traversing by row, when accessing matrix[r][c], some subsequent elements will be loaded into the cache line, and then matrix[r][c+1] and matrix[r][c+2] can hit the cache, which can greatly improve the speed of cache access.

As shown in the figure below, when accessing matrix[0][0], matrix[0][1], matrix[0][2], matrix[0][2] is also loaded into the cache, so the cache can be used during subsequent traversals.

cache line

When traversing by column, after accessing matrix[0][0], the next data to be accessed is matrix[1][0], which is not in the cache, so it needs to be accessed again. memory, which makes program access much slower than line-based caching.

Reference article

https://www.scss.tcd.ie/Jeremy.Jones/VivioJS/caches/MESIHelp.htm
https://cloud.tencent.com/developer/article/1495957
https://www.6hu.cc/archives/79496.html

Guess you like

Origin blog.csdn.net/qq_31442743/article/details/132086242