The principle of cache (cache): understanding computer architecture and performance optimization

Memories based memory hierarchy

Author: Once Day Date: May 9, 2023

It's a long way to go, and we've only just started!

This content is collected and organized in the book "In-depth Understanding of Computer Systems".

See documentation:

1 Overview

In an ideal world, the memory system is a linear array of bytes, and the CPU can access each memory location in constant time.

insert image description here

But in reality, a memory system is a hierarchy of storage devices with varying capacities, costs, and access times :

  • CPU registers hold the most frequently used data.
  • Then comes the small, fast cache memory.
  • Then there is the slower, but larger main memory, which holds instructions and data.
  • Further down are actual or virtual storage such as disks, hard disks, cloud disks, tapes, networks, etc. with larger capacities.

insert image description here

Programs are generally partial to the principle of locality, so most of the time, the content of a certain location is accessed, which means that this hierarchical structure can achieve a good balance between cost and efficiency.

Generally speaking, the data stored in the register can be directly accessed in 0 cycles, the data in the cache needs 4~75 cycles, the data in the main memory needs hundreds of cycles, and the data on the disk may require Millions of cycles to access .

1.1 Random-Access Memory (RAM)

Random access memory is divided into two categories: static SRAM and dynamic DRAM. SRAM is generally more expensive, so its capacity is smaller, typically a few megabytes.

Static SRAM stores each bit in a bistable memory cell, each cell is implemented using a transistor circuit that has the property of being held indefinitely in two different voltage configurations, or states, Any other state is unstable. Even when subjected to electronic noise, it can return to a stable state.

Dynamic DRAM uses a capacitor to store each bit, which is more sensitive. There are many reasons for its capacitor voltage to change, so the value of each bit needs to be refreshed periodically (tens of milliseconds).

1.2 Principle of locality

A well-programmed computer program often has good locality, tending to refer to data items that are adjacent to other recently referenced data items, or to the recently referenced data items themselves. Usually there are two forms:

  • Temporal locality, a memory location that is referenced once will be referenced many times in the near future.
  • Spatial locality (spatial locality), a memory location is referenced once, then the program may refer to nearby locations in the near future.

The principle of locality is a general performance optimization technique, programs with good locality run faster than programs with poor locality . For example hardware caches, operating systems use DRAM to cache disk contents.

Here is an example of data locality of reference, summing two-dimensional arrays:

# 步长为1
for (int i = 0; i < M; i++){
    for (int j = 0; j < N; j++) {
        sum += g_data[i][j];
    }
}
# 步长为N=1000=M
for (int i = 0; i < M; i++){
    for (int j = 0; j < N; j++) {
        sum += g_data[j][i];
    }
}

On the test device, step 1 takes only one-third the time of step N.

Sequential access to each element of a vector like the first loop is considered to have a stride-1 reference pattern, which is also called a sequential reference pattern. pattern) . As the step size increases, the spatial locality decreases.

The basic idea of ​​general locality is as follows:

  • Repeated references to the same variable have good temporal locality.
  • For reference-mode programs with a step size of k, the smaller the step size, the better the spatial locality.
  • Loops have good temporal and spatial locality for instruction fetches.
1.3 Caching

Generally speaking, a cache (cache) is a small, fast storage device that acts as a buffer area for data objects stored in a larger, slower device. The process of using a cache is called caching.

Cache hit: When the program needs a data object d in layer K+1, if d happens to be cached in layer k .

Cache miss: If there is no cache data object d in the K+1 layer, then it is a cache miss .

When a cache miss is found, the cache at layer K will go to layer K-1 to fetch the block containing d. If the cache at layer k is full, it may overwrite an existing block.

The process of overwriting an existing block is called replacing or evicting the block. The evicted block is also sometimes called the victim block .

Deciding which block to replace is determined by the cache's replacement policies, such as random replacement and least-used replacement.

There are the following types of cache misses:

  • An empty cache is generally called a cold cache, and such a miss is called a compulsory miss or a cold miss.
  • The restrictive placement strategy also causes misses, called conflict misses, so these data objects are mapped to the same block.
  • The set of cache blocks accessed at each stage of the program is different, that is, the working set. When the working set size exceeds the cache size, the cache experiences a capacity miss.

Generally, L1/L2/L3 caches are cached in memory blocks of 64 bytes .

2. Cache memory

2.1 General purpose cache memory organization

Assuming a computer system, each memory address has m bits, forming M = 2 m M=2^mM=2m different addresses.

insert image description here

Then the machine's cache is organized as aS=2An array of s cache sets, each set containing E cache lines. Each row is composed of aB = 2 b B=2^bB=2B -byte data block (block), a valid bit (valid bit) indicates whether this row contains meaningful information, andt = m − ( b + s ) t=m-(b+s)t=m(b+s ) tag bits, which are a subset of the bits of the memory address of the current block, uniquely identify the block stored in this cache line.

The structure of the cache can generally be described by a tuple (S, E, B, m). The capacity of the cache refers to the sum of the sizes of all blocks, and the mark bit and valid bit are not included, so C = S ∗ E ∗ BC=S*E*BC=SEB

The cache divides m-bit addresses into the following structure :

|-(高位)--------------m位------------(低位)-|
|--标记(t位)--|--组索引(s位)--|--块偏移(b位)--|

Generally, there are the following symbolic parameters:

parameter describe Derivative amount
S = 2 s S=2^s S=2s Number of groups s = l o g 2 ( S ) s=log_2(S) s=log2( S ) , number of group index bits
E number of rows per group
B = 2 b B=2^b B=2b block size (bytes) b = l o g 2 ( B ) b=log_2(B) b=log2( B ) , number of block offset bits
m = l o g 2 ( M ) m=log_2(M) m=log2(M) (main memory) physical address bits t = m − ( s + b ) t=m-(s+b) t=m(s+b ) , the number of marker bits
2.2 direct-mapped cache

The cache is divided into different classes according to the number E of cache lines per group. A cache with only one row per set (E=1) is called a direct-mapped cache.

insert image description here

For a simple system, such as cpu+registers+L1 cache+DRAM. wWhen the CPU executes an instruction that reads a word of memory , it L1requests the word from the cache, and if L1the cache whas a cached copy, then it gets L1a cache hit.

If there is a cache miss, the CPU must wait while L1the cache requests wa copy of the containing block from main memory DRAM .

Eventually L1the cache will contain wa copy of the block, and words are pulled from the cache block wand returned to the CPU.

The process of caching to determine whether a request is hit, and then extracting the requested word is divided into three steps :

  • (1) Group selection
  • (2) line matching
  • (3) Word extraction

As shown below, ws group index bits are extracted directly from the middle of the address, and these bits are interpreted as an unsigned integer corresponding to a group number.

insert image description here

Using an index composed of s group index bits, the corresponding cache block can be found. For direct-mapped caches, since there is only one row per set, it is directly judged whether the valid bit and the tag match. If the valid bit is set and the flags match, then wa copy is included in this row .

insert image description here

As shown above, if (1) (2) two-step judgment is not established, then it is a cache miss. The lowest bits of the address are the block offset bits, whose size is limited by the block size. The general cache line size is 64 bytes, so the offset bits are generally 6 bits.

The line replacement strategy for direct mapping cache misses is very simple, just replace the current line with the newly fetched line .

2.3 Analysis of direct mapped cache instance

The best way to gain a better understanding of how a cache operates is to actually simulate a specific situation.

Assume an instance as follows:
( S , E , B , m ) = ( 4 , 1 , 2 , 4 ) (S,E,B,m) = (4,1,2,4)(S,E,B,m)=(4,1,2,4 )
That is, the high-speed cache has 4 groups, each group has one line, and the cache block of each line has 2 bytes, the address bit has 4 bits, and the word length (word) is 1 byte. Here's the full list:

address m Mark bit (t=1) index bits (s=2) (binary) offset bit (b=1) block number
0 0 00 0 0
1 0 00 1 0
2 0 01 0 1
3 0 01 1 1
4 0 10 0 2
5 0 10 1 2
6 0 11 0 3
7 0 11 1 3
8 1 00 0 4
9 1 00 1 4
10 1 01 0 5
11 1 01 1 5
12 1 10 0 6
13 1 10 1 6
14 1 11 0 7
15 1 11 1 7

As can be seen from the figure above:

  • Mark bits and index bits uniquely identify each block in memory.
  • There are 8 blocks, but only 4 cache groups, so multiple blocks are mapped into the same cache group, and these blocks have the same group index.
  • Blocks mapped to the same cache set are uniquely identified by tag bits.

The following performs a simulation run:

  1. Initially, the cache is empty, so the four cache groups are as follows:

    |------|---有效位----|----标记位----|----[0]----|----[1]----|
    	0	  	 0
    	1		 0
    	2		 0
    	3		 0
    
  2. Read the word at address 0, and a cache miss occurs at this time, which belongs to. cold cacheAt this time, the cache fetches block 0 from the memory and stores this block in group 0, that is, the contents of addresses m[0] and m[1] , and then return the contents of m[0].

    |------|---有效位----|----标记位----|----[0]----|----[1]----|
    	0	  	 1				0			m[0]		  m[1]
    	1		 0
    	2		 0
    	3		 0
    
  3. The word at address 1 is read, and the cache is commanded at this time, so the value of m[1] is taken directly from block[1] of cache set[0].

  4. Read the word at address 13, because the cache line in bank[2] is not valid, so there is a cache miss, the cache line loads block[6] into bank[2], and then reads from the new cache line bank [ 2] returns m[13] in block[1].

    |------|---有效位----|----标记位----|----[0]----|----[1]----|
    	0	  	 1				0			m[0]		  m[1]
    	1		 0
    	2		 1				1			m[12]		  m[13]			
    	3		 0
    
  5. The word at address 8 is read, a cache miss occurs, the cache line in set 0 is valid, but the tags don't match, the cache loads block 4 into set 0, replacing the old data, and starts from the new block [ 0] returns m[8].

    |------|---有效位----|----标记位----|----[0]----|----[1]----|
    	0	  	 1				1			m[8]		  m[9]
    	1		 0
    	2		 1				1			m[12]		  m[13]			
    	3		 0
    
  6. Read the word at address 0, which again has a cache miss, replacing block 0 earlier. This is typically enough capacity, but misses due to cache conflicts.

This phenomenon of a cache repeatedly loading and evicting the same set of cache blocks is commonly referred to as thrashing . As follows:

float dotprod(float x[8], float y[8])
{
	float sum =0.0;
    int i;
    for (i = 0; i < 8; i++) {
    	sum += x[i] * y[i];
    }
    return sum;
}

Although the above function has good spatial locality, if the difference between the x address and the y address is just equal to an integer multiple of the cache size, then they will be mapped to the same cache group, which is prone to cache conflicts .

2.4 Set associative cache (set associative cache)

The problem caused by conflict misses in direct-mapped caches stems from the fact that there is only one row per set, set-associative caches relax this restriction by holding more than one cache line per set, i.e. 1 < E < C / B 1<E<C/B1<E<C / B , called E-way set-associated cache.

The group selection of a set-associated cache is no different from that of a direct-mapped cache, both use the group index bit to identify the group .

The difference is that each group in the group-associated cache has multiple lines. Each group can be regarded as an associative memory, that is, an array of (key, value), and the corresponding value value is returned. The key is composed of a flag bit and a valid bit. composition.

Any line in each set of a set-associative cache can contain any block of memory mapped to that set, so the cache must search each line in the set for a valid line whose tag matches the one in the address match .

If the word requested by the CPU is not in any row of the set, then there is a cache miss and the cache must fetch the block containing the word from the core. If there are empty lines in the group, it is a good choice to directly replace the empty lines .

If there is no blank line in this group, a non-empty line must be selected from it, and the strategy is generally as follows (general code programming does not need to care about this point) :

  • random replacement strategy
  • Least-Frequently
  • Least Recently Used Policy (Least-Recently-Used, LRU)
2.5 Fully Associative Cache

Fully associative cache (fully associative cache), is composed of a group that contains all cache lines ( E = C / BE = C / BE=C / B ) composition.

Set selection in a fully associative cache is very simple, since there is only one set, there are no set index bits in the address, and the address is only divided into a tag and a block offset .

The line matching and word selection of the fully associative cache are consistent with the set associative cache, but it is difficult to build a large and fast fully associative cache, generally speaking, the capacity is small. A typical example is the translation lookaside buffer (TLB) in the virtual memory system .

3. Cache performance analysis

3.1 Write back problem

Compared with read caching, the problem of write back is more complicated.

Suppose you write a cached word w (write hit, write hit), after the cache updates its copy of w, how to update the lower-level copy? as follows:

  • Write-through , the simplest method, is to immediately write the cache block of w back to the immediately lower layer. The downside is that it causes bus traffic every time.
  • Write back (write-back) , postpone the update as much as possible, and only write it to the next lower layer when the replacement algorithm wants to evict the updated block. The disadvantage is that it will increase the complexity of the processing logic.

When faced with a write miss, there are the following processing methods:

  • Write allocation (write-allocate) , load the corresponding lower layer block into the cache, and then update the cache block. This approach attempts to exploit the spatial locality of writes, but has the disadvantage that each miss causes a block to be transferred from a lower layer to the cache.
  • Non-write-allocate (not-write-allocate) , avoid the cache, directly write this word to the lower layer.

Direct caches are usually non-write allocated and write-back caches are usually write allocated .

The caches of modern CPUs generally use a write-back strategy. The write-back strategy means that when the data in the cache is modified, it is not written back to the main memory immediately, but the modified data is marked as "dirty data", and the dirty data is written next time the cache line needs to be replaced. back to main memory. This strategy can reduce the number of writes to main memory and improve system performance.

In contrast, the write-through strategy writes data back to main memory immediately when modifying data in the cache, which results in frequent main memory writes and degrades system performance.

Therefore, the write-back strategy is usually more common and superior than the write-through strategy. However, the write-back strategy may also cause data inconsistency, which needs to be solved by a reasonable cache coherence protocol design.

It can generally be assumed that modern systems use write-back and write-allocate, which is symmetric to read (but the details are still processor-specific) .

Second, our programs can be developed at a high level, exhibiting good spatial and temporal locality, rather than trying to optimize for a certain memory.

3.2 Analysis of real cache structure

A cache that holds only instructions is called i-cachea cache that holds only program data d-cache, and a cache that holds both data and commands is called a unified cache .

Advantages of i-cache:

  • Accelerate the acquisition of instructions: i-cache can cache the instructions that the CPU needs to execute, reducing the time to acquire instructions from the main memory, thereby improving the execution speed of the program.
  • Reduce instruction access conflicts: Since instructions are usually executed sequentially according to the execution order of the program, the access mode of i-cache is relatively regular, and access conflicts are not easy to occur.

Disadvantages of i-cache:

  • Waste a part of the cache capacity: Since the instructions are usually executed sequentially according to the execution order of the program, the instructions stored in the i-cache may be relatively continuous, which leads to the possibility that some cache lines may store unnecessary instructions, thus wasting a part cache capacity.
  • Cache invalidation has a great impact on the program: if the cache line in the i-cache is invalid, the CPU needs to re-fetch instructions from the main memory, which will cause a large performance loss.

Advantages of d-cache:

  • Accelerated reading and writing of data: d-cache can cache the data that the CPU needs to read and write, reducing the time to obtain data from the main memory, thereby improving the execution speed of the program.
  • Support multiple access modes: Unlike i-cache, the data stored in d-cache may be accessed by multiple threads or processes at the same time, so d-cache supports multiple access modes, including read, write, read and write, etc. It can better adapt to the needs of multi-threaded programs.

Disadvantages of d-cache:

  • Prone to access conflicts: Since the access mode of data is relatively random, the access mode of d-cache is relatively irregular, which is prone to access conflicts, thus affecting the execution efficiency of the program.
  • Consistency problems may occur: Since d-cache stores data that needs to be modified by the CPU, if multiple threads or processes access the same piece of data at the same time, consistency problems may occur. In order to solve this problem, some synchronization mechanisms, such as locks or atomic operations, need to be adopted, which will increase the complexity and overhead of the program.

Below is Intel Core i7the processor's cache hierarchy for reference (from Chapter 6 of "In-depth Understanding of Computers") .

insert image description here

The relevant data are summarized as follows:

cache type Access time (period) Cache Size© Connectivity (E) block size (B) Number of groups (S)
L1 i-cache 4 32KB 8 64B 64
L1 d-cache 4 32KB 8 64B 64
L2 Unified Cache 10 256KB 8 64B 512
L3 Unified Cache 40~75 8MB 16 64B 8192
3.3 Performance Impact of Cache Parameters

The following are common metrics for measuring cache performance:

  • The miss rate refers to the number of times the required data is found in the cache (hits) compared to the number of times it is not found (number of misses /number of references ) when accessing the cache.
  • Hit rate (hit rate) , hit memory reference ratio, it is equal to 1 - 不命中率.
  • Hit time (hit time) , the time required to transfer a word from the cache to the CPU, including the time for group selection, row confirmation, and word selection. For an L1 cache, the hit time is on the order of several clock cycles.
  • Miss penalty (miss penalty) , due to the extra time required for a miss, L1 misses need to get service penalties from L2, usually counting 10 cycles, and get service penalties from L3, 50 cycles, from main memory Service penalty, 200 cycles.

Cache size, block size, associativity and write strategy and other indicators will have an impact on the performance of the cache, as follows:

  1. Cache Size : Cache size refers to the amount of data that the cache can store. The larger the cache size, the more data can be cached and the hit rate may be higher. However, cache size is also limited by factors such as manufacturing cost and power consumption.
  2. Block size : The block size refers to the amount of data that can be stored in each cache block in the cache. The choice of block size affects cache hit ratio and access latency. Smaller block sizes improve hit rates because data can be cached more finely, but increase access latency and tag memory overhead. Larger block sizes reduce access latency, but lower hit rates because each cache block replacement causes more data to be invalidated, reducing the hit rate.
  3. Associativity : Associativity refers to how many cache lines each cache block can be mapped to in the cache. The choice of associativity will affect the cache hit rate and access latency. A lower associativity can reduce access latency, because each cache block only needs to look up one cache line, but it will reduce the hit rate, because if multiple cache blocks are mapped to the same cache line, conflicts will occur . Higher associativity improves hit ratios because each cache block can be mapped into multiple cache lines, reducing collisions at the cost of increased access latency and tag memory overhead.
  4. Write strategy : The write strategy refers to how the cache handles data updates when the CPU writes to the cache. Common write strategies include write-back and write-through. The write-back strategy means that when the CPU writes to the cache, the data is first written to the cache instead of being written to the main memory immediately. This reduces access to main memory and improves performance. But it also increases cache complexity and consistency issues. The write-through strategy means that when the CPU writes to the cache, the data will be written directly into the main memory. This can ensure data consistency, but it will increase access to main memory and reduce performance.
3.4 Writing cache-friendly code

Cache-friendly code must first have good locality. There are two core principles:

  • Let the most common cases run fast , ignore some trivial paths, and focus on the most core paths.
  • Minimize the number of cache misses inside each loop .

Writing cache-friendly code is an optimization technique aimed at maximizing the efficient use of cache memory, thereby improving program performance. Here are some practical ways to write cache-friendly code:

  1. Spatial locality : The cache is usually managed in units of cache blocks. Therefore, if the data access mode in the program is relatively local, that is, accessing adjacent data, the hit rate of the cache can be improved.
  2. Temporal locality : The data stored in the cache is usually the most recently accessed data. Therefore, if the data access pattern in the program is repetitive, that is, the same block of data is accessed multiple times, the cache hit rate can be improved.
  3. Data alignment : The cache is usually managed according to the size of the cache block. Therefore, if the data in the program is not aligned to the boundary of the cache block, it will cause access across the boundary of the cache block, thereby reducing the hit rate and performance of the cache.
  4. Loop Unrolling : Loop unrolling is an optimization technique that can copy the code in a loop multiple times to reduce the number of iterations of the loop, thereby improving the performance of the program. Loop unrolling can improve spatial locality and temporal locality, thereby improving cache hit ratio and performance.
  5. Write cache-friendly algorithms : Some algorithms have good cache locality, such as algorithms such as matrix multiplication and convolution.
  6. Avoid cache pollution : Cache pollution refers to the storage of unnecessary data in the cache, thereby reducing the hit rate and performance of the cache. Techniques such as local variables and static variables can be used to reduce access to global variables and dynamically allocated memory to improve cache hit ratios and performance.
  7. Avoid cache conflicts : Cache conflicts refer to the mapping of multiple data into the same cache block, resulting in mutual replacement of data and reducing the hit rate and performance of the cache. Techniques such as different data structures and algorithms can be used to reduce data collisions, thereby improving cache hit rates and performance.
3.5 Blocking technology

Blocking is an optimization technique used to improve temporal locality. Temporal locality describes the tendency of a program to access the same data multiple times over a period of time. The use of time locality can reduce the number of cache misses (cache miss), thereby improving the execution speed of the program.

The core idea of ​​chunking is to divide large data structures into smaller blocks so that they can fit in the cache. When accessing these smaller blocks, the program accesses the same data multiple times in a shorter period of time, increasing cache utilization. Blocking techniques have applications in many fields, such as matrix multiplication, image processing, and database query optimization.

The following takes matrix multiplication as an example to introduce how to use block technology to improve temporal locality.

Suppose we have two matrices A and B of size N x N respectively. We want to calculate the product C = A x B of these two matrices. The traditional matrix multiplication algorithm is as follows:

for i in range(N):
    for j in range(N):
        for k in range(N):
            C[i][j] += A[i][k] * B[k][j]

This implementation suffers from cache misses due to poor locality of data access. To improve temporal locality, we can use a blocking technique to divide matrices A, B, and C into smaller submatrices. Assuming we use the block size of blockSize x blockSize, the matrix multiplication algorithm after block is as follows:

for i in range(0, N, blockSize):
    for j in range(0, N, blockSize):
        for k in range(0, N, blockSize):
            for ii in range(i, min(i + blockSize, N)):
                for jj in range(j, min(j + blockSize, N)):
                    for kk in range(k, min(k + blockSize, N)):
                        C[ii][jj] += A[ii][kk] * B[kk][jj]

In this way, we divide the large matrix into smaller sub-matrices and perform calculations between the sub-matrices. This improves cache utilization because the same data is accessed multiple times in a short period of time. At the same time, the block technology can adjust the size of blockSize according to hardware characteristics to adapt to different cache structures.

It should be noted that the blocking technology does not always improve performance, it needs to be optimized according to the specific program and hardware environment. In practical applications, programmers need to have an in-depth understanding of the problems they are dealing with, and choose the appropriate optimization method according to the specific situation.

Guess you like

Origin blog.csdn.net/Once_day/article/details/130939543