Caching and Locality Principles

Caching and Locality Principles

1 Overview

A well-written program often has the principle of locality. The locality here is manifested in two aspects: temporal locality and spatial locality. Temporal locality means that if a memory location is referenced, it is very likely that it will be referenced again in the near future; spatial locality means that if a memory location is referenced, then in the near future, the The memory location is likely to be referenced.

It is precisely because of the principle of program locality that computer designers increase the hardware of cache memory to increase the access speed of programs to main memory. Generally speaking, the cache is designed as 3 layers, the capacity increases sequentially, and the access speed decreases sequentially. If the data is not hit in the cache, the program then accesses the main memory to obtain the data.

2 Real cache

The figure below shows the cache hierarchy of the Intel Core i7 processor. Each CPU chip has four cores, and each core has its own private L1 i-cache (instruction cache), L1 d-cache (data cache) and L2 unified cache. All cores share the on-chip L3 unified cache.

The access speed of the L1 cache is almost as fast as the register; the access time of the L2 cache is about 10 clock cycles; the access time of the L3 cache is about 50 clock cycles.

3 Impact of cache memory on program performance

3.1 Test procedure

Write a program that issues read requests through a loop statement, then the measured read throughput can show the storage performance of the cache.

test()Read sequences are generated by stridescanning the first element of an array in steps .elems

run()parameters sizeand strideprocedures that allow control over the temporal and spatial locality of the generated read sequences. sizeThe smaller it is, the smaller the working set is, and the better the temporal locality of the program is; stridethe smaller it is, the smaller the scanning step size is, and the better the spatial locality of the program is. run()Called test()and returns the measured read throughput.

/* mountain.c - Generate the memory mountain. */
/* $begin mountainmain */
#include <stdlib.h>
#include <stdio.h>
#include "fcyc2.h" /* measurement routines */
#include "clock.h" /* routines to access the cycle counter */

#define MINBYTES (1 << 14)  /* First working set size */
#define MAXBYTES (1 << 27)  /* Last working set size */
#define MAXSTRIDE 15        /* Stride x8 bytes */
#define MAXELEMS MAXBYTES/sizeof(long) 

/* $begin mountainfuns */
long data[MAXELEMS];      /* The global array we'll be traversing */

/* $end mountainfuns */
/* $end mountainmain */
void init_data(long *data, int n);
int test(int elems, int stride);
double run(int size, int stride, double Mhz);

/* $begin mountainmain */
int main()
{
    
    
    int size;        /* Working set size (in bytes) */
    int stride;      /* Stride (in array elements) */
    double Mhz;      /* Clock frequency */

    init_data(data, MAXELEMS); /* Initialize each element in data */
    Mhz = mhz(0);              /* Estimate the clock frequency */
/* $end mountainmain */
    /* Not shown in the text */
    printf("Clock frequency is approx. %.1f MHz\n", Mhz);
    printf("Memory mountain (MB/sec)\n");

    printf("\t");
    for (stride = 1; stride <= MAXSTRIDE; stride++)
	printf("s%d\t", stride);
    printf("\n");

 /* $begin mountainmain */
    for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
    
    
/* $end mountainmain */
	/* Not shown in the text */
	if (size > (1 << 20))
	    printf("%dm\t", size / (1 << 20));
	else
	    printf("%dk\t", size / 1024);

/* $begin mountainmain */
	for (stride = 1; stride <= MAXSTRIDE; stride++) {
    
    
	    printf("%.0f\t", run(size, stride, Mhz));
	    
	}
	printf("\n");
    }
    exit(0);
}
/* $end mountainmain */

/* init_data - initializes the array */
void init_data(long *data, int n)
{
    
    
    int i;

    for (i = 0; i < n; i++)
	data[i] = i;
}

/* $begin mountainfuns */
/* test - Iterate over first "elems" elements of array "data" with
 *        stride of "stride", using 4x4 loop unrolling.
 */
int test(int elems, int stride)
{
    
    
    long i, sx2 = stride*2, sx3 = stride*3, sx4 = stride*4;
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems;
    long limit = length - sx4;

    /* Combine 4 elements at a time */
    for (i = 0; i < limit; i += sx4) {
    
    
	acc0 = acc0 + data[i];     
        acc1 = acc1 + data[i+stride];
	acc2 = acc2 + data[i+sx2]; 
        acc3 = acc3 + data[i+sx3];
    }

    /* Finish any remaining elements */
    for (; i < length; i += stride) {
    
    
	acc0 = acc0 + data[i];
    }
    return ((acc0 + acc1) + (acc2 + acc3));
}

/* run - Run test(elems, stride) and return read throughput (MB/s).
 *       "size" is in bytes, "stride" is in array elements, and Mhz is
 *       CPU clock frequency in Mhz.
 */
double run(int size, int stride, double Mhz)
{
    
       
    double cycles;
    int elems = size / sizeof(double);       

    test(elems, stride);                     /* Warm up the cache */       //line:mem:warmup
    cycles = fcyc2(test, elems, stride, 0);  /* Call test(elems,stride) */ //line:mem:fcyc
    return (size / stride) / (cycles / Mhz); /* Convert cycles to MB/s */  //line:mem:bwcompute
}
/* $end mountainfuns */

3.2 Test results

Fetch stride=8, change sizethe size, and obtain the read throughput as shown in the figure below:

  • size=32KBThe working set can be completely put into the L1 d-cache, with a peak throughput of 12GB/s;

  • size=256KBThe working set can be completely put into the L2 cache, with a throughput of 4GB/s;

  • size=8MBThe working set can fit completely into the L3 cache with a throughput of 1GB/s.

Fetch size=4MB, change stridethe size, and obtain the read throughput as shown in the figure below:

  • As it strideincreases from 1 to 8, the read throughput decreases;
  • Once strideit grows to 8, which is equivalent to a block size of 64 bytes on this system, each read request will miss in L2 and must query the L3 cache.

4 attached

The above content is all referenced from "In-depth Understanding of Computer Systems".

In-depth understanding of computer systems (3rd edition).pdf https://www.aliyundrive.com/s/deuwEqV81Z1

Guess you like

Origin blog.csdn.net/qq_49588762/article/details/128932700