高速缓存与局部性原理

1 概述

一个编写良好的程序常常具有局部性原理。这儿的局部性表现为两方面：时间局部性和空间局部性。时间局部性即为，如果一个内存位置被引用，则在不久的将来，它很有可能将被再次引用；空间局部性即为，如果一个内存位置被引用，则在不久的将来，它附近的内存位置很可能被引用。

正是因为程序局部性原理，计算机设计者增加了高速缓存存储器这个硬件，从而提高程序对主存的访问速度。一般来说，高速缓存设计为3层，容量依次增大，访问速度依次减小。如果没有在高速缓冲中命中数据，程序再访问主存获取数据。

2 真实的高速缓存

下图给出了Intel Core i7处理器的高速缓存层次结构。每个CPU芯片有四个核，每个核有自己私有的L1 i-cache(指令cache)、L1 d-cache(数据cache)和L2统一的高速缓存。所有的核片共享片上L3统一的高速缓存。

L1高速缓存的访问速度几乎和寄存器一样快；L2高速缓存的访问时间大约为10个时钟周期；L3高速缓存的访问时间大约为50个时钟周期。

3 高速缓存对程序性能的影响

3.1 测试程序

编写一个程序，这个程序通过循环语句发出读请求，那么，测量出的读吞吐量可以显示出高速缓存的存储性能。

test()通过以步长stride扫描一个数组的头elems个元素来产生读序列。

run()的参数size和stride允许控制产生的读序列的时间和空间局部性的程序。size越小，得到的工作集越小，程序的时间局部性越好；stride越小，扫描的步长越小，程序的空间局部性越好。run()调用test()，并返回测量出的读吞吐量。

/* mountain.c - Generate the memory mountain. */
/* $begin mountainmain */
#include <stdlib.h>
#include <stdio.h>
#include "fcyc2.h" /* measurement routines */
#include "clock.h" /* routines to access the cycle counter */

#define MINBYTES (1 << 14)  /* First working set size */
#define MAXBYTES (1 << 27)  /* Last working set size */
#define MAXSTRIDE 15        /* Stride x8 bytes */
#define MAXELEMS MAXBYTES/sizeof(long) 

/* $begin mountainfuns */
long data[MAXELEMS];      /* The global array we'll be traversing */

/* $end mountainfuns */
/* $end mountainmain */
void init_data(long *data, int n);
int test(int elems, int stride);
double run(int size, int stride, double Mhz);

/* $begin mountainmain */
int main()
{
    
    
    int size;        /* Working set size (in bytes) */
    int stride;      /* Stride (in array elements) */
    double Mhz;      /* Clock frequency */

    init_data(data, MAXELEMS); /* Initialize each element in data */
    Mhz = mhz(0);              /* Estimate the clock frequency */
/* $end mountainmain */
    /* Not shown in the text */
    printf("Clock frequency is approx. %.1f MHz\n", Mhz);
    printf("Memory mountain (MB/sec)\n");

    printf("\t");
    for (stride = 1; stride <= MAXSTRIDE; stride++)
	printf("s%d\t", stride);
    printf("\n");

 /* $begin mountainmain */
    for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
    
    
/* $end mountainmain */
	/* Not shown in the text */
	if (size > (1 << 20))
	    printf("%dm\t", size / (1 << 20));
	else
	    printf("%dk\t", size / 1024);

/* $begin mountainmain */
	for (stride = 1; stride <= MAXSTRIDE; stride++) {
    
    
	    printf("%.0f\t", run(size, stride, Mhz));
	    
	}
	printf("\n");
    }
    exit(0);
}
/* $end mountainmain */

/* init_data - initializes the array */
void init_data(long *data, int n)
{
    
    
    int i;

    for (i = 0; i < n; i++)
	data[i] = i;
}

/* $begin mountainfuns */
/* test - Iterate over first "elems" elements of array "data" with
 *        stride of "stride", using 4x4 loop unrolling.
 */
int test(int elems, int stride)
{
    
    
    long i, sx2 = stride*2, sx3 = stride*3, sx4 = stride*4;
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems;
    long limit = length - sx4;

    /* Combine 4 elements at a time */
    for (i = 0; i < limit; i += sx4) {
    
    
	acc0 = acc0 + data[i];     
        acc1 = acc1 + data[i+stride];
	acc2 = acc2 + data[i+sx2]; 
        acc3 = acc3 + data[i+sx3];
    }

    /* Finish any remaining elements */
    for (; i < length; i += stride) {
    
    
	acc0 = acc0 + data[i];
    }
    return ((acc0 + acc1) + (acc2 + acc3));
}

/* run - Run test(elems, stride) and return read throughput (MB/s).
 *       "size" is in bytes, "stride" is in array elements, and Mhz is
 *       CPU clock frequency in Mhz.
 */
double run(int size, int stride, double Mhz)
{
    
       
    double cycles;
    int elems = size / sizeof(double);       

    test(elems, stride);                     /* Warm up the cache */       //line:mem:warmup
    cycles = fcyc2(test, elems, stride, 0);  /* Call test(elems,stride) */ //line:mem:fcyc
    return (size / stride) / (cycles / Mhz); /* Convert cycles to MB/s */  //line:mem:bwcompute
}
/* $end mountainfuns */

3.2 测试结果

取stride=8，改变size的大小，获取的读吞吐量如下图：

size=32KB的工作集可以完全放入L1 d-cache，吞吐量为峰值12GB/s；
size=256KB的工作集可以完全放入L2 cache，吞吐量为4GB/s；
size=8MB的工作集可以完全放入L3 cache，吞吐量为1GB/s。

取size=4MB，改变stride的大小，获取的读吞吐量如下图：

随着stride由1增长到8，读吞吐量下降；
一旦stride增长到8，在这个系统上就等于一个块的大小64字节，每个读请求在L2中均不会命中，必须查询L3缓存。

4 附

以上内容均参考自《深入理解计算机系统》。

深入理解计算机系统(第3版).pdf https://www.aliyundrive.com/s/deuwEqV81Z1