《CSAPP》读书笔记：面向高速缓存编程

计算机发展初期，CPU是和主存直接交互。但是随后CPU的速度越来越快，甩了主存几十条街。这时候Intel的工程师提出了一个解决方案——在CPU和主存中插入高速缓存（cache），以缓和CPU和内存几个数量级的速度差。于是现代计算机的存储层次结构衍生成了下图的样子，图片源于《CSAPP》一书

CPU需要读数据，会首先访问L1 Cache, 如果Hit miss，就去请求L2 cache。如果L2 Cache hit miss，那么......直至访问内存。L1 Cache的速度已经达到了CPU的水准。L2 cache虽然略慢，但也能秒杀主存。我用aida64软件测试了本机的高速缓存和主存的读写速度，如下图：

从评测数据可以看出，就读速度而言，L1 Cache 是主存的近180倍。用cpuz软件检测本机的高速缓存Cache参数，详见下图：

L1缓存每个物理核心仅有32KB大小，8-way set associative,64-byte line size 表明该cache采用8路组相联结构，每个缓存行的大小是64字节，每组有8行。32 * 1024 / ( 8 * 64) = 64, 即本机的L1 Cache 仅有64个高速缓存组（Cache set）。

高速缓存每行上的数据都是内存空间上相邻的64字节的数据。因此满足空间局部性的代码在性能上是最好的。

一行64个字节能存储16个int型变量。一组8行，即一个组能存储16 * 8 个int变量。现在用两种方式访问二维数组，一种按行，一种按列，对比下性能。

package com.jd.lvsheng;

import java.util.Random;

/**
 * Created by cdlvsheng on 2016/6/6.
 */
public class CacheContrast {
	int     ROW    = 1024 * 128;
	int     COL    = 16 * 32;
	int[][] matrix = new int[ROW][COL];
	Random  rand   = new Random();

	static String row = "addByRow";
	static String col = "addByCol";

	void init() {
		for (int i = 0; i < ROW; i++)
			for (int j = 0; j < COL; j++)
				matrix[i][j] = rand.nextInt(100);
	}

	public void addByRow() {
		long sum = 0;
		for (int i = 0; i < ROW; i++)
			for (int j = 0; j < COL; j++)
				sum += matrix[i][j];

		System.out.println(row + " result:\t" + sum);
	}

	public void addByCol() {
		long sum = 0;
		for (int i = 0; i < COL; i++)
			for (int j = 0; j < ROW; j++)
				sum += matrix[j][i];

		System.out.println(col + " result:\t" + sum);
	}

	public static void main(String[] args) {
		CacheContrast cc = new CacheContrast();
		cc.init();
		long start = System.currentTimeMillis();
		cc.addByRow();
		long end1      = System.currentTimeMillis();
		long costOfRow = end1 - start;
		System.out.println(row + " time cost :\t" + costOfRow);
		cc.addByCol();
		long end2      = System.currentTimeMillis();
		long costOfCol = end2 - end1;
		System.out.println(col + " time cost :\t" + costOfCol);
		System.out.println("performance ratio :\t" + (costOfCol * 1.0) / costOfRow);
	}
}

运行结果

addByRow result:	3322121552
addByRow time cost :	133
addByCol result:	3322121552
addByCol time cost :	1406
performance ratio :	10.571428571428571

按列访问耗时竟然达到了按行访问的10倍多。空间局部性对cache的性能尤为重要。如果要写出 cache friendly 的代码，那么在循环访问数组时，尽量引用步长为1的引用模式。

《CSAPP》读书笔记：面向高速缓存编程

猜你喜欢