Sike Java multithreading (five) --- understand CPU cache works

We say that the Java Memory Model is a language-level memory model abstraction that shields the underlying hardware differences consistency needs memory, provides a unified interface to the upper layer to provide programming capabilities ensure consistency of memory.

In this consistency problem domain, the role of all levels as follows:

  1. Consistency model defines the rationale for each model of consistency
  2. Hardware layer, provides the hardware capability to achieve some consistency model. Default hardware to run according to the basic embodiment, for example
    of the same thread without data-dependent instructions may execute reordering optimization, data-dependent instructions for execution in program order, so as to ensure the correctness of the run to ensure the single-threaded programs read read data must be written before the data layer 3. the language in the same position, a few programming language provides the ability to meet the level of language consistency of the model, while others language directly use the hardware layer provides a consistent programming capability. The ability to provide a consistent language works as follows:
    the ability to meet the compliance requirements of the program as a resource to specify some rules, such as volitile, synchronized, Happens-before rules when the application layer need to use this programming capability when needed explicitly submit an application, such as explicitly use volatile variable provides the ability to identify the consistency of programming by adapting the underlying compiler variety of hardware platforms, such as some platforms use memory barriers, some platforms using the read-modified-write, need such shielding layer language differences
  3. Language level, minority language provides the ability to meet the programming language level of consistency model, others language directly use the hardware layer provides a consistent programming capability. The ability to provide a consistent language works as follows:
    the ability to meet the compliance requirements of the program as a resource to specify some rules, such as volitile, synchronized, Happens-before rules when the application layer need to use this programming capability when needed explicitly submit an application, such as explicitly use volatile variable provides the ability to identify the consistency of programming by adapting the underlying compiler variety of hardware platforms, such as some platforms use memory barriers, some platforms using the read-modified-write, need such shielding layer language differences
  4. The application layer, such as a distributed system, such as concurrent server program, which work with a consistency of
    algorithms to define the applications required to meet the actual demand requirements definition and selection consistency respective compliance requirements, such as the distribution of ability, Zab, multi-phase commit protocol by Paxos messages such as the use of storage in programming languages provide the basic consistency of programming as the basis for implementation conformance requirements of the algorithm

He said a bunch of related conformance requirements, then the question is, why there is this memory coherency demand it?

Memory consistency needs arise primarily because of the emergence of multi-core CPU, and multiple levels of cache, so there have been complicated by the problem of memory read and write, which appeared consistent memory problems.

So cache is a major cause of memory coherency problem. Many write Java memory model articles General said CPU write operation when there is a write buffer write buffer, resulting in a write operation can not be timely written back to main memory, causing other threads can not see the newly written values, which is called the visibility of the issue; and because of the write buffer is a lazy write, led to the CPU can not write to refresh the memory when you read the beginning of the follow-up, also formed a reordering scenes, the so-called orderly issues.

This article Write CPU cache coherency works, take a look at the write buffer in the end is what. I am not study the hardware, but also some of the points based on their own understanding, if that's wrong, please review further information.

First look at a map, this is the concept model diagram Java memory model, working memory is an abstract work memory of the CPU registers and caches.
Here Insert Picture Description
Let's look at a picture, taken from the "in-depth understanding of computer systems" concept model described in Intel Core i7 processor cache.
Here Insert Picture Description
Comparison of these two maps, we can see the Java memory model, each thread is actually working memory registers and cache abstraction. In the current mainstream multicore processor design, each core will generally comprise an L1 cache, and L2 cache, a plurality of core shared L3 cache. The core of each directly connected through a system bus. The system bus includes a data bus, an address bus, a control bus, a system bus collectively. We have to remember is that the bus is a shared resource, if the unreasonable use, such as memory coherency protocol bus traffic caused by the storm will affect the efficiency of program execution.

This figure said some parameters at all levels of cache, there are a few key points:

  1. CPU registers have only the L1 cache and the direct interaction
  2. Modern L1 cache is divided into two separate physical blocks:
    I-Cache store instruction is read-only,
    D-Cache data store is writable
  3. L2 and L3 cache storing instructions and data
  4. Note that the size of the cache, Core L1 cache size i7 is 64KB, L2 cache is 256KB, L3 is 8MB
  5. A cache block, the packet
  6. L1 access cycle is 4, L2 is 3 times L1, L3 L2 is three times
  7. A memory access clock cycle is about 3 times the L3, L1, and two orders of magnitude difference
  8. A hard time (normal disk) accessed 1-10ms level, and a memory access difference four orders of magnitude, and one cache accesses more than six orders of magnitude difference
  9. A time SSD accessed 10-100 microsecond fast about 1-2 orders of magnitude, and the difference in memory access time 2-3 orders of magnitude than the average hard

Here Insert Picture Description
Speaking of cache will have when it comes to the principle of locality in the computer field (Principle of Locality). The principle of locality is the theoretical basis underlying cache technology. Locality two forms:

  1. Temporal locality, a program that has good temporal locality, the referenced memory location once again is likely to be referenced multiple times in the near future
  2. 空间局部性,一个具有良好空间局部性的程序中,如果一个存储器位置被引用了一次,那么程序很可能在不远的将来引用附近的一个存储器位置

我们知道64位机器一次内存数据读取64位,也就是8个字节,8个连续的内存位置,所以高速缓存中存放的也是连续位置的数据,这是局部性的体现

局部性对编程的一些指导:

  1. 重复引用同一个变量具有良好的时间局部性
  2. 对于具有步长为k的引用模式的程序,步长越短空间局部性越好。尤其是操作数组,多维数组,局部性的影响很大
  3. 对于取指令来说,循环有好的时间和空间局部性,循环体越小,循环次数越多,局部性越好

另外来看一下存储器的体系结构
Here Insert Picture Description
有几个要点

  1. 越往上存储容量越小,存取速度越快,成本越高,反之亦然
  2. 一层存储器只和下层存储器打交道,不会跨级访问
  3. 下层作为上层的一个缓存。CPU要访问的数据的最终一般都经过主存,主存作为下层其他设备的一个缓存,其他设备的数据最终都要进入主存才能被CPU访问到。比如磁盘文件读取操作,CPU只发起操作请求,具体的数据操作不需要经过CPU,由DMA(Direct Memory Access)来操作IO和主存的交互,当操作完成后,IO设备发出中断,通知CPU操作完成
  4. 每层缓存都需要一个管理器来管理缓存,比如将缓存划分为块,在不同层中传送块,判定命中不命中。管理器可以是硬件,软件或两者的集合。比如高速缓存完全由内置在缓存中的硬件来管理

下面正式进入高速缓存工作原理的主题,先看一下高速缓存的基本结构

  1. 划分为S个缓存组 cache set
  2. 每组里面有E个缓存行 cache line,也叫Cache线,行数E也叫缓存的相联度
  3. 每行里面1个有效位来标记该缓存行是否dirty,有t个长度的标记位来辅助缓存地址定位,标识该缓存块的唯一位置,有一个B个字节的缓存块block。一行只有一个块
  4. 高速缓存的大小C = B * E * S,只计算有效的字节数,不包括有效位及标记位的大小
  5. 一个高速缓存可以用一个四元组来表示(S, E, B, m),m表示计算机的位数。拿Core i7的L1缓存来说,S = 64, E = 8, B = 64, m = 64,可以表示为(64,8,64,64).

可以看到L1的大小32K = 64个字节(块大小) * 8(行数) * 64(组数)
Here Insert Picture Description
先看高速缓存是如何在当前缓存中定位一个目标内存地址的缓存并读命中的,分为三步

  1. 组选择
  2. 行匹配
  3. 字抽取

这个定位的过程有点类似哈希操作,把一个m位的内存地址映射到一个高速缓存的组索引(s位),行(t位),块偏移(b位)中去。
Here Insert Picture Description
还拿Core i7的L1缓存(64,8,64,64)来说,拿到一个64位的内存地址

  1. 组选择:有64个组,那么64位的内存地址中就要拿出s=6位(000000-111111)来表示64个组号,根据这个内存地址的s位定位到一个组
  2. 行匹配:每个组有8行,大小为64B的块得到的b=6, 计算得到t = m - (b+s) = 64 - 12 = 52,也就是说64位地址的高52位作为t,用这个t标记去这个组的8个行去匹配对应t标记位,如果有匹配的行,就命中,否则不命中
  3. 如果命中,再由这个内存地址的低b位计算出这个地址在块中的偏移位置。块可以理解为一个字节数组,64个字节的块就有块[0]…块[63]个偏移量,有内存地址的低b位可以计算得到这个地址对应的偏移量,从而找到这个数

比如对于一个32个元素的int数组int[32]来说,int[0] - int[15]存放到高速缓存组[0]的第0行,一个块是64个字节,正好可以存储16个int数据。int[16] - int[31]存放到高速缓存组[0]的第1行。当访问int[0]的时候,没有命中,会从下一层存储器加载0行的缓存块,这样int[0]-int[15]都加载到缓存块中了,下一次访问int[1] - int[15]的时候都命中。访问到Int[16]的时候没有命中,同样从下一层存储中加载int[16] - int[31]到第1行,这样下次访问int[16]

  • int[31]时就都命中

高速缓存有直接映射高速缓存,E路相联高速缓存,全相联高速缓存之分,区别是直接相联高速缓存每一组只有1行,所以只要定位到组就能知道是否命中。全相联高速缓存则相反,只有1组,只要匹配到t位的标记位就知道是否命中。

E路相联高速缓存则是折中,比如Core i7的L1 d-cache就是8路相联高速缓存,每组有8行,这样定位到组之后,还需要在组的8个行里面去匹配标记位来判断是否命中。

缓存的常用术语命中hit表示在当前缓存中定位到了目标地址的缓存,不命中表示在当前缓存中没有找到目标地址的缓存。
结合读写动作,所以有4个状态

  1. 读命中
  2. 读不命中
  3. 写命中
  4. 写不命中

知道了如何把一个内存地址映射到高速缓存块中之后,我们来分析这4种情况各自的表现

读命中
最简单的情况,按照组选择,行匹配,数据抽取的步骤返回命中的数据

读不命中
读不命中的话就需要从下一层存储去加载对应的数据项来对应的缓存行中,注意加载的时候是整个缓存块都会被新的缓存块所代替。替换的时候比较复杂,要判断替换掉哪个缓存行。最常用的作法是使用LRU(least recently used)算法,最近最少使用算法,替换最后一次访问时间最久远的那一行。然后返回加载后找到的数据

关于写,情况就更复杂,这也是常说的CPU lazy write的原因。CPU写高速缓存有两种方式

  1. 直写 write-through, 这种方式会写高速缓存和内存
  2. 写回,也有叫回写的,write-back,这种方式只写高速缓存,将相应的缓存行标记为脏dirty,我们前面说了每个缓存行有一个有效位,0表示dirty/空, 1表示有效。只有当这个脏的缓存行要被替换掉时,才会写到内存中去

在写命中的情况下,由于write-through要写高速缓存和内存,每次写都会造成总线流量。write-back只写高速缓存,不产生总线流量
当写不命中的情况下,有两种方法:写分配 write-allocate 和非写分配 not-write-allocate。写分配会从下一层存储加载相应的块到高速缓存,然后更新这个缓存块。非写分配会直接避开高速缓存,直接写到主存。一般都是write-back使用write-allocate的方式,write-through使用not-write-allocate的方式。

我们比较一下write-through和write-back的特点

write-through: 每次写都会写内存,造成总线流量,性能较差,优点是实时性强,不会因为断电丢失数据

write-back: 充分利用局部性原理,脏的缓存线也能被后面的读立刻读取,性能较高。缺点是实时性不高,出现故障可能会丢失数据

目前基本上CPU的写缓存都采用write-back的方式,不过可以通过BIOS或者操作系统内核参数来配置CPU采取哪种写的方式。

下面这两张来自wiki的图说清了write-through和write-back的流程
Here Insert Picture Description

Here Insert Picture Description

So write buffer write-buffer others frequently mentioned in the end is what is it, write-buffer is used when write-through, write back to main memory used to cache data, we know you want to write a memory about 100ns, CPU does not will wait until the write mode written to the memory continues until a subsequent instruction, it is to be written into the main memory data write-buffer, and then execute the following instruction can be understood as an asynchronous, to optimize the write- through the performance. If the write buffer is full, then the subsequent write to wait for write buffer in the free position to continue to write.

Understand the concept of the buffer, which is used to adapt the flow rate of two different components conventional manner, such as in BufferedWriter, IO, producer - consumer queue buffer mode, etc., which may well improve the system performance.

We can see, whether it is write-through, or write-back, due to the existence of the cache and write buffer, they have caused the phenomenon of lazy write, write not immediately written back to main memory, resulting in visibility data and issues of orderliness, it is necessary to define the memory model to provide some means to ensure the consistency of some of the requirements, such as by using a memory barrier to force the cache / write buffer data is written back to memory, or force the cache the data refresh to ensure visibility and ordering data.

Original link: https://blog.csdn.net/ITer_ZC/article/details/41979189

Published 45 original articles · won praise 3 · Views 2326

Guess you like

Origin blog.csdn.net/weixin_44046437/article/details/99101180