Java multi-threading - Understanding cache coherency protocol and the impact on concurrent programming

As a cross-platform Java language and its implementation have to face different underlying hardware system, an intermediate layer model designed to shield the underlying hardware differences, to the upper developers to use a consistent interface. Java memory model is such a model of the middle layer, which shields the underlying implementation details of the hardware programmer supports most mainstream hardware platform. To understand the Java memory model as well as some of the processing technology of high concurrency, and understand some basic hardware knowledge is necessary. This knowledge will talk about some of the hardware associated with concurrent programming.

A basic calculation process executed by a CPU as follows:

1. The programs and data are loaded into the main memory

2. The instructions and data are loaded into the CPU cache

3. CPU instruction execution, the results are written to cache

4. The data cache write-back to main memory

This process, we can see that there are two problems

1. Modern computer chips will integrate a L1 cache, we can understand each chip has a private storage space. So when different computing chip CPU to access the same memory address, the value of the memory address will have multiple copies across different computing chip CPU, how to synchronize these copies?

2. CPU and cache read and write is to deal directly, rather than deal directly with main memory. Because the main memory access time is typically in the tens to hundreds of clock cycles, and write a L1 cache requires only 12 clock cycles, and a write of the L2 cache takes only several tens of clock cycles. Written when the CPU cache value is written back into the main? If multiple computing chips are handled in the same memory address, the time difference is how to deal with this problem.

For the first question, different ways of processing hardware architecture is not the same. We have to understand what the concept of interconnection lines.

A processor interconnect communication medium between the processor and main memory and a processor, there are two basic interconnection structures: SMP (symmetric multiprocessing symmetric multiprocessing) and NUMA (nonuniform memory access non-uniform memory access)

                                 

SMP system structure is very common, because they are the easiest to build many small servers using this structure. Using the interconnection bus between the processor and the memory, the processor and memory are responsible for sending and listening to the bus control unit of the information broadcast bus. But the same time, only one processor (or memory controller) broadcast on the bus, all processors can be monitored.

It is easy to see, the use of the bus bottleneck SMP structure.

NUMP system configuration, a series of nodes interconnected by a peer to peer network, like a small Internet, each node comprising one or more processors and a local memory. A local storage node is visible to other nodes, it may be formed of a global memory shared by all processors of all nodes with the local storage. As can be seen, NUMP local storage is shared, not private, and this SMP is different. NUMP problem is the bus network than the copy, the need for more complex protocol processor accesses its own node memory access memory faster than other nodes. Scalability NUMP well, we now have a lot of medium-sized servers using NUMP structure.

For top programmers, need to understand is that most interconnection line is an important resource, the use of good or bad will directly affect the implementation of the program's performance.

After about a different understanding of the interconnection structure, we take a look at the cache coherence protocol. It mainly handle multiple processors dealing with a main memory address.

MESI is a mainstream cache coherency protocol, has been used in Pentium and PowerPC processors. It defines several state of the cache block

Modified (Review): cache block has been modified, must be written back to main memory, other processors can not cache the block
exclusive (mutex): cache block has not been modified, the processor and other cache blocks can be mounted on this
share (shared): cache block is not modified, and the other processors can be loaded into the cache block is
invalid (invalid): the cache block is invalid data

                            


The figure shows the MESI cache coherency protocol state transition example.

1. In a processor A reads data from the address a, data stored in its cache and set to exclusive

2. In (b), when the processor attempts to read data from the same B address a, A address conflict is detected, to respond to data. At this time, a simultaneously A and B are loaded into the buffer shared state

3. In (c), when a B to a shared write address, then the state was changed to Modified, and broadcast alert A, let it cache block state is set to Invalid

4. d, when A attempts to read data from a, it broadcasts its request, B put it modifies data A and sent to the main memory, and set the two copies of the shared state to respond

One of the biggest problem of cache coherency protocol is likely to cause cache coherency traffic storm, before we see the bus can only be used by one processor at a time when there are a large number of cache is modified, or the same cache block has been modified when, a large amount of cache coherency traffic to take the bus, affecting other normal read and write requests.

One of the most common example is that if multiple threads to the same variable has been using CAS operation, then there will be a lot of modification operations, resulting in a large number of cache coherency traffic because every issue CAS operations will broadcast notice other processors, thus affecting program performance.

 For the second question, how to deal with modified data from the cache to the main memory of the time difference, commonly used to handle memory barriers, see the in-depth understanding of the Java Memory Model (nine) - understanding memory barrier

 

Published 169 original articles · won praise 6 · views 3508

Guess you like

Origin blog.csdn.net/weixin_42073629/article/details/104743313