[Concurrent Programming] The Visibility Problem and Its Essence in Multithreading

visibility

The so-called visibility means that the modification of a shared variable by one thread can be seen immediately by another thread.

In layman's terms, two threads share a variable. No matter which thread modifies this variable, the other thread can immediately see the modification of this variable by the previous thread.

Causes of Visibility Issues

The computer uses the CPU to perform data calculations, but the CPU can only perform calculations on the data in the memory. For the data in the disk, it must be read into the memory before the CPU can perform calculations. CPU, memory, and disk all affect the processing performance of the computer. At the same time, there is a core contradiction between the three, which is the difference in processing speed among the three . The calculation speed of the CPU is very fast, followed by the memory, and finally the IO device (such as a disk), which means that the calculation speed of the CPU is much higher than the I/O speed of the memory and disk device.

Although the CPU is upgraded from single-core to multi-core or even multi-threaded technology to maximize the processing performance of the CPU, it is not enough to improve the CPU performance alone. If the processing performance of the memory and disk cannot keep up, it means that the overall computing speed Depending on the slowest device, in order to balance the speed difference between the three, maximize the use of CPU. Therefore, a lot of optimizations have been made at the hardware level, operating system level, and compiler level.

  • CPU adds cache
  • The operating system adds processes and threads. Maximize CPU usage through CPU time slice switching
  • Compiler instruction optimization, more reasonable to make good use of CPU cache

Every optimization brings corresponding problems, and these problems are the root cause of thread safety problems.

CPU cache

The emergence of CPU cache is mainly to solve the contradiction between CPU computing speed and memory reading and writing speed, because CPU computing speed is much faster than memory reading and writing speed, which will make CPU spend a long time waiting for data to arrive or write data into memory.

This cache can cache the data stored in the memory, and the CPU will first read the data that needs to be calculated from the cache each time. If the data does not exist in the cache, it will load it from the memory.

For the mainstream x86 platform, the cpu cache (cache) is divided into three levels: L1, L2, and L3 (processing speed L1>L2>L3)
insert image description here

There is no visibility problem under single-core CPU

We also need to note that there is no visibility problem on a single-core CPU. Why is this?

Because on a single-core CPU, no matter how many threads are created, only one thread can obtain CPU resources to perform tasks at the same time, even if the single-core CPU has added a cache. These threads all run on the same CPU, and they use the same cpu cache. As long as one of the threads modifies the value of the shared variable, the other threads must be able to access the latest data in real time.

False sharing and cache line padding

In system engineering, whether it is in the database or the system memory, for data access, there is usually a phenomenon of re-access of some data with high probability in time and space.

  • Temporal locality phenomenon If a main memory data is being accessed, the probability of it being accessed again in the near future is very high.
  • Spatial locality phenomenon When the CPU uses data in a certain memory area, there is a high probability that the adjacent data behind this memory area will be used immediately. For example, arrays and collections are often accessed sequentially (memory addresses are continuous or adjacent).

Therefore, in the cpu cache, data is stored and read and written in units of cache lines (Cache line, the smallest storage unit that can be allocated in the cache) at all levels of cache, referencing a continuous address in the main memory, usually 64 bytes

Since the data in the CPU cache is the smallest unit of the cache line, there may be a problem that there are multiple objects in a cache line. At this time, if there are multiple threads concurrently operating the cache line, a cache may be generated. The problem of invalidation (also known as false sharing)

For example, the thread of CPU1 and the thread of CPU2 both load a cache line from the main memory to their respective L1 and L2Cache, and there are three variables x, y, and z in the cache line (since the three variables exist in two In the cpu cache, that is to say, all three variables are in a shared state, and the specific cache state will be introduced below)

At this time, the thread of cpu1 modifies the variable x. In order to ensure the data consistency of the cache, it is necessary to mark the cache in cpu2 as invalid. If the thread of cpu2 wants to manipulate the variable y, it is necessary for cpu1 to write the corresponding cache line first. to memory, and then read the latest data from memory. If two cpu threads operate the x and y variables concurrently, it will cause the data to be reloaded into the memory for each operation. The effect of this cache is equivalent to no more, that is, the so-called cache invalidation is caused by this problem
insert image description here
. Because the default size of a cache line is 64 bytes, and our defined object may be smaller than 64 bytes, there will be a problem that there are multiple variables in a cache line. It is also very simple to solve this problem. We will use this object Just fill it to 64 bytes.

Provided in java8 @Contentedto achieve byte filling, annotations can be added to fields or classes. Adding to a field indicates that this field occupies a single cache line, and adding to a class indicates that all fields in the class occupy an exclusive cache line.
Using @Contentedannotations requires configuration of jvm parameters-XX:-RestrictContended

public class ValuePaddingTest {
    
    

    public static void main(String[] args) throws InterruptedException {
    
    
        Pair pair = new Pair();

        Thread t1 = new Thread(() -> {
    
    
            for (int i = 0; i < 2000000000; i++) {
    
    
                pair.x++;
            }
        });
        Thread t2 = new Thread(() -> {
    
    
            for (int i = 0; i < 2000000000; i++) {
    
    
                pair.y++;
            }
        });
        long start = System.currentTimeMillis();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println(System.currentTimeMillis() - start);
    }
    static class Pair {
    
    
        long x1,x2,x3,x4,x5,x6,x7;
        volatile long x=0;
        long y1,y2,y3,y4,y5,y6,y7;
        volatile long y=0;
    }
}

insert image description here

public class NoValuePaddingTest {
    
    

        public static void main(String[] args) throws InterruptedException {
    
    
            Pair pair= new Pair();

            Thread t1 = new Thread(() -> {
    
    
                for (int i = 0; i < 2000000000; i++) {
    
    
                    pair.x++;
                }
            });
            Thread t2 = new Thread(() -> {
    
    
                for (int i = 0; i < 2000000000; i++) {
    
    
                    pair.y++;
                }
            });
            long start = System.currentTimeMillis();
            t1.start();
            t2.start();
            t1.join();
            t2.join();
            System.out.println(System.currentTimeMillis() - start);
        }

    static class Pair {
    
    
        //    long x1,x2,x3,x4,x5,x6,x7;
        volatile long x = 0;
        //    long y1,y2,y3,y4,y5,y6,y7;
        volatile long y = 0;
    }

}

insert image description here

Cache coherency problem and cache coherence protocol

In a multi-threaded environment, when multiple threads execute and load the same piece of memory data in parallel, since each CPU has its own independent L1 and L2 caches, this part of the cache space of each CPU will cache the same data, and When each CPU executes related instructions, it is invisible to each other, which will cause cache coherency problems

In order to achieve consistent data access, we can use locks (cache locks, bus locks) to ensure mutual exclusion of operations on each cache line. This requires each processor to follow some protocols when accessing the cache, and operate according to the protocol when reading and writing. Common protocols include MSI, MESI, MOSI, etc. The most common is the MESI protocol.
MESI represents the four states of the cache line, which are

  1. M (Modify) indicates that the shared data is only cached in the current CPU cache and is in a modified state, that is, the cached data is inconsistent with the data in the main memory
  2. E (Exclusive) Exclusive state means that the data is only cached in the current CPU cache and has not been modified
  3. S(Shared) Shared state means that the data may be cached by multiple CPUs, and the data in each cache is consistent with the main memory data
  4. I(Invalid) Invalid status indicates that the cache has been invalidated

Shared,Exclusive,InvalidEvery cached variable must be in one of three states in a CPU cache line

insert image description here

In the java code, we add the volatile keyword to the variable, and a #Lock (assembly instruction) will be added to the final generated execution instruction to trigger the cache lock to ensure visibility

CPU instruction reordering

The program will optimize the execution order of instructions at the CPU level/JVM level. Why should the execution order of instructions be optimized? This is because when multiple cpu threads operate on the same variable, if a cpu thread modifies the value of variable x, based on In the cache consistency protocol, it is necessary to notify other CPUs that the cached data is invalid, and wait for all CPUs to respond and confirm before writing the latest data to the main memory. During this period, the current CPU is in an idle state. As shown below:
insert image description here

Store Buffer和Store Forwarding

In order to maximize the use of cpu resources, introduced Store Buffer. The CPU writes the modified data that it wants to write back to the main memory into the store buffer, and sends a message that invalidates the data in other CPUs, and then continues to process subsequent instructions. The data will finally be synchronized to the main memory only after all the notifications sent out to set the cache to invalid status are responded. (using the idea of ​​asynchronous processing, similar to mq)

insert image description here

function () {
    
    
	a = 1;
	b = a + 1;
	assert(b == 2);
}

If the modification of a is executed before it is written into the main memory, b=a+1the final calculated result will be different from the expected one.
Since the execution sequence of the program may be destroyed, the engineer has implemented the technology store bufferbased on the store buffer. store forwarding: The cpu can load data directly from the store buffer, that is, it supports forwarding the data stored in the store buffer by the cpu to subsequent loading operations without using the original data in the cache.

But this still has problems under multi-threading:

a=0;b=0;
void function1() {
    
    
    a = 1;
    b = 1;
}
void function2() {
    
    
    while (b == 1) {
    
    
   	   assert(a == 1)
    }
}

The initial values ​​of a and b are both 0, assuming that a exists in the cache of cpu1, and b exists in the cache of cpu0, both of which are in the Exclusive state, cpu0 executes the function1() function, and cpu1 executes the function2() function.
As shown below:
insert image description here

When cpu1 executes function2(), it needs to read the value of b from cpu0, and the latest value b=1 can be obtained.
When cpu0 executes function1(), it needs to read the value of a from cpu1 (asynchronously), so it will first Put a=1 into the store buffer, and b=1 is directly updated and written to the main memory because the variable is originally exclusive to cpu0; at this time, since
cpu1 has not received the invalidation message of a, it is executing function2 (), the value a=0 in the cpu1 cache is still used, and the result makes the assertion invalid

The reason for this problem is that the cpu does not know that there is a dependency between the variables a and b. The writing of cpu0 to a needs to communicate with other cpus, so there is a delay, while the writing of b directly modifies the local cache, so b is better than a first takes effect in the cache, causing cpu1 to read b=1, a still exists in the store buffer.

Invalid Queue

Based on the previous problem, an invalidation queue was introduced Invalid Queueto optimize this problem. The asynchronous idea is still used. The Invalidate ACKmain reason for the time-consuming is that the CPU must first set the corresponding cache line to Invalid before returning the response. A very busy CPU may It will cause other CPUs to wait for its Invalidate ACK. Through asynchronous mode, the CPU can first put the Invalidate message in the invalidation queue Invalid Queue, and then return the Invalidate ACK. The CPU can then process the messages in the Invalid Queue, thereby greatly reducing the response time of Invalidate ACK (that is, reducing the time-consuming of the red arrow above).

However, since Invalid Queuethe invalidation message is also processed asynchronously, when cpu1 executes the assertion, if the Invalid Queueinvalidation message in it has not been processed, it is still possible to read the data of a=0, causing the assertion to fail

Although Invalid Queue cannot solve the visibility problem caused by sequential consistency, it can help us avoid too much data backlog in the store buffer. After we modify a shared data, we need to send an invalidate message first and wait for the other party's CPU to reply Invalidate ACK. After that, the data in the store buffer will be written into the main memory. During this waiting process, the store buffer of the local CPU will continuously write new data, and if the other party’s CPU responds too slowly (assuming the other party’s CPU is too busy), then The store buffer of the local CPU is likely to be full. By introducing Invalid Queue, the storage time of data in the store buffer can be greatly reduced

memory barrier

The sequential consistency problem caused by CPU performance optimization cannot be solved at the CPU level, because the CPU is just a computing tool, it is only responsible for receiving instructions and executing instructions, and it is not clear whether there are problems that cannot be optimized in the entire logic currently being executed. That is to say, the visibility problem caused by this sequential consistency cannot be optimized at the hardware level.

Therefore, instructions such as write barrier, read barrier, and full barrier are provided at the CPU level, allowing developers to judge whether the CPU is allowed to perform such optimizations.

In the x86 architecture, these three instructions are SFENCE、LFENCE、MFENCEinstructions,

  • sfence: That is save fence, write barrier instruction. The write operation before the sfence instruction must be completed before the write operation after the sfence instruction.
  • lfence: That is, load fence, read barrier instruction. The read operation before the lfence instruction must be completed before the read operation after the lfence instruction.
  • mfence: That is mix fence, mixed barrier instruction, the read and write operations before mfence must be completed before the read and write operations after mfence instruction.

In the Linux system, these three instructions are respectively packaged into three methods : smp_wmb-write barrier, smp_rmb-read barrier, -read-write barrier.smp_mb

The read barrier is used to process the invalidate queue (when the cpu executes the read barrier, it will first invalidate queueprocess the current data and then execute the "read operation" after the barrier), and the write barrier is used to process the store buffer (when the cpu executes the write barrier , the data in the current store buffer will be flushed to the cache first, and then the "write operation" after the barrier will be executed).

Summarize

insert image description here

The underlying implementation of the volatile keyword is the lock prefix instruction. What is the relationship between lock prefix instructions and memory barriers?
I don't think it has anything to do with it.
It's just that some functions of the lock prefix instruction can achieve the effect of the memory barrier.
This point can also be found in the corresponding description in the "IA-32 Architecture Software Developer's Manual".
insert image description here
The definition of the lock prefix instruction in the manual is a bus lock, that is, the lock prefix instruction ensures visibility and prohibits instruction reordering by locking the bus.

Although the term "bus lock" is too old, today's systems are more about "locking cache lines". But what I want to express is that the core idea of ​​the lock prefix instruction is still "lock", which is fundamentally different from the memory barrier.

The content about JMM and Happens-Before is in the next article: JMM and the happens-before principle

References:
https://www.cnblogs.com/coderw/p/16380057.html

Guess you like

Origin blog.csdn.net/qq_35448165/article/details/129941593