Deep understanding of the volatile keyword

Thoughts from a visibility issue

Let's look at a piece of code:

public static boolean flg =false;
    public static void main(String[] args) throws InterruptedException {
    
    
        Thread thread=new Thread(()->{
    
    
            int i=0;
            while (!flg){
    
    
                i++;
                //1. System.out.println("i:="+i);

                // 2.Thread.sleep(1000);
                /*try {
                    Thread.sleep(1000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }*/
            }
        });
        thread.start();
        Thread.sleep(1000);
        flg=true;
    }

Operation result:
Insert picture description here
If you let go of the first comment code or the second comment code, you will find that the program ends normally. Why? Let's analyze

print causes the loop to end

Because the bottom layer of print uses the synchronized keyword, it indicates that println has a lock operation, and the operation of releasing the lock will forcibly synchronize the write operations involved in the working memory to the main memory.
```
public void println(String x) {
      
      
        synchronized (this) {
      
      
            print(x);
            newLine();
        }
    }
```
From the IO point of view, print is essentially an IO operation, and the IO efficiency of the disk must be much slower than the calculation efficiency of the CPU, so IO can make the CPU time to refresh the memory, which leads to this phenomenon. We can verify by defining a new File()

Thread.sleep(long)

Thread.sleep(long) will cause thread switching, which will cause cache invalidation, and then read the latest value

volatile

We know that the main reason why the problem mentioned at the beginning of the article appears is visibility. In order to ensure the visibility of threads, java proposed the volatile keyword.

What is visibility?

In a multithreaded environment, after a thread updates a shared variable, threads that subsequently access the variable may not be able to read the updated result immediately, or even never read the updated result. This is another manifestation of thread safety issues: visibility.

Why is there a visibility problem?

1. Cache

The processing power of a modern processor (CPU) is far better than the access rate of the main memory (DRAM). The time for the main memory to perform a read and write operation is enough for the processor to execute hundreds of instructions. In order to bridge the gap between the processor and the main memory, hardware designers introduced a cache between the main memory and the processor , as shown in the figure: The
Insert picture description here
cache is an access rate that is much greater than that of the main memory. Storage components with much smaller capacity than main memory, and each processor has its own cache. After the introduction of the cache, the processor does not directly deal with the main memory when performing read and write operations, but through the cache.
Modern processors generally have multiple levels of cache, as shown in the figure above. There are Level 1 Cache (L1 Cache), Level 2 Cache (L2 Cache), and Level 3 Cache (L3 Cache). Their order of access: L1>L2>L3.
When multiple threads access the same shared variable, each thread’s processor’s cache will keep a copy of the shared variable. This brings about a problem: when a processor updates the copy data , How do other processors know and react appropriately? This involves visibility issues. Also known as cache coherency problem

Cache consistency problem (MESI)

The MESI (Modified-Exclusive-Shared-Invalid) protocol is a widely used cache coherency protocol. The coherency protocol used by the x86 processor is based on the MESI protocol.
In order to ensure data consistency, MESI divides the cache entry status into four types: Modified, Exclusive, Shared, Invalid, and defines a set of messages (Message) to coordinate the read and write operations between the various processors.

Invalid (invalid, marked as I), means that the corresponding cache line does not contain any valid copy corresponding to the memory address. This state is the initial state of the cache entry
Shared (shared, denoted as S), means that the corresponding cache line contains the copy data corresponding to the corresponding memory address, and the caches on other processors also contain the copy data corresponding to the same memory address, as shown in the figure:
Exclusive (exclusive, denoted as E), means that the corresponding cache line contains the copy data corresponding to the corresponding memory address, and the cache on all other processors does not retain the copy data, as shown in the figure:
Modified (modified, marked as M), which means that the corresponding cache line contains the updated result data of the corresponding memory address. In the MESI protocol, only one processor can update the data corresponding to the same memory address at any one time. FIG:

MESI protocol defines a set of message (Message) for reading the coordination of the various processors, memory write operation, as:

Next, we take a brief look through the flowchart of the work process MESI protocol:

From above this We can see a performance weakness of the MESI protocol: After the processor has performed the memory operation, it must wait for all other processors to cache the corresponding copy data in its cache and receive the
Invalidate Acknowledge/Read Response from these processors. Only after the message can the data be written to the cache.
In order to avoid and reduce the delay of write operations caused by such waiting, hardware designers have introduced write cache and invalidation queues.

Write buffer (Store Buffer) and invalidate queue (Invalidate Queue)

The Write Buffer (Store Buffer, also known as Write Buffer) is a private cache component inside the processor with a capacity smaller than the cache. After the write buffer is introduced, the processor will handle the operation as follows: if the corresponding cache entry is S, then the processor will first store the relevant data of the write operation (including data and memory address with operation) into the write buffer In the entry, and asynchronously send the Invalidate message, that is, the execution processor of the memory write operation will consider that the write operation has been completed after putting the relevant data of the write operation into the write buffer, and does not wait for other processors to return Invalidate Acknowledge/Read The Response message continues to execute other instructions, thereby reducing the delay of the write operation.
Invalidate Queue. After receiving the Invalidate message, the processor does not delete the copy data corresponding to the memory address specified in the message, but sends the message After being stored in the invalidation queue, the Invalidate Acknowledge message is returned, thereby reducing the waiting time for the write operation executor.
The process is as follows:
Insert picture description here
However, the write buffer and invalidation queue will bring some new problems: instruction reordering

2. Instruction reordering

We use an example to explain in detail the problem of instruction reordering :

int data =0;
boolean ready=false;
void threadDemo1(){
    
    
    data=1; //S1
    ready=true; //S2

}
void threadDemo2(){
    
    
   while (!ready){
    
     //S3
       System.out.println(data);  //S4
   }
}

Suppose that the cache of CPU0 has only a copy of ready, and the cache of CPU1 has only a copy of data. The
execution process is as follows:
Insert picture description here
From the perspective of CPU1, this creates a phenomenon: S2 is executed before S1

Memory barrier

What kind of memory reordering is supported by the processor, and it will provide instructions that can prohibit the corresponding reordering. These instructions are called memory barriers.
Memory barriers can be represented by XY, where X and Y sub-tables represent Load (read) and Store. (write). The function of the memory barrier is to prohibit reordering between any X operation on the left side of the instruction and any Y operation on the right side of the instruction, so as to ensure that all X operations on the left side of the instruction are submitted before the Y operation on the right side of the instruction. As shown below:
Insert picture description here

principle

The principle of volatile is actually realized by using the underlying memory barrier. We can look at the pseudo code after adding the volatile keyword:

	volatile int data =0;
    boolean ready=false;
    void threadDemo1(){
    
    
        data=1;
        //StoreStore 确保前后的写操作已经写入到高速缓存中
        ready=true;

    }
    void threadDemo2(){
    
    
        while (!ready){
    
    
            //LoadLoad 确保ready的读操作在data的读操作之前
            System.out.println(data);
        }
    }

To summarize: volatile read/write insert memory barrier rules:

Insert LoadLoad barrier and LoadStore barrier after each volatile read operation
Insert a StoreStore barrier and a StoreLoad barrier before and after each volatile write operation

happens-before model

The Java Memory Model (JMM) defines the behavior of the volatile, final, and synchronized keywords and ensures that properly synchronized Java programs can run on processors of different architectures.
In terms of atomicity, JMM stipulates that the read and write operations of basic data types other than long/double and shared variables of reference types are all atomic. In addition, JMM also specifically stipulates that read and write operations on volatile-modified long/double shared variables are also atomic.
For visibility and order issues, JMM uses the happens-before model to answer the
happens-before rule as follows:

Program sequence rules : seemingly serial semantics (As-if-serial). The result of any action in a thread is visible to other actions after the action in the program sequence, and these actions appear to the thread itself to be executed and submitted completely in the program sequence.
Monitor lock rule : The release of the monitor lock happens-before each subsequent application for the lock.
Note: "release" and "application" must be for the same type of lock instance, that is to say, the release of one lock has no happens-before relationship with the application of another lock
Volatile variable rules : the write operation of a volatile variable happens-before each subsequent read operation for the variable.
Note: It must be for the same volatile variable, and secondly, the read and write operations for the same volatile variable must have a time sequence.
Thread start rule : call the start() method of the thread happens-before any action in the thread that is started.
Thread termination rule : any action in a thread happens-before any action performed by the thread's join method after the join method returns
Transitivity rule : If A happens-before B, and B happens-before C, then A happens-before C.

to sum up

Volatile achieves visibility through cache coherency
Volatile prohibits the reordering of instructions through the memory barrier, thereby ensuring orderliness
JMM uses the happens-before model to more concisely describe the problem of visibility and order.