Analysis of volatile test points

Today we learn another important keyword in concurrent programming, volatile. Although its proportion in the interview is lower than synchronized, it is still not negligible.

Regarding volatile, I have collected 8 common test points, focusing on applications, characteristics and implementation principles.

What does volatile do?
Why do visibility issues occur in a multi-threaded environment?
What is the difference between synchronized and volatile?
Describe in detail the implementation principle of volatile (involving memory barriers).
What are the characteristics of volatile? How does it guarantee these properties?
Volatile guarantees the visibility of variables between threads. Does it mean that volatile variables are concurrently safe?
Why do variables in methods not need to use volatile?
How did the reordering happen?

This article starts with the application of volatile, then analyzes the implementation of volatile from the perspective of source code, and tries to answer the above questions through the analysis of the principle.

what is volatile

Like synchronized, volatile is a keyword provided by Java for concurrency control , but there are obvious differences between them.

The first is how to use it:

synchronized can modify methods and code blocks
volatile can only modify member variables

In terms of ability, volatile is also "weaker":

Ensure visibility of modified variables
Disable instruction rearrangement of modified variables

We slightly modify the code of the visibility problem in the 8 questions you must know about threads (above) , and use volatile to modify the variable flag:

private static volatile boolean flag = true;

public static void main(String[] args) throws InterruptedException {
	new Thread(() -> {
		while (flag) {}
		System.out.println("线程：" + Thread.currentThread().getName() + "，flag状态：" + flag);
	}, "block_thread").start();

	TimeUnit.MICROSECONDS.sleep(500);

	new Thread(() -> {
		flag = false;
		System.out.println("线程：" + Thread.currentThread().getName() + "，flag状态：" + flag);
	}, "change_thread").start();
}

It is not difficult to find that the block_thread is released, indicating that the modification of the flag is "seen" by other threads, which is the performance of volatile guaranteeing visibility.

Then modify the code that deeply understands the ordering problems caused by the order rearrangement in JMM and Happens-Before, and also use volatile to modify the variable instance:

public class Singleton {

	static volatile Singleton instance;

	public static Singleton getInstance() {
		if (instance == null) {
			synchronized(Singleton.class) {
				if (instance == null) {
					instance = new Singleton();
				}
			}
		}
		return instance;
	}
}

After many experiments, it is found that uninitialized instacne objects will no longer be obtained, which is the performance of volatile prohibiting instruction rearrangement.

Tips : Again, Happens-Before describes the relationship between behavioral outcomes .

Implementation of volatile

The following content is based on the JDK 11 HotSpot virtual machine, mainly based on the implementation of the X86 architecture, and will be compared with the implementation of the ARM architecture. The reason for choosing these is simple, they are the "top stream" in their respective fields .

Volatile is easy to use and its functions are easy to understand, but complex implementations are often hidden behind the simplicity. Like the process of analyzing synchronized, starting from the bytecode and then to the implementation of the JVM, we strive to connect volatile from the bottom layer, the relationship between the memory barrier and the hardware.

The implementation of volatile under different architectures is quite different. Take a look at an example, the implementation of the getfield_or_static method in templateTable_x86_ _ of the X86 architecture:

void TemplateTable::getfield_or_static(int byte_no, bool is_static, RewriteControl rc) {
	// 省略类型判断的代码
	__ bind(Done);
	// [jk] not needed currently
	// volatile_barrier(Assembler::Membar_mask_bits(Assembler::LoadLoad | Assembler::LoadStore));
}

Implementation of the getfield_or_static method in templateTable_arm_ _ of the ARM architecture :

void TemplateTable::getfield_or_static(int byte_no, bool is_static, RewriteControl rc) {
	// 省略类型判断的代码
	__ bind(Done);
	if (gen_volatile_check) {
		Label notVolatile;
		__ tbz(Rflagsav, ConstantPoolCacheEntry::is_volatile_shift, notVolatile);
		volatile_barrier(MacroAssembler::Membar_mask_bits(MacroAssembler::LoadLoad | MacroAssembler::LoadStore), Rtemp);
		__ bind(notVolatile);
	}
}

Under the X86 architecture, no special processing is required for volatile types, but under the ARM architecture, adding a memory barrier ensures the same effect as the X86 architecture. This example is to show the differences in the implementation of the specification on different CPU architectures . In addition, it reminds everyone not to mistake the implementation of X86 as a standard . The X86 architecture has stronger constraints on reordering and can "naturally" implement some of the requirements in the JMM specification. So the implementation at the JVM level will look very simple.

The above has been talking about the template interpreter, but I will use the bytecode interpreter bytecodeInterpreter for the following content . Why not use a template interpreter? Because the template interpreter is too "far" from OrderAccess , and the detailed explanation of the memory barrier in OrderAccess is the key to understanding the principle of volatile.

However, let's take some time to understand the implementation of the memory barrier assembler_x86 under the X86 architecture :

enum Membar_mask_bits {
StoreStore = 1 << 3,
LoadStore  = 1 << 2,
StoreLoad  = 1 << 1,
LoadLoad   = 1 << 0
};

void membar(Membar_mask_bits order_constraint) {
	if (os::is_MP()) {
		if (order_constraint & StoreLoad) {
			int offset = -VM_Version::L1_line_size();
			if (offset < -128) {
				offset = -128;
			}
			lock();
			addl(Address(rsp, offset), 0);
		}
	}
}

Use a bitmask to define the enumeration of memory barriers. When analyzing biased locks , you can see the use of bitmasks. The focus is on the last two lines of code in the membar method:

lock();
addl(Address(rsp, offset), 0);

The lock addl instruction is inserted, which is the key to the memory barrier implementation under the X86 architecture, and the implementation in orderAccess_linux_x86 is the same.

Tips : The membar method is the abbreviation of Memory Barrier (memory barrier), and it is also called Memory Fence (memory fence), or directly called fence, anyway, barriers, fences and so on are messed up.

start with bytecode

Bytecode generated using the double-checked lock singleton pattern:

public class com.wyz.keyword.keyword_volatile.Singleton
static volatile com.wyz.keyword.keyword_volatile.Singleton instance;
flags:(0x0048) ACC_STATIC, ACC_VOLATILE
public static com.wyz.keyword.keyword_volatile.Singleton getInstance();
Code:
stack=2, locals=2, args_size=0
24: putstatic     #7      // Field instance:Lcom/wyz/keyword/keyword_volatile/Singleton;
37: getstatic     #7      // Field instance:Lcom/wyz/keyword/keyword_volatile/Singleton;

Let's look at the key parts of the bytecode:

ACC_VOLATILE to mark volatile variables;
Instructions getstatic and putstatic when writing/reading static variables.

Chapter 4 of the Java 11 Virtual Machine Specification describes ACC_VOLATILE as follows:

ACC_STATIC 0x0008 Declared static. ACC_VOLATILE 0x0040 Declared volatile; cannot be cached.

The virtual machine specification requires that variables modified by volatile cannot be cached . We know that the CPU cache is the "culprit" of the visibility problem , and being unable to be cached means that the visibility problem is eliminated, but it does not mean that the cache is not used.

The role of the getstatic instruction is also described in Chapter 6 of the Java 11 Virtual Machine Specification :

Get static field from class.

The role of the putstatic command:

Set static field in class.

You can roughly guess the way the JVM implements volatile. The method corresponding to the getstatic/putstatic instruction is defined in the JVM, and whether the variable is marked as ACC_VOLATILE is judged in the method, and then special logic processing is performed.

Tips：

0x0048 is the result of combining ACC_STATIC and ACC_VOLATILE;
For non-static variables, read and write are two instructions getfield and putfield.

Implementation of bytecode interpreter

In this part, we only look at the source code of putstatic. The previous part of the template interpreter also roughly analyzes getstatic, and the rest is left for everyone to analyze by themselves.

The implementation of putstatic is line 2026 in bytecodeInterpreter :

CASE(_putfield):
CASE(_putstatic):
{
	if ((Bytecodes::Code)opcode == Bytecodes::_putstatic) {
		// static的处理方式
	} else {
		// 非static的处理方式
	}

	// ACC_VOLATILE -> JVM_ACC_VOLATILE -> is_volatile()
	if (cache->is_volatile()) {
		// volatile变量的处理方式
		if (tos_type == itos) {
			obj->release_int_field_put(field_offset, STACK_INT(-1));
		}else {
			// 省略了超多的类型判断
		}
		OrderAccess::storeload();
	} else {
		// 非volatile变量的处理方式
	}
}

The logic is very simple, judge the type of the variable, and then call OrderAccess::storeload() to ensure that the characteristics of the volatile variable are realized.

JVM's memory barrier

On the basis of different operating systems and CPU architectures, JVM builds a set of memory barriers that conform to the JMM specification, shields the differences between different architectures, and realizes the semantic consistency of memory barriers. This part focuses on explaining the four main memory barriers implemented by the JVM and introducing the implementation of the X86 architecture and the differences caused by hardware differences.

Let's look at the explanation of the four memory barriers in orderAccess:

Memory Access Ordering Model

LoadLoad: Load1(s); LoadLoad; Load2

Ensures that Load1 completes (obtains the value it loads from memory) before Load2 and any subsequent load operations. Loads before Load1 may not float below Load2 and any subsequent load operations.

StoreStore: Store1(s); StoreStore; Store2

Ensures that Store1 completes (the effect on memory of Store1 is made visible to other processors) before Store2 and any subsequent store operations. Stores before Store1 may not float below Store2 and any subsequent store operations.

LoadStore: Load1(s); LoadStore; Store2

Ensures that Load1 completes before Store2 and any subsequent store operations. Loads before Load1 may not float below Store2 and any subsequent store operations.

StoreLoad: Store1(s); StoreLoad; Load2

Ensures that Store1 completes before Load2 and any subsequent load operations. Stores before Store1 may not float below Load2 and any subsequent load operations.

Try to translate the description of the 4 main memory barriers:

LoadLoad , instruction: Load1;LoadLoad;Load2. Ensure that Load1 completes the read operation before Load2 and subsequent read operations, and the Load instructions before Load1 cannot be reordered to the end of Load2 and subsequent read operations;
StoreStore , instruction: Store1;StoreStore;Store2. Ensure that Store1 completes the write operation before Store2 and subsequent write operations, and the result of Store1's write operation is visible to Store2 , and Store instructions before Store1 cannot be reordered to Store2 and subsequent write operations;
LoadStore , instruction: Load1;LoadStore;Store2. Ensure that Load1 completes the read operation before Store2 and subsequent write operations, and the Load instructions before Load1 cannot be reordered to Store2 and subsequent write operations;
StoreLoad : Command: Store1;StoreLoad;Load2. Ensure that Store1 completes the write operation before Load2 and subsequent Load instructions, and Store instructions before Store1 cannot be reordered to after Load2 and subsequent Load instructions.

Although the translation is a bit awkward, it is not difficult to understand. It is recommended that friends read this part of the notes (including the following content) carefully.

As can be seen in the comments, the memory barrier ensures the order of the program .

Memory barrier implementation of X86 architecture

The implementation of the Linux platform X86 architecture defines the memory barrier in orderAccess_linux_x86 :

inline void OrderAccess::loadload()   { compiler_barrier(); }
inline void OrderAccess::storestore() { compiler_barrier(); }
inline void OrderAccess::loadstore()  { compiler_barrier(); }
inline void OrderAccess::storeload()  { fence();            }
inline void OrderAccess::acquire()    { compiler_barrier(); }
inline void OrderAccess::release()    { compiler_barrier(); }

The implementation is very simple, there are only two core methods compiler_barrier and fence:

static inline void compiler_barrier() {
	__asm__ volatile ("" : : : "memory");
		}

inline void OrderAccess::fence() {
	// always use locked addl since mfence is sometimes expensive
	#ifdef AMD64
	__asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
	#else
	__asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
	#endif
	compiler_barrier();
	}

The above code is the extended inline assembly form of GCC, briefly explain the contents of the compiler_barrier method:

asm, insert assembly instructions;
volatile, it is forbidden to optimize the assembly instructions here;
meomory, the assembly instruction modifies the memory, and the memory data needs to be re-read.

Then there is the fence method that supports the storeload barrier, which is the same as the implementation of templateTable_x86 , and the core is the lock addl instruction. The lock prefix instruction can be understood as a CPU instruction-level lock, which locks the bus and cache, and has two main functions:

The Lock prefix instruction will cause the processor cache to be written back to memory
Processor cache writeback to memory invalidates caches of other processors

In fact, the X86 architecture provides memory barrier instructions lfence, sfence, mfence, etc., but why not use memory barrier instructions? The reason is in the comment of fence:

mfence is sometimes expensive

That is, the mfence instruction has a large performance overhead. Well, here we have been able to get the principle of implementing the volatile feature under the X86 architecture:

From the perspective of the JVM, the memory barrier provides a guarantee of visibility and order ;
From the perspective of X86, the voaltile instruction prohibits reordering, and the Lock instruction causes cache invalidation and writeback .

Tips : The processing of AMD64 and X86 in the fence method is slightly different. For their origin, you can refer to the knowledge of pansz .

Reasons for other architecture implementation differences

As seen earlier, in the template interpreters of the X86 architecture and the ARM architecture, the getfield_or_static method has differences in the use of memory barriers. ARM has reached "Rome" through the memory barrier, while X86 was born in "Rome".

It is not difficult to think of the reason for this difference. The CPU architecture has different constraints on reordering, which leads to the need for JVM to use different processing methods to achieve a unified effect . Regarding the reordering allowed by the CPU, I "handled" a picture:

This picture comes from the classic article " Memory Barriers: a Hardware View for Software Hackers " that introduces CPU cache and memory barriers. Although it is relatively "old", it is still worth reading. The column headings in the original image are vertical, which is inconvenient to look at, so a simple "visual optimization" has been carried out.

It can also be seen from the figure that the X86 architecture only allows "Stores Reordered After Loads" reordering, so only storeload is implemented in the JVM, and the characteristics of other barriers are guaranteed by the CPU itself.

epilogue

The content of volatile is too difficult to write, the features are not difficult, and the source code is not difficult, but it is very difficult to talk about memory barriers.

If you say less, it is difficult to understand, if you say too much, you will "cross the border", and it becomes an article about hardware. Therefore, the strategy for implementing differences in writing hardware is based on the most commonly used X86 architecture on the desktop side, and comparing the implementation of the most commonly used ARM architecture on the mobile side, explain the implementation of volatile as briefly as possible.

In fact, there are two one-way barriers, acquire and release , that are not involved in the part of the memory barrier , so you can understand it yourself.

So, now back to the first questions, I believe you can easily answer the first 6 questions, right?

If this article is helpful to you, please give it a lot of praise and support. If there are any mistakes in the article, please criticize and correct. Finally, everyone is welcome to pay attention to Wang Youzhi, a financial man . See you next time!