Lock elimination + escape analysis

 If you can confirm that a locked object will not escape the local scope, you can delete the lock. This means that this object can only be accessed by one thread at a time, so there is no need to prevent other threads from accessing it. In this case, the lock can be deleted. This is called lock elimination. This article is a series of articles about the JVM implementation mechanism, which is exactly the topic of today.

As we all know, java.lang.StringBuffer is a thread-safe class that uses synchronous methods, it can be used to interpret lock elimination well. StringBuffer was introduced in Java 1.0 and can be used to efficiently stitch immutable string objects. It synchronizes all append methods to ensure that when multiple threads write to the same StringBuffer object at the same time, it can ensure that the string in the construction can be safely created.

Many programs actually do not need this layer of thread safety guarantee, so in Java 5 introduced a non-synchronous java.lang.StringBuilder class as its alternative. Both of these classes inherit the java.lang.AbstractStringBuilder class of package private (note: in simple terms, the class without modifiers), and the implementation of their append method is also very similar.

The difference lies in the synchronization operation of StringBuffer:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Compare with StringBuilder:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The thread that calls the append method of StringBuffer must acquire the internal lock (also called the monitor lock) of this object to enter the method. The lock must also be released before exiting the method. StringBuilder does not need to perform this operation, so its execution performance is higher than that of StringBuffer-at least at first glance.

However, after the HotSpot virtual machine introduced escape analysis, when the synchronization method of an object such as StringBuffer was called, the lock could be automatically eliminated. This will only appear on objects created inside the method domain, and only then can it be guaranteed that no escape will occur.

Java performance testing generally uses Java Microbenchmark Harness (JMH). Let's use JMH to test, when the modern JVM can confirm that the StringBuffer object can only be accessed by one thread, how does it narrow the performance gap by eliminating the lock on the StringBuffer.

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Lock elimination is a very effective optimization. It is enabled by default in Java 8, but you can also turn it off by the VM parameter -XX: -DoEscapeAnalysis, so you can see the effect of the optimization. After enabling (default) escape analysis, the performance of StringBuffer and StringBuilder is basically the same. (The result report counts the number of operations performed per second. A higher score indicates better performance.)

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

As shown above, after the escape analysis is turned off, the StringBuffer code is about 15% slower-and this difference is mainly due to the lock operation when the append () method is called.

Lock Coarsening

HotSpot virtual machines also have some additional lock optimization techniques. Although they are not technically part of the escape analysis subsystem, they also improve the performance of internal locks by analyzing scope. When successively acquiring locks of the same object, the HotSpot virtual machine checks whether multiple lock areas can be combined into a larger lock area. This aggregation is called lock coarsening, which can reduce the consumption of locking and unlocking.

When the HotSpot JVM finds that it needs to be locked, it will try to find the unlock operation of the same object forward. If it can match, it will consider whether to merge the two lock areas and delete a set of unlocking / locking operations.

Let's look at a program that continuously acquires monitor locks for the same object:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Its bytecode is as follows, which looks very verbose:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

[The comment at the end of the code corresponds to the following output line. ——Ed.]

Let's review first, the byte codes corresponding to the operation internal lock are monitorenter and monitorexit.

Each monitorenter instruction in the bytecode corresponds to two monitorexit instructions, which respectively correspond to different execution paths. The reason is that the first monitorexit instruction will release the monitor lock when it normally exits the lock area, and the second instruction will release it when it exits abnormally.

This bytecode may seem strange because there is only one increment operation of the int variable in the synchronization block in the source program. There is no exception thrown in the code, but it may indeed exit the lock area abnormally. (This may happen if the thread catches an InterruptedException, such as calling the stop () method of the executing thread. Therefore, a second execution path is required to ensure that the monitor lock can be released, even if it is thrown unchecked The same is true for unchecked exceptions. You can learn more from the JVM specification.) Lock coarsening is enabled by default, but it can also be turned off by starting the parameter -XX: -EliminateLocks.

Nested lock

The synchronization blocks may be nested one by one, and it is also possible that two blocks use the same object's monitor lock for synchronization. This situation is called a nested lock. The HotSpot virtual machine can recognize and delete the lock in the internal block. A thread has already acquired the lock when it enters the external block, so when it tries to enter the internal block, it must still hold the lock, so it is feasible to delete the lock at this time.

At the time of writing, nested lock deletion in Java 8 can only occur when the lock is declared as static final or the lock is the this object.

The following is an example of deleting an internal lock when it encounters a nested synchronization block:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The HotSpot virtual machine deletes the internal nested lock, so this code will eventually become like this:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Nested lock optimization is enabled by default, but it can also be turned off by starting the parameter -XX: -EliminateNestedLocks.

Array and escape analysis

The space allocated on the non-heap is either stored on the stack or just in the CPU registers. These are relatively scarce resources, so escape analysis, like other optimizations, will definitely face compromise (in implementation). A default limitation on the HotSpot JVM is that arrays larger than 64 elements will not be optimized for escape analysis. This size can be controlled by the startup parameter -XX: EliminateAllocationArraySizeLimit = n, where n is the size of the array.

Assuming a hotspot code, it will allocate a temporary array to read data from the cache. If the escape analysis finds that the scope of this array does not escape the method body, it will not allocate memory on the heap. However, if the array size exceeds 64 (even if not all are used), it will still be stored in the heap. In this way, the escape analysis optimization of the array will not work, and the memory will still be allocated from the heap.

In the following JMH benchmark test, the test method will create new non-escape arrays of sizes 63, 64, and 65, respectively. (The array of size 63 also participated in the test to prove that the 64 array is faster than 65 not because of memory alignment.)

Each round of testing only used the first two elements of the array, namely a [0] and a [1]. It should be noted that the escape analysis is only limited by the length of the array, and it does not matter how many elements are actually used.

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

From the results, once the array allocation cannot benefit from the optimization of escape analysis, the performance will drop significantly. (The score here is also the number of operations per second, the higher the score, the better the performance.)

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

If you need to allocate a larger array in the hotspot code, you can configure the virtual machine to optimize the large array. Adjust the element upper limit to 65, and then run the benchmark again to find that the performance is also aligned.

Command Line:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The execution result is:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

You can see the result is the same.

in conclusion

This article and the previous article on escape analysis show you some of the magic behind the HotSpot JVM. At the same time, you can also see the complexity behind these optimizations. Every major release of Java will add some new features to the JVM.

In fact, Oracle is also studying the next generation of compilation technology. It is Graal, which is a pluggable, extensible, Java-implemented just-in-time (JIT) compiler. It is an important part of the Metropolis project. The goal of this project is to use the Java language as much as possible to build JVM runtime programs.

As mentioned in JEP 317, the Graal compiler is an experimental new feature of Java 10. Its main goal is to enable developers and professional platform leaders to write their own JIT compilers that meet their own special needs. For the design and prototype completion of new optimization technologies, Graal is a very suitable platform.

The scope analysis methods mentioned in this article and the previous article can be used to implement many optimization techniques. The first is allocation elimination (that is, scalar replacement. Note: refers to the decomposition of objects into basic types such as int, allocate space on the stack and registers, so that you can not allocate memory on the heap, and do not need GC Recycled), and the lock-related technologies we discussed. These are just some examples of the JIT compilation technology provided by the mature C2 compiler in the HotSpot JVM. Subsequent articles will also introduce some other techniques used to improve code performance in the HotSpot JVM.

 If you can confirm that a locked object will not escape the local scope, you can delete the lock. This means that this object can only be accessed by one thread at a time, so there is no need to prevent other threads from accessing it. In this case, the lock can be deleted. This is called lock elimination. This article is a series of articles about the JVM implementation mechanism, which is exactly the topic of today.

As we all know, java.lang.StringBuffer is a thread-safe class that uses synchronous methods, it can be used to interpret lock elimination well. StringBuffer was introduced in Java 1.0 and can be used to efficiently stitch immutable string objects. It synchronizes all append methods to ensure that when multiple threads write to the same StringBuffer object at the same time, it can ensure that the string in the construction can be safely created.

Many programs actually do not need this layer of thread safety guarantee, so in Java 5 introduced a non-synchronous java.lang.StringBuilder class as its alternative. Both of these classes inherit the java.lang.AbstractStringBuilder class of package private (note: in simple terms, the class without modifiers), and the implementation of their append method is also very similar.

The difference lies in the synchronization operation of StringBuffer:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Compare with StringBuilder:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The thread that calls the append method of StringBuffer must acquire the internal lock (also called the monitor lock) of this object to enter the method. The lock must also be released before exiting the method. StringBuilder does not need to perform this operation, so its execution performance is higher than that of StringBuffer-at least at first glance.

However, after the HotSpot virtual machine introduced escape analysis, when the synchronization method of an object such as StringBuffer was called, the lock could be automatically eliminated. This will only appear on objects created inside the method domain, and only then can it be guaranteed that no escape will occur.

Java performance testing generally uses Java Microbenchmark Harness (JMH). Let's use JMH to test, when the modern JVM can confirm that the StringBuffer object can only be accessed by one thread, how does it narrow the performance gap by eliminating the lock on the StringBuffer.

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Lock elimination is a very effective optimization. It is enabled by default in Java 8, but you can also turn it off by the VM parameter -XX: -DoEscapeAnalysis, so you can see the effect of the optimization. After enabling (default) escape analysis, the performance of StringBuffer and StringBuilder is basically the same. (The result report counts the number of operations performed per second. A higher score indicates better performance.)

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

As shown above, after the escape analysis is turned off, the StringBuffer code is about 15% slower-and this difference is mainly due to the lock operation when the append () method is called.

Lock Coarsening

HotSpot virtual machines also have some additional lock optimization techniques. Although they are not technically part of the escape analysis subsystem, they also improve the performance of internal locks by analyzing scope. When successively acquiring locks of the same object, the HotSpot virtual machine checks whether multiple lock areas can be combined into a larger lock area. This aggregation is called lock coarsening, which can reduce the consumption of locking and unlocking.

When the HotSpot JVM finds that it needs to be locked, it will try to find the unlock operation of the same object forward. If it can match, it will consider whether to merge the two lock areas and delete a set of unlocking / locking operations.

Let's look at a program that continuously acquires monitor locks for the same object:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Its bytecode is as follows, which looks very verbose:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

[The comment at the end of the code corresponds to the following output line. ——Ed.]

Let's review first, the byte codes corresponding to the operation internal lock are monitorenter and monitorexit.

Each monitorenter instruction in the bytecode corresponds to two monitorexit instructions, which respectively correspond to different execution paths. The reason is that the first monitorexit instruction will release the monitor lock when it normally exits the lock area, and the second instruction will release it when it exits abnormally.

This bytecode may seem strange because there is only one increment operation of the int variable in the synchronization block in the source program. There is no exception thrown in the code, but it may indeed exit the lock area abnormally. (This may happen if the thread catches an InterruptedException, such as calling the stop () method of the executing thread. Therefore, a second execution path is required to ensure that the monitor lock can be released, even if it is thrown unchecked The same is true for unchecked exceptions. You can learn more from the JVM specification.) Lock coarsening is enabled by default, but it can also be turned off by starting the parameter -XX: -EliminateLocks.

Nested lock

The synchronization blocks may be nested one by one, and it is also possible that two blocks use the same object's monitor lock for synchronization. This situation is called a nested lock. The HotSpot virtual machine can recognize and delete the lock in the internal block. A thread has already acquired the lock when it enters the external block, so when it tries to enter the internal block, it must still hold the lock, so it is feasible to delete the lock at this time.

At the time of writing, nested lock deletion in Java 8 can only occur when the lock is declared as static final or the lock is the this object.

The following is an example of deleting an internal lock when it encounters a nested synchronization block:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The HotSpot virtual machine deletes the internal nested lock, so this code will eventually become like this:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

Nested lock optimization is enabled by default, but it can also be turned off by starting the parameter -XX: -EliminateNestedLocks.

Array and escape analysis

The space allocated on the non-heap is either stored on the stack or just in the CPU registers. These are relatively scarce resources, so escape analysis, like other optimizations, will definitely face compromise (in implementation). A default limitation on the HotSpot JVM is that arrays larger than 64 elements will not be optimized for escape analysis. This size can be controlled by the startup parameter -XX: EliminateAllocationArraySizeLimit = n, where n is the size of the array.

Assuming a hotspot code, it will allocate a temporary array to read data from the cache. If the escape analysis finds that the scope of this array does not escape the method body, it will not allocate memory on the heap. However, if the array size exceeds 64 (even if not all are used), it will still be stored in the heap. In this way, the escape analysis optimization of the array will not work, and the memory will still be allocated from the heap.

In the following JMH benchmark test, the test method will create new non-escape arrays of sizes 63, 64, and 65, respectively. (The array of size 63 also participated in the test to prove that the 64 array is faster than 65 not because of memory alignment.)

Each round of testing only used the first two elements of the array, namely a [0] and a [1]. It should be noted that the escape analysis is only limited by the length of the array, and it does not matter how many elements are actually used.

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

From the results, once the array allocation cannot benefit from the optimization of escape analysis, the performance will drop significantly. (The score here is also the number of operations per second, the higher the score, the better the performance.)

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

If you need to allocate a larger array in the hotspot code, you can configure the virtual machine to optimize the large array. Adjust the element upper limit to 65, and then run the benchmark again to find that the performance is also aligned.

Command Line:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

The execution result is:

 

Missing and filling gaps, JVM optimization, lock elimination + escape analysis

 

 

You can see the result is the same.

in conclusion

This article and the previous article on escape analysis show you some of the magic behind the HotSpot JVM. At the same time, you can also see the complexity behind these optimizations. Every major release of Java will add some new features to the JVM.

In fact, Oracle is also studying the next generation of compilation technology. It is Graal, which is a pluggable, extensible, Java-implemented just-in-time (JIT) compiler. It is an important part of the Metropolis project. The goal of this project is to use the Java language as much as possible to build JVM runtime programs.

As mentioned in JEP 317, the Graal compiler is an experimental new feature of Java 10. Its main goal is to enable developers and professional platform leaders to write their own JIT compilers that meet their own special needs. For the design and prototype completion of new optimization technologies, Graal is a very suitable platform.

The scope analysis methods mentioned in this article and the previous article can be used to implement many optimization techniques. The first is allocation elimination (that is, scalar replacement. Note: refers to the decomposition of objects into basic types such as int, allocate space on the stack and registers, so that you can not allocate memory on the heap, and do not need GC Recycled), and the lock-related technologies we discussed. These are just some examples of the JIT compilation technology provided by the mature C2 compiler in the HotSpot JVM. Subsequent articles will also introduce some other techniques used to improve code performance in the HotSpot JVM.

Guess you like

Origin www.cnblogs.com/zhuyeshen/p/12735734.html