Java multi-thread tuning

Optimization of Synchronized synchronization lock

  • Lock synchronization lock is implemented based on Java
  • Synchronized is implemented based on the Mutex Lock of the underlying operating system. Every acquisition and release of a lock will switch between user mode and kernel mode, thereby increasing system performance overhead. In the case of intense lock competition, Synchronized synchronization locks perform very poorly in performance, and they are also known as heavyweight locks.

After the JDK1.6 version, Java has fully optimized the Synchronized synchronization lock

Synchronized synchronization lock implementation principle

javac -encoding UTF-8 SyncTest.java  // 先运行编译 class 文件命令
javap -v SyncTest.class // 再通过 javap 打印出字节文件

When Synchronized decorates a synchronous code block, it realizes synchronization by the monitorenter and monitorexit instructions. After entering the monitorenter instruction, the thread will hold the Monitor object, and after exiting the monitorenter instruction, the thread will release the Monitor object.

When Synchronized decorates a synchronized method, there is an ACC_SYNCHRONIZED flag. When a method is invoked, the call instruction will check to see if the method has the ACC_SYNCHRONIZED access flag set. If this flag is set, the execution thread will first hold the Monitor object before executing the method. During the execution of this method, other threads will not be able to obtain the Monitor object, and the Monitor object will be released after the execution of the method is completed.

Synchronization in the JVM is based on entering and exiting Monitor objects. Each object instance will have a Monitor, which can be created and destroyed together with the object. Monitor is implemented by ObjectMonitor, and ObjectMonitor is implemented by the ObjectMonitor.hpp file of C++

ObjectMonitor() {
   _header = NULL;
   _count = 0; // 记录个数
   _waiters = 0,
   _recursions = 0;
   _object = NULL;
   _owner = NULL;
   _WaitSet = NULL; // 处于 wait 状态的线程,会被加入到 _WaitSet
   _WaitSetLock = 0 ;
   _Responsible = NULL ;
   _succ = NULL ;
   _cxq = NULL ;
   FreeNext = NULL ;
   _EntryList = NULL ; // 处于等待锁 block 状态的线程,会被加入到该列表
   _SpinFreq = 0 ;
   _SpinClock = 0 ;
   OwnerIsThread = 0 ;
}
  • When multiple threads access a piece of synchronization code at the same time, multiple threads will be stored in the EntryList collection first, and threads in the block state will be added to the list. Next, when the thread obtains the Monitor of the object, the Monitor relies on the Mutex Lock of the underlying operating system to achieve mutual exclusion. If the thread successfully applies for the Mutex, it will hold the Mutex, and other threads will not be able to obtain the Mutex.

  • If the thread calls the wait() method, the currently held Mutex will be released, and the thread will enter the WaitSet collection and wait for the next wake-up. If the current thread successfully executes the method, the Mutex will also be released.
    insert image description here

    In this implementation method of synchronization lock, because Monitor relies on the implementation of the underlying operating system, there is a switch between user mode and kernel mode, which increases performance overhead.

Lock Escalation Optimization

JDK1.6 introduced the concepts of biased locks, lightweight locks, and heavyweight locks to reduce context switching caused by lock competition, and it is the newly added Java object header that implements the lock upgrade function.

When a Java object is modified into a synchronization lock by the Synchronized keyword, a series of upgrade operations around this lock will be related to the Java object header.

Java object header

In JDK1.6 JVM, object instances are divided into three parts in heap memory: object header, instance data and alignment padding. Among them, the Java object header consists of three parts: Mark Word, a pointer to the class, and the length of the array.
insert image description here

The lock upgrade function mainly depends on the lock flag and release bias lock flag in Mark Word. Synchronized synchronization locks start from bias locks. As competition becomes more and more fierce, bias locks are upgraded to lightweight locks and finally to Heavyweight lock.

Bias lock

Biased locks are mainly used to optimize the competition of the same thread applying for the same lock multiple times. For example, in the scenario of creating a thread and performing loop monitoring in the thread, the same thread needs to acquire and release the lock every time, and each operation will switch between the user state and the kernel state.

The function of the biased lock is that when a thread accesses the synchronization code or method again, the thread only needs to go to the Mark Word of the object header to determine whether there is a biased lock pointing to its ID, and there is no need to enter the Monitor to compete for the object. When the object is regarded as a synchronization lock and a thread grabs the lock, the lock flag is still 01, the "biased lock" flag is set to 1, and the thread ID that grabs the lock is recorded, indicating that it has entered the biased lock state.

Once other threads compete for lock resources, the bias lock will be revoked. The cancellation of the biased lock needs to wait for the global security point, suspend the thread holding the lock, and check whether the thread is still executing the method, if so, upgrade the lock, otherwise it will be preempted by other threads.

In a high-concurrency scenario, when a large number of threads compete for the same lock resource at the same time, the biased lock will be revoked. After stop the word occurs, enabling the biased lock will undoubtedly bring greater performance overhead. At this time, we can add JVM Parameters to turn off biased locks to tune system performance

-XX:-UseBiasedLocking // 关闭偏向锁(默认打开)
或者
-XX:+UseHeavyMonitors  // 设置重量级锁
  • stop the world: wait for all user threads to enter the safe point and block, and do some global operations.

lightweight lock

When another thread competes to acquire this lock, since the lock is already a biased lock, when it is found that the thread ID in the object header Mark Word is not its own thread ID, it will perform a CAS operation to acquire the lock. If the acquisition is successful, directly replace Mark The thread ID in Word is its own
ID, and the lock will remain in a biased lock state; if the lock acquisition fails, it means that the current lock has certain competition, and the biased lock will be upgraded to a lightweight lock.

Lightweight locks are suitable for scenarios where threads execute synchronization blocks alternately, and most locks do not have long-term competition during the entire synchronization cycle.

Spin locks and heavyweight locks

If the lightweight lock CAS fails to grab the lock, the thread will be suspended and enter the blocked state. If the thread that is holding the lock releases the resource in a short period of time, the thread that enters the blocked state will undoubtedly apply for the lock resource again.

The JVM provides a spin lock, which can continuously try to acquire the lock by spinning, so as to avoid the thread being suspended and blocked. This is based on the fact that in most cases, the thread will not hold the lock for too long. After all, the thread being suspended and blocked may not be worth the candle.

Starting from JDK1.7, the spin lock is enabled by default, the number of spins is determined by the JVM settings, and the number of retries is too much, because the CAS retry operation means that the CPU will be occupied for a long time.

If the lock grab still fails after the spin lock is retried, the synchronization lock will be upgraded to a heavyweight lock, and the lock flag will be changed to 10. In this state, threads that have not grabbed the lock will enter the Monitor, and then be blocked in the _WaitSet queue.

In scenarios where lock competition is not intense and lock occupation time is very short, spin locks can improve system performance. Once the lock competition is fierce or the lock takes too long, the spin lock will cause a large number of threads to be in the CAS retry state all the time, occupying CPU resources and increasing system performance overhead.

-XX:-UseSpinning // 参数关闭自旋锁优化 (默认打开) 
-XX:PreBlockSpin // 参数修改默认的自旋次数。JDK1.7 后,去掉此参数,由 jvm 控制

Dynamic compilation implements lock elimination/lock coarsening

Java uses compiler optimizations for locks. When the JIT compiler dynamically compiles the synchronization block, it uses a technique called escape analysis to determine whether the lock object used by the synchronization block can only be accessed by one thread and has not been released to other threads. If it is confirmed, then the JIT compiler will not generate the machine code for the application and release of the lock represented by synchronized when compiling the synchronization block, that is, the use of the lock is eliminated.

When the JIT compiler is dynamically compiling, if it finds that several adjacent synchronization blocks use the same lock instance, the JIT compiler will merge these synchronization blocks into one large synchronization block, thus avoiding a thread "repeated lock" The performance overhead caused by applying and releasing the same lock".

When the JIT compiler dynamically compiles the synchronization block, it judges that it can only be accessed by one thread and will not be concurrent. The machine code for the application and release of the lock represented by synchronized will not be generated during compilation, which is called lock elimination; adjacent synchronization
blocks Use the same lock instance and merge them together to avoid repeated application and release of the same lock, which is called lock coarsening

Reduce lock granularity

It is a common method to implement lock optimization through the code layer and reduce the lock granularity.

When our lock object is an array or queue, it will be very intense to concentrate on an object, and the lock will be upgraded to a heavyweight lock. We can consider splitting an array and queue object into multiple small objects to reduce lock competition and improve parallelism.

The most classic case of reducing the lock granularity is the ConcurrentHashMap version implemented before JDK1.8. ConcurrentHashMap uses segmented lock Segment very cleverly to reduce lock resource competition
insert image description here

By JDK1.8, ConcurrentHashMap has made a lot of changes and abandoned the concept of Segment. Since the performance of Synchronized locks has been greatly improved after Java6, in JDK1.8, Java re-enables Synchronized synchronization locks, and implements HashEntry through Synchronized as the lock granularity. This change makes the data structure simpler and the operation clearer and smoother. Like the put method of JDK1.7, JDK1.8 will use CAS to add elements when there is no hash conflict when adding elements; if there is a conflict, lock the linked list through Synchronized, and then execute the next operate.

  • In the non-thread-safe Map container, use the TreeMap container to access large data; in the thread-safe Map container, use the SkipListMap container to access large data.
  • If there is a strong consistency requirement for the data, you need to use Hashtable; in the case of weak consistency in most scenarios, you can use ConcurrentHashMap; if the amount of data is in the tens of millions, and there are a lot of additions, deletions, and modifications, you can consider Use ConcurrentSkipListMap.

insert image description here

The JVM introduced a hierarchical lock mechanism in JDK1.6 to optimize
Synchronized. When a thread acquires a lock, the object lock will first become a biased lock. This is done to optimize the switching between user mode and kernel mode caused by repeated acquisition by the same thread. ;Secondly, if there are multiple threads competing for lock resources, the lock will be upgraded to a lightweight lock, which is suitable for scenarios where locks are held in a short period of time and sub-locks are switched alternately; lightweight locks also use spin Locks are used to avoid frequent switching between thread user mode and kernel mode, which greatly improves system performance; but if the lock competition is too fierce, then the synchronization lock will be upgraded to a heavyweight lock.

Reducing lock competition is the key to optimizing Synchronized synchronization locks. You should try to make the Synchronized synchronization lock a lightweight lock or a biased lock, so as to improve the performance of the Synchronized synchronization lock; in addition, you can improve the success rate of the Synchronized synchronization lock to acquire lock resources during spin by reducing the holding time of the lock. Avoid upgrading Synchronized synchronization locks to heavyweight locks. Reducing lock competition by reducing lock granularity is also the most commonly used optimization method;

Lock synchronization lock

insert image description here

In the case of low concurrency and low competition, Synchronized synchronization locks have the advantages of hierarchical locks, and their performance is similar to that of Lock locks; however, under high load and high concurrency conditions, Synchronized synchronization locks will be upgraded to Heavyweight lock, performance is not as stable as Lock lock.

Implementation principle of Lock lock

Lock is a lock implemented based on Java. Lock is an interface class. The commonly used implementation classes are ReentrantLock and ReentrantReadWriteLock (RRW), which are all implemented by relying on the AbstractQueuedSynchronizer (AQS) class.

The AQS class structure contains a linked list-based waiting queue (CLH queue) for storing all blocked threads. There is also a state variable in AQS, which represents the locking state for ReentrantLock. The operation of this queue is realized through CAS operation

insert image description here

Lock Separation Optimization Lock Synchronization Lock

Read-write lock ReentrantReadWriteLock

In the scenario of more reading and less writing, Java provides another read-write lock RRW that implements the Lock interface. RRW allows multiple read threads to access at the same time, but does not allow write threads and read threads, write threads and write threads to access at the same time. The read-write lock internally maintains two locks, one is ReadLock for read operations and the other is WriteLock for write operations.

RRW is also implemented based on AQS. Its custom synchronizer (inherited from AQS) needs to maintain the state of multiple read threads and one write thread on the synchronization state state. The design of this state becomes the key to realize the read-write lock. RRW makes good use of the high and low bits to realize the function of an integer to control two states. The read-write lock divides the variable into two parts, the high 16 bits indicate reading, and the low 16 bits indicate writing.

When a thread tries to acquire a write lock, it will first determine whether the synchronization status state is 0. If the state is equal to 0, it means that no other thread has acquired the lock temporarily; if the state is not equal to 0, it means that other threads have acquired the lock.

At this time, judge whether the lower 16 bits (w) of the synchronization state state are 0. If w is 0, it means that other threads have acquired the read lock, and enter the CLH queue for blocking waiting; if w is not 0, it means other threads The thread has acquired the write lock. At this time, it is necessary to judge whether the current thread has acquired the write lock. If not, it will enter the CLH queue for blocking and waiting; if it is, it should be judged whether the current thread has acquired the write lock. Abnormal, otherwise update the synchronization status.

Optimistic locking optimizes parallel operations

CAS is the core algorithm to implement optimistic locking, which includes 3 parameters: V (variable to be updated), E (expected value) and N (latest value).

Only when the variable that needs to be updated is equal to the expected value, the variable that needs to be updated will be set to the latest value. If the updated value is different from the expected value, it means that other threads have already updated the variable that needs to be updated. At this time, the current thread does not Operation that returns the true value of V.

CAS is to call the underlying instructions of the processor to achieve atomic operations, cache locking mechanism

context switch

In a concurrent program, if the number of threads is set too small, the program cannot fully utilize system resources; if the number of threads is set too large, it may cause excessive competition for resources, resulting in additional system overhead caused by context switching.

A thread is deprived of the right to use the processor and is suspended, which is "cut-out"; a thread is selected to occupy the processor to start or continue to run, which is "cut-in". In the process of switching in and out, the operating system needs to save and restore the corresponding progress information, which is the "context".

The state of a thread changes from RUNNING to BLOCKED, then from BLOCKED to RUNNABLE, and then selected by the scheduler for execution. This is a context switching process.

When a thread changes from the RUNNING state to the BLOCKED state, we call it a thread suspension. After the thread suspension is cut out, the operating system will save the corresponding context so that the thread can execute the progress before entering the RUNNABLE state again later. continue on the basis of

When a thread enters the RUNNABLE state from the BLOCKED state, we call it the wake-up of a thread. At this time, the thread will obtain the context saved last time and continue to complete the execution.

Spontaneous context switching means that the thread is switched out by the Java program call

sleep()
wait()
yield()
join()
park()
synchronized
lock

Non-spontaneous context switching means that the thread is forced to switch out due to the scheduler. The common ones are: the allocated time slice of the thread is exhausted, the virtual machine garbage collection is caused, or the execution priority is caused.

In the Java virtual machine, the memory of the object is allocated by the heap in the virtual machine. During the running of the program, new objects will be created continuously. If the old objects are not recycled after use, the heap memory will be quickly destroyed. run out. The Java virtual machine provides a recycling mechanism to recycle objects that are no longer used after creation, thereby ensuring the sustainable allocation of heap memory. The use of this garbage collection mechanism may lead to stop-the-world events, which is actually a thread suspension behavior.

The concurrent execution speed cannot exceed the serial execution speed, because context switching also exists in multi-threading. The design of Redis and NodeJS well embodies the advantages of single-threaded serialization.

  • Under the Linux system, you can use the vmstat command provided by the Linux kernel to monitor the context switching frequency of the system during the running of the Java program
  • If you are monitoring the context switch of an application, you can use the pidstat command to monitor the Context Switch context switch of the specified process. pidstat -w -l -p port

What specific links in the switching process will the system overhead occur in?

The operating system saves and restores the context;
the scheduler performs thread scheduling;
the processor cache reloads;
context switching may also cause the entire cache area to be flushed, thereby incurring time overhead.

Multi-threaded competition for lock resources will cause context switching, and the more threads blocked due to lock competition, the more frequent context switching will be, and the greater the performance overhead of the system will be. It can be seen that in multi-threaded programming, locks are not actually the source of performance overhead, but competing locks are.

What specific links in the switching process will the system overhead occur in?

The operating system saves and restores the context;
the scheduler performs thread scheduling;
the processor cache reloads;
context switching may also cause the entire cache area to be flushed, thereby incurring time overhead.

Multi-threaded competition for lock resources will cause context switching, and the more threads blocked due to lock competition, the more frequent context switching will be, and the greater the performance overhead of the system will be. It can be seen that in multi-threaded programming, locks are not actually the source of performance overhead, but competing locks are.

Guess you like

Origin blog.csdn.net/weixin_46488959/article/details/127149879