Concurrency - CAS operation, implementation, and optimization principle

Brief introduction

In Java, many tools are in use CAS (Compare And Set) to improve the accuracy and efficiency of the nature of concurrent data.

  • The following concurrent and concurrent.atomic many other types AtomicInteger
  • concurrent.locks packet following ReentrantLock, WriteLock etc.
  • other

For most people, the most common is to use should AtomicXXX, as well as in the use of Lock relevant subclasses We know their underlying use of the CAS, CAS also know that was the expectation value (expect) before passing an updated and a need to update the value (update), if you meet the requirements to perform the update, otherwise even failed to achieve atomic data.

We know with certainty CAS in a certain way to ensure that the underlying atomic data, and its benefits are

  • This does not have to do a lot of overhead and synchronous blocking the pending wake thread
  • Will ensure that data atomicity of the operation to the underlying hardware performance is much higher than do synchronous blocking hang wake-up operation, so it's better concurrency
  • Can be determined based on the status CAS subsequent return to achieve consistency of the data, then a failure such as increment values ​​until it succeeds cycle (hereinafter, speak) and the like

First look at a wrong increment ()

    private int value = 0;

    public static void main(String[] args) {
        Test test = new Test();
        test.increment();
        System.out.println("期待值:" + 100 * 100 + ",最终结果值:" + test.value);
    }

    private void increment() {
        for (int i = 0; i < 100; i++) {
            new Thread(() -> {
                for (int j = 0; j < 100; j++) {
                    value++;
                }
            }).start();
        }
    }
复制代码

Output:期待值:10000,最终结果值:9900

The results can be found in the output value of the error, since value++not an atomic operation, it will be value++split into three steps load、add、store, the thread is a multi-threaded yet possible to add after the thread has executed a store such repeated load the result obtained may be smaller than the final value.

Of course, here was added volatile int value, but no volatile because there is no use of 32-bit operation itself int atoms way to get atoms of these three operations performed, which can only block a command corresponding to the reordering memory to ensure its visibility, If it is long 等 64 位操作类型的可以加上 volatile, because the write operation on 32-bit machines may be assigned to a different bus transactions up operation (imagine ingredient became 2-step operation, 32 after the previous operation after the first operation 32), while the bus execution of the transaction is determined by the bus arbitration does not guarantee its execution sequence (corresponding to the former 32 may be added to other places on the handover is performed, for example directly read, the read data on the read only written half the value)

CAS is used to ensure the increment () correctly

We know about the operation of CAS basically encapsulated in Unsafe inside this package, but we are not allowed due Unsafe external use, it think this is an unsafe operation, such as if the direct use of Unsafe unsafe = Unsafe.getUnsafe();throws Exception in thread "main" java.lang.SecurityException: Unsafe.

We view the source code, it was because it did check

    public static Unsafe getUnsafe() {
        Class var0 = Reflection.getCallerClass();
        if (!VM.isSystemDomainLoader(var0.getClassLoader())) {
            throw new SecurityException("Unsafe");
        } else {
            return theUnsafe;
        }
    }
复制代码

So we can call it by reflection (of course, practice is not recommended for such use, in order to facilitate the presentation here)

public class Test {

    // value 的内存地址,便于直接找到 value
    private static long valueOffset = 0;

    {
        try {
            // 这个内存地址是和 value 这个成员变量的值绑定在一起的
            valueOffset = getUnsafe().objectFieldOffset
                (Test.class.getDeclaredField("value"));
        } catch (Exception ex) { throw new Error(ex); }
    }
    
    private int value;

    public static void main(String[] args) throws NoSuchFieldException, IllegalAccessException {
        Test test = new Test();
        test.increment();
    }

    private void increment() throws NoSuchFieldException, IllegalAccessException {
        Unsafe unsafe = getUnsafe();
        for (int i = 0; i < 100; i++) {
            new Thread(() -> {
                for (int j = 0; j < 1000; j++) {
                    unsafe.getAndAddInt(this, valueOffset, 1);
                }
            }).start();
        }
        System.out.println("需要得到的结果为: " + 100 * 1000);
        System.out.println("实际得到的结果为: " + value);
    }

    // 反射获取 Unsafe
    private Unsafe getUnsafe() throws NoSuchFieldException, IllegalAccessException {
        Field field = Unsafe.class.getDeclaredField("theUnsafe");
        field.setAccessible(true);
        return (Unsafe) field.get(null);
    }
}
复制代码

This time we can see that the results are correct from the output

The principle underlying the CAS

We continue to explore, getAndAddInt call unsafe.compareAndSwapInt(Object obj, long valueOffset, int expect, int update)this method in the end is how to achieve the Hotspot, we found that native calls unsafe.compareAndSwapInt(Object obj, long valueOffset, int expect, int update), we look at the source code found Hotspot defined in this section of code in unsafe.cpp

UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapInt(JNIEnv *env, jobject unsafe, 
jobject obj, jlong offset, jint e, jint x))
  UnsafeWrapper("Unsafe_CompareAndSwapInt");
  oop p = JNIHandles::resolve(obj);
  jint* addr = (jint *) index_oop_from_field_offset_long(p, offset);
  return (jint)(Atomic::cmpxchg(x, addr, e)) == e;
UNSAFE_END
复制代码

We can see that it is using Atomic::cmpxchg(x, addr, e)this operation to complete, at a different underlying hardware will have different codes Hotspot shield to help us up the details. This implementation has a different approach to achieve solaris, windows, linux_x86 and other methods, we use our most common server linux_x86, its codes are as follows

inline jint Atomic::cmpxchg (jint exchange_value, volatile jint*  dest, 
jint compare_value) {
  int mp = os::is_MP();
  __asm__ volatile (LOCK_IF_MP(%4) "cmpxchgl %1,(%3)"
                    : "=a" (exchange_value)
                    : "r" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp)
                    : "cc", "memory");
  return exchange_value;
}
复制代码

Several codes can be seen from the above

  • Hotspot underlying direct call to implement the function corresponding to the assembler
  • __asm__ Indicates that the subsequent assembly code section
  • volatile volatile and some differences in Java here, used here to tell the compiler not to optimize code compilation
  • LOCK_IF_MP If the operating system is indicated polynuclear then need to lock to ensure atomicity
  • cmpxchgl It is a comparison of the compilation and exchange

From this can be seen, CAS is at the bottom with a lock to ensure atomicity. In early implementations Intel is a direct bus locked, this has led to other processors do not gain access to the bus transaction can not perform follow-up operation, greatly reducing performance.

Intel has been optimized for its subsequent upgrades, may only need to lock a specific memory address in the x86 processor, then you can continue to use other processors to access the memory data bus, and only if the other bus access is also locked will block only when the memory address of the data live, so to greatly enhance the performance.

But think about the following questions

  1. Concurrent volume is very high, it may cause in constant competition for this value may result in many threads in a consistent state and circulation data can not be updated, resulting in excessive consumption of CPU resources
  2. ABA issues, such as the one thread adds some value, but also changed a value, then the back of the thread that the data had not changed, in fact, has been altered

JAVA8 optimized for CAS

Of course, the problem of ABA can be used to increase the version number to control, each operating version number + 1, the version number changes described value was once a turn, it provides a solution to this problem in Java AtomicStampedReference this class.

For the first problem, it has also been in Java8 corresponding optimized, Java 8 provides some new tools to solve this problem, as follows

We picked a point of view, the other is similar

He can see is serialized and must be of type Number, inheritance Striped64 can support dynamic segmentation

It works mainly CAS segmentation mechanism and automatic segmentation migration mechanism , starting with the CAS operation carried out in the above base, too much subsequent concurrent threads, then this will be a large number of threads allocated to the cells to the array, each array separate threads to perform accumulation operation, then the final result of the merger

Figure from [the consolidation] in chapter 8 of Java optimized for CAS

to sum up

I can see to do with doing direct synchronization pending or reasonably if compared to the use of CAS operation or if it is used in combination both wake the thread, then the concurrent performance can enhance an order of magnitude

  • Like for the synchronous blocking + CAS ReentrantLock this manner and the like are used to achieve high performance of the lock, in such ReentrantLock tryAcuqire () fails if CAS is used to obtain the corresponding lock, which will be placed in the blocking queue wait for subsequent wake
  • For example, the spin lock to lock CAS failed to get the number of times specified by the then hangs enter blocking queue waiting to be awakened
  • For example using AtomicInteger will increment when a polling value determined continuously updated until the operation is successful so far
  • CAS polling process without blocking embedded suspend and wake up, then it's advantage is that it can quickly respond to user requests to reduce resource consumption, because suspend and resume the thread involves calling user mode to kernel mode also involves threading a "snapshot "Related to store data, and the response is slow and resource consumption is high, but we also need to consider the overhead on the CPU polling, it may be fused to some extent on both together.
  • It is very important to understand CAS

Reference: JAVA in CAS

Guess you like

Origin juejin.im/post/5d9b2ca0e51d4577f3534e93