Java multithreading and high concurrency CAS (optimistic locking) in-depth interpretation

table of Contents

what is CAS

Introduction

Background introduction

Source code analysis

AtomicInteger 

unsafe.cpp 

What problem is solved?

What are the defects?

1. ABA problem (linked list will lose data)

2. Long spins are very CPU intensive

3. Only atomic operations of a shared variable can be guaranteed

Application scenario

Java 8 incrementAndGet optimization

False sharing

 


what is CAS

  • CAS (compare and swap) compare and replace, compare and replace is a technology used in thread concurrency algorithm

  • CAS is an atomic operation to ensure concurrent security, not to ensure concurrent synchronization

  • CAS is an instruction of CPU

  • CAS is a non-blocking, lightweight optimistic lock

Introduction

CAS is the full name of CompareAndSwap, is a mechanism to achieve synchronization in a multi-threaded environment. The CAS operation includes three operands-memory location, expected value and new value. The implementation logic of CAS is to compare the value at the memory location with the expected value, and if it is equal, replace the value at the memory location with a new value. If they are not equal, no operation is done [many articles on the Internet are explained as loops, which is inaccurate] .

Background introduction

CPU is data transmission through bus and memory. In the multi-core era, multiple cores communicate with memory and other hardware through a bus. As shown below:

Image source "In-depth understanding of computer systems"

The above picture is a relatively simple computer structure diagram, although simple, but enough to explain the problem. In the picture above, the CPU communicates with the memory via the bus marked by two blue arrows. Now consider a problem. What problems will be caused if the multiple cores of the CPU operate on a memory process at the same time, and do not control it?

Assuming that Core 1 writes 64-bit data to the memory via a 32-bit bandwidth bus, Core 1 needs to perform two write operations to complete the entire operation. If the core 1 writes the 32-bit data for the first time, the core 2 reads the 64-bit data from the memory location written by the core 11. 2 Start reading data from the memory location, then the read data must be chaotic. But for this problem, there is actually no need to worry. Starting with the Pentium processor, Intel processors will guarantee atomic reading and writing of quadwords aligned on 64-bit boundaries.

According to the above description, Intel processors can ensure that single-access memory alignment instructions are executed atomically. But what if it is a two-time memory access instruction? The answer is not guaranteed. For example, the increment instruction inc dword ptr [...] is equivalent to DEST = DEST +1. This instruction contains three operations, read-> change-> write, which involves two accesses. Consider a situation where a value of 1 is stored at a specified location in memory. Now both CPU cores execute the instruction at the same time. The process of alternate execution of the two cores is as follows:

1. Core 1 reads the value 1 from the specified location in memory and loads it into the register

2. Core 2 reads the value 1 from the specified location in memory and loads it into the register

3. Core 1 increments the value in the register by 1

4. Core 2 increments the value in the register by 1

5. Core 1 writes the modified value back to memory

6. Core 2 writes the modified value back to memory

After the above process, the final value in memory is 2, and we expect 3, which is a problem. To deal with this problem, it is necessary to avoid two or more cores operating the same area of ​​memory at the same time. How to avoid it? This will introduce the protagonist -lock prefix of this article.

LOCK—Assert LOCK# Signal Prefix
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

In a multiprocessor environment, the LOCK # signal can ensure that the processor exclusively uses some shared memory. Lock can be added before the following commands ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG

By adding the lock prefix before the inc instruction, you can make the instruction atomic. When multiple cores execute the same inc instruction at the same time, they will proceed in a serial manner, which will be full of the situation mentioned above.

How does the lock prefix ensure that the core monopolizes a certain area of ​​memory?

In Intel processors, there are two ways to ensure that a certain core of the processor monopolizes a certain memory area. The first method is to let a certain core exclusively use the bus by locking the bus, but this cost is too high, and other cores cannot access the memory after the bus is locked, which may cause other cores to stop working within a short time; the second The way is to lock the cache. If memory data is cached in the processor cache, the lock # signal issued by the processor will not lock the bus, but lock the memory area corresponding to the cache. While other processors are locked in this memory area, they cannot perform relevant operations on this memory area. Compared to locking the bus, the cost of locking the cache is significantly smaller.

Simply put, the lock function:

  • If the memory area (area of ​​memory) to be accessed has been locked in the internal cache of the processor during the execution of the lock prefix instruction (that is, the cache line containing the memory area is currently in an exclusive or modified state), and the memory area is Completely contained in a single cache line, the processor will execute the instruction directly. Because the cache line is always locked during the execution of the instruction, other processors cannot read / write the memory area to be accessed by the instruction, so the execution of the instruction can be guaranteed atomic

  • Prohibit reordering of this instruction with the previous and subsequent read and write instructions

  • Flush all data in the write buffer to memory

Source code analysis

With the above background knowledge, we can now leisurely read the CAS source code. Use the compareAndSet method in the subclass AtomicInteger under java.util.concurrent.atomic for analysis

Description:

mpxchg: the "compare and exchange" command

dword: The full name is double word. In the x86 / x64 system, a word = 2 byte, dword = 4 byte = 32 bit

ptr: The full name is pointer, used in conjunction with the previous dword, indicating that the memory unit accessed is a double word unit [edx]: [...] represents a memory unit, edx is a register, and the dest pointer value is stored in edx. Then [edx] means the memory unit whose memory address is dest

AtomicInteger 

public class AtomicInteger extends Number implements java.io.Serializable {

    // setup to use Unsafe.compareAndSwapInt for updates
    private static final Unsafe unsafe = Unsafe.getUnsafe();
    private static final long valueOffset;

    static {
        try {
            // 计算变量 value 在类对象中的偏移
            valueOffset = unsafe.objectFieldOffset
                (AtomicInteger.class.getDeclaredField("value"));
        } catch (Exception ex) { throw new Error(ex); }
    }

    private volatile int value;
    
    public final boolean compareAndSet(int expect, int update) {
        /*
         * compareAndSet 实际上只是一个壳子,主要的逻辑封装在 Unsafe 的 
         * compareAndSwapInt 方法中
         */
        return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
    }
  //该方法功能是Interger类型加1
		public final int getAndIncrement() {
		//主要看这个getAndAddInt方法
        return unsafe.getAndAddInt(this, valueOffset, 1);
    }

		//var1 是this指针
		//var2 是地址偏移量
		//var4 是自增的数值,是自增1还是自增N
		public final int getAndAddInt(Object var1, long var2, int var4) {
        int var5;
        do {
	        //获取内存值,这是内存值已经是旧的,假设我们称作期望值E
            var5 = this.getIntVolatile(var1, var2);
            //compareAndSwapInt方法是重点,
            //var5是期望值,var5 + var4是要更新的值
            //这个操作就是调用CAS的JNI,每个线程将自己内存里的内存值M
            //与var5期望值E作比较,如果相同将内存值M更新为var5 + var4,否则做自旋操作
        } while(!this.compareAndSwapInt(var1, var2, var5, var5 + var4));

        return var5;
    }
    // ......
}

public final class Unsafe {
    // compareAndSwapInt 是 native 类型的方法,继续往下看
    public final native boolean compareAndSwapInt(Object o, long offset,
                                                  int expected,
                                                  int x);
    // ......
}

unsafe.cpp 

// unsafe.cpp
/*
 * 这个看起来好像不像一个函数,不过不用担心,不是重点。UNSAFE_ENTRY 和 UNSAFE_END 都是宏,
 * 在预编译期间会被替换成真正的代码。下面的 jboolean、jlong 和 jint 等是一些类型定义(typedef):
 * 
 * jni.h
 *     typedef unsigned char   jboolean;
 *     typedef unsigned short  jchar;
 *     typedef short           jshort;
 *     typedef float           jfloat;
 *     typedef double          jdouble;
 * 
 * jni_md.h
 *     typedef int jint;
 *     #ifdef _LP64 // 64-bit
 *     typedef long jlong;
 *     #else
 *     typedef long long jlong;
 *     #endif
 *     typedef signed char jbyte;
 */
UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapInt(JNIEnv *env, jobject unsafe, jobject obj, jlong offset, jint e, jint x))
  UnsafeWrapper("Unsafe_CompareAndSwapInt");
  oop p = JNIHandles::resolve(obj);
  // 根据偏移量,计算 value 的地址。这里的 offset 就是 AtomaicInteger 中的 valueOffset
  jint* addr = (jint *) index_oop_from_field_offset_long(p, offset);
  // 调用 Atomic 中的函数 cmpxchg,该函数声明于 Atomic.hpp 中
  return (jint)(Atomic::cmpxchg(x, addr, e)) == e;
UNSAFE_END

// atomic.cpp
unsigned Atomic::cmpxchg(unsigned int exchange_value,
                         volatile unsigned int* dest, unsigned int compare_value) {
  assert(sizeof(unsigned int) == sizeof(jint), "more work to do");
  /*
   * 根据操作系统类型调用不同平台下的重载函数,这个在预编译期间编译器会决定调用哪个平台下的重载
   * 函数。相关的预编译逻辑如下:
   * 
   * atomic.inline.hpp:
   *    #include "runtime/atomic.hpp"
   *    
   *    // Linux
   *    #ifdef TARGET_OS_ARCH_linux_x86
   *    # include "atomic_linux_x86.inline.hpp"
   *    #endif
   *   
   *    // 省略部分代码
   *    
   *    // Windows
   *    #ifdef TARGET_OS_ARCH_windows_x86
   *    # include "atomic_windows_x86.inline.hpp"
   *    #endif
   *    
   *    // BSD
   *    #ifdef TARGET_OS_ARCH_bsd_x86
   *    # include "atomic_bsd_x86.inline.hpp"
   *    #endif
   * 
   * 接下来分析 atomic_windows_x86.inline.hpp 中的 cmpxchg 函数实现
   */
  return (unsigned int)Atomic::cmpxchg((jint)exchange_value, (volatile jint*)dest,
                                       (jint)compare_value);
}

The above analysis looks more, but the main process is not complicated. If you are not entangled with the detailed code, it is still relatively easy to move. Next I will analyze the Atomic :: cmpxchg function under the win platform.

// atomic_windows_x86.inline.hpp
#define LOCK_IF_MP(mp) __asm cmp mp, 0  \
                       __asm je L0      \
                       __asm _emit 0xF0 \
                       __asm L0:
              
inline jint Atomic::cmpxchg (jint exchange_value, volatile jint* dest, jint compare_value) {
  // alternative for InterlockedCompareExchange
  // mp是“os::is_MP()”的返回结果,“os::is_MP()”是一个内联函数,用来判断当前系统是否为多处理器
  //如果当前系统是多处理器,该函数返回1。否则,返回0。
  int mp = os::is_MP();
  __asm {
    mov edx, dest
    mov ecx, exchange_value
    mov eax, compare_value
    // LOCK_IF_MP(mp)会根据mp的值来决定是否为cmpxchg指令添加lock前缀。如果通过mp判断当前系统是多处理器(即mp值为1),则为cmpxchg指令添加lock前缀。否则,不加lock前缀。
    // 这是一种优化手段,认为单处理器的环境没有必要添加lock前缀,只有在多核情况下才会添加lock前缀,因为lock会导致性能下降。cmpxchg是汇编指令,作用是比较并交换操作数。
    LOCK_IF_MP(mp)
    cmpxchg dword ptr [edx], ecx
  }
}

The above code consists of the LOCK_IF_MP pre-compiled identifier and cmpxchg function. In order to see more clearly, we replace the LOCK_IF_MP in the cmpxchg function with the actual content. as follows:

inline jint Atomic::cmpxchg (jint exchange_value, volatile jint* dest, jint compare_value) {
  // 判断是否是多核 CPU
  int mp = os::is_MP();
  __asm {
    // 将参数值放入寄存器中
    mov edx, dest    // 注意: dest 是指针类型,这里是把内存地址存入 edx 寄存器中
    mov ecx, exchange_value
    mov eax, compare_value
    
    // LOCK_IF_MP
    cmp mp, 0
    /*
     * 如果 mp = 0,表明是线程运行在单核 CPU 环境下。此时 je 会跳转到 L0 标记处,
     * 也就是越过 _emit 0xF0 指令,直接执行 cmpxchg 指令。也就是不在下面的 cmpxchg 指令
     * 前加 lock 前缀。
     */
    je L0
    /*
     * 0xF0 是 lock 前缀的机器码,这里没有使用 lock,而是直接使用了机器码的形式。至于这样做的
     * 原因可以参考知乎的一个回答:
     *     https://www.zhihu.com/question/50878124/answer/123099923
     */ 
    _emit 0xF0
L0:
    /*
     * 比较并交换。简单解释一下下面这条指令,熟悉汇编的朋友可以略过下面的解释:
     *   cmpxchg: 即“比较并交换”指令
     *   dword: 全称是 double word,在 x86/x64 体系中,一个 
     *          word = 2 byte,dword = 4 byte = 32 bit
     *   ptr: 全称是 pointer,与前面的 dword 连起来使用,表明访问的内存单元是一个双字单元
     *   [edx]: [...] 表示一个内存单元,edx 是寄存器,dest 指针值存放在 edx 中。
     *          那么 [edx] 表示内存地址为 dest 的内存单元
     *          
     * 这一条指令的意思就是,将 eax 寄存器中的值(compare_value)与 [edx] 双字内存单元中的值
     * 进行对比,如果相同,则将 ecx 寄存器中的值(exchange_value)存入 [edx] 内存单元中。
     */
    cmpxchg dword ptr [edx], ecx
  }
}

 

At this point, the implementation process of CAS is finished, and the implementation of CAS is inseparable from the support of the processor. With so much code above, the core code is actually a cmpxchg instruction with a lock prefix, namely lock cmpchg dword ptr [edx], ecx

Note: CAS only guarantees the atomicity of the operation, does not guarantee the visibility of the variable, so the variable needs to add the volatile keyword

 

What problem is solved?

Before JDK1.5, the Java language used the synchronized keyword to ensure synchronization, which would lead to the existence of locks. The lock mechanism has the following problems:

  • Under multi-threaded competition, adding and releasing locks will cause more context switching and scheduling delays, causing performance problems

  • A thread holding a lock will cause all other threads that need the lock to hang

  • If a high-priority thread waits for a low-priority thread to release the lock, the priority will be inverted, causing performance risks

Volatile is a good mechanism that can guarantee data visibility between threads, but volatile cannot guarantee the original execution. So for synchronization, we still have to return to the lock mechanism.

An exclusive lock is a pessimistic lock, and synchronized is an exclusive lock, which causes all other threads that require the current lock to hang, waiting for the thread holding the lock to release the lock. And another more effective lock is optimistic locking. The so-called optimistic locking is to complete an operation without assuming locking each time, assuming there is no conflict, and retry if it fails because of conflict until it succeeds.

What are the defects?

1. ABA problem (linked list will lose data)

Because CAS needs to check whether the lower value has changed when operating the value, and update if there is no change, but if a value is originally A, changed to B, and then changed to A, then when using CAS to check Its value has not changed, but in fact it has changed. The solution to the ABA problem is to use the version number. Add the version number in front of the variable, increment the version number every time the variable is updated, then A-B-A will become 1A-2B-3A

2. Long spins are very CPU intensive

Spin is an operation cycle of cas. If a thread is particularly unlucky, each time the value obtained is modified by other threads, then it will continue to perform spin comparison until it succeeds. In this process, the CPU overhead is very high Is large, so try to avoid it. If the JVM can support the pause instruction provided by the processor, the efficiency will be improved to a certain extent. The pause instruction has two functions. First, it can delay the pipeline execution of the instruction (de-pipeline), so that the CPU does not consume excessive execution resources. The delay time depends on the specific implementation version, on some processors the delay time is zero. Second, it can avoid CPU pipeline flushing (CPU pipeline flush) due to memory order violation when exiting the loop, thereby improving CPU execution efficiency. ? ?

3. Only atomic operations of a shared variable can be guaranteed

When performing operations on a shared variable, we can use the cyclic CAS method to guarantee atomic operations, but when operating on multiple shared variables, cyclic CAS cannot guarantee the atomicity of the operation. Since Java 1.5, JDK provides the AtomicReference class to ensure atomicity between referenced objects. You can put multiple variables in one object to perform CAS operations.

Application scenario

  • Spinlock

  • Token bucket current limiter (RateLimiter :: refillToken in Eureka), to ensure that in multi-threaded situations, the filling token and consumption token of the thread are not blocked

public class RateLimiter {

    private final long rateToMsConversion;

    private final AtomicInteger consumedTokens = new AtomicInteger();
    private final AtomicLong lastRefillTime = new AtomicLong(0);

    @Deprecated
    public RateLimiter() {
        this(TimeUnit.SECONDS);
    }

    public RateLimiter(TimeUnit averageRateUnit) {
        switch (averageRateUnit) {
            case SECONDS:
                rateToMsConversion = 1000;
                break;
            case MINUTES:
                rateToMsConversion = 60 * 1000;
                break;
            default:
                throw new IllegalArgumentException("TimeUnit of " + averageRateUnit + " is not supported");
        }
    }

    //提供给外界获取 token 的方法
    public boolean acquire(int burstSize, long averageRate) {
        return acquire(burstSize, averageRate, System.currentTimeMillis());
    }

    public boolean acquire(int burstSize, long averageRate, long currentTimeMillis) {
        if (burstSize <= 0 || averageRate <= 0) { // Instead of throwing exception, we just let all the traffic go
            return true;
        }

        //添加token
        refillToken(burstSize, averageRate, currentTimeMillis);

        //消费token
        return consumeToken(burstSize);
    }

    private void refillToken(int burstSize, long averageRate, long currentTimeMillis) {
        long refillTime = lastRefillTime.get();
        long timeDelta = currentTimeMillis - refillTime;

        //根据频率计算需要增加多少 token
        long newTokens = timeDelta * averageRate / rateToMsConversion;
        if (newTokens > 0) {
            long newRefillTime = refillTime == 0
                    ? currentTimeMillis
                    : refillTime + newTokens * rateToMsConversion / averageRate;

            // CAS 保证有且仅有一个线程进入填充
            if (lastRefillTime.compareAndSet(refillTime, newRefillTime)) {
                while (true) {
                    int currentLevel = consumedTokens.get();
                    int adjustedLevel = Math.min(currentLevel, burstSize); // In case burstSize decreased
                    int newLevel = (int) Math.max(0, adjustedLevel - newTokens);
                    // while true 直到更新成功为止
                    if (consumedTokens.compareAndSet(currentLevel, newLevel)) {
                        return;
                    }
                }
            }
        }
    }

    private boolean consumeToken(int burstSize) {
        while (true) {
            int currentLevel = consumedTokens.get();
            if (currentLevel >= burstSize) {
                return false;
            }

            // while true 直到没有token 或者 获取到为止
            if (consumedTokens.compareAndSet(currentLevel, currentLevel + 1)) {
                return true;
            }
        }
    }

    public void reset() {
        consumedTokens.set(0);
        lastRefillTime.set(0);
    }
}

Java 8 incrementAndGet optimization

Since this method of CAS is not used to lock the method, all threads can enter the increment () method. If there are too many threads entering this method, there will be a problem: every time there is a thread to be executed In the third step, the value of i is always modified, so the thread returns to the first step and continues to start again.

And this will cause a problem: because the thread is too dense, too many people want to modify the value of i, and then most people will modify it unsuccessfully, and waste resources there in vain.

Let's briefly talk about what optimization it has done. It maintains an array Cell [] and base internally, and the value is maintained in the Cell. When competition occurs, JDK will select a Cell based on the algorithm and perform a value on Operation, if there is still competition, it will try again with another Cell, and finally add the value and base in Cell [] to get the final result.

Because the code in it is more complicated, I chose a few more important questions and took a look at the source code with questions:

  1. When Cell [] was initialized.

  2. If there is no competition, it will only operate on the base, which is seen from where.

  3. What are the rules for initializing Cell [].

  4. What is the timing of the expansion of Cell [].

  5. How to initialize Cell [] and expand Cell [] ensure thread safety.

public void add(long x) {
        Cell[] cs; long b, v; int m; Cell c;
        if ((cs = cells) != null || !casBase(b = base, b + x)) {//第一行
            boolean uncontended = true;
            if (cs == null || (m = cs.length - 1) < 0 ||//第二行
                (c = cs[getProbe() & m]) == null ||//第三行
                !(uncontended = c.cas(v = c.value, v + x)))//第四行
                longAccumulate(x, null, uncontended);//第五行
        }
    }

This is relatively simple, is to call the compareAndSet method to determine whether it is successful:

  • If there is currently no competition, return true.

  • If there is currently a competition, a thread will return false.

Returning to the first line, the overall interpretation of this judgment: If cell [] has been initialized, or there is competition, it will enter the second line of code. If there is no competition and no initialization, it will not enter the second line of code.

This answers the second question: if there is no competition, it will only operate on the base, which is seen from here

The second line of code: || judgment, the former judges whether cs is [is NULL], the latter judges whether (length of cs-1) is [greater than 0] Both of these judgments should determine whether Cell [] is initialized. If it is not initialized, it will enter the fifth line of code.

The third line of code: If the cell is initialized, get a number through the [getProbe () & m] algorithm, determine whether cs [number] is [is NULL], and assign cs [number] to c, if [is NULL ], Will enter the fifth line of code. We need to simply look at what is done in getProbe ():

static final int getProbe() {
        return (int) THREAD_PROBE.get(Thread.currentThread());
    }

    private static final VarHandle THREAD_PROBE;

The fourth line of code: CAS operation is performed on c to see if it is successful, and the return value is assigned to uncontended. If there is no competition at present, it will succeed. If there is competition at present, it will fail. There is one outside! (), So CAS fails and will enter the fifth line of code. It should be noted that this is already an operation on the Cell element.

The fifth line of code: This method is very complicated internally, let us first look at the overall method:

There are three ifs: 1. Determine if the cells are initialized, if they are initialized, enter this if.

There are 6 ifs in it, which is terrible, but here, we don't need to pay attention to them, because our goal is to solve the problems raised above.

Let's take a look first:

The first judgment: according to the algorithm, take out an element in cs [] and assign it to c, and then judge whether it is [is NULL], if [is NULL], enter this if.

if (cellsBusy == 0) {       // 如果cellsBusy==0,代表现在“不忙”,进入这个if
    Cell r = new Cell(x);   //创建一个Cell
    if (cellsBusy == 0 && casCellsBusy()) {//再次判断cellsBusy ==0,加锁,这样只有一个线程可以进入这个if
        //把创建出来Cell元素加入到Cell[]
        try {       
            Cell[] rs; int m, j;
            if ((rs = cells) != null &&
                (m = rs.length) > 0 &&
                rs[j = (m - 1) & h] == null) {
                rs[j] = r;
                break done;
            }
        } finally {
            cellsBusy = 0;//代表现在“不忙”
        }
        continue;           // Slot is now non-empty
    }
}
collide = false;

This supplements the first question. When initializing Cell [], one of the elements is NULL. Here, the element that is NULL is initialized, that is, only when this element is used, it is initialized.

The sixth judgment: judge whether cellsBusy is 0 and lock it. If it succeeds, enter this if to expand the capacity of Cell [].

try {
     	if (cells == cs)        // Expand table unless stale
             cells = Arrays.copyOf(cs, n << 1);
         } finally {
                        cellsBusy = 0;
             }
         collide = false;
         continue; 

This answers half of the fifth question: when expanding the Cell [], the CAS is used to add a lock, so the safety of the thread is guaranteed.

What about the fourth question? First of all, you should pay attention to that the outermost is a for (;;) dead loop. Only when it breaks, the loop is terminated.

At the beginning, collide is false. In the third if, the cell is CAS-operated. If it succeeds, it breaks, so we need to assume that it failed and enter the fourth if. Whether the length of] is greater than the number of CPU cores, if it is less than the number of cores, it will enter the fifth judgment, this time collide is false, will enter this if, change collide to true, which means there is a conflict, and then go to the advanceProbe method to generate New THREAD_PROBE, loop again. If in the third if, CAS still fails, judge again whether the length of Cell [] is greater than the number of cores. If it is less than the number of cores, it will enter the fifth judgment. At this time, collide is true, so it will not enter the fifth if Go to the middle, so we entered the sixth judgment to expand capacity. Is it complicated?

In simple terms, the timing of Cell [] expansion is when the length of Cell [] is less than the number of CPU cores, and the Cell CAS has failed twice.


2. The first two judgments are easy to understand, mainly looking at the third judgment:

 final boolean casCellsBusy() {
        return CELLSBUSY.compareAndSet(this, 0, 1);
    }

cas sets CELLSBUSY to 1, can be understood as adding a lock, because it will be initialized soon.

 try {                           // Initialize table
                    if (cells == cs) {
                        Cell[] rs = new Cell[2];
                        rs[h & 1] = new Cell(x);
                        cells = rs;
                        break done;
                    }
                } finally {
                    cellsBusy = 0;
                }

Initialize Cell [], you can see that the length is 2, according to the algorithm, initialize one of the elements, that is, the length of Cell [] is 2, but one of the elements is still NULL, and now only one of the elements After the initialization, the cellsBusy was finally changed to 0, which means "not busy" now.

This answers the first question: When competition occurs and Cell [] has not been initialized, Cell [] is initialized. The fourth problem: the rule of initialization is to create an array of length 2, but only one of the elements will be initialized, and the other element will be NULL. Half of the fifth question: When initializing Cell [], CAS was used to add a lock, so thread safety can be guaranteed.

3. If all the above fails, perform a CAS operation on the base.

If you look at the source code with me, you will find a note that you may never have seen before:

What does this comment do? Contended is used to solve false sharing .

Well, it leads to a blind spot of knowledge, what is false sharing.

False sharing

We know the relationship between CPU and memory: when the CPU needs a piece of data, it will first look in the cache, if it is not in the cache, it will go to the memory to find it. Just take it out.

But this statement is not perfect. The data in the cache is stored in the form of cache lines . What does it mean? Just one cache line may have more than one data. If the size of a cache line is 64 bytes, the CPU goes to the memory to fetch the data, it will take out the adjacent 64 bytes of data, and then copy it to the cache.

This is an optimization for single-threaded. Imagine that if the CPU needs A data, all the neighboring BCDE data is taken out of the memory and put into the cache. If the CPU needs BCDE data again, you can go directly to the cache to get it.

But there is a disadvantage under multi-threading, because the data of the same cache line can only be read by one thread at the same time, which is called pseudo sharing .

Is there a way to solve this problem? Clever developers thought of a way: If the size of the cache line is 64 bytes, I can add some redundant fields to fill up to 64 bytes.

For example, I only need one field of type long. Now I add 6 fields of type long as padding. One long occupies 8 bytes. Now it is 7 fields of type long, which is 56 bytes. It takes 8 bytes, exactly 64 bytes, just enough for one cache line.

But this method is not elegant enough, so @ jdk.internal.vm.annotation.Contended annotation was introduced in Java8 to solve the problem of false sharing. But if developers want to use this annotation, they need to add JVM parameters. I won't talk about specific parameters here, because I haven't tested them myself.

 

Reference documents:

 

https://www.cnblogs.com/nullllun/p/9039049.html

https://blog.csdn.net/v123411739/article/details/79561458?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3

https://juejin.im/post/5a73cbbff265da4e807783f5

https://juejin.im/post/5a75db20f265da4e826320a9

https://juejin.im/post/5cd4e7996fb9a0323e3ad6ff

https://juejin.im/post/5c7a86d2f265da2d8e7101a1

 

Published 10 original articles · won 6 · visited 1435

Guess you like

Origin blog.csdn.net/yueyazhishang/article/details/105621191
Recommended