LongAdder class study notes

excellent original

  1. Introduction to LongAdder | LongAccumulator
  2. Source code reading: Comprehensive explanation of LongAdder

Speaking of LongAdder, I have to mention AtomicLong. AtomicLong appeared in JDK1.5, which mainly uses a long type value as a member variable. Its principle is to rely on the underlying CAS method to ensure atomic update data. When data is to be increased or decreased, it will use an infinite loop to continuously CAS to a specific value, so as to achieve the purpose of updating data. In the case of high concurrency , which will generate a lot of useless empty loops and waste CPU resources.

public final long getAndSetLong(Object var1, long var2, long var4) {
    long var6;
    do {
        var6 = this.getLongVolatile(var1, var2);
    } while(!this.compareAndSwapLong(var1, var2, var6, var4));

    return var6;
}

Striped64 class

public class LongAdder extends Striped64 implements Serializable

LongAdder inherits the Striped64 class to realize the accumulation function. It is a tool class for realizing high concurrent accumulation;

The core idea of ​​Striped64's design is to avoid competition through internal decentralized computing.

Striped64 contains a basevalue and an Cell[]array of cells of a type, also known as a hash table. In the absence of competition, the number to be accumulated is accumulated to the base through cas; if there is competition, the number to be accumulated will be accumulated into a cell element in the Cells array, and the base will be added only when the current accumulated total is obtained. The value is added to the value of each element in the cells array to calculate the current sum total value. So the value of the entire Striped64 is sum=base+∑[0~n]cells.

image.png-26.3kB

Three important member variables inside Striped64:

/** 
 * 存放Cell的hash表,大小为2的幂。 
 */  
transient volatile Cell[] cells;  
/** 
 * 基础值,
 * 1. 在没有竞争时会更新这个值;
 * 2. 在cells初始化的过程中,cells处于不可用的状态,这时候也会尝试将通过cas操作值累加到base。 
 */  
transient volatile long base;  
/** 
 * 自旋锁,通过CAS操作加锁,用于保护创建或者扩展Cell表。 
 */  
transient volatile int cellsBusy; 

member variable cells

The cells array is the killer of LongAdder's high-performance implementation:

AtomicInteger has only one value, and all threads must compete for the variable value through CAS. Under high concurrency, thread contention is very serious; while LongAdder has two values ​​for accumulation, one is base, and its function is similar to that in AtomicInteger. value, the cells array will not be used in the absence of competition. It is null. At this time, the base will be used for accumulation. After the competition, the cells array will be used. The first initialization length is 2, and each subsequent expansion is It becomes twice the original size, and it will not expand until the length of the cells array is greater than or equal to the number of current server CPUs (think why it will not expand when it exceeds the number of CPUs); each thread will pass the thread to cells[threadLocalRandomProbe% The value in the Cell object at the position of cells.length] is accumulated, which is equivalent to binding the thread to a cell object in cells;

  1. When the number of CPUs is exceeded, the capacity will not be expanded because the number of CPUs represents the processing capacity of the machine. When the number of CPUs is exceeded, the extra elements of the cells array have little effect.

Member variable cellsBusy

cellsBusy, it has two values ​​0 or 1. Its function is to lock when the cells array is to be modified to prevent multiple threads from modifying the cells array at the same time. 0 means no lock and 1 means lock. There are three situations of locking:

  1. When the cells array is initialized;
  2. When the cells array is expanded;
  3. If an element in the cells array is null, when creating a new Cell object for this position;

member variable base

It has two functions:

  1. Accumulate the accumulated value to base with no race at the beginning
  2. During the initialization of cells, the cells are not available, and an attempt is made to accumulate the value to the base;

Cell inner class

//为提高性能,使用注解@sun.misc.Contended,用来避免伪共享,
@sun.misc.Contended static final class Cell {
    //用来保存要累加的值
    volatile long value;
    Cell(long x) { value = x; }
    //使用UNSAFE类的cas来更新value值
    final boolean cas(long cmp, long val) {
        return UNSAFE.compareAndSwapLong(this, valueOffset, cmp, val);
    }
    private static final sun.misc.Unsafe UNSAFE;
    //value在Cell类中存储位置的偏移量;
    private static final long valueOffset;
    //这个静态方法用于获取偏移量
    static {
        try {
            UNSAFE = sun.misc.Unsafe.getUnsafe();
            Class<?> ak = Cell.class;
            valueOffset = UNSAFE.objectFieldOffset
                (ak.getDeclaredField("value"));
        } catch (Exception e) {
            throw new Error(e);
        }
    }
}

This class is very simple, final type, with a value value inside, use cas to update its value; the only thing to pay attention to in the Cell class is the annotation @sun.misc.Contended of the Cell class .

false sharing

To understand the role of the Contended annotation, we must first figure out what false sharing is, what impact it will have, and how to solve false sharing.

cache line

To understand false sharing, we must first figure out what a cache line is. The CPU cache system is stored in units of cache lines. A cache line is an integer power of 2 consecutive bytes, usually 32-256 words. Festival. The most common cache line size is 64 bytes, and the cache line is the smallest unit of data transfer between cache and memory.

Most modern CPUs one-die both L1 and L2 caches. For L1 cache, most of them are write though; L2 cache is write back and will not write back to memory immediately, which will lead to inconsistency between the contents of cache and memory; in addition, for mp (multi processors) environment, due to cache It is private to the CPU, and the content of the caches of different CPUs is also inconsistent. Therefore, many mp computing architectures, whether ccnuma or smp, implement the cache coherence mechanism, that is, the cache coherence mechanism of different CPUs.

Write-through (write-through mode) writes to the cache and back-end storage at the same time when data is updated. The advantage of this mode is that the operation is simple; the disadvantage is that the data writing speed is slower because the data modification needs to be written to the storage at the same time.

Write-back (write-back mode) only writes to the cache when the data is updated. The modified cached data is written to the backend storage only when the data is replaced out of the cache. The advantage of this mode is that the data writing speed is fast, because it does not need to write to the storage; the disadvantage is that once the updated data is not written to the storage and the system is powered off, the data cannot be retrieved.

One implementation of cache coherence is through the cache-snooping protocol. Each cpu monitors the read and write caches of other cpus through the snoop of the bus:

  1. When cpu1 wants to write the cache, other cpus will check the corresponding cache line in their own cache. If it is dirty, it will write back to memory and refresh the relevant cache line of cpu1; if it is not dirty, it will invalidate the cache. line.

  2. When cpu1 wants to read the cache, other cpus will write back the part marked as dirty in the corresponding cache line in their own cache to memory, and refresh the relevant cache line of cpu1.

Therefore, increasing the cache hit rate of the CPU and reducing the data transfer between the cache and memory will improve the performance of the system.

Therefore, it is very important to maintain cache line alignment in the memory allocation of programs and binary objects. If the cache line alignment is not guaranteed, the probability of concurrently running processes or threads in multiple CPUs reading and writing the same cache line at the same time will increase. very large. At this time, there will be repeated write back and refresh situations between the cache and memory of the CPU. This situation is called cache thrashing.

In order to effectively avoid cache thrashing, there are usually two ways:

  1. For heap allocation, many systems implement mandatory alignment in malloc calls.
  2. For stack allocation, many compilers provide the stack aligned option.

Of course, if stack aligned is specified in the compiler, the size of the program will increase and it will take up more memory. Therefore, the trade-off between these needs to be carefully considered;

For details about pseudo-sharing, please see the introduction here and here ;

In order to solve this problem, the method used in jdk1.6 long paddingis to add 7 long type variables before and after the variables that prevent them from being falsely shared, as shown below:

public class VolatileLongPadding {
    volatile long p0, p1, p2, p3, p4, p5, p6;
    volatile long v = 0L;
    volatile long q0, q1, q2, q3, q4, q5, q6;
}

A certain version of jdk1.7 will be optimized out long padding. In order to solve this problem, @sun.misc.Contended is added to jdk1.8 .

LongAdder

A lot has been said before, and now I finally get to the point.

LongAdder->add method

The add method is the accumulation method of LongAdder, and the incoming parameter x is the value to be accumulated;

public void add(long x) {

    Cell[] as; long b, v; int m; Cell a;
    /**
     * 如果一下两种条件则继续执行if内的语句
     * 1. cells数组不为null(不存在争用的时候,cells数组一定为null,一旦对base的cas操作失败,才会初始化cells数组)
     * 2. 如果cells数组为null,如果casBase执行成功,则直接返回,如果casBase方法执行失败(casBase失败,说明第一次争用冲突产生,需要对cells数组初始化)进入if内;
     * casBase方法很简单,就是通过UNSAFE类的cas设置成员变量base的值为base+要累加的值
     * casBase执行成功的前提是无竞争,这时候cells数组还没有用到为null,可见在无竞争的情况下是类似于AtomticInteger处理方式,使用cas做累加。
     */
    if ((as = cells) != null || !casBase(b = base, b + x)) {
        //uncontended判断cells数组中,当前线程要做cas累加操作的某个元素是否#不#存在争用,如果cas失败则存在争用;uncontended=false代表存在争用,uncontended=true代表不存在争用。

        boolean uncontended = true;
        /**
        *1. as == null : cells数组未被初始化,成立则直接进入if执行cell初始化
        *2. (m = as.length - 1) < 0: cells数组的长度为0
        *条件1与2都代表cells数组没有被初始化成功,初始化成功的cells数组长度为2;
        *3. (a = as[getProbe() & m]) == null :如果cells被初始化,且它的长度不为0,则通过getProbe方法获取当前线程Thread的threadLocalRandomProbe变量的值,初始为0,然后执行threadLocalRandomProbe&(cells.length-1 ),相当于m%cells.length;如果cells[threadLocalRandomProbe%cells.length]的位置为null,这说明这个位置从来没有线程做过累加,需要进入if继续执行,在这个位置创建一个新的Cell对象;
        *4. !(uncontended = a.cas(v = a.value, v + x)):尝试对cells[threadLocalRandomProbe%cells.length]位置的Cell对象中的value值做累加操作,并返回操作结果,如果失败了则进入if,重新计算一个threadLocalRandomProbe;

        如果进入if语句执行longAccumulate方法,有三种情况
        1. 前两个条件代表cells没有初始化,
        2. 第三个条件指当前线程hash到的cells数组中的位置还没有其它线程做过累加操作,
        3. 第四个条件代表产生了冲突,uncontended=false
        **/
        if (as == null || (m = as.length - 1) < 0 ||
            (a = as[getProbe() & m]) == null ||
            !(uncontended = a.cas(v = a.value, v + x)))
            longAccumulate(x, null, uncontended);
    }
}

longAccumulate method

The first of the three parameters is the value to be accumulated, the second is null, and the third is wasUncontended, indicating whether the add method before calling the method does not compete.

final void longAccumulate(long x, LongBinaryOperator fn,
                          boolean wasUncontended) {
    //获取当前线程的threadLocalRandomProbe值作为hash值,如果当前线程的threadLocalRandomProbe为0,说明当前线程是第一次进入该方法,则强制设置线程的threadLocalRandomProbe为ThreadLocalRandom类的成员静态私有变量probeGenerator的值,后面会详细将hash值的生成;
    //另外需要注意,如果threadLocalRandomProbe=0,代表新的线程开始参与cell争用的情况
    //1.当前线程之前还没有参与过cells争用(也许cells数组还没初始化,进到当前方法来就是为了初始化cells数组后争用的),是第一次执行base的cas累加操作失败;
    //2.或者是在执行add方法时,对cells某个位置的Cell的cas操作第一次失败,则将wasUncontended设置为false,那么这里会将其重新置为true;第一次执行操作失败;
    //凡是参与了cell争用操作的线程threadLocalRandomProbe都不为0;
    int h;
    if ((h = getProbe()) == 0) {
        //初始化ThreadLocalRandom;
        ThreadLocalRandom.current(); // force initialization
        //将h设置为0x9e3779b9
        h = getProbe();
        //设置未竞争标记为true
        wasUncontended = true;
    }
    //cas冲突标志,表示当前线程hash到的Cells数组的位置,做cas累加操作时与其它线程发生了冲突,cas失败;collide=true代表有冲突,collide=false代表无冲突 
    boolean collide = false;
    for (;;) {
        Cell[] as; Cell a; int n; long v;
        //这个主干if有三个分支
        //1.主分支一:处理cells数组已经正常初始化了的情况(这个if分支处理add方法的四个条件中的3和4)
        //2.主分支二:处理cells数组没有初始化或者长度为0的情况;(这个分支处理add方法的四个条件中的1和2)
        //3.主分支三:处理如果cell数组没有初始化,并且其它线程正在执行对cells数组初始化的操作,及cellbusy=1;则尝试将累加值通过cas累加到base上
        //先看主分支一
        if ((as = cells) != null && (n = as.length) > 0) {
            /**
             *内部小分支一:这个是处理add方法内部if分支的条件3:如果被hash到的位置为null,说明没有线程在这个位置设置过值,没有竞争,可以直接使用,则用x值作为初始值创建一个新的Cell对象,对cells数组使用cellsBusy加锁,然后将这个Cell对象放到cells[m%cells.length]位置上 
             */
            if ((a = as[(n - 1) & h]) == null) {
                //cellsBusy == 0 代表当前没有线程cells数组做修改
                if (cellsBusy == 0) {
                    //将要累加的x值作为初始值创建一个新的Cell对象,
                    Cell r = new Cell(x);
                    //如果cellsBusy=0无锁,则通过cas将cellsBusy设置为1加锁
                    if (cellsBusy == 0 && casCellsBusy()) {
                        //标记Cell是否创建成功并放入到cells数组被hash的位置上
                        boolean created = false;
                        try {
                            Cell[] rs; int m, j;
                            //再次检查cells数组不为null,且长度不为空,且hash到的位置的Cell为null
                            if ((rs = cells) != null &&
                                    (m = rs.length) > 0 &&
                                    rs[j = (m - 1) & h] == null) {
                                //将新的cell设置到该位置
                                rs[j] = r;
                                created = true;
                            }
                        } finally {
                            //去掉锁
                            cellsBusy = 0;
                        }
                        //生成成功,跳出循环
                        if (created)
                            break;
                        //如果created为false,说明上面指定的cells数组的位置cells[m%cells.length]已经有其它线程设置了cell了,继续执行循环。
                        continue;
                    }
                }
                //如果执行的当前行,代表cellsBusy=1,有线程正在更改cells数组,代表产生了冲突,将collide设置为false
                collide = false;

                /**
                 *内部小分支二:如果add方法中条件4的通过cas设置cells[m%cells.length]位置的Cell对象中的value值设置为v+x失败,说明已经发生竞争,将wasUncontended设置为true,跳出内部的if判断,最后重新计算一个新的probe,然后重新执行循环;
                 */
            } else if (!wasUncontended)
                //设置未竞争标志位true,继续执行,后面会算一个新的probe值,然后重新执行循环。 
                wasUncontended = true;
            /**
             *内部小分支三:新的争用线程参与争用的情况:处理刚进入当前方法时threadLocalRandomProbe=0的情况,也就是当前线程第一次参与cell争用的cas失败,这里会尝试将x值加到cells[m%cells.length]的value ,如果成功直接退出  
             */
            else if (a.cas(v = a.value, ((fn == null) ? v + x :
                    fn.applyAsLong(v, x))))
                break;
            /**
             *内部小分支四:分支3处理新的线程争用执行失败了,这时如果cells数组的长度已经到了最大值(大于等于cup数量),或者是当前cells已经做了扩容,则将collide设置为false,后面重新计算prob的值
             else if (n >= NCPU || cells != as)
             collide = false;
             /**
             *内部小分支五:如果发生了冲突collide=false,则设置其为true;会在最后重新计算hash值后,进入下一次for循环
             */
            else if (!collide)
                //设置冲突标志,表示发生了冲突,需要再次生成hash,重试。 如果下次重试任然走到了改分支此时collide=true,!collide条件不成立,则走后一个分支
                collide = true;
            /**
             *内部小分支六:扩容cells数组,新参与cell争用的线程两次均失败,且符合库容条件,会执行该分支
             */
            else if (cellsBusy == 0 && casCellsBusy()) {
                try {
                    //检查cells是否已经被扩容
                    if (cells == as) {      // Expand table unless stale
                        Cell[] rs = new Cell[n << 1];
                        for (int i = 0; i < n; ++i)
                            rs[i] = as[i];
                        cells = rs;
                    }
                } finally {
                    cellsBusy = 0;
                }
                collide = false;
                continue;                   // Retry with expanded table
            }
            //为当前线程重新计算hash值
            h = advanceProbe(h);

            //这个大的分支处理add方法中的条件1与条件2成立的情况,如果cell表还未初始化或者长度为0,先尝试获取cellsBusy锁。
        }else if (cellsBusy == 0 && cells == as && casCellsBusy()) {
            boolean init = false;
            try {                           // Initialize table
                //初始化cells数组,初始容量为2,并将x值通过hash&1,放到0个或第1个位置上
                if (cells == as) {
                    Cell[] rs = new Cell[2];
                    rs[h & 1] = new Cell(x);
                    cells = rs;
                    init = true;
                }
            } finally {
                //解锁
                cellsBusy = 0;
            }
            //如果init为true说明初始化成功,跳出循环
            if (init)
                break;
        }
        /**
         *如果以上操作都失败了,则尝试将值累加到base上;
         */
        else if (casBase(v = base, ((fn == null) ? v + x :
                fn.applyAsLong(v, x))))
            break;                          // Fall back on using base
    }
}

About hash generation

Hash is where LongAdder locates where the current thread should accumulate the value to the cells array, so the hash algorithm is very important. Let's take a look at its implementation.

There is a member variable in java's Thread class:

@sun.misc.Contended("tlr")
int threadLocalRandomProbe;

The value of the variable threadLocalRandomProbe is what LongAdder uses to hash the location of the Cells array. Usually, this variable of thread is generally not used, and its value is always 0.

In the parent class Striped64 of LongAdder, get the value of the current thread threadLocalRandomProbe through the getProbe method:

static final int getProbe() {
    //PROBE是threadLocalRandomProbe变量在Thread类里面的偏移量,所以下面语句获取的就是threadLocalRandomProbe的值;
    return UNSAFE.getInt(Thread.currentThread(), PROBE);
}

Initialization of threadLocalRandomProbe

The thread's accumulation operation on LongAdder, before entering the longAccumulate method, threadLocalRandomProbe is always 0. When contention occurs, it enters the longAccumulate method. The first thing to enter this method is to determine whether threadLocalRandomProbe is 0. If it is 0, then set it to 0x9e3779b9.

 int h;
if ((h = getProbe()) == 0) {
    ThreadLocalRandom.current(); 
    h = getProbe();
    //设置未竞争标记为true
    wasUncontended = true;
}

The focus is on this line ThreadLocalRandom.current();

public static ThreadLocalRandom current() {
    if (UNSAFE.getInt(Thread.currentThread(), PROBE) == 0)
        localInit();
    return instance;
}

In the current method, if the probe value is 0, execute the locaInit() method to set the probe of the current thread to a value other than 0. This method is implemented as follows:

static final void localInit() {
    //private static final AtomicInteger probeGenerator =
    new AtomicInteger();
    //private static final int PROBE_INCREMENT = 0x9e3779b9;
    int p = probeGenerator.addAndGet(PROBE_INCREMENT);
    //prob不能为0
    int probe = (p == 0) ? 1 : p; // skip 0
    long seed = mix64(seeder.getAndAdd(SEEDER_INCREMENT));
    //获取当前线程
    Thread t = Thread.currentThread();
    UNSAFE.putLong(t, SEED, seed);
    //将probe的值更新为probeGenerator的值
    UNSAFE.putInt(t, PROBE, probe);
}

probeGenerator is an AtomicInteger class of static type. Each time the localInit() method is executed, the value of 0x9e3779b9 will be accumulated in the probeGenerator; the value of 0x9e3779b9 is 2^32 divided by a constant, which is the legendary golden ratio 1.6180339887 ; Then set the threadLocalRandomProbe of the current thread to the value of probeGenerator, if probeGenerator is 0, this takes 1;

threadLocalRandomProbe regenerates

It is to shift and XOR the value of prob left and right three times:

static final int advanceProbe(int probe) {
    probe ^= probe << 13;   // xorshift
    probe ^= probe >>> 17;
    probe ^= probe << 5;
    UNSAFE.putInt(Thread.currentThread(), PROBE, probe);
    return probe;
}

The probe is repeatedly executed 10 times starting from =1, and the results are as follows:

1 
270369 
67634689 
-1647531835 
307599695 
-1896278063 
745495504 
632435482 
435756210 
2005365029 
-1378868364

LongAdder and Atomic performance comparison

image.png-21kB

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325124161&siteId=291194637