Java Concurrency accumulator

Background

Speaking of concurrent programming problem, the first reaction most people think of a piece of code that appear when most cited example of thread safety:

	...
	i++;	// 自增
	...
复制代码

Then it is natural to think, since i++this operation is actually in the bottom three operations:

tmp1 = i;
tmp2 = tmp1 + 1;
i = tmp2;

Therefore, i++not an atomic operation, thread-safe in a multithreaded environment.

So the question is, if you want to achieve an accumulator, achieved under conditions of concurrent i++functions, should be how to do?

such as:

Interface methods in a record of how many times is called 1s
In LFU (Least Frequently Used) algorithm statistics within a period of an object to be used many times
......

Scheme 1: AtomicLong

AtomicLongLocated java.util.concurrent.atomicunder the package, is a lock but no need to thread-safe and can be realized i++or i += xoperation of the class, the author is Doug Lea.

Doug Lea Masters needless describes, in our view from his hand AtomicLongefficient and thread-safe "no doubt." Of course, even if his name did not know we really should rethink, after all java.util.concurrentpackages covered under most of his class stamp it ......

AtomicLongClass has a property value, for recording the value of this Long;

This value is volatile, indicating that modification in different threads on other threads are visible;

This class uses loop + CAS (Compare And Swap) of the lock-free thread-safe mode;

This class provides getAndAdd, addAndGetlike the method implementation i++, i += xatomic and other operations.

In the AtomicLongimplementation, get and set methods directly read or modify the value, due to the volatile guarantee mechanism, these two operations are thread-safe.

The core of the method in Java7

compareAndSet

Method accepts two parameters, the expectations will be replaced to get the value and the actual value comparison, when the same update, otherwise fail. Of course, there is the ABA problem which, for the time being would not be discussed here.

By underlying instrument Unsafenative methods compareAndSwapLongto achieve CAS operation, this method is native support for computer hardware, it is possible to compete in a highly concurrent failure, because the real value has been modified by other threads lead to inconsistent compare results.

    public final boolean compareAndSet(long expect, long update) {
        return unsafe.compareAndSwapLong(this, valueOffset, expect, update);
    }
复制代码

getAndSet

Read the old value write the new value, which is accomplished by circulating + CAS, CAS if the assignment fails, the other threads in the competition. This method keeps trying until success.

    public final long getAndSet(long newValue) {
        while (true) {
            long current = get();
            if (compareAndSet(current, newValue))
                return current;
        }
    }
复制代码

Other core methods

There are several other methods commonly used in the core for effecting increment decrement operations:

getAndIncrement i++
getAndDecrement i--
getAndAdd After the first value is calculated
incrementAndGet ++i
decrementAndGet --i
addAndGet After calculating the first value

These methods are all realized by the circulation atomization + CAS manner to getAndIncrementExample:

	// getAndIncrement
	while (true) {
		long current = get();
		long next = current + 1;
		if (compareAndSet(current, next))	// 实际就是unsafe.compareAndSwapLong
			return current;
	}
复制代码

Implemented in the Java8

Unsafe added method getAndSetLong, getAndAddLongis actually in the Java7 AtomicLongmethod of moving over.

In addition to compareAndSetthe method is still in use compareAndSwap, other methods are completed by two getAndAdd method:

getAndIncrement
getAndDecrement
getAndAdd
incrementAndGet
decrementAndGet
addAndGet

Of course, getAndAddLongand getAndSetLongthe method is still essentially implemented using CAS cycle, Java8 the Unsafeclass portion decompiled code:

    public final long getAndAddLong(Object var1, long var2, long var4) {
        long var6;
        do {
            var6 = this.getLongVolatile(var1, var2);
        } while(!this.compareAndSwapLong(var1, var2, var6, var6 + var4));
        return var6;
    }
复制代码

However, according to: CAS 8 in AtomicLong the Java classes that are related to change how they work? - Code journal article saying, internal circulation in Java7 CAS method of conditional statements if there is a branch prediction optimization problems under high concurrency lead to less efficient. Original translation is very strange, I re-interpret what:

When the CAS in the circulation often fail, the CPU starts a branch prediction function to accelerate a desired efficiency; branch misprediction once - i.e. the CAS is successful - the processor stopping time consuming - Rollback - Thermal start up.

About branch predictor, we will post after introduction.

PS. Another argument is Java7 instruction is in the bottom CAS LOCK CMPXCHG, is in the LOCK XADD Java8, resulting in high efficiency in Java8 the CAS Java7 ratio. Description of reference https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7023898 replace CMPXCHG with XADD I think should have been understood and implemented after JDK7u40 adaptive version of Java developers.

Program 2: LongAdder

In Java8, Doug Lea guru java.util.concurrent.atomicadds several new classes package, which has a class called LongAdder.

See the introduction, this class is for accumulating calculation of a highly concurrent.

This class is usually preferable to {@link AtomicLong} when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization control. Under low update contention, the two classes have similar characteristics. But under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.

This class (LongAdder) for multiple threads to update a cumulative value, such as for statistical and not for synchronization control, the effect will be even better than AtomicLong. At low conflict, similar to the features of these two classes, but in the case of high conflict, the expected throughput of this class will be higher at the expense of consuming more space.

LongAdderInherited Striped64, we look at these two classes do, and why statistics for expected efficiency will be higher.

Striped64

Interior contains a subclass Cellis actually AtomicLonga subset of the functions, save only the value and CAS functions, but not others, such as get, incrementAndGetand other functions. And Cellclass uses the @Contendedannotation to avoid false sharing problem.

Internal defined transient volatile long basefor "save a part of the value"; also defines transient volatile Cell[] cells, for "another portion of the stored value." Yeah, base and cells together to form the final value.

Striped64 core has two methods - longAccumulateand doubleAccumulate, similar logic, a process integer (e.g., LongAdderusing) a floating-point processing (e.g. DoubleAdderuse).

These two methods is more complex, a large amount of code, attach one here, we recommend a closer look, if you really do not want to see you can skip code is as follows:

    final void longAccumulate(long x, LongBinaryOperator fn,
                              boolean wasUncontended) {
        int h;  // 探针值
        if ((h = getProbe()) == 0) {
            // 初始化一个探针值，其实就是一个跟线程相关的伪随机值
            ThreadLocalRandom.current(); // force initialization
            h = getProbe();
            // 标记这个Striped64是否原来有值
            // CAS失败调用longAccumulate方法时显然默认是false
            wasUncontended = true;
        }
        // 线程操作碰撞标记，上一个槽非空时为true
        // 也就是线程竞争碰撞时为true
        boolean collide = false;
        for (;;) {
            Cell[] as; Cell a; int n; long v;
            // 已经初始化的情况，绝大多数调用进入的分支
            if ((as = cells) != null && (n = as.length) > 0) {
                // 线程探针值对n取模，n是2的幂
                // 实际上是找当前线程对应的Cell是否为null
                if ((a = as[(n - 1) & h]) == null) {
                    if (cellsBusy == 0) {       // 没人持有锁，尝试获取锁并添加Cell
                        Cell r = new Cell(x);   // 乐观创建
                        if (cellsBusy == 0 && casCellsBusy()) { // CAS加锁
                            boolean created = false;
                            try {               // 再次检查是否应当添加
                                Cell[] rs; int m, j;
                                if ((rs = cells) != null &&
                                        (m = rs.length) > 0 &&
                                        rs[j = (m - 1) & h] == null) {
                                    rs[j] = r;
                                    created = true;
                                }
                            } finally {
                                cellsBusy = 0;  // 最后释放锁
                            }
                            // 如果创建成功了，跳出整个循环，计算结束
                            if (created)
                                break;
                            // 再次检查时这个槽已经被其他线程写入了，进入下一轮
                            continue;           // Slot is now non-empty
                        }
                    }
                    // 这个槽是空的且CAS加锁失败的情况
                    collide = false;
                }
                // 线程对应的Cell有值了，且调用longAccumulate之前的CAS失败，之后的逻辑会重新计算探针值继续循环
                // 下一次循环不会再进入这个分支，这个分支只进入一次
                else if (!wasUncontended)       // CAS already known to fail
                    wasUncontended = true;      // Continue after rehash
                // 线程对应的Cell有值，之前CAS也成功了，那么尝试正常计算并CAS设置这个Cell的value
                else if (a.cas(v = a.value, ((fn == null) ? v + x :
                        fn.applyAsLong(v, x))))
                    break;
                // 线程对应的Cell有值，之前CAS也成功了，但上一个分支条件中本线程对应的CAS失败了
                else if (n >= NCPU || cells != as)
                    // 数组达到长度上限，或cells被其他线程并发修改了
                    // 清空collide，下一轮循环
                    collide = false;            // At max size or stale
                // 对应Cell有值，之前CAS成功，本次CAS失败，且数组长度未达上限，且未被其他线程修改
                // 且collide标记为false
                // 这个collide标记实际上是扩容前的最后一道防线
                else if (!collide)
                    // 设置冲突标记
                    collide = true;
                // 其他分支全部尝试过了且无效，最终方案加锁扩容
                // 如果加锁扩容还失败，那继续循环
                else if (cellsBusy == 0 && casCellsBusy()) {
                    try {
                        if (cells == as) {      // Expand table unless stale
                            Cell[] rs = new Cell[n << 1];
                            for (int i = 0; i < n; ++i)
                                rs[i] = as[i];
                            cells = rs;
                        }
                    } finally {
                        cellsBusy = 0;
                    }
                    collide = false;
                    continue;                   // Retry with expanded table
                }
                h = advanceProbe(h);
            }
            // 未初始化，且cells未被其他线程扩容，且CAS获取到锁的情况
            else if (cellsBusy == 0 && cells == as && casCellsBusy()) {
                boolean init = false;
                try {                           // Initialize table
                    // 获取锁后，再次检查
                    if (cells == as) {
                        // 创建length = 2的数组并添加当前值到线程对应的Cell
                        Cell[] rs = new Cell[2];
                        rs[h & 1] = new Cell(x);
                        cells = rs;
                        init = true;
                    }
                } finally {
                    // 解锁
                    cellsBusy = 0;
                }
                // 本线程创建成功才跳出
                // 如果cells又被其他线程扩容了，那就继续循环
                if (init)
                    break;
            }
            // 未初始化cells，不巧被其他线程获取锁了，只好CAS修改base
            // 如果CAS修改base还失败，那就继续循环
            else if (casBase(v = base, ((fn == null) ? v + x :
                    fn.applyAsLong(v, x))))
                break;                          // Fall back on using base
        }
    }
复制代码

Code above seems complicated, in fact, only the following core points:

As described above, Striped64by the value long baseand the Cell[] cellstwo parts, wherein each thread is mapped to an array of cellsCell
From the optimistic point of view, try not to lock, try to use CAS when competition for resources
Not loop when there is no conflict, conflict and even under certain conditions will not cycle (Doug Lea consider very thoughtful)
In essence, most of the branches are in conflict, the real core of only two branches:
- New or updated for the current thread corresponding Cellvalues
- When the conflict is relatively large, expansion cells arrays, of course, there is the upper limit of the array length

Why Striped64 To use the base + cells

At higher degree of concurrency, AtomicLonguse of higher frequency operation failed CAS, and a lot more unnecessary consumption of resources, resulting in performance degradation.

While Striped64taking into account the AtomicLongresources of a single competition in CAS, choose dispersed compete for resources when there is a conflict, as each thread is allocated a Cell, let each resource corresponding to resource competition, significantly reduce conflict.

Long Adder

Since Striped64is LongAccumulator, LongAdderlike the parent class, out of the common concurrent calculation processing section, realized without special subclass value portion.

So LongAdderit inherited Striped64when not only the methods required to achieve an accumulator, also increased the value of several methods.

The core method

	// 增加值
    public void add(long x) {
        Cell[] as; long b, v; int m; Cell a;
		// 乐观情况下，完全无冲突时会使用父类Sriped64.casBase方法，更新base
		// 一旦casBase产生冲突，会使得cells不为空，那么每个线程会通过探针值probe找到自己对应的Cell，通过CAS更新其value
		// 一旦CAS更新Cell的值出现冲突，那么会使用Striped64.longAccumulate方法更新cells或base的值
        if ((as = cells) != null || !casBase(b = base, b + x)) {
            boolean uncontended = true;
            if (as == null || (m = as.length - 1) < 0 ||
                (a = as[getProbe() & m]) == null ||
                !(uncontended = a.cas(v = a.value, v + x)))
                longAccumulate(x, null, uncontended);
        }
    }

	// 自增
    public void increment() {
        add(1L);
    }

	// 自减
    public void decrement() {
        add(-1L);
    }
复制代码

As can be seen, the core of the conflict is complicated by the dispersion, calling Striped64the longAccumulatemethod.

The AtomicLongbiggest difference is the increment method did not return a value. (Nonsense, method name, said very clearly)

sum

Returns the sum, i.e. base and all cells in the array values.

Since unlocked when traversing the sums other threads in the update, the result may not be accurate.

reset

No lock to reset the data, including clearing the cells elements and clear base, which means that consumers need to make sure that no other threads in the update, or can not be completely cleared.

But the cells will not change the size of the array, elements are also not removed, which means that after reset multiplex expansion process conducted prior to this object will not be repeated.

sumThenReset

While traversing the sum value of elements in the array is cleared cells, attention is also unlocked, sumand resetmodified by other threads problem with this method existing methods also exist.

Other methods

Provided longValue, doubleValuesuch as value method, the core is to use all summethods.

LongAdder shortcomings

As said before, AtomicLongthe SET, the get operation is directly the core value, the time complexity is O (1).

In contrast, LongAdderthere is such a number of disadvantages:

Assignment method does not provide means to reuse need to call the resetmethod, the disadvantage of this method is the top has been said, will be more than another time complexity AtomicLong.setis high, because there traversal operation
The method of time complexity value is also relatively high, due to the summation traverse
As the authors describe, more space, because the use of @Contendedannotations to avoid false sharing

Although the LongAddercells to traverse the length of the array of empty or when the maximum sum for the number of CPU cores, in fact, is not large, but considering the scenario of concurrent accumulator, if you call too many times too high proportion of the impact on performance will still be reflected of.

Compare and LongAdder of AtomicLong

LongAdderMore suitable for high concurrency, write a scene far more than read operations. This is Doug Lea described in the class description, "for statistical and not for synchronization control" than expected efficiency will AtomicLongbe much higher.

AtomicLongRead more for far more than writing, or small number of threads scene.

At low concurrency, AtomicLongthe write efficiency with LongAddersubstantially the same, or even slightly better LongAdder;

AtomicLongThe reading efficiency is always better than LongAdderthe;

But AtomicLongthe write efficiency with the intensity of competition in the linear reduction, but LongAdderthe write efficiency can be maintained almost be good.

Overall, the two scenarios is very different, should not be LongAdderas AtomicLongused.

Reference material

LongAdder principle of concurrent Java tools of summary · Issue # 22 · aCoder2013 / blog

Java 8 in AtomicLong class of CAS-related changes to how it works? - Code Log

Bug ID: JDK-7023898 Intrinsify AtomicLongFieldUpdater.getAndIncrement() —— java.com

This article moved from my blog , welcome to visit!