Java Memory Model JMM Seven CAS Mechanism

the term

In the exploration of the synchronized lock mechanism in the previous article, we know that the underlying implementation of those lock mechanisms more or less relies on the CAS operation. In fact, the implementation of the java.util.concurrent package in Java is almost based on CAS. , the importance of CAS in the field of Java synchronization can be seen.

 

CAS is the short form of Compare and Swap, which can be translated as: Compare and Swap. Used to provide atomic operations at the hardware level . The implementation method is based on the assembly instructions of the hardware platform, that is to say, CAS is implemented by hardware, and the JVM only encapsulates the assembly call. Compare whether it is consistent with the given value, if it is consistent, modify it, if it is inconsistent, do not modify it.

 

CAS case study

The atomic feature of AtomicInteger is a typical usage scenario of the CAS mechanism. The relevant source code snippets are as follows (the following code is based on JDK1.7 and openJDK7):

private volatile int value;

public final int get() {
        return value;
}

public final int incrementAndGet() {
    for (;;) {
        int current = get();
        int next = current + 1;
        if (compareAndSet(current, next))
            return next;
    }
}

public final boolean compareAndSet(int expect, int update) {   
    return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
}

 

 AtomicInteger ensures that data between threads is visible (shared) with the help of volatile primitives without a lock mechanism. It is the get() method that can get the latest in-memory value.

In the operation of ++1, the CAS operation is used. Each time the latest data is read from the memory and then +1 is added to this data, and when the data is finally written to the memory, the latest value in the memory is compared first, and it is the same as the value read before the accumulation. Whether the value is consistent, if it is inconsistent, the write fails, and the loop retry until it succeeds.

 

The specific implementation of compareAndSet calls the compareAndSwapInt method of the unsafe class, which is actually a Java Native Interface (JNI for short) java native method, which will call the corresponding C implementation of different platforms according to different JDK environments. For example, the C++ code that this native method calls in turn in openjdk is: unsafe.cpp, atomic.cpp and atomic_windows_x86.inline.hpp, and its implementation code exists in: openjdk7\hotspot\src\os_cpu\windows_x86\vm \atomic_windows_x86.inline.hpp, the following is the relevant code snippet:

 

// Adding a lock prefix to an instruction on MP machine
// VC++ doesn't like the lock prefix to be on a single line
// so we can't insert a label after the lock prefix.
// By emitting a lock prefix, we can define a label after it.
#define LOCK_IF_MP(mp) __asm cmp mp, 0  \
                       __asm ​​is L0 \
                       __asm _emit 0xF0 \
                       __asm ​​L0:

inline jint Atomic :: cmpxchg (jint exchange_value, volatile jint * dest, jint compare_value) {
  // alternative for InterlockedCompareExchange
  int mp = os::is_MP();
  __asm ​​{
    mov edx, dest
    mov ecx, exchange_value
    mov eax, compare_value
    LOCK_IF_MP(mp)
    cmpxchg dword ptr [edx], ecx
  }
}

 

 It can be seen from the above source code that CAS is implemented by the instruction cmpxchg on the processor of the platform, and the program will decide whether to add a lock prefix (LOCK_IF_MP) to the cmpxchg instruction according to whether the current processor is a multiprocessor (is_MP), if it is a single core. The processor omits the lock prefix (the single processor itself maintains the order consistency within the single processor, and does not require the memory barrier effect provided by the lock prefix). Regarding the Lock prefix instruction:

1. The Lock prefix instruction can lock the bus or the internal cache of the processor, so that other processors cannot read or write the memory area to be accessed by the instruction, so the atomicity of the instruction execution can be preserved.

2. A Locl prefix instruction will inhibit reordering of the instruction with the preceding and following read and write instructions.

3. The Lock prefix instruction will immediately flush all data in the write buffer to main memory.

Through the above analysis, we go deep into the internal implementation of CAS combined with the specific platform implementation, and then we know how CAS guarantees the atomicity of operations.

 

Write about bus lock and cache lock
1、早期的处理器只支持通过总线锁保证原子性。所谓总线锁就是使用处理器提供的一个LOCK#信号,当一个处理器在总线上输出此信号时,其他处理器的请求将被阻塞住,那么该处理器可以独占使用共享内存。很显然,这会带来昂贵的开销。
2、缓存锁定是改进后的方案。在同一时刻我们只需保证对某个内存地址的操作是原子性即可,但总线锁定把CPU和内存之间通信锁住了,这使得锁定期间,其他处理器不能操作其他内存地址的数据,所以总线锁定的开销比较大,最近的处理器在某些场合下使用缓存锁定代替总线锁定来进行优化。

但是有两种情况下处理器不会使用缓存锁定。第一种情况是:当操作的数据不能被缓存在处理器内部,或操作的数据跨多个缓存行(cache line),则处理器会调用总线锁定。第二种情况是:有些处理器不支持缓存锁定。对于Inter486和奔腾处理器,就算锁定的内存区域在处理器的缓存行中也会调用总线锁定

  

CAS缺陷 

1.  ABA问题。因为CAS需要在操作值的时候检查下值有没有发生变化,如果没有发生变化则更新,但是如果一个值原来是A,变成了B,又变成了A,那么使用CAS进行检查时会发现它的值没有发生变化,但是实际上却变化了。ABA问题的解决思路就是使用版本号。在变量前面追加上版本号,每次变量更新的时候把版本号加一,那么A-B-A 就会变成1A-2B-3A。

从Java1.5开始JDK的atomic包里提供了一个类AtomicStampedReference来解决ABA问题。这个类的compareAndSet方法作用是首先检查当前引用是否等于预期引用,并且当前标志是否等于预期标志,如果全部相等,则以原子方式将该引用和该标志的值设置为给定的更新值。

2. 循环时间长开销大。自旋CAS如果长时间不成功,会给CPU带来非常大的执行开销。如果JVM能支持处理器提供的pause指令那么效率会有一定的提升,pause指令有两个作用,第一它可以延迟流水线执行指令(de-pipeline),使CPU不会消耗过多的执行资源,延迟的时间取决于具体实现的版本,在一些处理器上延迟时间是零。第二它可以避免在退出循环的时候因内存顺序冲突(memory order violation)而引起CPU流水线被清空(CPU pipeline flush),从而提高CPU的执行效率。 

3. 只能保证一个共享变量的原子操作。当对一个共享变量执行操作时,我们可以使用循环CAS的方式来保证原子操作,但是对多个共享变量操作时,循环CAS就无法保证操作的原子性,这个时候就可以用锁,或者有一个取巧的办法,就是把多个共享变量合并成一个共享变量来操作。比如有两个共享变量i=2,j=a,合并一下ij=2a,然后用CAS来操作ij。从Java1.5开始JDK提供了AtomicReference类来保证引用对象之间的原子性,你可以把多个变量放在一个对象里来进行CAS操作。 

4. 总线风暴带来的本地延迟。在上一章偏向锁的介绍中,我们提到CAS指令存在本地延迟,那么到底是指什么呢?我们知道多处理架构中,所有处理器会共享一条总线,靠此总线连接主存,每个处理器核心都有自己的高速缓存,各核相对于BUS对称分布,这种结构称为“对称多处理器”即SMP。当主存中的数据同时存在于多个处理器高速缓存的时候,某一个处理器的高速缓存中相应的数据更新之后,会通过总线使其它处理器的高速缓存中相应的数据失效,从而使其重新通过总线从主存中加载最新的数据,大家通过总线的来回通信称为“Cache一致性流量”,因为总线被设计为固定的“通信能力”,如果Cache一致性流量过大,总线将成为瓶颈。而CAS恰好会导致Cache一致性流量,如果有很多线程都共享同一个对象,当某个核心CAS成功时必然会引起总线风暴,这就是所谓的本地延迟。而偏向锁就是为了消除CAS,降低Cache一致性流量。

 

 

例外之言 写道
其实也不是所有的CAS都会导致总线风暴,这跟Cache一致性协议有关,具体参考:http://blogs.oracle.com/dave/entry/biased_locking_in_hotspot

NUMA(Non Uniform Memory Access Achitecture)架构:
与SMP对应还有非对称多处理器架构,现在主要应用在一些高端处理器上,主要特点是没有总线,没有公用主存,每个Core有自己的内存

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326113565&siteId=291194637