[In-depth understanding of JVM] 3. CPU storage + MESI + CPU pseudo-sharing + CPU disorder problem and code demonstration + volatile + synchrnized [interview essential]

Previous post: [In-depth understanding of JVM] 1. How to load in JVM class loading? Parents delegate loading class and mixed mode, code demonstration [required for interview]

1. The hierarchical structure of the memory

2. Cache line

Since the shared variable stored in the CPU cache is a cache line as a basic unit, a cache line may store a plurality of variables (the number of bytes of the cache is full line); to modify the CPU cache and cache behavior is the smallest unit , Then there will be a false sharing problem.

The Cache Line can be simply understood as the smallest cache unit in the CPU Cache. Today’s CPUs no longer access memory in bytes. They are usually taken in chunks of 64 bytes, called a cache line. line). When you read a specific memory address, the entire cache line is swapped from main memory into the cache, and the overhead of accessing other values ​​in the same cache line is very small.

3. Why is there a problem of false sharing?

As shown in the figure below: In the case of multiple threads such as T1 and T2, if you think that the two shared variables of X and Y are in the same cache line , CPU1 modifies the variable X, which will cause the X and Y variables in CPU2 to become invalid at the same time. At this time, for the thread running on CPU1, only the variable X is modified, which causes all the variables in the same cache line to be invalid, and the cache needs to be refreshed (it does not necessarily mean that it must be reloaded from memory every time. It is also possible to import data from other Caches. The specific implementation depends on the implementation of each chip manufacturer). Assuming that the thread running on CPU2 at this time just wants to modify the variable Y, then there will be competition and mutual failure, which is pseudo sharing.

 4. How to solve false sharing?

  • Using cache line alignment can improve efficiency (the data to be manipulated will be combined into a cache line for operation)
  • The current CPU data consistency is achieved through the combination of cache (MESI) + bus lock (data that is very large and cannot be cached or data that spans multiple cache lines must use a cache lock).
  • Sun.misc.Contended annotation used after 1.8.
    @sun.misc.Contended
    public static class T{
        //8字节
        private volatile long x = 0L;
    }

Note: This annotation is invalid by default, you need to set -XX:-RestrictContended when the JVM starts.

Code verification:

T01_CacheLinePadding demonstrates the time required in the same cache line. Response time: about 180

public class T01_CacheLinePadding {
    private static class T {
        public volatile long x = 0L;
    }

    public static T[] arr = new T[2];

    static {
        arr[0] = new T();
        arr[1] = new T();
    }

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(()->{
            for (long i = 0; i < 1000_0000L; i++) {
                arr[0].x = i;
            }
        });

        Thread t2 = new Thread(()->{
            for (long i = 0; i < 1000_0000L; i++) {
                arr[1].x = i;
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start)/100_0000);
    }
}

We separate the two objects in different cache lines. The default initialization is 56 bytes of data. Response time: about 60

public class T02_CacheLinePadding {
    private static class Padding {
        public volatile long p1, p2, p3, p4, p5, p6, p7;
    }

    private static class T extends Padding {
        public volatile long x = 0L;
    }

    public static T[] arr = new T[2];

    static {
        arr[0] = new T();
        arr[1] = new T();
    }

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(()->{
            for (long i = 0; i < 1000_0000L; i++) {
                arr[0].x = i;
            }
        });

        Thread t2 = new Thread(()->{
            for (long i = 0; i < 1000_0000L; i++) {
                arr[1].x = i;
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start)/100_0000);
    }
}

The test result is obvious that the running time of T02_CacheLinePadding is much less than that of T01_CacheLinePadding. Although it takes up more memory, its efficiency is improved.

5. Hardware layer data consistency MESI (cache lock)

MESI( Modified Exclusive Shared Or Invalid) (Also known as the Illinois protocol because it was proposed by Illinois State University) is a widely used cache coherency protocol that supports write-back strategies.

1. The status in the MESI protocol

CPUEach cache line ( caceh line) in each cache line is marked with 4 states (using the extra two bits ( bit) to indicate):

Modified: Modified

The cache line is only cached in the CPUcache, and it has been modified ( dirty), that is, it is inconsistent with the data in the main memory. The memory in the cache line needs to be at a certain point in the future (allow other CPUreads, please Before the corresponding memory in the main memory) write back ( write back) the main memory.

After being written back to main memory, the state of the cache line will become exclusive ( exclusive) state.

Exclusive: Exclusive

The cache line is only cached in the CPUcache. It has not been modified ( clean) and is consistent with the data in the main memory. This state can CPUbecome a shared state ( shared) at any time when there are other reading the memory .

Similarly, when CPUthe content in the cache line is modified, the state can become a Modifiedstate.

Shared: shared

This state means that the cache line may be CPUcached by multiple caches, and the data in each cache is consistent with the main memory data ( clean). When one of CPUthe cache lines is modified, the cache line in the other CPUcache lines can be invalidated (become an invalid state ( Invalid)).

Invalid: invalid

The cache is invalid ( CPUthe cache line may be modified by others ). Go to the memory and read it again

6, CPU disorder problem

For details, please refer to: cpu out-of-order execution and problems

If the memory that a cpu needs to access during execution is not in the cache, the cpu must be fetched from the main memory through the memory bus, then during the time the data returns to the cpu (this period of time is roughly hundreds of thousands of executions) At least two data levels in the time of an instruction) What do you do?

The answer is: cpu will continue to execute other qualified instructions. For example, the cpu has an instruction sequence instruction 1 instruction 2 instruction 3..., when instruction 1, it needs to access the main memory. Before the data returns, the cpu will continue to follow "independent instructions" that are not dependent on the logical relationship with instruction 1. The cpu is generally The "independent relationship" between instructions that depends on the memory reference relationship between instructions is judged. For details, please refer to the documentation of each CPU. This is also one of the root causes of cpu's out-of-order execution of instructions.

In order to improve the efficiency of instruction execution, the CPU will execute another instruction at the same time during the execution of an instruction (such as going to the memory to read data (100 times slower)), provided that there is no dependency between the two instructions

It is a bit more complicated for writing data:

When the cpu executes the storage instruction, it will first try to write the data to the L1_cache closest to the cpu. If the cpu has an L1 miss at this time, it will access the next level of cache. In terms of speed, L1_cache can basically be the same as cpu, and the others are significantly lower than cpu. The speed of L2_cache is about 20-30 times slower than cpu, and there are still L2_cache misses, and it takes more cycles to read from the main memory. In fact, after the L1_cache misses, the cpu will use another buffer called the combined write storage buffer (WCBuffer, which is faster than L1_cache, so it should seem to be very expensive, generally only has 4 locations) . This technique is called merge write technique . When the ownership of the requested L2_cache cache line has not been completed, the cpu will write the data to be written into the combined write storage buffer. The buffer size and the size of a cache line are generally 64 bytes. This buffer allows the cpu to continue to execute other instructions while writing or reading data in the buffer, which alleviates the performance impact of a cache miss when the cpu writes data.

These buffers become very interesting when subsequent write operations need to modify the same cache line. Before submitting subsequent write operations to the L2 cache, the buffer write merge can be performed. These 64-byte buffers maintain a 64-bit field, and the corresponding bit is set every time a byte is updated to indicate which data is valid when the buffer is exchanged to the external buffer. Of course, if the program reads some data that has been written into the buffer, it will read the buffer before reading the buffered data.

After the above steps, the data in the buffer will still be updated to the external cache (L2_cache) at a certain delay time. If we can fill the buffer as much as possible before it is transferred to the cache, the effect will be improved Transmission bus efficiency at all levels to improve program performance.

Merging and writing code verification:

/**
 * WCBuffer只有4个位置
 */
public final class WriteCombining {

    private static final int ITERATIONS = Integer.MAX_VALUE;
    private static final int ITEMS = 1 << 24;
    private static final int MASK = ITEMS - 1;

    private static final byte[] arrayA = new byte[ITEMS];
    private static final byte[] arrayB = new byte[ITEMS];
    private static final byte[] arrayC = new byte[ITEMS];
    private static final byte[] arrayD = new byte[ITEMS];
    private static final byte[] arrayE = new byte[ITEMS];
    private static final byte[] arrayF = new byte[ITEMS];

    public static void main(final String[] args) {

        for (int i = 1; i <= 3; i++) {
            System.out.println(i + " SingleLoop duration (ns) = " + runCaseOne());
            System.out.println(i + " SplitLoop  duration (ns) = " + runCaseTwo());
        }
    }

    public static long runCaseOne() {
        long start = System.nanoTime();
        int i = ITERATIONS;

        while (--i != 0) {
            int slot = i & MASK;
            byte b = (byte) i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }
        return System.nanoTime() - start;
    }

    public static long runCaseTwo() {
        long start = System.nanoTime();
        int i = ITERATIONS;
        while (--i != 0) {
            int slot = i & MASK;
            // 这里的b占了一个位置
            byte b = (byte) i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
        }
        i = ITERATIONS;
        while (--i != 0) {
            int slot = i & MASK;
            // 这里的b占了一个位置
            byte b = (byte) i;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }
        return System.nanoTime() - start;
    }
}

The results show that: the separated case is more efficient. (Because it makes full use of the combined writing technology)

Why is 6 slow?

Because WCBuffer has 4 positions, 6=4+2, 4 can be read once through WCBuffer, but there are 2 more that have to wait for 2 supplements to be read once, which will also waste efficiency here.

 

Proof of out-of-order execution:

It may take a long time to execute the result:

public class T04_Disorder {
    private static int x = 0, y = 0;
    private static int a = 0, b =0;

    public static void main(String[] args) throws InterruptedException {
        int i = 0;
        for(;;) {
            i++;
            x = 0; y = 0;
            a = 0; b = 0;
            Thread one = new Thread(new Runnable() {
                public void run() {
                    //由于线程one先启动,下面这句话让它等一等线程two. 读着可根据自己电脑的实际性能适当调整等待时间.
                    //shortWait(100000);
                    a = 1;
                    x = b;
                }
            });

            Thread other = new Thread(new Runnable() {
                public void run() {
                    b = 1;
                    y = a;
                }
            });
            one.start();other.start();
            one.join();other.join();
            String result = "第" + i + "次 (" + x + "," + y + ")";
            if(x == 0 && y == 0) {
                System.err.println(result);
                break;
            } else {
                //System.out.println(result);
            }
        }
    }


    public static void shortWait(long interval){
        long start = System.nanoTime();
        long end;
        do{
            end = System.nanoTime();
        }while(start + interval >= end);
    }
}
第1342606次 (0,0)

How to ensure that there is no disorder under certain circumstances?

1. Hardware memory barrier (on X86)

  • sfence:  store| The write operation before the sfence instruction must be completed before the write operation after the sfence instruction. It is a Store Barrier write barrier.
  • lfence: load | The read operation before the lfence instruction must be completed before the read operation after the lfence instruction. It is a Load Barrier read barrier.
  • mfence: modify/mix | The read and write operations before the mfence instruction must be completed before the read and write operations after the mfence instruction. It is an all-round barrier with the ability of ifence and sfence.
  • Atomic instructions, such as the "lock..." instruction on x86 is a full barrier. Lock is not a memory barrier, but it can perform a function similar to a memory barrier. Lock will lock the CPU bus and cache, which can be understood as a kind of lock at the CPU instruction level. During execution, the memory subsystem is locked to ensure the order of execution, even across multiple CPUs. Software Locks usually use memory barriers or atomic instructions to achieve variable visibility and maintain program order. It can be followed by instructions such as ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG.

2. How to standardize the JVM level (JSR133)

(This is a virtual thing, the hardware memory barrier is real, the JVM is just a specification, the implementation depends on the specific implementation of the virtual machine or CPU)

  1. LoadLoad barrier:
  • For such statements Load1; LoadLoad; Load2,
  • Before the data to be read by Load2 and subsequent read operations are accessed, ensure that the data to be read by Load1 has been read.
  1. StoreStore barrier:
  • For such statements Store1; StoreStore; Store2,
  • Before Store2 and subsequent write operations are executed, ensure that Store1's write operations are visible to other processors.
  1. LoadStore barrier:
  • For such statements Load1; LoadStore; Store2,
  • Before Store2 and subsequent write operations are flushed out, ensure that the data to be read by Load1 has been read.
  1. StoreLoad barrier:
  • For such a statement Store1; StoreLoad; Load2,
  • Before Load2 and all subsequent read operations are executed, ensure that Store1 writes are visible to all processors.

Implementation details of volatile

Many articles explaining volatile are messy, so I will analyze it from bytecode, JVM, and hardware level.

1. Bytecode level

(Look at the compiled bytecode file) Just added ACC_VOLATILE

public class TestVolatile {
    int i;
    volatile int j;
}

Bytecode:

2. JVM level

volatile memory area reads and writes are barriers

For instructions, see "How to Standardize JVM Level" above

StoreStoreBarrier
volatile write operation
StoreLoadBarrier

LoadLoadBarrier
volatile 读操作
LoadStoreBarrier

3. OS and hardware level

This requires tools

If you want to learn more, you can read this article: https://blog.csdn.net/qq_26222859/article/details/52235930

Use hsdis to observe the assembly code
lock instruction to ensure the yoke to the memory area when executing instructions
hsdis-HotSpot Dis Assembler
is implemented using the lock instruction on windows | MESI implementation

synchrnized implementation details

1. Bytecode level

方法:ACC_SYSCHRONIZED

Synchronous sentence block: monitorenter/monitorexit.

public class TestSync {
    synchronized void m() {

    }

    void n() {
        synchronized (this) {

        }
    }

    public static void main(String[] args) {

    }
}

monitorenter: enter

The first monitorexit: exit

The second monitorexit: it will automatically exit when an exception is found

2. JVM level

C/C++ calls the synchronization mechanism provided by the operating system.

3. OS and hardware level

X86: lock a command, various commands cmpxchg / xxx (lock means lock, the latter means modification)

Save it and organize it later

Picture reference: https://blog.csdn.net/baidu_38083619/article/details/82527461?spm=1001.2014.3001.5506

Details: HTTPS : //blog.csdn.net/21aspnet/article/details/ 88.57174 million

Next article: [In-depth understanding of JVM] 4. The object creation process and what does the object header specifically include and object alignment? [Required for Interview]

Guess you like

Origin blog.csdn.net/zw764987243/article/details/109502616