To build a high-performance queue, you have to know the underlying knowledge!

foreword

This article is included in the album: http://dwz.win/HjK , click to unlock more knowledge of data structures and algorithms.

Hello, my name is Tong.

In the previous section, we learned how to rewrite recursion into non-recursion, where the data structure used is mainly stack.

Stacks and queues can be said to be the most basic data structures other than arrays and linked lists. They are used in many scenarios, and we will see them one after another.

Today, I want to introduce, in Java, how to build a high-performance queue, and the underlying knowledge we need to master.

Students learning other languages ​​can also see how to build high-performance queues in your language.

queue

A queue is a first-in-first-out (FIFO) data structure, similar to a queue in a real-life scenario, where people who come first serve first.

Using arrays and linked lists to implement simple queues, we have introduced them before, so I won't repeat them here. Interested students can click the following links to view:

Revisit the four basic data structures: arrays, linked lists, queues and stacks

Today we mainly learn how to implement high-performance queues.

Speaking of high-performance queues, of course, it means queues that can work well in high-concurrency environments. Good here mainly refers to two aspects: concurrency safety and good performance.

Concurrency-safe queue

In Java, by default, there are also some concurrency-safe queues:

queue boundedness Lock data structure
ArrayBlockingQueue Bounded lock array
LinkedBlockingQueue optionally bounded lock linked list
ConcurrentLinkedQueue Unbounded no lock linked list
SynchronousQueue Unbounded no lock queue or stack
LinkedTransferQueue Unbounded no lock linked list
PriorityBlockingQueue Unbounded lock heap
DelayQueue Unbounded lock heap

> The source code analysis shortcut entry of these queues: the end of the dead Java concurrent collection

To sum up, the main data structures for implementing concurrent security queues are: arrays, linked lists, and heaps.

From the perspective of boundedness, only ArrayBlockingQueue and LinkedBlockingQueue can implement bounded queues, and the others are unbounded queues.

From the point of view of locking, both ArrayBlockingQueue and LinkedBlockingQueue adopt the method of locking, and the others are realized by the lock-free technology of CAS.

From a security point of view, we generally choose a bounded queue to prevent the producer from running too fast and causing memory overflow.

From a performance point of view, we generally need to consider a lock-free method to reduce the performance loss caused by thread context switching.

From the perspective of the JVM, we generally choose the implementation of arrays, because the linked list will frequently add and delete nodes, resulting in frequent garbage collection, which is also a performance loss.

So, the best choice is: array + bounded + lock-free.

JDK does not provide such a queue. Therefore, many open source frameworks implement high-performance queues by themselves, such as Disruptor and jctools used in Netty.

high performance queue

We will not discuss a specific framework here, but only introduce general techniques for implementing high-performance queues, and implement one by ourselves.

Ring array

Through the above discussion, we know that the data structure used to implement high-performance queues can only be arrays, and arrays must use circular arrays to implement queues.

A circular array is generally implemented by setting two pointers: putIndex and takeIndex, or writeIndex and readIndex, one for writing and one for reading.

When the write pointer reaches the end of the array, it will start from the beginning. Of course, the read pointer cannot be crossed. Similarly, when the read pointer reaches the end of the array, it will also start from the beginning. Of course, unwritten data cannot be read.

In order to prevent the overlap of the write pointer and the read pointer, it is impossible to distinguish whether the queue is full or empty, and a size field is generally added:

Therefore, the data structure of using a circular array to implement a queue is generally:

public class ArrayQueue<t> {
    private T[] array;
    private long wrtieIndex;
    private long readIndex;
    private long size;
}

In a single-threaded case, this would not be a problem, but in a multi-threaded environment, it would introduce serious false-sharing problems.

false sharing

What is sharing?

In the computer, there are many storage units. The one we touch the most is the memory, also known as the main memory. In addition, the CPU has three levels of cache: L1, L2, and L3. L1 is the closest to the CPU. Of course, its storage space is also very small. , L2 is slightly larger than L1, and L3 is the largest, which can cache data of multiple cores at the same time. When the CPU fetches data, it first reads from the L1 cache, if not from the L2 cache, if not from the L3, if there is no L3 cache, it will finally read from the memory. The farther away from the CPU core, the longer the relative time-consuming, so if you want to do some very frequent operations, try to ensure that the data is cached in L1, which can greatly improve performance.

cache line

The data in the third-level cache is not meant to be cached one by one, but a batch of data is cached at a time. This batch of data is also called a cache line (Cache Line), which is usually 64 bytes.

Every time, when the CPU goes to the memory to get data, it will take the data behind it together (composed of 64 bytes). Let's take the long array as an example. When the CPU takes a long in the array, it will also Fetch the next 7 longs together into the cache line.

This can speed up data processing to a certain extent, because at this time, when processing data with a subscript of 0, the data with a subscript of 1 may be processed at the next moment, and it is much faster to directly fetch it from the cache.

However, this brings a new problem - false sharing.

false sharing

Imagine that two threads (CPUs) are processing the data in this array at the same time, and both CPUs have cached them. One CPU is adding 1 to the data in array[0], and the other CPU is adding 1 to the data in array[1]. 1. Then, when writing back to the main memory, which cache line data is subject to (when writing back to the main memory, it is also written back in the form of a cache line), so at this time, it is necessary to write back the two cache lines. The row is "locked". One CPU modifies the data first, writes it back to the main memory, and the other CPU can read the data, modify the data, and then write it back to the main memory, which will inevitably lead to performance loss. It is called false sharing , and this "locking" method is called memory barrier. We will not describe the knowledge of memory barrier.

So, how to solve the problem caused by false sharing?

Taking the queue implemented by a circular array as an example, writeIndex, readIndex, and size are now handled as follows:

Therefore, we only need to add 7 longs between writeIndex and readIndex to isolate them, and the same is true between readIndex and size.

This eliminates the problem of false sharing between writeIndex and readIndex, because writeIndex and readIndex must be updated in two different threads, so the performance improvement brought by eliminating false sharing is obvious.

If there are multiple producers, writeIndex will definitely be contended. At this time, how to modify writeIndex amicably? That is, if one producer thread modifies writeIndex, the other producer thread should be visible immediately.

Your first thought must be volatile, yes, but volatile alone is not enough, volatile can only guarantee visibility and order, but not atomicity, so you need to add the atomic instruction CAS, who provides CAS ? Both AtomicInteger and AtomicLong have the function of CAS, so should we use them directly? Certainly not, after careful observation, it is found that they are all implemented by calling Unsafe in the end.

OK, now it's the turn of the most powerful bottom killer - Unsafe.

Unsafe

Unsafe not only provides CAS instructions, but also provides many other low-level methods, such as operating direct memory, modifying the value of private variables, instantiating a class, blocking/awakening threads, methods with memory barriers, etc.

> About Unsafe, you can read this article: Unsafe Analysis of Dead Java Magic Class

Of course, to build high-performance queues, Unsafe's CAS instructions and methods with memory barriers are mainly used:

// 原子指令
public final native boolean compareAndSwapLong(Object var1, long var2, long var4, long var6);
// 以volatile的形式获取值,相当于给变量加了volatile关键字
public native long getLongVolatile(Object var1, long var2);
// 延迟更新,对变量的修改不会立即写回到主内存,也就是说,另一个线程不会立即可见
public native void putOrderedLong(Object var1, long var2, long var4);

Well, the basic knowledge is almost introduced, it is time to show the real technology - handwritten high-performance queue.

Handwritten high-performance queue

We assume such a scenario: there are multiple producers (Multiple Producer), but only one consumer (Single Consumer), this is a classic scenario in Netty, how to implement such a queue?

Go directly to the code:

/**
 * 多生产者单消费者队列
 *
 * @param <t>
 */
public class MpscArrayQueue<t> {

    long p01, p02, p03, p04, p05, p06, p07;
    // 存放元素的地方
    private T[] array;
    long p1, p2, p3, p4, p5, p6, p7;
    // 写指针,多个生产者,所以声明为volatile
    private volatile long writeIndex;
    long p11, p12, p13, p14, p15, p16, p17;
    // 读指针,只有一个消费者,所以不用声明为volatile
    private long readIndex;
    long p21, p22, p23, p24, p25, p26, p27;
    // 元素个数,生产者和消费者都可能修改,所以声明为volatile
    private volatile long size;
    long p31, p32, p33, p34, p35, p36, p37;

    // Unsafe变量
    private static final Unsafe UNSAFE;
    // 数组基础偏移量
    private static final long ARRAY_BASE_OFFSET;
    // 数组元素偏移量
    private static final long ARRAY_ELEMENT_SHIFT;
    // writeIndex的偏移量
    private static final long WRITE_INDEX_OFFSET;
    // readIndex的偏移量
    private static final long READ_INDEX_OFFSET;
    // size的偏移量
    private static final long SIZE_OFFSET;

    static {
        Field f = null;
        try {
            // 获取Unsafe的实例
            f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            UNSAFE = (Unsafe) f.get(null);

            // 计算数组基础偏移量
            ARRAY_BASE_OFFSET = UNSAFE.arrayBaseOffset(Object[].class);
            // 计算数组中元素偏移量
            // 简单点理解,64位系统中有压缩指针占用4个字节,没有压缩指针占用8个字节
            int scale = UNSAFE.arrayIndexScale(Object[].class);
            if (4 == scale) {
                ARRAY_ELEMENT_SHIFT = 2;
            } else if (8 == scale) {
                ARRAY_ELEMENT_SHIFT = 3;
            } else {
                throw new IllegalStateException("未知指针的大小");
            }

            // 计算writeIndex的偏移量
            WRITE_INDEX_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("writeIndex"));
            // 计算readIndex的偏移量
            READ_INDEX_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("readIndex"));
            // 计算size的偏移量
            SIZE_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("size"));
        } catch (Exception e) {
            throw new RuntimeException();
        }
    }

    // 构造方法
    public MpscArrayQueue(int capacity) {
        // 取整到2的N次方(未考虑越界)
        capacity = 1 &lt;&lt; (32 - Integer.numberOfLeadingZeros(capacity - 1));
        // 实例化数组
        this.array = (T[]) new Object[capacity];
    }

    // 生产元素
    public boolean put(T t) {
        if (t == null) {
            return false;
        }
        long size;
        long writeIndex;
        do {
            // 每次循环都重新获取size的大小
            size = this.size;
            // 队列满了直接返回
            if (size &gt;= this.array.length) {
                return false;
            }

            // 每次循环都重新获取writeIndex的值
            writeIndex = this.writeIndex;

            // while循环中原子更新writeIndex的值
            // 如果失败了重新走上面的过程
        } while (!UNSAFE.compareAndSwapLong(this, WRITE_INDEX_OFFSET, writeIndex, writeIndex + 1));

        // 到这里,说明上述原子更新成功了
        // 那么,就把元素的值放到writeIndex的位置
        // 且更新size
        long eleOffset = calcElementOffset(writeIndex, this.array.length-1);
        // 延迟更新到主内存,读取的时候才更新
        UNSAFE.putOrderedObject(this.array, eleOffset, t);

        // 往死里更新直到成功
        do {
            size = this.size;
        } while (!UNSAFE.compareAndSwapLong(this, SIZE_OFFSET, size, size + 1));

        return true;
    }

    // 消费元素
    public T take() {
        long size = this.size;
        // 如果size为0,表示队列为空,直接返回
        if (size &lt;= 0) {
            return null;
        }
        // size大于0,肯定有值
        // 只有一个消费者,不用考虑线程安全的问题
        long readIndex = this.readIndex;
        // 计算读指针处元素的偏移量
        long offset = calcElementOffset(readIndex, this.array.length-1);
            // 获取读指针处的元素,使用volatile语法,强制更新生产者的数据到主内存
        T e = (T) UNSAFE.getObjectVolatile(this.array, offset);

        // 增加读指针
        UNSAFE.putOrderedLong(this, READ_INDEX_OFFSET, readIndex+1);
        // 减小size
        do {
            size = this.size;
        } while (!UNSAFE.compareAndSwapLong(this, SIZE_OFFSET, size, size-1));

        return e;
    }

    private long calcElementOffset(long index, long mask) {
        // index &amp; mask 相当于取余数,表示index到达数组尾端了从头开始
        return ARRAY_BASE_OFFSET + ((index &amp; mask) &lt;&lt; ARRAY_ELEMENT_SHIFT);
    }

}

Can't understand it? That's right, watch it a few more times, and the interview will blow again.

What is used here is to add 7 long type variables between every two variables to eliminate false sharing. You may see that some open source frameworks are implemented by inheritance, and some add 15 long types. In addition, in JDK8 An annotation is also provided @Contendedto eliminate false sharing.

In fact, there is still room for optimization in this example. For example, when using size, can we not use size? How to do it without using size?

postscript

In this section, we learned how to build high-performance queues in Java together, and learned some basic knowledge. It is no exaggeration to say that after learning these basic knowledge, the queue can play with the interviewer for an hour during the interview. .

In addition, I recently received feedback from some students, saying that hash, hash table, and hash function are related to each other? What is the relationship? Why put a hash() method in Object? How is it related to the equals() method?

In the next section, we'll take a look at all about hashes. Want to get the latest tweets in time? Don't hurry to follow me!

> Follow the official account owner "Tong Ge Read Source Code" to unlock more source code, basic and architecture knowledge. </t></t></t>

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324132415&siteId=291194637