Disruptor in-memory high-performance message queue

Disruptor Introduction

Disruptor is a high-performance queue developed by the British foreign exchange trading company LMAX. The original intention of the research and development is to solve the delay problem of the memory queue. Unlike Kafka and RabbitMQ, which are used for message queues between services, disruptors are generally used for message delivery between threads. The single thread of the system developed based on Disruptor can support 6 million orders per second.

Disruptor is a message queue used between multiple threads in a JVM. Its function is similar to ArrayBlockingQueue, but disruptor is far better than ArrayBlockingQueue in terms of function and performance. When it is high, consider using disruptor as a replacement for ArrayBlockingQueue.

The official also compared the performance of disruptor and ArrayBlockingQueue in different application scenarios, and the visual performance is only improved by about 5 to 10 times.

queue

The queue is a data structure. The queue uses FIFO (first in firstout). New elements (elements waiting to enter the queue) are always inserted at the end, and when reading, they are always read from the head. In computing, queues are generally used for queuing (such as thread pool waiting queuing, lock waiting queuing), decoupling (producer consumer mode), asynchrony, etc.

The queues in jdk all implement the java.util.Queue interface, and the queues are divided into two types, one is thread-unsafe, ArrayDeque, LinkedList, etc., and the other is under the java.util.concurrent package It is thread-safe, but in our real environment, our machines are all multi-threaded. When multiple threads queue up the same queue, if the thread is not safe, data overwriting, data loss, etc. cannot be predicted. Things, so we can only choose thread-safe queues at this time.
Secondly, there are two queues, ArrayBlockingQueue and LinkedBlockingQueue, both of which are thread-safe controlled by ReentrantLock. The difference between them is an array and a linked list. In the queue, after obtaining the queue element, it will be obtained immediately It is possible to get the next element, or get multiple queue elements at a time, and the address of the array in the memory is continuous, and there will be cache optimization in the operating system (the cache line will also be introduced below), so the access speed will be slightly faster One way, we will try our best to choose ArrayBlockingQueue. It turns out that in many third-party frameworks, such as the early log4j asynchronous, ArrayBlockingQueue is the choice.

The following is a brief list of some of the thread-safe queues provided in jdk:

insert image description here

We can see that our lock-free queue is unbounded, and the lock-locked queue is bounded. This will involve a problem. In our real online environment, the unbounded queue has a greater impact on our system. , it may cause our memory to overflow directly, so we first have to exclude unbounded queues. Of course, it is not that unbounded queues are useless, but they must be excluded in some scenarios. Secondly, there are two queues, ArrayBlockingQueue and LinkedBlockingQueue, both of which are thread-safe controlled by ReentrantLock. The difference between them is an array and a linked list.
(LinkedBlockingQueue is actually a bounded queue, but it will be Integer.MAX_VALUE if the size is not set), ArrayBlockingQueue and LinkedBlockingQueue also have their own drawbacks, that is, the performance is relatively low. Why does jdk add some lock-free queues? In fact, it is to increase performance. Distress, lock-free and bounded, the answer is Disruptor

Disruptor

Disruptor is a high-performance queue developed by LMAX, a British foreign exchange trading company, and an open source concurrency framework, which won the 2011 Duke's Program Framework Innovation Award. It can realize the Queue concurrent operation of the network without lock, and the single thread of the system developed based on Disruptor can support 6 million orders per second. At present, well-known frameworks including Apache Storm, Camel, Log4j2, etc. have internally integrated Disruptor to replace jdk queues to achieve high performance.

Why are you so awesome?

There are three killers in Disruptor:

  • CAS
  • Eliminate false sharing
  • RingBuffer

3.1.1 Locks and CAS

The reason why our ArrayBlockingQueue is abandoned is because of the use of heavyweight locks. We will suspend the lock during the locking process, and restore the thread after unlocking. This process will have a certain overhead, and Once we don't acquire the lock, this thread can only wait forever, and this thread can't do anything.

CAS (compare and swap), as the name suggests, compares and swaps first. Generally, it compares whether it is an old value. If yes, perform swap settings. Everyone who is familiar with optimistic locks knows that CAS can be used to implement optimistic locks. There is no thread in CAS. Context switching reduces unnecessary overhead
and our Disruptor is also based on CAS.

3.1.2 False sharing

When it comes to false sharing, we have to talk about the computer CPU cache. The cache size is one of the important indicators of the CPU, and the structure and size of the cache have a great impact on the CPU speed. The operating frequency of the cache in the CPU is extremely high. Frequency operation, work efficiency is far greater than the system memory and hard disk. In actual work, the CPU often needs to repeatedly read the same data block, and the increase in cache capacity can greatly improve the hit rate of reading data inside the CPU, without having to search for it in the memory or hard disk, thereby improving system performance . But considering the CPU chip area and cost factors, the cache is very small.
insert image description here

CPU cache can be divided into first-level cache and second-level cache. Nowadays, mainstream CPUs also have third-level cache, and some CPUs even have fourth-level cache. All the data stored in each level of cache is part of the next level of cache. The technical difficulty and manufacturing cost of these three caches are relatively decreasing, so their capacity is relatively increasing.

Every time you hear that Intel releases a new CPU, such as i7-7700k, 8700k, it will optimize the size of the CPU cache. If you are interested, you can go down and search for these conferences or publish articles.

Some per-cache times are given in Martin and Mike's QConpresentation talk:

image.png

cache line

In the multi-level cache of the cpu, it is not stored as an independent item, but a strategy similar to a pageCahe, which is stored in a cache line, and the size of a cache line is usually 64 bytes. In Java, Long It is 8 bytes, so it can store 8 Longs. For example, when you access a long variable, it will load 7 more helpers. We said above why you choose an array instead of a linked list, which is the reason. You can rely on buffered rows in an array to get fast access.
insert image description here

Are cache lines everything? NO, because it still brings a shortcoming. Let me give an example to illustrate this shortcoming. It is conceivable that there is an array queue, ArrayQueue, and its data structure is as follows:

class ArrayQueue{
    
    
    long maxSize;
    long currentIndex;
}

For maxSize, we defined it at the beginning. The size of the array, for currentIndex, is to mark the position of our current queue. This change is relatively fast. You can imagine that when you access maxSize, whether the currentIndex is also loaded. At this time , when other threads update currentIndex, the cache line in the cpu will be invalidated. Please note that this is stipulated by the CPU. It does not just invalidate the currentIndex. Read, but MaxSize is what we defined at the beginning, we should just access the cache, but it is affected by the currentIndex that we often change.
insert image description here

The Magic of Padding

In order to solve the problem of the above cache line, the Padding method is adopted in the Disruptor.

class LhsPadding
{
    
    
    protected long p1, p2, p3, p4, p5, p6, p7;
}

class Value extends LhsPadding
{
    
    
    protected volatile long value;
}

class RhsPadding extends Value
{
    
    
    protected long p9, p10, p11, p12, p13, p14, p15;
}

The Value in it is filled with some other useless long variables. In this way, when you modify Value, it will not affect the cache lines of other variables.

Finally, by the way, the @Contended annotation is provided in jdk8. Of course, generally speaking, it is only allowed inside Jdk. If you use it yourself, you have to configure the Jvm parameter -RestricContentended = false, which will restrict this annotation to set and cancel. Many articles analyze ConcurrentHashMap, but they ignore this annotation. This annotation is used in ConcurrentHashMap. In ConcurrentHashMap, each bucket is independently calculated by a counter, and this counter is changing all the time, so This annotation is used for padding cache line optimizations to increase performance.

insert image description here

The following example is a comparison of the effect of testing the characteristics of using the cache line and the characteristics of not using the cache line.

public class CacheLineEffect {
    
    
    //考虑一般缓存行大小是64字节, 一个 long 类型占8字节
    static  long[][] arr;

    public static void main(String[] args) {
    
    
        arr = new long[1024 * 1024][];
        for (int i = 0; i < 1024 * 1024; i++) {
    
    
            arr[i] = new long[8];
            for (int j = 0; j < 8; j++) {
    
    
                arr[i][j] = 0L;
            }
        }
        long sum = 0L;
        long marked = System.currentTimeMillis();
        for (int i = 0; i < 1024 * 1024; i+=1) {
    
    
            for(int j =0; j< 8;j++){
    
    
                sum = arr[i][j];
            }
        }
        System.out.println("Loop times:" + (System.currentTimeMillis() - marked) + "ms");

        marked = System.currentTimeMillis();
        for (int i = 0; i < 8; i+=1) {
    
    
            for(int j =0; j< 1024 * 1024;j++){
    
    
                sum = arr[j][i];
            }
        }
        System.out.println("Loop times:" + (System.currentTimeMillis() - marked) + "ms");
    }
}

insert image description here

What is false sharing

ArrayBlockingQueue has three member variables:

takeIndex: 需要被取走的元素下标
putIndex: 可被元素插入的位置的下标
count: 队列中元素的数量

These three variables are easy to put in a cache line, but there is not much correlation between the modifications. So each modification will invalidate the previously cached data, so that the sharing effect cannot be fully achieved.
insert image description here

As shown in the figure above, when the producer thread puts an element to the ArrayBlockingQueue, putIndex will be modified, which will cause the cache line in the cache of the consumer thread to be invalid and need to be re-read from the main memory.

This phenomenon of not being able to fully use the characteristics of the cache line is called false sharing

3.1.3RingBuffer

What exactly is a ringbuffer?
It is a ring (end-to-end ring), and you can use it as a buffer to transfer data between different contexts (threads).

insert image description here

Basically, the ringbuffer has an index that points to the next available element in the array. (The picture on the right of the figure below indicates the serial number, which points to the index 4 of the array.)
insert image description here

As you keep filling the buffer (and possibly reading it accordingly), the sequence number will keep increasing until the ring is bypassed.
insert image description here

To find the element pointed to by the current serial number in the array, you can use sequence & (array length-1) = array index, for example, there are 8 slots in total, 3&(8-1)=3, HashMap uses this method to locate array elements, This method is faster than modulo.

The difference between commonly used queues

There is no tail pointer. Only a sequence number pointing to the next available location is maintained.
The data in the buffer is not deleted, that is to say, these data are stored in the buffer until new data overwrite them.
The reason why ringbuffer adopts this data structure

Because it is an array, it is faster than a linked list, and the memory addresses of the elements in the array are stored consecutively. This is CPU cache-friendly—that is, at the hardware level, the elements in the array are preloaded, so in the ringbuffer, the CPU does not need to go to main memory to load the next element in the array from time to time. Because as long as one element is loaded into the cache line, several other adjacent elements will also be loaded into the same cache line.
Second, you can pre-allocate memory for the array, so that the array object will always exist (unless the program terminates). This means that there is no need to spend a lot of time on garbage collection. In addition, unlike a linked list, a node object needs to be created for each object added to it—correspondingly, when a node is deleted, a corresponding memory cleanup operation needs to be performed.

How to read from Ringbuffer

insert image description here

Consumer (Consumer) is a thread that wants to read data from the Ring Buffer, it can access the ConsumerBarrier object - this object is created by the RingBuffer and interacts with the RingBuffer on behalf of the consumer. Just like the Ring Buffer obviously needs a sequence number to find the next available node, consumers also need to know the sequence number it will process - each consumer needs to find the next sequence number it wants to visit. In the above example, the consumer has processed all the data before (including 8) in the Ring Buffer, then the next sequence number it expects to access is 9.

Consumers can call the waitFor() method of the ConsumerBarrier object, passing the next sequence number it needs.

final long availableSeq = consumerBarrier.waitFor(nextSequence);
ConsumerBarrier returns the highest accessible sequence number of the RingBuffer - 12 in the above example. ConsumerBarrier has a WaitStrategy method to determine how it waits for this sequence number.

next

Next, consumers will keep wandering around, waiting for more data to be written to the Ring Buffer. Also, the consumer will be notified when the data is written - nodes 9, 10, 11 and 12 have been written. Now that the sequence number 12 has arrived, the consumer can instruct the ConsumerBarrier to get the data in these sequence numbers.

insert image description here

In the Disruptor, we use an array to save our data. Above we also introduced the use of an array to save our access to a good use of the cache, but in the Disruptor, we further choose to use a ring array to save data, that is, RingBuffer. Let me explain here that the ring array is not a real ring array. In the RingBuffer, the remainder is used for access. For example, the array size is 10, and 0 accesses the position where the array subscript is 0. In fact, 10, 20, etc. The location where the subscript of the array is 0 is also accessed.

In fact, the remainder in these frameworks does not use the % operation, but the & and operation, which requires you to set the size to the Nth power of 2, that is, 10, 100, 1000, etc., so subtract 1 In other words, 1, 11, 111, you can use index & (size -1) very well, so that the use of bit operations increases the access speed.
If you don't set the size to the Nth power of 2 in the Disruptor, he will throw an exception that the buffersize must be the Nth power of 2.

image.png

The Producer will fill the RingBuffer with elements. The process of filling elements is to first read the next Sequence from the RingBuffer, then fill the slot at the Sequence position with data, and then publish it.
The Consumer consumes the data in the RingBuffer, coordinates the consumption sequence of different Consumers through the SequenceBarrier, and obtains the next consumption position Sequence.
When the Producer is full in the RingBuffer, it will continue to write from the beginning to replace the previous data. But if there is a SequenceBarrier pointing to the next location, this location will not be overwritten, and it will be blocked until this location is consumed. Similarly, after all barriers are consumed, the consumer will be blocked until new data comes in.
Disruptor's design scheme
Disruptor solves the problem of slow queue speed through the following design:

Ring array structure
In order to avoid garbage collection, an array is used instead of a linked list. At the same time, the array is more friendly to the cache mechanism of the processor. The element
position is located in
the array with a length of 2^n, and the positioning speed is accelerated through bit operations. The subscript is in the form of increment . Don’t worry about index overflow. The index is of long type. Even with a processing speed of 1 million QPS, it will take 300,000 years to run out. Each producer or consumer thread
in the lock-free design
will first apply for operable elements in the After the position in the array is applied, write or read data directly at the position.
Ignore the ring structure of the array below, and introduce how to realize the lock-free design. The whole process uses the atomic variable CAS to ensure the thread safety of the operation.

A producer
The process of writing data in a single thread of a producer is relatively simple:

Apply to write m elements;
if there are m elements that can be written, then return the largest serial number. Here the main judgment is whether to overwrite unread elements.
If the return is correct, the producer starts to write elements.
insert image description here


In the case of multiple producers and multiple producers, the problem of "how to prevent multiple threads from repeatedly writing the same element" will be encountered. The solution of Disruptor is that each thread obtains a different section of array space for operation. This is done by CAS is easy to achieve. You only need to use CAS to determine whether this space has been allocated when allocating elements.

However, there will be a new problem: how to prevent unwritten elements from being read when reading. In the case of multiple producers, Disruptor introduces a buffer with the same size as the Ring Buffer: available Buffer. When a When the position is successfully written, the corresponding position of the available Buffer is set, marking it as a successful write. When reading, it will traverse the available Buffer to determine whether the element is ready.

Reading data
The situation of multi-threaded writing by producers is much more complicated:

The application reads the sequence number n;
if the writer cursor >= n, it is still impossible to determine the maximum subscript that can be read continuously at this time. Start reading the available Buffer from the reader cursor, check until the first unavailable element, and then return the maximum The position of the continuous readable element;
the consumer reads the element.
As shown in the figure below, the reading thread reads the element with the subscript 2, and the three threads Writer1/Writer2/Writer3 are writing data to the corresponding position of the RingBuffer, and the writing thread is assigned to The maximum element subscript is 11.
The reading thread applies to read the elements with subscripts from 3 to 11, and judges that the writer cursor>=11. Then it starts to read availableBuffer, starting from 3 and reading backwards, and finds that the subscript is 7 The elements of are not produced successfully, so WaitFor(11) returns 6.

Then, the consumer reads a total of 4 elements with subscripts ranging from 3 to 6.
insert image description here

Write data
When multiple producers write:

Apply to write m elements;
if there are m elements that can be written, return the largest serial number. Each producer will be allocated an exclusive space; the
producer writes elements, and sets the available Buffer while writing elements The corresponding position in it to mark which positions have been successfully written.
As shown in the figure below, Writer1 and Writer2 two threads write into the array, and both apply for writable array space. Writer1 is assigned subscript 3 to the following table 5 space, Writer2 is allocated the space from subscript 6 to subscript 9.
Writer1 writes the element at subscript 3, and at the same time sets the corresponding position of available Buffer to mark that the writing has been successful, move back one bit, and start Write the element at position 4. Writer2 in the same way. Finally, all writing is completed.

insert image description here

The code to prevent different producers from writing to the same space is as follows:

public long tryNext(int n) throws InsufficientCapacityException
{
    
    
    if (n < 1)
    {
    
    
        throw new IllegalArgumentException("n must be > 0");
    }

    long current;
    long next;

    do
    {
    
    
        current = cursor.get();
        next = current + n;

        if (!hasAvailableCapacity(gatingSequences, n, current))
        {
    
    
            throw InsufficientCapacityException.INSTANCE;
        }
    }
    while (!cursor.compareAndSet(current, next));

    return next;
}

Use the condition cursor.compareAndSet(current, next) of the do/while loop to determine whether the space requested each time has been occupied by other producers. If it is already occupied, the function will return failure, and the While loop will be re-executed, and the application will be written space.

The process of the consumer is very similar to that of the producer, so I won’t describe it here. The Disruptor achieves high performance in high-concurrency situations through an exquisite lock-free design.

3.2 How to use Disruptor

package concurrent;

import sun.misc.Contended;

import java.util.concurrent.ThreadFactory;

import com.lmax.disruptor.BlockingWaitStrategy;
import com.lmax.disruptor.EventFactory;
import com.lmax.disruptor.EventHandler;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.dsl.ProducerType;

/**
 * @Description:
 * @Created on 2019-10-04
 */
public class DisruptorTest {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 队列中的元素
        class Element {
    
    
            @Contended
            private String value;


            public String getValue() {
    
    
                return value;
            }

            public void setValue(String value) {
    
    
                this.value = value;
            }
        }

        // 生产者的线程工厂
        ThreadFactory threadFactory = new ThreadFactory() {
    
    
            int i = 0;
            @Override
            public Thread newThread(Runnable r) {
    
    
                return new Thread(r, "simpleThread" + String.valueOf(i++));
            }
        };

        // RingBuffer生产工厂,初始化RingBuffer的时候使用
        EventFactory<Element> factory = new EventFactory<Element>() {
    
    
            @Override
            public Element newInstance() {
    
    
                return new Element();
            }
        };

        // 处理Event的handler
        EventHandler<Element> handler = new EventHandler<Element>() {
    
    
            @Override
            public void onEvent(Element element, long sequence, boolean endOfBatch) throws InterruptedException {
    
    
                System.out.println("Element: " + Thread.currentThread().getName() + ": " + element.getValue() + ": " + sequence);
//                Thread.sleep(10000000);
            }
        };


        // 阻塞策略
        BlockingWaitStrategy strategy = new BlockingWaitStrategy();

        // 指定RingBuffer的大小
        int bufferSize = 8;

        // 创建disruptor,采用单生产者模式
        Disruptor<Element> disruptor = new Disruptor(factory, bufferSize, threadFactory, ProducerType.SINGLE, strategy);

        // 设置EventHandler
        disruptor.handleEventsWith(handler);

        // 启动disruptor的线程
        disruptor.start();
        for (int i = 0; i < 10; i++) {
    
    
            disruptor.publishEvent((element, sequence) -> {
    
    
                System.out.println("之前的数据" + element.getValue() + "当前的sequence" + sequence);
                element.setValue("我是第" + sequence + "个");
            });

        }
    }
}

There are several key ones in Disruptor:

ThreadFactory: This is a thread factory for the threads needed for production and consumption in our Disruptor.
EventFactory: The event factory, which is used to generate the factory of our queue elements. In the Disruptor, it will directly fill the RingBuffer when it is initialized, and it will be in place at one time.
EventHandler: The handler used to process the Event. Here, an EventHandler can be regarded as a consumer, but multiple EventHandlers are independent consumption queues.
WorkHandler: It is also a handler for processing Events. The difference from the above is that multiple consumers share the same queue.
WaitStrategy: Waiting strategy, there are many strategies in the Disruptor to determine what strategy consumers will adopt if there is no data when consuming? The following is a brief list of some strategies
BlockingWaitStrategy in the Disruptor: through thread blocking, wait for the producer to wake up, after being woken up, recycle to check whether the dependent sequence has been consumed.
BusySpinWaitStrategy: The thread has been spinning and waiting, which may consume more cpu.
YieldingWaitStrategy: Try 100 times, and then Thread.yield() gives up the cpu

Guess you like

Origin blog.csdn.net/heqiushuang110/article/details/125599239