Still using BlockingQueue? Read this article to learn about Disruptor

1. What is a queue

Hearing the queue, I believe that everyone is not unfamiliar with it. Queues can be seen everywhere in our real life. When you go to the supermarket to check out, you will see everyone standing in rows and waiting for the checkout. Why do they stand in rows? Imagine that everyone has no quality and a swarm of checkouts will not only crash the supermarket, but also easily cause various stampede incidents. Of course, these things happen frequently in our reality.

Of course, in the computer world, the queue belongs to a data structure. The queue uses FIFO (first in firstout). New elements (elements waiting to enter the queue) are always inserted at the tail, and read from the head. Start reading. In computing, the queue is generally used for queuing (such as the waiting queue of the thread pool, the waiting queue of the lock), decoupling (producer consumer mode), asynchronous and so on.

2. Queue in jdk

The queues in jdk all implement the java.util.Queue interface, and are divided into two categories in the queue, one is thread-unsafe, ArrayDeque, LinkedList, etc., and the other is under the java.util.concurrent package It is thread-safe, and in our real environment, our machines are multi-threaded. When multiple threads queue up the same queue, if the thread is used, it will appear unsafe, overwriting data, data loss, etc. Unpredictable Thing, so we can only choose thread-safe queues at this time. Below is a brief list of some of the thread-safe queues provided in jdk:

Whether the queue name is locked. The key technical point of the data structure is whether the lock is bounded or not
ArrayBlockingQueue is an array array ReentrantLock is locked and bounded
LinkedBlockingQueue is a linked list ReentrantLock is locked and bounded
LinkedTransferQueue is a linked list CAS no lock***
DelayQueue no heap Heap CAS no lock* **
We can see that our lock-free queue is ***, and the locked queue is bounded. This will involve a problem. We are in a real online environment, the *** queue, yes The impact of our system is relatively large, which may cause our memory to overflow directly, so we must first eliminate the *** queue. Of course, it is not that the *** queue is useless, but it must be excluded in certain scenarios. Secondly, there are two queues, ArrayBlockingQueue and LinkedBlockingQueue. Both of them are thread-safe controlled by ReentrantLock. The difference between the two is an array and a linked list. In the queue, the queue element is generally obtained immediately after obtaining it. It is possible to get the next element, or to get multiple queue elements at a time, and the address of the array in memory is continuous, there will be cache optimization in the operating system (the cache line will also be introduced below), so the access speed will be slightly better First, we will try to choose ArrayBlockingQueue. It turns out that in many third-party frameworks, such as the early log4j asynchronous, ArrayBlockingQueue is the choice.

Of course, ArrayBlockingQueue also has its own drawbacks, that is, the performance is relatively low, why jdk will add some lock-free queues, in fact, to increase performance, it is very distressing, and it needs to be lock-free, and it needs to be bounded. At this time, I am afraid I can’t help but say Why don't you go to heaven? But someone really went to heaven.

3.Disruptor

Disruptor is the day mentioned above. Disruptor is a high-performance queue developed by the British foreign exchange trading company LMAX, and is an open source concurrency framework, and won the 2011 Duke's Program Framework Innovation Award. It can realize the concurrent operation of network Queue without lock, and the single-threaded system based on Disruptor can support 6 million orders per second. At present, well-known frameworks including Apache Storm, Camel, Log4j2, etc. have integrated Disruptor internally to replace jdk queues to achieve high performance.

3.1 Why is it so awesome?

The Disruptor has been blown out above. You will definitely have questions. Can he really be so awesome? My answer is of course. There are three major killers in Disruptor:

  • CASE
  • Eliminate false sharing
  • RingBuffer has these three killers, Disruptor has become so awesome.

    3.1.1 Lock and CAS

The reason why our ArrayBlockingQueue was abandoned is because we use heavyweight locks. During our locking process, we will suspend the lock, and after unlocking, we will resume the thread. This process will have a certain overhead, and Once we have not acquired the lock, this thread can only wait forever, and this thread cannot do anything.

CAS (compare and swap), as the name suggests, is the first comparison to exchange. Generally, it is to compare whether it is an old value. If it is to exchange settings, everyone who is familiar with optimistic locking knows that CAS can be used to implement optimistic locking. There is no thread in CAS. Context switching reduces unnecessary overhead. JMH is used here, with two threads, one call each time, and the test is performed on my local machine. The code is as follows:

@BenchmarkMode
({
Mode
.
SampleTime
})
@OutputTimeUnit
(
TimeUnit
.
MILLISECONDS
)
@Warmup
(
iterations
=
3
,
 time 
=

5
,
 timeUnit 
=

TimeUnit
.
MILLISECONDS
)
@Measurement
(
iterations
=
1
,
batchSize 
=

100000000
)
@Threads
(
2
)
@Fork
(
1
)
@State
(
Scope
.
Benchmark
)
public

class

Myclass

{

Lock

lock

=

new

ReentrantLock
();

long
 i 
=

0
;

AtomicLong
 atomicLong 
=

new

AtomicLong
(
0
);

@Benchmark

public

void
 measureLock
()

{

lock
.
lock
();
        i
++;

lock
.
unlock
();

}

@Benchmark

public

void
 measureCAS
()

{
        atomicLong
.
incrementAndGet
();

}

@Benchmark

public

void
 measureNoLock
()

{
        i
++;

}
}

The test results are as follows:

Test results:
Lock 26000ms
CAS 4840ms
No lock 197ms It
can be seen that Lock is a five-digit number, CAS is a four-digit number, and a lock-free smaller is a three-digit number. From this we can know Lock>CAS>No lock.

And our Disruptor uses CAS, which uses CAS to set some subscripts in the queue, which reduces lock conflicts and improves performance.

In addition, CAS is also used for other lock-free queues in jdk, and CAS is also used for atomic classes.

3.1.2 False sharing

When talking about pseudo-sharing, I have to say computer CPU cache. Cache size is one of the important indicators of CPU, and the structure and size of cache have a great influence on CPU speed. The operating frequency of the cache in the CPU is extremely high. The same frequency operation, the work efficiency is far greater than the system memory and hard disk. In actual work, the CPU often needs to read the same data block repeatedly, and the increase of the cache capacity can greatly improve the hit rate of the data read in the CPU, instead of searching in the memory or hard disk, thereby improving system performance . But considering the factors of CPU chip area and cost, the cache is very small.

Still using BlockingQueue?  Read this article to learn about Disruptor

The CPU cache can be divided into a first-level cache and a second-level cache. Nowadays, mainstream CPUs also have a third-level cache, and some CPUs even have a fourth-level cache. All the data stored in each level of cache is part of the next level of cache. The technical difficulty and manufacturing cost of these three caches are relatively decreasing, so their capacity is relatively increasing.

Why does CPU have a cache design like L1, L2, L3? The main reason is that the current processor is too fast, and it is too slow to read data from the memory (one is because the memory itself is not fast enough, and the other is because it is too far away from the CPU. In general, you need to let the CPU wait for a few minutes. Ten or even hundreds of clock cycles). At this time, in order to ensure the speed of the CPU, a smaller delay and faster memory is needed to help, and this is the cache. If you are interested in this, you can remove the computer CPU and play with it yourself.

Every time you hear that Intel releases a new cpu, such as i7-7700k, 8700k, the cpu cache size will be optimized. If you are interested, you can search for it by yourself, these conferences or articles.

header 1 header 2
row 1 col 1 row 1 col 2
row 2 col 1 row 2 col 2
Martin and Mike’s QConpresentation presentation gave some of the time for each cache:

From CPU to CPU cycles needed about the approximate time required for
the main memory
of about 60-80 nanoseconds
QPI bus transfer (between sockets, not drawn)
about 20ns
L3 Cache about 40-45 cycles of about 15ns
L2 of about 10 cycles Cache about 3ns
Ll Cache About 3-4 cycles About 1ns
Register
1 cycle
cache line

In the multi-level cache of the cpu, it is not stored as a separate item, but a strategy similar to pageCahe, which is stored in a cache line, and the size of the cache line is usually 64 bytes. In Java, Long It is 8 bytes, so 8 Longs can be stored. For example, when you access a long variable, he will load 7 more helpers. We said above why we choose arrays and not linked lists, that’s why, In the array, you can rely on buffered rows to get quick access.
Still using BlockingQueue?  Read this article to learn about Disruptor
Is the cache line everything? NO, because it still brings a shortcoming, I will give an example here to illustrate this shortcoming. You can imagine an array queue, ArrayQueue, whose data structure is as follows:

class

ArrayQueue
{

long
 maxSize
;

long
 currentIndex
;
}

For maxSize, we defined the size of the array at the beginning. For currentIndex, it marks the position of our current queue. This change is relatively fast. You can imagine that when you visit maxSize, do you load currentIndex? This time If other threads update the currentIndex, they will invalidate the cache line in the cpu. Please note that this is a CPU regulation. It is not just that the currentIndex is invalidated. If you continue to access maxSize at this time, you still have to continue from the memory. Read, but MaxSize is defined at the beginning, we should access the cache, but it is affected by the currentIndex that we often change.
Still using BlockingQueue?  Read this article to learn about Disruptor

The magic of padding

In order to solve the problem of the above cache line, the Padding method is used in the Disruptor,

class

LhsPadding
{

protected

long
 p1
,
 p2
,
 p3
,
 p4
,
 p5
,
 p6
,
 p7
;
}
class

Value

extends

LhsPadding
{

protected

volatile

long
 value
;
}
class

RhsPadding

extends

Value
{

protected

long
 p9
,
 p10
,
 p11
,
 p12
,
 p13
,
 p14
,
 p15
;
}

The Value is filled with other useless long variables. In this way, when you modify the Value, it will not affect the cache lines of other variables.

Finally, by the way, the @Contended annotation is provided in jdk8. Of course, generally speaking, only Jdk is allowed internally. If you use it yourself, you have to configure the Jvm parameter -RestricContentended = fase, which will restrict this annotation from setting and canceling. Many articles have analyzed ConcurrentHashMap, but have ignored this annotation. This annotation is used in ConcurrentHashMap. Each bucket in ConcurrentHashMap uses a separate counter for calculation, and this counter is changing all the time, so This annotation is used to optimize the filling of cache lines to increase performance.
Still using BlockingQueue?  Read this article to learn about Disruptor

3.1.3RingBuffer

In the Disruptor, an array is used to save our data. Above we also introduced the use of an array to save our access to the cache, but in the Disruptor, we further choose to use a circular array to save the data, which is RingBuffer. Let me first explain that the ring array is not a real ring array. In RingBuffer, the remainder is used to access. For example, the size of the array is 10, and 0 accesses the position of the array index as 0. In fact, 10, 20, etc. It is also the position where the index of the array is 0.

In fact, in these frameworks, the remainder does not use the% operation, but the & and operation, which requires you to set the size to the Nth power of 2, that is, 10, 100, 1000, etc., so subtract 1 If it is 1, 11, 111, index & (size -1) can be used well, so that the use of bit operations increases the access speed. If you do not set the size in the Disruptor to the Nth power of 2, it will throw an exception that the buffersize must be the Nth power of 2.

Of course, it not only solves the problem of fast access to the array, but also solves the problem of not needing to allocate memory again, and reduces garbage collection, because we 0, 10, 20, etc. are all executed in the same memory area, so there is no need to allocate again Memory is frequently reclaimed by the JVM garbage collector.
Still using BlockingQueue?  Read this article to learn about Disruptor

Since then, the three major killers have been said, and these three major killers have laid the foundation for such high performance of Disruptor. Next, I will explain how to use Disruptor and the specific working principle of Disruptor.

3.2 How to use Disruptor

Here is a simple example:

ublic 
static

void
 main
(
String
[]
 args
)

throws

Exception

{

// 队列中的元素

class

Element

{

@Contended

private

String
 value
;

public

String
 getValue
()

{

return
 value
;

}

public

void
 setValue
(
String
 value
)

{

this
.
value 
=
 value
;

}

}

// 生产者的线程工厂

ThreadFactory
 threadFactory 
=

new

ThreadFactory
()

{

int
 i 
=

0
;

@Override

public

Thread
 newThread
(
Runnable
 r
)

{

return

new

Thread
(
r
,

"simpleThread"

+

String
.
valueOf
(
i
++));

}

};

// RingBuffer生产工厂,初始化RingBuffer的时候使用

EventFactory
<
Element
>
 factory 
=

new

EventFactory
<
Element
>()

{

@Override

public

Element
 newInstance
()

{

return

new

Element
();

}

};

// 处理Event的handler

EventHandler
<
Element
>
 handler 
=

new

EventHandler
<
Element
>()

{

@Override

public

void
 onEvent
(
Element
 element
,

long
 sequence
,

boolean
 endOfBatch
)

throws

InterruptedException

{

System
.
out
.
println
(
"Element: "

+

Thread
.
currentThread
().
getName
()

+

": "

+
 element
.
getValue
()

+

": "

+
 sequence
);
//                Thread.sleep(10000000);

}

};

// 阻塞策略

BlockingWaitStrategy
 strategy 
=

new

BlockingWaitStrategy
();

// 指定RingBuffer的大小

int
 bufferSize 
=

8
;

// 创建disruptor,采用单生产者模式

Disruptor
<
Element
>
 disruptor 
=

new

Disruptor
(
factory
,
 bufferSize
,
 threadFactory
,

ProducerType
.
SINGLE
,
 strategy
);

// 设置EventHandler
        disruptor
.
handleEventsWith
(
handler
);

// 启动disruptor的线程
        disruptor
.
start
();

for

(
int
 i 
=

0
;
 i 
<

10
;
 i
++)

{
            disruptor
.
publishEvent
((
element
,
 sequence
)

->

{

System
.
out
.
println
(
"之前的数据"

+
 element
.
getValue
()

+

"当前的sequence"

+
 sequence
);
                element
.
setValue
(
"我是第"

+
 sequence 
+

"个"
);

});

}

}

There are several key ones in the Disruptor: ThreadFactory: This is a thread factory for the threads needed by the producers in our Disruptor when they consume. EventFactory: The event factory, the factory used to generate our queue elements. In the Disruptor, it will directly fill the RingBuffer when it is initialized, and it will be in place at one time. EventHandler: A handler used to process Event. Here an EventHandler can be regarded as a consumer, but multiple EventHandlers are queues for independent consumption. WorkHandler: It is also a handler used to process Event. The difference from the above is that multiple consumers share the same queue. WaitStrategy: Waiting strategy. There are multiple strategies in Disruptor to determine when consumers get consumption, what is the strategy to adopt if there is no data? Here is a brief list of some strategies in Disruptor

BlockingWaitStrategy: Through thread blocking, wait for the producer to wake up. After being awakened, it will recycle to check whether the dependent sequence has been consumed.

  • BusySpinWaitStrategy: The thread has been spinning and waiting, which may consume cpu

  • LiteBlockingWaitStrategy: Thread blocks waiting for the producer to wake up. Compared with BlockingWaitStrategy, the difference is signalNeeded.getAndSet. If two threads access one access waitfor and one access signalAll at the same time, the number of lock locks can be reduced.

  • LiteTimeoutBlockingWaitStrategy: Compared with LiteBlockingWaitStrategy, the blocking time is set, and an exception is thrown after the time expires.

  • YieldingWaitStrategy: Try 100 times, then Thread.yield() yields the cpu

    EventTranslator: Implementing this interface can convert our other data structures into Events circulating in the Disruptor.

3.3 working principle

The three killers of CAS, reducing false sharing, and RingBuffer have already been introduced above. Let me introduce the entire process of producers and consumers in Disruptor.

3.3.1 Producer

For producers, they can be divided into multi-producers and single-producers, which are distinguished by ProducerType.Single and ProducerType.MULTI. For multi-producers and single-producers, there is more CAS, because single-producers are single-threaded. So there is no need to guarantee thread safety.

In the disruptor, the disruptor.publishEvent and disruptor.publishEvents() are usually used for single and group sending.

Posting an event in the disruptor into the queue requires the following steps:

  1. First, get the next location in RingBuffer that can be published on RingBuffer. This can be divided into two categories:
    • Never written position
    • It has been read by all consumers and can be in the writing position. If you do not read until you will keep trying to read, the disruptor is very clever, and does not always occupy the CPU, but through LockSuport.park (), the thread is blocked and suspended for a while, in order to prevent the CPU from continuing. This kind of empty loop, otherwise no other thread can grab the CPU time slice.
      Still using BlockingQueue?  Read this article to learn about Disruptor
      After obtaining the position, cas will be preempted. If it is a single thread, it is not needed.
  2. Next, call the EventTranslator we introduced above to give the event at that position in the RingBuffer in the first step to the EventTranslator for rewriting.
  3. For publishing, there is an additional array in the disruptor to record the current sequence number of the current ringBuffer. Take the above 0, 10, and 20 for example. When writing to 10, the avliableBuffer is recorded at the corresponding position. At present, this belongs to 10. What is the use? I will introduce it later. When publishing, you need to update this avliableBuffer, and then wake up all blocked producers.

Let's briefly draw the process. The example above is not correct. We take 10 as an example above, because the bufferSize must be 2 to the Nth power, so we take Buffersize=8 as an example: the following describes when we have pushed 8 events, which is a circle When the time, the next process of pushing 3 messages: 1. First call next(3), we are currently at the position 7, so the next three are 8, 9, 10, and the remainder is 0, 1, 2 . 2. Rewrite the data in the three memory areas 0, 1, and 2. 3. Write availableBuffer.
Still using BlockingQueue?  Read this article to learn about Disruptor

By the way, I don’t know if you are familiar with the above process. It is similar to our 2PC, two-stage submission, first lock the location of the RingBuffer, and then submit and notify consumers. For the specific introduction of 2PC, please refer to another article of mine. Then someone will ask you about distributed transactions and show him this article.

3.3.1 Consumer

For consumers, there are two types described above, one is multiple consumers consume independently, the other is multiple consumers consume the same queue, here is a more complicated multiple consumers consume the same queue , Can understand this also can understand independent consumption. In our disruptor.strat() method, our consumer thread will be started for background consumption. There are two queues in consumers that need our attention, one is the progress queue shared by all consumers, and the other is the independent consumption progress queue of each consumer. 1. Perform the next Next CAS preemption on the consumer shared queue, and mark the current progress on the queue of its own consumption progress. 2. Apply for the Next position of the readable RingBuffer for yourself. The application here is not only for next, but may apply for a larger range than Next. The application process of the blocking strategy is as follows:

  • Get the latest position written by the producer to RingBuffer
  • Determine whether it is smaller than the position I want to apply for
  • If it is greater than, it proves that the position has been written, and returns to the producer.
  • If the less than the proof has not been written to this position, it will be blocked in the blocking strategy, and it will be awakened during the producer's submission phase. 3. Perform a readable check on this position, because the position you applied for may be continuous. For example, the producer is currently at 7, and then apply for reading. If the consumer has written the position of the serial number 8 and 10, But the position of 9 has not had time to write, because the first step will return 10, but 9 is actually not readable, so you have to shrink the position down to 8.
    Still using BlockingQueue?  Read this article to learn about Disruptor
    4. If it is smaller than the current next after shrinking, continue to apply in a loop. 5. Give it to handler.onEvent() for the
    same processing . Let's take an example. We want to apply for the position next=8. 1. First preempt progress 8 in the shared queue, and write progress 7 in the independent queue 2. Get the maximum readable position of 8. Here according to different strategies, we choose to block. Because the producer produces 8, 9, 10, So the returned value is 10, so there is no need to compare it with the availableBuffer again. 3. Finally hand it over to the handler for processing.
    Still using BlockingQueue?  Read this article to learn about Disruptor

    4. Disruptor in Log4j

The following figure shows the comparison of Log4j's use of Disruptor, ArrayBlockingQueue and synchronized Log4j throughput. You can see that the use of Disruptor has exploded others. Of course, there are more frameworks that use Disruptor, which will not be introduced here.
Still using BlockingQueue?  Read this article to learn about Disruptor

At last

This article introduces the shortcomings of traditional blocking queues. The following article focuses on the Disruptor, the reason why he is so awesome, and the specific workflow.

If someone asks you to design an efficient lock-free queue in the future, how do you design it? I believe you can summarize the answer from the article. If you have questions about it or want to exchange ideas with me, you can follow my official account and add my friends to discuss with me.

If you have any questions about the above questions, you can pay attention to the official account, come and discuss with me, follow you to receive the massive latest java learning materials videos and the latest interview materials for free.

If you think this article is helpful to you, or if you have any questions and want to provide 1v1 free VIP service, you can follow my public account, and you can receive the latest java learning materials videos and the latest interview materials for free. Following and forwarding are my greatest support, O(∩_∩)O:

Still using BlockingQueue?  Read this article to learn about Disruptor

Guess you like

Origin blog.51cto.com/14980978/2544834