High-performance design of concurrent programming framework Disruptor

Get into the habit of writing together! This is the 7th day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

Architecture UML

1 single thread write

The reason why Disruptor's RingBuffer can be completely lock-free is also because of "single-threaded writing", which is the premise of all "preconditions". Without this premise, no technology can be completely lock-free. The design of high-performance technical frameworks such as Redis and Netty is the core idea.

2 System Memory Optimization - Memory Barrier

To achieve lock-free, another key technology is needed: memory barrier.

Corresponding to the Java language, it is the valotile variable and the happens before semantics.

See: Memory Barrier - Linux's smp_wmb()/smp_rmb() System kernel: such as Linux's kfifo: smp_wmb(), both the underlying read and write use Linux's smp_wmb github.com/opennetwork…

3 System Cache Optimization - Eliminate False Sharing

The cache system is stored in units of cache lines. A cache line is an integer power of 2 consecutive bytes, typically 32-256 bytes. The most common cache line size is 64 bytes.

When multiple threads modify variables that are independent of each other, if these variables share the same cache line, it will inadvertently affect each other's performance, which is false sharing.

Core: Sequence

It can be regarded as an AtomicLong, which is used to identify the progress. It also prevents false sharing of CPU caches between different Sequences (Flase Sharing).

如下设计保证保存的 value 永远在一个缓存行中。(8 个long,正好 64 字节),空间换时间。这些变量就是没有实际意义,只是帮助我们进行缓存行填充(Padding Cache Line),使得我们能够尽可能地用上CPU高速缓存(CPU Cache) 若访问内置在CPU的L1 Cache或L2 Cache,访问延时是内存的1/15乃至1/100。而内存访问速度远慢于CPU。想追求极限性能,需尽可能多从CPU Cache拿数据,而非从内存。

CPU Cache装载内存里的数据,不是一个个字段加载,而是加载整个缓存行。 如定义长度64的long类型数组,则数据从内存加载到CPU Cache,不是一个个数组元素加载,而是一次性加载固定长度的一个缓存行。

64位Intel CPU计算机的缓存行通常64个字节(Bytes)。一个long数据需8字节,所以一下会加载8个long数据。 即一次加载数组里面连续的8个数值。这样的加载使得遍历数组元素时,会很快。因为后面连续7次的数据访问都会命中缓存,无需重新从内存里读取数据。

但不使用数组,而使用单独变量时,这就出问题了。 Disruptor RingBuffer(环形缓冲区)定义了RingBufferFields类,里面有indexMask和其他几个变量存放RingBuffer的内部状态信息。 CPU在加载数据时,自然也会把这个数据从内存加载到高速缓存。 但这时,高速缓存除了这个数据,还会加载这个数据前后定义的其他变量

这时,问题就来了,Disruptor是个多线程的服务器框架,在这个数据前后定义的其他变量,可能会被多个不同线程更新、读取数据。这些写入及读取的请求,会来自不同 CPU Core。于是,为保证数据的同步更新,不得不把CPU Cache里的数据,重新写回内存或重新从内存里加载数据。

这些CPU Cache的写回和加载,都不是以一个变量作为单位。这些都是以整个Cache Line作为单位。 所以,当INITIAL_CURSOR_VALUE 前后的那些变量被写回到内存时,这个字段自己也写回到了内存,这个常量的缓存也就失效了。 当要再次读取这个值时,要再重新从内存读取。这就意味着,读取速度大大变慢。 对此,Disruptor利用了缓存行填充,在 RingBufferFields里面定义的变量的前后,分别定义了7个long类型的变量:

  • 前面7个来自继承的 RingBufferPad 类
  • 后面7个直接定义在 RingBuffer 类

这14个变量无任何实际用途。我们既不读他们,也不写他们。

而RingBufferFields里面定义的这些变量都是final,第一次写入后就不会再修改。 所以,一旦它被加载到CPU Cache后,只要被频繁读取访问,就不会再被换出Cache。这意味着,对于该值的读取速度,会一直是CPU Cache的访问速度,而非内存的访问速度。

使用RingBuffer,利用缓存和分支预测

这利用CPU Cache的性能的思路,贯穿整个Disruptor。Disruptor整个框架,就是个高速的生产者-消费者模型(Producer-Consumer)下的队列。

The producer keeps producing new pending tasks to the queue, and the consumer keeps processing these tasks from the queue. To implement a queue, the most appropriate is a linked list. Just maintain the head and tail of the linked list. The producer only needs to keep inserting new nodes to the end of the chain, and the consumer only needs to keep taking out the oldest node from the head for processing. LinkedBlockingQueue can be used directly in the producer-consumer pattern. Disruptor does not use LinkedBlockingQueue, but uses a RingBuffer data structure. The bottom layer of this RingBuffer is a fixed-length array. Compared with a linked list, the data of an array will have spatial locality in memory.

Multiple consecutive elements of the array will be loaded into the CPU Cache at the same time, so the access and traversal speed will be faster. Most of the data of each node in the linked list will not appear in the adjacent memory space, and naturally it will not enjoy the advantage of continuously accessing the data from the cache after the entire Cache Line is loaded.

Another great advantage of data traversal access is that the branch prediction at the CPU level will be very accurate. This allows us to make more efficient use of the multi-stage pipeline in the CPU, and our program will run faster. If you don't remember the principle of this part, you can go back and review the content of branch prediction in Lecture 25.

4 Algorithm Optimization - Sequence Number Fence Mechanism

When we deliver events from producers, we always use:

 long sequence = ringBuffer.next();
复制代码

In Disruptor 3.0, SequenceBarrier and Sequence are used together. Coordinates and manages the work rhythm of consumers and producers, avoiding the use of locks and CAS.

  • The consumer serial number value must be less than the producer serial number value
  • The consumer serial number value must be less than the serial number value of its predecessor (dependency) consumer
  • The serial number of the producer cannot be greater than the smallest serial number of the consumer
  • In order to avoid the producer being too fast, it will cover the messages that have not yet been consumed.

SingleProducerSequencerPad#next

     /**
      * @see Sequencer#next(int)
      */
     @Override
     public long next(int n) // 1
     {
         if (n < 1) // 初始值:sequence = -1
         {
             throw new IllegalArgumentException("n must be > 0");
         }
     // 语义级别的
     // nextValue为SingleProducerSequencer的变量
         long nextValue = this.nextValue;
 ​
         long nextSequence = nextValue + n;
         // 用于判断当前序号是否绕过整个 ringbuffer 容器
         long wrapPoint = nextSequence - bufferSize;
         // 用于缓存优化
         long cachedGatingSequence = this.cachedValue;
 ​
         if (wrapPoint > cachedGatingSequence || cachedGatingSequence > nextValue)
         {
             long minSequence;
             while (wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue)))
             {
                 LockSupport.parkNanos(1L); // TODO: Use waitStrategy to spin?
             }
 ​
             this.cachedValue = minSequence;
         }
 ​
         this.nextValue = nextSequence;
 ​
         return nextSequence;
     }
复制代码

refer to

Guess you like

Origin juejin.im/post/7083848134157140005