[Posts] real-time streaming system backpressure mechanisms (Backpressure) Summary

Real-time stream processing system backpressure mechanism (BackPressure) review

https://blog.csdn.net/qq_21125183/article/details/80708142

 

Disclaimer: This article is a blogger original article, follow the  CC 4.0 BY-SA  copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_21125183/article/details/80708142

        Backpressure mechanism (BackPressure) has been widely applied to the real-time stream processing system, the stream processing system needs to be able to handle backpressure (backpressure) problems gracefully. Typically backpressure generated in this scenario: short-term load peak rate of the received data cause a system is much higher than the rate it processes data. Many everyday problems can cause back pressure, for example, garbage collection pauses may cause rapid accumulation of data flows, or to promote or encounter a major traffic spike activity leads to increased sharply. Back pressure if not handled properly, could lead to resource exhaustion and even system crashes. Backpressure mechanism refers to a system capable of detecting their blocked Operator, then the system adaptively reducing the transmission rate of the source or upstream. Current mainstream stream processing system Apache Storm, JStorm, Spark Streaming, S4, Apache Flink, Twitter Heron backpressure mechanisms are used to solve this problem, but their implementations are different.

        Different components can perform different speeds (and the processing speed of each component changes over time). For example, consider a workflow or task schedule or inclined because the data resulting data is processed very slowly. In this case, if the upstream stage is not decelerating, would result in the establishment of a buffer queue length, or cause the system to drop tuple. If the tuple is discarded in the middle, then there may be a loss of efficiency, as has been calculated for these tuples generated is wasted. And such Strom, these will resend the missing tuples in some stream processing system, this will lead to problems of data consistency, and also lead to some Operator superposition state. Then the entire program output inaccurate results. Since the data rate of the received second systems change over time, resulting in short-term load peak data rate of the system receives its processing rate is much higher than the case where the data will be lost in the middle lead Tuple. So real-time stream processing system must be able to solve the transmission rate is much larger than the system can handle the rate of this problem, most real-time stream processing system using back pressure (BackPressure) mechanism to solve this problem. Here we introduce backpressure mechanisms different real-time stream processing system uses:

1.Strom backpressure mechanism

1.1 Storm 1.0 previous backpressure mechanism

        For opening the mechanism acker storm procedures, can be accomplished by providing the counter-pressure effect conf.setMaxSpoutPending parameter, if a downstream component (Bolt) to keep up with the processing speed of the transmission does not result in tuple spout timely confirmation number exceeds the value of the parameter set , Spout will stop sending data, disadvantage of this approach is difficult to set preferences conf.setMaxSpoutPending tune parameters to achieve the best effect of counter-pressure, is provided not increase throughput cause small, will cause a large set worker OOM; there shock, the data flow will be in a state of turbulence, progressively less effective than backpressure; Further procedures for closing mechanism acker invalid;

1.2 Storm Automatic Backpressure

        The new storm automatic backpressure mechanism (Automatic Back Pressure) by monitoring the bolt in the receive queue, when the water level exceeds the high value dedicated thread will back-pressure information is written Zookeeper, watch Zookeeper will be on notice that the topology of all Worker have entered the counter-pressure state, reduce the speed of the last tuple Spout transmitted.

        Executor Each queue has a receiving and sending queue for receiving and transmitting Spout Tuple tuple Tuple Bolt or generated. Worker Each process has a single thread of the receiving port to monitor received. It comes from the network each message to the receive queue Executor. Executor receiving queue holds Worker Worker or other Executor internal message sent. Executor receive queue worker thread out of the data, and then execute method is called to transmit the Executor Tuple transmission queue. Executor thread gets a message from the transmission queue in the transmission, transmitted to the reception queue in the transmission queue Worker Executor or other message in accordance with the destination address selection. Worker thread reads the last message sent from the transmission queue, and then sends the tuple Tuple network.

After 1. When a Worker Process Executor thread found himself receive queue is full, that is, receive queue reaches a threshold value of the high watermark, so it sends a notification message to the back pressure thread.

2. Thread the back-pressure information on the current worker process is registered to Znode node Zookeeper's. Specific path is / Backpressure / topo1 / wk1 under

Under the directory node changes the Znode Watcher 3. Zookeepre monitoring / Backpreesure / topo1, if found to increase the directory node znode described or other changes. This explains the need Topo1 back pressure control, then it will notify all of the Worker Topo1 into the back-pressure state.

4. Final Spout tuple speed reduction transmission.

2. JStorm backpressure mechanism

        Jstorm做了两级的反压,第一级和Jstorm类似,通过执行队列来监测,但是不会通过ZK来协调,而是通过Topology Master来协调。在队列中会标记high water mark和low water mark,当执行队列超过high water mark时,就认为bolt来不及处理,则向TM发一条控制消息,上游开始减慢发送速率,直到下游低于low water mark时解除反压。

        此外,在Netty层也做了一级反压,由于每个Worker Task都有自己的发送和接收的缓冲区,可以对缓冲区设定限额、控制大小,如果spout数据量特别大,缓冲区填满会导致下游bolt的接收缓冲区填满,造成了反压。

 

        限流机制:jstorm的限流机制, 当下游bolt发生阻塞时, 并且阻塞task的比例超过某个比例时(现在默认设置为0.1),触发反压

        限流方式:计算阻塞Task的地方执行线程执行时间,Spout每发送一个tuple等待相应时间,然后讲这个时间发送给Spout,  于是, spout每发送一个tuple,就会等待这个执行时间。

        Task阻塞判断方式:在jstorm 连续4次采样周期中采样,队列情况,当队列超过80%(可以设置)时,即可认为该task处在阻塞状态。

3. SparkStreaming 反压机制

3.1 为什么引入反压机制Backpressure

        默认情况下,Spark Streaming通过Receiver以生产者生产数据的速率接收数据,计算过程中会出现batch processing time > batch interval的情况,其中batch processing time 为实际计算一个批次花费时间, batch interval为Streaming应用设置的批处理间隔。这意味着Spark Streaming的数据接收速率高于Spark从队列中移除数据的速率,也就是数据处理能力低,在设置间隔内不能完全处理当前接收速率接收的数据。如果这种情况持续过长的时间,会造成数据在内存中堆积,导致Receiver所在Executor内存溢出等问题(如果设置StorageLevel包含disk, 则内存存放不下的数据会溢写至disk, 加大延迟)。Spark 1.5以前版本,用户如果要限制Receiver的数据接收速率,可以通过设置静态配制参数“spark.streaming.receiver.maxRate”的值来实现,此举虽然可以通过限制接收速率,来适配当前的处理能力,防止内存溢出,但也会引入其它问题。比如:producer数据生产高于maxRate,当前集群处理能力也高于maxRate,这就会造成资源利用率下降等问题。为了更好的协调数据接收速率与资源处理能力,Spark Streaming 从v1.5开始引入反压机制(back-pressure),通过动态控制数据接收速率来适配集群数据处理能力。

3.2 反压机制Backpressure

        Spark Streaming Backpressure:  根据JobScheduler反馈作业的执行信息来动态调整Receiver数据接收率。通过属性“spark.streaming.backpressure.enabled”来控制是否启用backpressure机制,默认值false,即不启用。

SparkStreaming 架构图如下所示: 

SparkStreaming 反压过程执行如下图所示:

        在原架构的基础上加上一个新的组件RateController,这个组件负责监听“OnBatchCompleted”事件,然后从中抽取processingDelay 及schedulingDelay信息.  Estimator依据这些信息估算出最大处理速度(rate),最后由基于Receiver的Input Stream将rate通过ReceiverTracker与ReceiverSupervisorImpl转发给BlockGenerator(继承自RateLimiter).

4. Heron 反压机制

        当下游处理速度跟不上上游发送速度时,一旦StreamManager 发现一个或多个Heron Instance 速度变慢,立刻对本地spout进行降级,降低本地Spout发送速度, 停止从这些spout读取数据。并且受影响的StreamManager  会发送一个特殊的start backpressure message 给其他的StreamManager ,要求他们对spout进行本地降级。 当其他StreamManager  接收到这个特殊消息时,他们通过不读取当地Spout中的Tuple来进行降级。一旦出问题的Heron Instance 恢复速度后,本地的SM 会发送stop backpressure message 解除降级。

        很多Socket Channel与应用程序级别的Buffer相关联,该缓冲区由high watermark 和low watermark组成。 当缓冲区大小达到high watermark时触发反压,并保持有效,直到缓冲区大小低于low watermark。 此设计的基本原理是防止拓扑在进入和退出背压缓解模式之间快速振荡。

5. Flink 反压机制

        Flink did not use any complex mechanism to solve the problem of back pressure, because do not need that kind of program! It uses its own advantage as data-flow engine to respond to backpressure problems gracefully. Here we will analyze in depth Flink is how to transfer data between Task, and how data flows of natural deceleration.
        Flink is mainly composed of operators and streams of two components at runtime. Each operator will be spending an intermediate flow state, and the conversion on the stream, and then generates a new stream. For a network mechanism Flink image analogy is, Flink efficient use of distributed bounded blocking queue, just like Java generic blocking queue (BlockingQueue) the same. Remember the classic case of thread communication: producer-consumer model it? Use BlockingQueue, then a slower recipients will reduce the transmission rate of the sender, because once the queue is full (bounded queue) the sender will be blocked. Flink solve backpressure scheme that feeling.

        In Flink, these distributed logic flow that these blocking queue, and the queue to the buffer pool by the capacity (LocalBufferPool) implementation. Produced and consumed each stream is assigned a buffer pool. Manages a set of pool buffer (Buffer), the buffer can be recycled after being recovered consumption. This is well understood: you take a buffer from the pool, fill in the data, after the data has been spending, again returned to the buffer pool, then you can use it again.

5.1 Flink network transmission Memory Management

        Flink shown below shows a memory management in a network of transmission scenarios. Data is transmitted over the network is written to the Task InputGate (IG), the Task after the treatment, and then in the Task ResultPartition (RS) writes. Each Task includes the presence of input data and input, the input and output Buffer (all data bytes). Buffer is MemorySegment wrapper class.

 

 

  1. TaskManager(TM)在启动时,会先初始化NetworkEnvironment对象,TM 中所有与网络相关的东西都由该类来管理(如 Netty 连接),其中就包括NetworkBufferPool。根据配置,Flink 会在 NetworkBufferPool 中生成一定数量(默认2048个)的内存块 MemorySegment(关于 Flink 的内存管理,后续文章会详细谈到),内存块的总数量就代表了网络传输中所有可用的内存。NetworkEnvironment 和 NetworkBufferPool 是 Task 之间共享的,每个 TM 只会实例化一个。
  2. Task 线程启动时,会向 NetworkEnvironment 注册,NetworkEnvironment 会为 Task 的 InputGate(IG)和 ResultPartition(RP) 分别创建一个 LocalBufferPool(缓冲池)并设置可申请的 MemorySegment(内存块)数量。IG 对应的缓冲池初始的内存块数量与 IG 中 InputChannel 数量一致,RP 对应的缓冲池初始的内存块数量与 RP 中的 ResultSubpartition 数量一致。不过,每当创建或销毁缓冲池时,NetworkBufferPool 会计算剩余空闲的内存块数量,并平均分配给已创建的缓冲池。注意,这个过程只是指定了缓冲池所能使用的内存块数量,并没有真正分配内存块,只有当需要时才分配。为什么要动态地为缓冲池扩容呢?因为内存越多,意味着系统可以更轻松地应对瞬时压力(如GC),不会频繁地进入反压状态,所以我们要利用起那部分闲置的内存块。
  3. 在 Task 线程执行过程中,当 Netty 接收端收到数据时,为了将 Netty 中的数据拷贝到 Task 中,InputChannel(实际是 RemoteInputChannel)会向其对应的缓冲池申请内存块(上图中的①)。如果缓冲池中也没有可用的内存块且已申请的数量还没到池子上限,则会向 NetworkBufferPool 申请内存块(上图中的②)并交给 InputChannel 填上数据(上图中的③和④)。如果缓冲池已申请的数量达到上限了呢?或者 NetworkBufferPool 也没有可用内存块了呢?这时候,Task 的 Netty Channel 会暂停读取,上游的发送端会立即响应停止发送,拓扑会进入反压状态。当 Task 线程写数据到 ResultPartition 时,也会向缓冲池请求内存块,如果没有可用内存块时,会阻塞在请求内存块的地方,达到暂停写入的目的。
  4. 当一个内存块被消费完成之后(在输入端是指内存块中的字节被反序列化成对象了,在输出端是指内存块中的字节写入到 Netty Channel 了),会调用 Buffer.recycle() 方法,会将内存块还给 LocalBufferPool (上图中的⑤)。如果LocalBufferPool中当前申请的数量超过了池子容量(由于上文提到的动态容量,由于新注册的 Task 导致该池子容量变小),则LocalBufferPool会将该内存块回收给 NetworkBufferPool(上图中的⑥)。如果没超过池子容量,则会继续留在池子中,减少反复申请的开销。

5.2 Flink 反压机制

下面这张图简单展示了两个 Task 之间的数据传输以及 Flink 如何感知到反压的:

 

  1. 记录“A”进入了 Flink 并且被 Task 1 处理。(这里省略了 Netty 接收、反序列化等过程)
  2. 记录被序列化到 buffer 中。
  3. 该 buffer 被发送到 Task 2,然后 Task 2 从这个 buffer 中读出记录。
Do not forget: Flink recording can be processed on the premise that there must be free Buffer available.

See above in connection with FIG two: Task 1 has an output terminal associated LocalBufferPool (called buffer pool 1), Task 2 also has an associated LocalBufferPool (called buffer pool 2) at the input. If there is a free buffer pool buffer is available to record a sequence of "A", and we serialize the transmit buffer.

Here we need to pay attention to two scenarios:

  • Local transmission: if the Task 1 Task 2 and run in the same worker nodes (TaskManager), the buffer can be given to direct a Task. Upon consumption of the Task 2 buffer, the buffer is a buffer pool recovery. If the rate of 2 Task 1 ratio slowly, will not catch up with the speed of recovery buffer Task buffer 1 takes speed, resulting in a free pool buffer is available, waiting for Task 1 on the available buffer. This Eventually slow down the formation of Task 1.
  • Remote transmission: if the Task 1 and Task 2 worker running on different nodes, then the buffer will be recovered after being sent to the network (TCP Channel). At the receiving end, you will apply LocalBufferPool from the buffer, and then copy the data to the network buffer. If no buffer, it stops reading data from the TCP connection. At the output, the value of the water level by Netty mechanisms to ensure data is not written much (say later) to the network. If the data (number of bytes in the output buffer Netty) network exceeds the high water level, we will wait until its value falls below the low water level before continuing to write data. This ensures that there are not many data networks. If the receiver stops data (due to the receiving end of the pool is not available buffer), the buffer will accumulate data network, then the sender will send a pause consumer network. In addition, this will make the transmission side buffer pool are not recovered, writer blocking buffer request to LocalBufferPool, ResultSubPartition blocking the writer to write data.
        This fixed size buffer pool as blocking queue as to ensure that the backpressure mechanism Flink have a robust, making the Task production data speed is not faster than the rate of consumption. The embodiment described above can be extended our from the data transmission between two Task naturally to more complex in the pipeline, to ensure the backpressure mechanism may spread throughout the pipeline.

5.3 Anti-pressure experiments

        In addition, the official blog in order to show the effect of back pressure, gives a simple experiment. This Figure shows the following: changes over time, the producer (yellow line) and the consumer (green line) every 5 seconds and the maximum throughput of the average throughput (up to 8 million records per second in a single JVM) of percentage. We measured an average throughput by measuring the number of records every five seconds processing task. The experiment was run in a single JVM, but the use of a complete stack Flink function.

        首先,我们运行生产task到它最大生产速度的60%(我们通过Thread.sleep()来模拟降速)。消费者以同样的速度处理数据。然后,我们将消费task的速度降至其最高速度的30%。你就会看到背压问题产生了,正如我们所见,生产者的速度也自然降至其最高速度的30%。接着,停止消费task的人为降速,之后生产者和消费者task都达到了其最大的吞吐。接下来,我们再次将消费者的速度降至30%,pipeline给出了立即响应:生产者的速度也被自动降至30%。最后,我们再次停止限速,两个task也再次恢复100%的速度。总而言之,我们可以看到:生产者和消费者在 pipeline 中的处理都在跟随彼此的吞吐而进行适当的调整,这就是我们希望看到的反压的效果。

5.4 Flink 反压监控

        在 Storm/JStorm 中,只要监控到队列满了,就可以记录下拓扑进入反压了。但是 Flink 的反压太过于天然了,导致我们无法简单地通过监控队列来监控反压状态。Flink 在这里使用了一个 trick 来实现对反压的监控。如果一个 Task 因为反压而降速了,那么它会卡在向 LocalBufferPool 申请内存块上。那么这时候,该 Task 的 stack trace 就会长下面这样:

 

  1.  
    java.lang.Object.wait(Native Method)
  2.  
    o.a.f.[...].LocalBufferPool.requestBuffer(LocalBufferPool.java:163)
  3.  
    o.a.f.[...].LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:133) <--- BLOCKING request
  4.  
    [...]

那么事情就简单了。通过不断地采样每个 task 的 stack trace 就可以实现反压监控。

Flink 的实现中,只有当 Web 页面切换到某个 Job 的 Backpressure 页面,才会对这个 Job 触发反压检测,因为反压检测还是挺昂贵的。JobManager 会通过 Akka 给每个 TaskManager 发送TriggerStackTraceSample消息。默认情况下,TaskManager 会触发100次 stack trace 采样,每次间隔 50ms(也就是说一次反压检测至少要等待5秒钟)。并将这 100 次采样的结果返回给 JobManager,由 JobManager 来计算反压比率(反压出现的次数/采样的次数),最终展现在 UI 上。UI 刷新的默认周期是一分钟,目的是不对 TaskManager 造成太大的负担。

总结

Flink 不需要一种特殊的机制来处理反压,因为 Flink 中的数据传输相当于已经提供了应对反压的机制。因此,Flink 所能获得的最大吞吐量由其 pipeline 中最慢的组件决定。相对于 Storm/JStorm 的实现,Flink 的实现更为简洁优雅,源码中也看不见与反压相关的代码,无需 Zookeeper/TopologyMaster 的参与也降低了系统的负载,也利于对反压更迅速的响应。

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/11691789.html