Flink Back Pressure (back pressure) implementation and monitoring

Disclaimer: This series of blogs is compiled based on SGG videos, which is very suitable for everyone to get started.

1. What is Back Pressure
If you see back pressure warnings for tasks (such as High level), it means that the speed of generating data is faster than the speed of consumption by downstream operators. Take a simple Source -> Sink job as an example. If you can see a warning for the Source, it means that the Sink is consuming data at a slower rate than the Source is generating it. Sink is applying back pressure to Source.

Many situations can cause backpressure. For example, GC causing incoming data to pile up, or a data source peaking at the rate at which it is sending data. If backpressure is not handled properly, it can lead to resource exhaustion and, in the worst case, data loss.

Look at a simple example. Assume that the data flow pipeline (abstracted as Source, Streaming job and Sink) processes data at a rate of 5 million elements per second in a steady state, as shown in the following normal situation (a black bar represents 1 million elements, and the figure below represents system 1 snapshot in seconds):

 If the speed at which Source sends data reaches a peak at a certain moment, the data generated per second doubles, and the downstream processing capacity remains unchanged:

Message processing speed < message sending speed, message congestion, the system is not running smoothly. How to handle this situation?

a. These elements can be removed, however, for many streaming applications, data loss is unacceptable.
b. Cache the congested messages and tell the message sender to slow down the sending of messages. The message cache should be durable because in case of failure, this data needs to be replayed to prevent data loss.

Two backpressure implementations
sampling thread

Backpressure monitoring works by repeatedly taking samples of the stack trace of a running task, the JobManager repeatedly calling Thread.getStackTrace() on the job.

If samples show that the task thread is stuck in an internal method call, then there is backpressure on the task.

By default, the JobManager fires 100 stack traces per task every 50ms to determine backpressure. The ratio seen in the web interface indicates how many stack traces are blocked in internal method calls, e.g. 0.01 means only 1 of the method is stuck. The comparison of status and ratio is as follows:
OK: 0 <= Ratio <= 0.10
LOW: 0.10 <Ratio <= 0.5
HIGH: 0.5 <Ratio <= 1

In order not to overload the TaskManager with stack trace samples, the sample data is refreshed every 60 seconds.

configuration

The number of samples for the JobManager can be configured with:

web.backpressure.refresh-interval, the time after which statistics are discarded and refreshed (default: 60000, 1 minute).

web.backpressure.num-samples Number of stack trace samples used to determine backpressure (default: 100).

web.backpressure.delay-between-samples, Delay between stack trace samples to determine backpressure (default: 50, 50ms).

3. Web display
In the job interface of Flink WebUI, you can see the Back Pressure option page.

Sampling
Indicates that the JobManager triggers stack trace sampling for running tasks. With the default configuration, it will take about five seconds.

back pressure state

normal operation

 back pressure state

Four comparisons of Spark streaming
The back pressure of Spark Streaming is introduced after version 1.5. In the previous version, only by limiting the maximum consumption speed. The disadvantage of this speed limit is obvious. If the downstream processing capacity exceeds this maximum limit, it will lead to waste of resources. Moreover, it is necessary to perform pressure measurement and estimation for each Spark Streaming job, and the cost is relatively high.

Since version 1.5, back pressure has been introduced to automatically adjust the data transmission rate. It listens to the onBatchCompleted event of all jobs, and estimates a rate based on processingDelay, schedulingDelay, the number of records in the current batch, and processing completion events. It is used for The maximum number of records that the update stream can process per second. It will be adjusted according to the data capacity to ensure the smooth operation of Spark Streaming.

In contrast, the back pressure of Spark Streaming is relatively simple, mainly based on the execution of downstream tasks to control the upstream rate of Spark Streaming. Flink's back pressure mechanism is different. The back pressure is determined by sampling the stack traces within a certain period of time and monitoring the blocking ratio.

 
-----------------------------------
©Copyright belongs to the author: from 51CTO blogger Crayon Shin-chan v For original works, please contact the author for reprint authorization,
otherwise you will be held legally accountable

Guess you like

Origin blog.csdn.net/philip502/article/details/127946913