Stepping on the pit | Flink event time semantics data is out of order

❝This article describes in detail the problems encountered when the upstream flink task using processing time semantics fails, restarting and consuming a large amount of data backlogged in the upstream and outputting it to the downstream data. The disorder is particularly serious when the downstream flink task uses event time semantics. The lost count problem and related solutions.

This article is divided into the following sections:

  • "1. The application scenario of stepping on the pit this time"

  • "2. Analysis of lost count failures in application scenarios"

  • "3. Faults to be repaired"

  • "4. Solution and Principle of Lost Count Failure"

  • "5. Summary"

Application scenario

The application scenarios are as follows:

  • "Flink task A"  uses the semantics of "processing time" to filter output and adds xx detail data to  "Kafka Y"

  • ``flink task B''  uses the semantics of ``event time'' to consume  ``Kafka Y''  as a window aggregation operation to produce minute-level aggregation indicators to  ``Kafka Z''

  • "Kafka Z" is  imported to  "Druid"  in real time for real-time OLAP analysis, and displayed on the BI application Kanban

Lost count failure analysis

Briefly introduce the failure scenario in this production. The entire fault tracking link is as follows:

Fault one:

  • Received alarm feedback  "flink task A"  entrance flow is 0

  • Locating   the failure of an operator in "flink task A" caused the entire job to get stuck

  • This resulted in  a large backlog of data in the  upstream  ``kafka X'' of this  ``flink task A' '

  • After restarting  "flink task A" , the consumption of a large backlog of  data in the upstream  "kafka X" is completed, and the task returns to normal

Failure one leads to downstream failure two:

  • Since  "flink task A"  uses the semantics of "processing time" to process data, and has filtering and keyBy binning window logic, when a large amount of backlog of data is consumed after restart, it will cause sink rebalance and output to downstream  "kafka Y"  The server_timestamp in each partition data is out of order

  • The downstream  "flink task B"  uses "event time" semantics to process data when  consuming  "Kafka Y" , and uses the server_timestamp in the data as the "event time" timestamp

  • After "flink task B"  consumes very out-of-order data, a large amount of data is lost during window aggregation calculations

  • The report finally displayed in the BI application has missing data

Point of failure to be repaired

  • 1. "flink task A"  stability failure, this part of the solution is temporarily not introduced in this article

  • 2. "flink task B"  consumption upstream of out-of-order lost count failure, the solution is introduced below

Solution and principle

Lost count failure solution

The solution is based on the downstream  "flink task B"  as the entry point, and  the sql code solution of "flink task B" is directly given  . Java code can also be implemented according to this solution, and the essential principles are the same. The principle is explained below.

SELECT
  to_unix_timestamp(server_timestamp / bucket) AS timestamp, -- format 成原有的事件时间戳
  count(id) as id_cnt,
  sum(duration) as duration_sum
FROM
  source_table
GROUP BY
  TUMBLE(proctime, INTERVAL '1' MINUTE),
  server_timestamp / bucket -- 根据事件时间分桶计算,将相同范围(比如 1 分钟)事件时间的数据分到一个桶内
复制代码

Solution principle

First clarify an unavoidable problem. Without considering that the watermark allows a particularly large delay setting, as long as the upstream uses the processing time semantics and the downstream uses the event time semantics, once the upstream fails to restart and consumes a large amount of data in a short time, The above-mentioned errors and malfunctions will inevitably occur.

The downstream consumer still needs to display the data corresponding to the event timestamp in the BI platform report, and the full link time semantics are all on the premise that the processing time is guaranteed to be countless. The solution is to aggregate and finally produce data corresponding to the event timestamp.

The final solution is as follows: The entire link is all processing time semantics, and window calculations also use processing time, but the time stamps in the output data are all event time stamps. In a failure scenario, the event timestamp of the data in the one-minute window may differ by several hours, but in the final window aggregation, it can be divided into the corresponding event time window according to the event timestamp, which is used when displaying downstream BI applications This event timestamp is sufficient.

Note: The bucket in SQL needs to be set according to the specific usage scenario. If the setting is too small, for example, in a non-faulty scenario, a 1-minute window is opened according to the processing time, and the bucket is set to 60000 (1 minute), then it is very likely that this time window The server_timestamp of all data in the data is concentrated in a certain two minutes, then these data will be divided into two buckets, which will cause serious data skew.

Sample input data

To simulate the above failure,  the data input in a certain window of the task of "flink B" is as follows.

server_timestamp id duration
2020/9/01 21:14:38 1 300
2020/9/01 21:14:50 1 500
2020/9/01 21:25:38 2 600
2020/9/01 21:25:38 3 900
2020/9/01 21:25:38 2 800

Sample output data

After processing the sql in the above solution, the output data is as follows, you can solve this type of missing count failure.

timestamp id_cnt duration_sum
2020/9/01 21:14:00 2 900
2020/9/01 21:25:00 3 2300

to sum up

This article analyzes in the flink application:

  • "When the upstream flink task that uses processing time semantics fails, restarts and consumes a large amount of backlog data, and outputs to downstream data, the disorder is particularly serious, the downstream uses event time semantics to encounter a large number of lost count problems"

  • "On the premise that the entire link is the processing time semantics, the output data timestamp is the event timestamp to solve the above problems"

  • "A sample of the solution to the missing count failure is given in the sql code"

 

Guess you like

Origin blog.csdn.net/bieber007/article/details/108691979
Recommended