Time Semantics and Watermark of Flink Streaming Computing

1 Description

In Flink's stream processing, different concepts of time are involved, as shown in the following figure:
insert image description here
Event Time: It is the time when the event is created. It is usually described by the timestamp in the event. For example, in the collected log data, each log will record its own generation time. Flink accesses the event timestamp through the timestamp assigner.
Ingestion Time: It is the time when data enters Flink.
Processing Time: It is the local system time of each operator that performs time-based operations. It is related to the machine. The default time attribute is Processing Time.
For example, the time when a log enters Flink is 2017-11-12 10:00:00.123, and the system time when it arrives at Window is 2017-11-12 10:00:01.234. The content of the log is as follows:

2017-11-02 18:37:15.624 INFO Fail over to rm2

For business, to count the number of fault logs within 1 minute, which time is the most meaningful? —— eventTime, because we want to make statistics based on the generation time of the log.

2 eventTime introduction

In Flink's stream processing, eventTime is used for most of the business. Generally, ProcessingTime or IngestionTime is forced to be used only when eventTime cannot be used.
If you want to use EventTime, you need to introduce the time attribute of EventTime, the import method is as follows:

val env = StreamExecutionEnvironment.getExecutionEnvironment

// Add time characteristics to each stream created by env from the moment of calling

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

3 Watermark

We know that there is a process and time in stream processing from event generation, to flow through the source, and then to the operator. Although in most cases, the data flowing to the operator is in the order of the time when the event is generated, but It does not rule out the occurrence of out-of-order due to network, distributed and other reasons. The so-called out-of-order means that the order of events received by Flink is not strictly in accordance with the event time order of the events.
insert image description here
Then there is a problem at this time. Once there is an out-of-sequence, if the operation of the window is only determined according to the eventTime, we cannot be sure whether all the data is in place, but we cannot wait indefinitely. At this time, there must be a mechanism to ensure a specific After the time, the window must be triggered to perform calculations. This special mechanism is Watermark.
 Watermark is a mechanism to measure the progress of Event Time.
 Watermark is used to deal with out-of-order events, and the correct handling of out-of-order events is usually achieved by using the Watermark mechanism combined with the window.
 The Watermark in the data stream is used to indicate that the data whose timestamp is smaller than the Watermark has arrived. Therefore, the execution of the window is also triggered by the Watermark.
 Watermark can be understood as a delay trigger mechanism. We can set the delay time of Watermark t. Each time the system will check the maximum maxEventTime among the data that has arrived, and then determine that all data with eventTime less than maxEventTime - t have arrived. If There is a window whose stop time is equal to maxEventTime - t, then this window is triggered to execute.
 The Watermarker of the ordered flow is shown in the figure below: (Watermark is set to 0)

insert image description here
The Watermarker of the out-of-order flow is shown in the following figure: (Watermark is set to 2)
insert image description here
When Flink receives data, it will generate Watermark according to certain rules. This Watermark is equal to the maxEventTime - delay time in all currently arriving data, that is, In other words, Watermark is generated based on the timestamp carried by the data. Once the Watermark is later than the stop time of the currently untriggered window, the execution of the corresponding window will be triggered. Since the event time is carried by the data, if new data cannot be obtained during the running process, the untriggered windows will never be triggered.
In the figure above, we set the allowable maximum delayed arrival time to 2s, so the Watermark corresponding to the event with a timestamp of 7s is 5s, and the Watermark for an event with a timestamp of 12s is 10s. If our window 1 is 1s–5s, the window 2 is 6s~10s, then the Watermarker when the event with a timestamp of 7s arrives just triggers window 1, and the Watermark when an event with a timestamp of 12s arrives happens to trigger window 2.
Watermark is the "window closing time" that triggers the previous window. Once the door is closed, all data within the window range based on the current time will be included in the window.
As long as the water level is not reached, no matter how long the real time advances, the window will not be triggered to close.

3 Introduction of Watermark

The introduction of watermark is very simple. For out-of-order data, the most common way of citing is as follows:

dataStream.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.milliseconds(1000)) {
    
    
  override def extractTimestamp(element: SensorReading): Long = {
    
    
    element.timestamp * 1000
  }
} )

The use of Event Time must specify the timestamp in the data source. Otherwise, the program cannot know what the event time of the event is (if the data in the data source does not have a timestamp, only Processing Time can be used).
We see that in the above example, a class that looks a bit complicated is created, and what this class implements is actually the interface for assigning timestamps. Flink exposes the TimestampAssigner interface for us to implement, allowing us to customize how to extract timestamps from event data.

val env = StreamExecutionEnvironment.getExecutionEnvironment
// 从调用时刻开始给env创建的每一个stream追加时间特性
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val readings: DataStream[SensorReading] = env
.addSource(new SensorSource)
.assignTimestampsAndWatermarks(new MyAssigner())

There are two types of MyAssigner
 AssignerWithPeriodicWatermarks
 AssignerWithPunctuatedWatermarks
The above two interfaces are inherited from TimestampAssigner.

3.1 Assigner with periodic watermarks

Periodically generate watermark: The system will periodically insert watermark into the stream (the watermark is also a special event!). The default period is 200 milliseconds. can useExecutionConfig.setAutoWatermarkInterval()方法进行设置。

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 每隔5秒产生一个watermark
env.getConfig.setAutoWatermarkInterval(5000)

The logic of generating watermark: every 5 seconds, Flink will call the getCurrentWatermark() method of AssignerWithPeriodicWatermarks. If the method returns a timestamp with a timestamp greater than the previous watermark, a new watermark will be inserted into the stream. This check ensures that the watermark is monotonically increasing. If the timestamp returned by the method is less than or equal to the timestamp of the previous water level, no new watermark will be generated.

3.2 Assigner with punctuated watermarks

Generate watermarks intermittently. Unlike the periodic generation method, this method is not fixed time, but can filter and process each piece of data as needed. Let’s go directly to the code to give an example. We only insert watermark for the sensor data stream of sensor_1:

class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[SensorReading] {
    
    
val bound: Long = 60 * 1000
override def checkAndGetNextWatermark(r: SensorReading, extractedTS: Long): Watermark = {
    
    
if (r.id == "sensor_1") {
    
    
new Watermark(extractedTS - bound)
} else {
    
    
null
}
}
override def extractTimestamp(r: SensorReading, previousTS: Long): Long = {
    
    
r.timestamp
}
}

4 The use of EventTime in windows

4.1 Tumbling window (TumblingEventTimeWindows)

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

object EventTimeDemo02 {
    
    

  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置读取时间为event time    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //为了便于观察,设置并行度为1
    env.setParallelism(1)
    //设置自动获取水印的时间
    env.getConfig.setAutoWatermarkInterval(200)

    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)
    //处理数据
    val mapStream = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        (strings(0), strings(1).toLong, strings(2).toDouble)
      }
    )

    /*设置watermark
    * 这里的new BoundedOutOfOrdernessTimestampExtractor()设置延迟时间
    * watermark = eventtime(事件时间) - boundtime(延迟时间)
    * */
    val waterMarkStream: DataStream[(String, Long, Double)] = mapStream.
      assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long, Double)](Time.seconds(2)) {
    
    
      override def extractTimestamp(element: (String, Long, Double)): Long = {
    
    
        element._2
      }
    })

    //使用keyby进行分组
    val keyedStream: KeyedStream[(String, Long, Double), Tuple] = waterMarkStream.keyBy(0)

    //设置滚动窗口
    val timeWindowStream: WindowedStream[(String, Long, Double), Tuple, TimeWindow] = keyedStream
      .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    //获取每个分组中的最大值
    val result: DataStream[(String, Long, Double)] = timeWindowStream.reduce((r1, r2) => ((r1._1, r1._2.max(r2._2), r1._3.max(r2._3))))
    result.print("result")
    env.execute()
  }
}

The delay time set here is 2 seconds, and the rolling window time is 5 seconds, that is to say, the first (0-5 seconds) rolling window will be closed and the calculation will be triggered when the water level >= rolling time = 5 seconds.
As shown in the figure: When the input event time is less than 7 seconds, the calculation will not be triggered.
insert image description here
When the event time of the input data is >= 7 seconds (window time 5 seconds + extension time 2 seconds), the first window calculation will be triggered
insert image description here
. It can be seen that the maximum value of sensor_1 is 43.92 entered in the first second. At this time, pay attention to observe why the maximum value of sensor_5 is not 62.53 entered in the 5th second? Because sensor_5 is the data of the 5th second, it will be recorded to the second rolling window, which is [5,10). To trigger the closing of the second window at this time, an event time >= 12 seconds is required. We pass in some test data
Note: If you send some data within 5 seconds of the event time at this time, it will not be received, because the first window has been closed and no more data will be received. At this point, you can use the allowedLateness() method to delay data reception, or use sideOutputLateData to pass data into the measurement flow, and then merge it into the mainstream for processing.

insert image description here
insert image description here
insert image description here
insert image description here
At this time, three more pieces of data are input, and the calculation is triggered when the water level reaches 12 seconds. At this point, it can be seen that in window 2, the maximum value corresponding to sensor_5 is 62.53.

4.2 Sliding window (SlidingEventTimeWindows)

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

//测试watermark与滑动窗口
object EventTimeDemo03 {
    
    

  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //设置读取时间为event time
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //为了便于观察,设置并行度为1
    env.setParallelism(1)
    //设置自动获取水印的时间
    env.getConfig.setAutoWatermarkInterval(200)

    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)
    val mapStream: DataStream[(String, Long, Long)] = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        (strings(0), strings(1).toLong, 1)
      }
    )

    /**
     * 设置watermark水位线
     */
    val waterStream: DataStream[(String, Long, Long)] = mapStream.
      assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long, Long)](Time.seconds(2)) {
    
    
        override def extractTimestamp(element: (String, Long, Long)): Long = {
    
    
          element._2
        }
      })

    val keyedStream: KeyedStream[(String, Long, Long), Tuple] = waterStream.keyBy(0)

    //设置滑动窗口,窗口大小10秒,滑动时间5秒
    val slideWindowStream: WindowedStream[(String, Long, Long), Tuple, TimeWindow] = keyedStream.
      window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))

    val resultStream: DataStream[(String, Long, Long)] = slideWindowStream.reduce((r1, r2) => (r1._1, r1._2.max(r2._2), r1._3 + r2._3))

    resultStream.print("watermark slide")
    env.execute()
  }

}

The sliding window size is 10 seconds, the sliding time is 5 seconds, and the delay time is 2 seconds. That is to say, when the event time >= sliding time 5 seconds + delay time 2 seconds, the window calculation will be triggered. As shown in the figure below, when the condition is not met, the window will not trigger the calculation.
insert image description here
When the event time is >= sliding window time 5 seconds + delay time 2 seconds, trigger calculation. The
insert image description here
next trigger window calculation time is >= 12 seconds. During this period, the incoming data
insert image description here
is 10 seconds due to the entire window size, so in the next step When sliding for 5 seconds, the data in the 5-second window of the previous stage will be included, so after the window calculation is triggered at 12 seconds, the data in the window is 8 (including 4 in the previous window)

4.3 Session window (EvnentTimeSessionWindows)

The execution will be triggered when the time difference between the EventTime of two adjacent data exceeds the specified time interval. If Watermark is added, it will be delayed in accordance with the window trigger. When the delay water level is reached, the window trigger is performed.

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

/**
 * 测试watermark和回话窗口
 */
object EventTimeDemo04 {
    
    

  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置读取时间为event time
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //为了便于观察,设置并行度为1
    env.setParallelism(1)
    //设置自动获取水印的时间
    env.getConfig.setAutoWatermarkInterval(200)

    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)
    val mapStream: DataStream[(String, Long, Double)] = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        (strings(0), strings(1).toLong, strings(2).toDouble)
      }
    )

    val waterStream: DataStream[(String, Long, Double)] = mapStream.
      assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long, Double)](Time.seconds(1)) {
    
    
        override def extractTimestamp(element: (String, Long, Double)): Long = {
    
    
          element._2
        }
      })

    val windowsStream: WindowedStream[(String, Long, Double), Tuple, TimeWindow] = waterStream.
      keyBy(0).window(EventTimeSessionWindows.withGap(Time.seconds(1)))

    val result = windowsStream.reduce((r1, r2) => (r1._1, r1._2.max(r2._2), r1._3 + r2._3))

    result.print("water session")
    env.execute()
  }

}

When the event time of the input data, the adjacent time interval does not exceed the set session time interval of 1 second + delay time of 1 second, the calculation of the window will not be triggered.
insert image description here
When the incoming adjacent data event time >= session window time 1 second + delay time 1 second = 2 seconds, the window calculation will be triggered. The calculation
insert image description here
result 105.4 at this time is the numerical sum of all data before the event time 4200. Note: This is the data with an event time of 4200, and it will enter the next session window instead of the current session window.
insert image description here
From the above figure, you can see that adjacent interval events will be in the same session window within 1 second, while adjacent data If the event time interval exceeds 1 second, it will enter the next session window, until the adjacent event time >= session window time 1 second + delay time 1 second = 2 seconds, the corresponding session window will be triggered for calculation.

Guess you like

Origin blog.csdn.net/Keyuchen_01/article/details/118498889