2.5 Apache Flink EventTime与Window

1. EventTime introduction of

In streaming Flink, the vast majority of businesses will use eventTime, usually only when eventTime can not be used, or will be forced to use ProcessingTime IngestionTime.

To use EventTime, then the need to introduce EventTime time attribute is introduced manner as follows:

val env = StreamExecutionEnvironment.getExecutionEnvironment 
// 从调用时刻开始给env创建的每一个stream追加时间特征 
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

2. Watermark

2.1 Basic Concepts

We know, from the event stream processing is generated, to flow through the Source, then operator, and the middle is a process of time, although in most cases the time sequence, the data flows are in accordance with operator generated by the event, but Since the network does not rule out, back pressure and other reasons, resulting in scrambled, so-called disorder, it refers to the sequence Flink received event is not strictly in accordance with the order of events event Time arrangement.

FIG scrambled data .png

So at this time there is a problem that once appeared out of order, if only in accordance with the decision window eventTime run, we can not be clear whether the data is all in place, but can not wait indefinitely, then there must be a mechanism to ensure that a particular after the time necessary to carry out the trigger window is calculated, this particular mechanism is Watermark.

Watermark Event Time is a mechanism to measure progress, which is itself a hidden attribute data, the data itself carries the corresponding Watermark.

Watermark is out of order for processing the events, and correct handling events out of order, usually implemented in conjunction with Watermark window mechanism.

Watermark data stream timestamp for indicating the data is less than Watermark, have been reached, therefore, is performed by the Watermark window is triggered.

Watermark can be understood as a delayed trigger mechanism, we can set the length of time t Watermark delay, the system will check every data has reached the maximum maxEventTime, then finds eventTime less than maxEventTime - all data have been t arrive, if there is window stop time is equal to maxEventTime - t, then the window being triggered.

Watermarker ordered stream as shown below: (Watermark set to 0)

FIG ordered data Watermark.png

Watermarker scrambled stream as shown below: (Watermark is set to 2)

Watermark.png order data of FIG.

When receiving each data Flink, will have a Watermark, this is equivalent to the current Watermark all incoming data maxEventTime - long delay, i.e., the data carried by the Watermark Once the data carried by the current ratio is not triggered Watermark the window of time to stop later, it will trigger the execution of the corresponding window. Since Watermark is carried by the data, therefore, if the process is running can not obtain new data, it has not been triggered by the window will never be triggered.

The figure above, allows us to set the maximum delay time of arrival for the 2s, so the time stamp for the corresponding event 7s Watermark is 5s, 12s timestamp Watermark event is 10s, if our window is 1 1S 5s, 2 window is 6S 10s, 7s then when the event arrives just timestamp of the trigger window Watermarker 1, timestamp Watermark 12s when the event arrives just trigger window 2.

The introduction of 2.2 Watermark

val env = StreamExecutionEnvironment.getExecutionEnvironment
 
// 从调用时刻开始给env创建的每一个stream追加时间特征
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 
val stream = env.readTextFile("eventTest.txt").assignTimestampsAndWatermarks(
  new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(200)) {
  override def extractTimestamp(t: String): Long = {
     // EventTime是日志生成时间,我们从日志中解析EventTime
     t.split(" ")(0).toLong
  }
})

3. EventTimeWindow API

When using EventTimeWindow, all Window EventTime divided on the time axis, that is, after the Window start time based on the initial EventTime division intervals a window, if the Window size is 3 seconds, 1 minute Window will be divided into the following form:

[00:00:00,00:00:03)
[00:00:03,00:00:06)
...
[00:00:57,00:01:00)

If the Window size is 10 seconds, the Window may be divided into the following form:

[00:00:00,00:00:10)
[00:00:10,00:00:20)
...
[00:00:50,00:01:00)

Notice that the window is left open and close the right, in the form: [window_start_time, window_end_time).

Window data regardless of the setting itself, but a good definition of the system, that is to say, Window would have been in accordance with the specified time interval to be divided, whether this Window there is no data, EventTime data during this Window will enter this Window .

Window will continue to produce, Window data belonging to this range will be continually added to the Window, all is not triggered Window will wait for a trigger, not as long as the trigger Window, Window data belonging to this range would have been added to the Window, Window triggered until additional data will stop, and when the Window trigger is triggered only received part of Window data will be discarded.

Window will be triggered executed when the following conditions are met:

  • Time watermark> = window_end_time;
  • Data exists in the [window_start_time, window_end_time) in.

We illustrate the relationship Watermark, EventTime and Window by the following figure.

3.1 scrolling window (TumblingEventTimeWindows)

// 获取执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 
// 创建SocketSource
val stream = env.socketTextStream("localhost", 11111)
 
// 对stream进行处理并按key聚合
val streamKeyBy = stream.assignTimestampsAndWatermarks(
  new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(3000)) {
     override def extractTimestamp(element: String): Long = {
       val sysTime = element.split(" ")(0).toLong
       println(sysTime)
       sysTime
     }}).map(item => (item.split(" ")(1), 1)).keyBy(0)
 
// 引入滚动窗口
val streamWindow = streamKeyBy.window(TumblingEventTimeWindows.of(Time.seconds(10)))
 
// 执行聚合操作
val streamReduce = streamWindow.reduce(
  (item1, item2) => (item1._1, item1._2 + item2._2)
)
 
// 将聚合数据写入文件
streamReduce.print
 
// 执行程序
env.execute("TumblingWindow")

The result is a time window in accordance with Event Time calculated, regardless of time of the system (including input speed).

3.2 the sliding window (SlidingEventTimeWindows)

// 获取执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 
// 创建SocketSource
val stream = env.socketTextStream("localhost", 11111)
 
// 对stream进行处理并按key聚合
val streamKeyBy = stream.assignTimestampsAndWatermarks(
  new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(0)) {
     override def extractTimestamp(element: String): Long = {
       val sysTime = element.split(" ")(0).toLong
       println(sysTime)
       sysTime
     }}).map(item => (item.split(" ")(1), 1)).keyBy(0)
 
// 引入滚动窗口
val streamWindow = streamKeyBy.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
 
// 执行聚合操作
val streamReduce = streamWindow.reduce(
  (item1, item2) => (item1._1, item1._2 + item2._2)
)
 
// 将聚合数据写入文件
streamReduce.print
 
// 执行程序
env.execute("TumblingWindow")

3.3 session window (EventTimeSessionWindows)

EventTime time difference between two adjacent data over a specified time interval will trigger execution. If you add Watermark, then when the trigger execution, Window meet all the time interval has not yet triggered simultaneously trigger execution.

// 获取执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

// 创建SocketSource
val stream = env.socketTextStream("localhost", 11111)

// 对stream进行处理并按key聚合
val streamKeyBy = stream.assignTimestampsAndWatermarks(
 new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(0)) {
    override def extractTimestamp(element: String): Long = {
      val sysTime = element.split(" ")(0).toLong
      println(sysTime)
      sysTime
    }}).map(item => (item.split(" ")(1), 1)).keyBy(0)

// 引入滚动窗口
val streamWindow = streamKeyBy.window(EventTimeSessionWindows.withGap(Time.seconds(5)))

// 执行聚合操作
val streamReduce = streamWindow.reduce(
 (item1, item2) => (item1._1, item1._2 + item2._2)
)

// 将聚合数据写入文件
streamReduce.print

// 执行程序
env.execute("TumblingWindow")

Guess you like

Origin blog.csdn.net/org_hjh/article/details/90552013