Flink的window操作

Flink中处理的都是流数据，窗口操作就是将无限流按照不同的规则截取成有限流放在bucket中进行操作
什么时候会触发窗口操作，并舍弃没来的元素

someStream.keyBy()
.window(TumbleTimeWindow.of())
.allowedLateness()
.reduce\sum\max\min\minby\......
 //默认情况下允许延迟是0，假如**watermark**到达了窗口操作的窗口结束时间，就关闭窗口并触发窗口计算，舍弃没进窗口的元素，开启下一个窗口,允许延迟时间通过.allowedLateness()  指定

Note：这里要区分指定watermark时的那个延迟跟这个不相同，那个是决定用event的timestamp值代替之前的lastEmitedwatermark机制

window类型

timewindow

所有的时间窗口都有一个startTimeStamp和一个endTimeStamp,而一个窗口内的数据是左闭右开的：

[elem1,elem2,…)

-> 滚动时间窗口

a.将数据依据固定的窗口长度对数据进行切分

b.时间对齐，窗口长度固定，没有重叠

-> 滑动时间窗口

a.窗口大小是固定的，但是还有步长的概念，如果步长=窗口大小那么这个滑动窗口就是一个滚动窗口

b.窗口大小固定，但是窗口之间可以有数据重叠

-> 会话窗口

扫描二维码关注公众号，回复： 9670048 查看本文章

a.窗口大小不固定，时间不对齐，一系列事件组合成一个指定时间长度的timeout间隙组成，也就是一段时间没有接收到新数据就会生成新的窗口

b.session-gap就是窗口之间的时间间隙

countwindow

-> 滚动计数窗口

[1,2],[3,4],[5,6]…

-> 滑动计数窗口

[1,2],[2,3],[3,4],[4,5],[5,6]…

但是实际生产中还是timewindow用的比较多

flink的window操作分为2种：

Keyed Windows

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       //必须指定的->.reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Non-Keyed Windows

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
      //必须指定的 ->.reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

窗口分配器（window assigners）

-tumbling windows：滚动窗口

-sliding windows：滑动窗口

-session windows：会话窗口

-global windows：全局窗口

-还可以通过extends WindowAssigner来：自定义窗口分配器

怎么指定一个分配器：

val input: DataStream[T] = ...

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

// sliding processing-time windows
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

// sliding processing-time windows offset by -8 hours
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>)

Event-Time&Watermarks

窗口中的时间语义(Time)

EventTime 时间生成的时间，比如点击事件产生时日志中的时间戳
IngestionTime 数据进入到FlinkSource中的那个时间
ProcessingTime 某个算子任务开始真正处理时的时间

但是往往EventTime是我们更关心的！

怎么指定Flink程序的时间使用概念

env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);

// alternatively:
// env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
// env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

WaterMarks

什么是WaterMarks，是用来衡量数据Event的概念，可以理解成一个特殊的record数据

本来watermark是用来处理乱序时间的，通过task的Event-timeClock来度量，Event-timeClock一般是最一个task的最小lastEmitedwatermark。task只能向下游的stream传输<=Event-timeClock的数据。

water到达之后，应该是默认watermark时间戳之前所有的数据都到达了：例如timstamp17的数据都到达后，lastEmitedwatermark就变成17，如果设置了watermark的延迟时间为1s，那么此时task的lastEmitedwatermark就是16，以此类推。

由于网络、分布式等原因，会导致乱序数据的产生，这样会让窗口计算不准确,

----乱序在官网中一般以(out-of-order)定义

*----有序在官网中以(**in-order)*定义

在这里插入图片描述
解决方式：

这里就涉及到一个概念 WarterMarks水位线：在乱序情况下，遇到一个时间戳达到了窗口关闭时间，不用改立即触发窗口计算，而是需要等一段时间，等迟到的数据到了再关闭窗口；

**-> **watermarks是一种衡量eventtime进展的机制，可以设定延迟触发

**-> **watermarks是用来处理乱序事件的，而正确处理乱序时间，通常用wartermark机制结合window来实现

**-> **数据流中的wartermarks用于表示timestamp小于wartermark的数据都已经到达了，因此window的执行也是由wartermark触发的

**-> **wartermarks用来让程序自己平衡延迟和结果的正确性。

Watermarks的特点：

watermarks是一种特殊的数据记录；

Watermarks必须单调递增，以确保任务的时间时间始终在向前推进，而不是后退；

Watermarks与数据的时间戳相关。

思考一下：allowedLateness() 与 watermarks的区别？

Watermark的传递

每个Task可能会有多个分区(并行度),所以每个并行度的dataflow中都会有各自的watermark，每个分区都会有一个最新的watermark代表窗口里面的最高水位，一旦后面的流的watermark进入到task，那么就用最新的watermark代替之前分区的watermark，此时task的 event-time-clock是所有分区中最小的watermark值

在这里插入图片描述

如何在代码中引入WaterMark

其实watermark有2中样式：

1.周期性的watermark：With Periodic Watermarks

2.不时终端的watermark：With Punctuated Watermarks

引入watermark必须在window操作之前分配；

val withTimestampsAndWatermarks: DataStream[MyEvent] = stream
        .filter( _.severity == WARNING )
			//有序数据
  		//.assignAscendingTimestamps(_.timeStamp*1000)
  		//乱序数据
        .assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks())

withTimestampsAndWatermarks
        .keyBy( _.getGroup )
        .timeWindow(Time.seconds(10))
        .reduce( (a, b) => a.add(b) )
        .addSink(...)

自定义Watermark的Assigner

//周期性的watermark
class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {
//允许延迟3.5s,假如来了个数据的Timestamp是30.5 那么此时task的watermark是30.5-3.5=27，那么以(15-27)窗口就会触发关闭，并触发计算并结果写出
    val maxOutOfOrderness = 3500L // 3.5 seconds是允许延迟的固定时间
      //这个是用来维护当前task的watermark的属性变量
    var currentMaxTimestamp: Long = _

      //先调用，将当前来的事件的timestamp与之前来过的maxtimestamp做比较，最大timestamp并不是task的watermark，取出其中的最大值，作为判断watermark的基准，获取到最新的maxtimestamp。
    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        val timestamp = element.getCreationTime()
        currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
        timestamp
    }


  //后调用，当数据到了，更新maxtimestamp之后，然后通过maxtimestamp-delaytime = new watermark
    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        new Watermark(currentMaxTimestamp - maxOutOfOrderness)
    }
}

------------------------------------------------------------------------------------
//周期的watermark02
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {

    val maxTimeLag = 5000L // 5 seconds

    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        element.getCreationTime
    }

    override def getCurrentWatermark(): Watermark = {
        // 这个使用系统时间来更新watermark，
        new Watermark(System.currentTimeMillis() - maxTimeLag)
    }
}

窗口操作

窗口操作也可以分为2种：

增量聚合函数

ReduceFunction、AggregateFunction、FoldFunction…

reduce/aggregate/fold/apply()......

[1,2,3,4,5,6,7]在窗口内部也是通过流处理的方式对数据进行处理1+2=3+3=6+4=10+5=15…

全量聚合函数（全窗口函数）

先把窗口所有的数据收集起来，等到计算的时候会遍历所有的数据

ProcessWindowFunction(low-level API) 可以用来处理时间

窗口其他Api可选

trigger() 出发函数，能够定义window什么时候关闭，输出结果

evitor() 移除器，定义一处某些数据的逻辑

allowedlateness() -允许处理迟到的数据

sideOutputLateData() -将迟到的数据放入侧输出流

getSideOutput() - 获取侧输出流

稳哥的哥

发布了65 篇原创文章 · 获赞 3 · 访问量 2161

私信关注