Window detailed explanation of stream processing in Apache Flink.

1. About Window

1.1Window overview

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can applycomputations.

From the translation of the official website, it means: the window is the core of dealing with infinite streams. Windows splits the stream into "buckets" of limited size, which we can apply to calculations.

1.2Window type

About Window can be divided into two categories:
1. CountWindow: Generate a Window according to the specified number of data, regardless of time.
2. TimeWindow: Generate Window according to time.
For TimeWindow, it can be divided into three categories according to different window implementation principles: Tumbling Window, Sliding Window and Session Window.

1.2.1 Detailed description of TimeWindow

1. Rolling window: The data is sliced ​​according to a fixed window length (in essence, it is a special sliding window).
Features: time alignment, fixed window length, no overlap

Insert picture description here

2. Sliding window: Sliding window is a more general form of fixed window. Sliding window is composed of fixed window length and sliding interval (that is, sliding step length).

Features: time alignment, fixed window length, can overlap

Insert picture description here

3. Session window: It is composed of a series of events combined with a timeout gap of a specified length of time, similar to the web application session, that is, a new window will be generated if no new data is received for a period of time.

Features: Time is not aligned .

Insert picture description here

1.2.2 About window Flink program

1.keyed-stream:

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

2.no keyed-stream:

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Note that under normal circumstances, we convert DataStream to KeyedStream in the corresponding calculations, and no keyed-stream is used relatively rarely .

1.3WindowAPI test

Requirements: Calculate the minimum temperature in each thermometer every 60s, and output the latest time stamp

package com.mo.apiTest

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.{
    
    EventTimeSessionWindows, SlidingEventTimeWindows, TumblingEventTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time


case class thermometer(id : String ,time :String,Temp : Double)
//温度计样例类
object Time_window {
    
    
  def main(args: Array[String]): Unit = {
    
    

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //从socket文本流中读取数据
    val inputStream = env.socketTextStream("hadoop102",7777)

    // 先转换成样例类类型
    val dataStream = inputStream
      .map( data => {
    
    
        val arr = data.split(",")
        thermometer(arr(0), arr(1), arr(2).toDouble)
      } )

    val res = dataStream
      .map(data => (data.id,data.Temp,data.time))
      .keyBy(_._1)   //按照id进行分组
//      .window(TumblingEventTimeWindows.of(Time.seconds(15)))  底层滚动窗口的实现
//      .window(SlidingEventTimeWindows.of(Time.seconds(15),Time.milliseconds(3))) //底层滑动窗口的实现
//      .window(EventTimeSessionWindows.withGap(Time.seconds(15)))  会话窗口
//      .countWindow(10)  滚动计数窗口
//      .countWindow(10,2) 滑动计数窗口
      .timeWindow(Time.seconds(60))  //使用Flink为我们封装好的滑动或者滚动窗口的实现方法
      .reduce((currdata,newdata)=>(currdata._1,currdata._2.min(newdata._2),newdata._3))  //每60s求出当前时间下各个温度计的最小值,并更新当前最新的时间戳

    res.print()
    env.execute("Tumblingwindow test")

  }
}

Run screenshot:

Insert picture description here

Insert picture description here
It can be seen that the output at this time is the minimum temperature value of each thermometer every 60s and the current latest time stamp

Guess you like

Origin blog.csdn.net/weixin_44080445/article/details/112156744