How to understand Window in Flink?

Window overview

Streaming 流式计算是一种被设计用于处理无限数据集的数据处理引擎,而无限数据集是指一种不断增长的本质上无限的数据集,而 window 是一种切割无限数据 为有限块进行处理的手段。

Window is the core of infinite data stream processing. Window splits an infinite stream into finite-size "buckets" buckets, and we can perform calculation operations on these buckets.

Window type

Window can be divided into two categories:

  • CountWindow: A Window is generated according to the specified number of data, regardless of time.

  • TimeWindow: Generate Window according to time.

    For TimeWindow, it can be divided into three categories according to different window implementation principles: Tumbling Window, Sliding Window and Session Window.

  1. 滚动窗口(Tumbling Windows)

    The data is sliced ​​according to a fixed window length.

    Features: Time alignment, fixed window length, no overlap.

    Applicable scenarios: Suitable for BI statistics, etc. (do aggregate calculations for each time period).

    The rolling window allocator allocates each element to a window of a specified window size. The rolling window has a fixed size and does not overlap. For example: If you specify a 5-minute rolling window, the window is created as shown in the figure below:
    Insert picture description here

  2. 滑动窗口(Sliding Windows)

    The sliding window is a more generalized form of the fixed window. The sliding window is composed of a fixed window length and a sliding interval.

    Features: Time alignment, fixed window length, can overlap.

    Applicable scenarios: Statistics in the last time period (seeking the failure rate of an interface in the last 5 minutes to determine whether to call an alarm).

    The sliding window allocator allocates elements to a fixed-length window. Similar to a rolling window, the size of the window is configured by the window size parameter, and another window sliding parameter controls the frequency at which the sliding window starts. Therefore, if the sliding parameter of the sliding window is smaller than the window size, the windows can overlap, in which case the elements will be allocated to multiple windows.

    For example, if you have a 10-minute window and a 5-minute sliding window, then the 5-minute window in each window contains the data generated in the last 10 minutes, as shown in the following figure:
    Insert picture description here

  3. 会话窗口(Session Windows)
    It consists of a series of events combined with a timeout interval of a specified length of time, similar to a web application session, that is, a new window will be generated if no new data is received for a period of time.

    Features: Time is not aligned.

    Applicable scenarios: Online user behavior analysis.

    The session window allocator groups elements through session activities. Compared with rolling windows and sliding windows, the session window will not have overlapping and fixed start time and end time. On the contrary, when it is in a fixed time period If the element is no longer received, that is, the inactive interval is generated, the window will be closed. A session window is configured through a session interval. This session interval defines the length of the inactive period. When this inactive period occurs, the current session will be closed and subsequent elements will be allocated to the new session window.
    Insert picture description here

Window API

TimeWindow

TimeWindow composes all data in a specified time range into a window, and calculates all data in a window at a time.

  1. Rolling window
    Flink's default time window is divided into windows according to Processing Time, and the data obtained by Flink is divided into different windows according to the time of entering Flink.

    val minTempPerWindow = dataStream
    .map(r => (r.id, r.temperature))
    .keyBy(_._1)
    .timeWindow(Time.seconds(15))
    .reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))
    

    The time interval can be specified by one of Time.milliseconds(x), Time.seconds(x), Time.minutes(x), etc.

  2. Sliding Window (SlidingEventTimeWindows)

    The function names of sliding window and rolling window are exactly the same, but two parameters need to be passed in when passing parameters, one is window_size and the other is sliding_size.

    The sliding_size in the following code is set to 5s, that is, the output result is calculated every 5s, and the window range of each calculation is all elements within 15s.

    val minTempPerWindow: DataStream[(String, Double)] = dataStream .map(r => (r.id, r.temperature))
    .keyBy(_._1)
    .timeWindow(Time.seconds(15), Time.seconds(5))
    .reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))
    // .window(SlidingEventTimeWindows.of(Time.seconds(15),Time.sec onds(5))
    

    The time interval can be specified by one of Time.milliseconds(x), Time.seconds(x), Time.minutes(x), etc.

CountWindow

​ CountWindow triggers execution according to the number of elements with the same key in the window, and only calculates the result corresponding to the key whose number of elements reaches the window size during execution.

Note: The window_size of CountWindow refers to the number of elements with the same Key, not the total number of all input elements.

  1. Rolling window The
    default CountWindow is a rolling window. You only need to specify the window size. When the number of elements reaches the window size, the execution of the window will be triggered.

    val minTempPerWindow: DataStream[(String, Double)] = dataStream .map(r => (r.id, r.temperature))
    .keyBy(_._1)
    .countWindow(5)
    .reduce((r1, r2) => (r1._1, r1._2.max(r2._2)))
    
  2. Sliding window

    The function names of sliding window and rolling window are exactly the same, but two parameters need to be passed in when passing parameters, one is window_size and the other is sliding_size.

    The sliding_size in the following code is set to 2, that is, it is calculated every time two data with the same key are received, and the window range for each calculation is 10 elements.

    val keyedStream: KeyedStream[(String, Int), Tuple] = dataStream.map(r => (r.id, r.temperature)).keyBy(0)
    //每当某一个 key 的个数达到 2 的时候,触发计算,计算最近该 key 最近 10 个元素的内容 
    val windowedStream: WindowedStream[(String, Int), Tuple, GlobalWindow] = keyedStream.countWindow(10,2)
    val sumDstream: DataStream[(String, Int)] = windowedStream.sum(1)
    

window function

Window function defines the calculation operations to be performed on the data collected in the window, which can be divided into two categories:

  • Incremental aggregation functions (incremental aggregation functions) Perform calculations every time a piece of data arrives, keeping a simple state. Typical incremental aggregate functions are ReduceFunction, AggregateFunction.
  • Full window functions collect all the data in the window first, and then iterate through all the data when calculating. ProcessWindowFunction is a full window function.

Other optional API

  • .trigger() —— Trigger
    Define when the window is closed, trigger the calculation and output the result

  • evitor()-remover

    Define the logic to remove certain data

  • allowedLateness()-allow processing of late data

  • sideOutputLateData()-put the late data into the side output stream

  • getSideOutput() —— Get side output stream

Guess you like

Origin blog.csdn.net/lp284558195/article/details/114309993