Flink learning from 0 to 1-Chapter 6 Window in Flink

1. Window Overview

Streaming streaming computing is a data processing engine designed to process infinite data sets, and infinite data sets refer to a growing, essentially infinite data set, and window is a way of cutting infinite data into finite blocks. Means of processing.

Window is the core of infinite data stream processing. Window splits an infinite stream into finite-size "buckets" buckets. We can perform calculation operations on these buckets.

2. Window type

Window can be divided into two categories:

  • CountWindow: Generate a Window according to the specified number of data, regardless of time.
  • TimeWindow: Generate Window according to time.

For TimeWindow, it can be divided into three categories according to the principle of window realization: Tumbling Window, Sliding Window and Session Window.

2.1 Tumbling Windows

The data is sliced ​​according to a fixed window length.

Features: Time alignment, fixed window length, no overlap.

The rolling window allocator allocates each element to a window of a specified window size. The rolling window has a fixed size and does not overlap. For example: if you specify a 5-minute rolling window, the window is created as shown in the figure below:

Insert picture description here

Figure scrolling window
Applicable scenarios: suitable for BI statistics, etc. (do aggregate calculations for each time period).

2.2 Sliding Windows

The sliding window is a more generalized form of the fixed window. The sliding window is composed of a fixed window length and a sliding interval.

Features: Time alignment, fixed window length, and overlap.

The sliding window allocator allocates elements to a fixed-length window. Similar to a rolling window, the size of the window is configured by the window size parameter, and another window sliding parameter controls the frequency at which the sliding window starts. Therefore, if the sliding parameter of the sliding window is smaller than the window size, the windows can overlap. In this case, the elements will be allocated to multiple windows.

For example, if you have a 10-minute window and a 5-minute sliding window, then the 5-minute window in each window contains the data generated in the last 10 minutes, as shown in the following figure:

Insert picture description here

Figure sliding window
Applicable scenario: Statistics in the last time period (seeking the failure rate of an interface in the last 5 minutes to determine whether to alarm).

2.3 Session Windows

It consists of a series of events combined with a timeout interval of a specified length of time, similar to a web application session, that is, a new window will be generated if no new data is received for a period of time.

Features: Time is not aligned.

The session window allocator groups elements through session activities. Compared with rolling windows and sliding windows, the session window will not have overlapping and fixed start and end times. On the contrary, when it is in a fixed time period If the element is no longer received, that is, the inactive interval is generated, the window will be closed. A session window is configured through a session interval. This session interval defines the length of the inactive period. When this inactive period occurs, the current session will be closed and subsequent elements will be allocated to the new session window.

Insert picture description here

Figure session window
## 3. Window API

3.1 TimeWindow

TimeWindow composes all data in a specified time range into a window, and calculates all data in a window at a time.

3.1.1 Rolling window

Flink's default time window is divided into windows according to ProcessingTime, and the data obtained by Flink is divided into different windows according to the time of entering Flink.

val minTemperature: DataStream[(String, Double)] = stream
.map(r => (r.id, r.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(15))
.reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))

The time interval can be specified by one of Time.milliseconds(x), Time.seconds(x), Time.minutes(x), etc.

3.1.2 Sliding Window (SlidingEventTimeWindows)

The function names of sliding window and rolling window are exactly the same, but two parameters need to be passed in when passing parameters, one is window_size and the other is sliding_size.

The sliding_size in the code below is set to 5s, that is, the window is calculated every 5s, and the window range for each calculation is all elements within 15s.

val minTemperature: DataStream[(String, Double)] = stream
.map(r => (r.id, r.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(15), Time.seconds(5))
.reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))

The time interval can be specified by one of Time.milliseconds(x), Time.seconds(x), Time.minutes(x), etc.

3.2 CountWindow

CountWindow triggers execution according to the number of elements with the same key in the window, and only calculates the result corresponding to the key whose number of elements reaches the window size during execution.

Note: The window_size of CountWindow refers to the number of elements with the same Key, not the total number of all input elements.

3.2.1 Rolling window

The default CountWindow is a rolling window. You only need to specify the window size. When the number of elements reaches the window size, the execution of the window will be triggered.

val minTemperature: DataStream[(String, Double)] = stream
.map(r => (r.id, r.temperature))
.keyBy(_._1)
.countWindow(5)
.reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))

3.2.2 Sliding window

The function names of sliding window and rolling window are exactly the same, but two parameters need to be passed in when passing parameters, one is window_size and the other is sliding_size.
The sliding_size in the following code is set to 2, that is, it is calculated every time two data with the same key are received, and the window range for each calculation is 10 elements.

val sumStream: DataStream[(String, Double)] = stream
.map(r => (r.id, r.temperature))
.keyBy(0)
// 当一个 key 的个数达到 2 个的时候触发计算,计算该 key 最近 10 个元素的内容
.countWindow(10,2)
.sum(1)

3.3 Window Function

The window function defines the calculation operations to be performed on the data collected in the window, which can be divided into two categories:

  • Incremental aggregation functions (incremental aggregation functions)

    Calculate every piece of data when it arrives, keeping a simple state. Typical incremental aggregate functions are: ReduceFunction, AggregateFunction.

  • Full window functions

    Collect all the data in the window first, and traverse all the data when it is calculated.

    ProcessWindowFunction is a full window function.

3.4 Other optional API

  • .trigger()——The trigger defines when the window is closed, triggers the calculation and outputs the result

  • .evitor()-Remover, which defines the logic to remove certain data

  • .allowedLateness()-Allow processing of late data

  • .sideOutputLateData()-Put late data into the side output stream

  • .getSideOutput()——Get side output stream

Insert picture description here

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109068500