Window in Flink

1. What is Windows?

Official introduction to Window:

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations.         

Window is at the heart of handling unbounded streams. The window can pack the data stream into "buckets" of limited size, and then calculate and process each "bucket". 

        In layman's terms, it is to divide the infinite data stream into different windows according to a fixed time or length , and we can use some calculation functions to process the data intercepted in the window, so as to obtain statistical results within a certain range.

2. Basic structure

        There is only one difference in the basic structure used by Flink windows on keyed streams and non-keyed streams: keyed streams need to call keyBy(...) and then call window(...), while non-keyed streams only need to call windowAll directly (...). details as follows:

1、Keyed Windows

2、Non-Keyed Windows

analyze:

        Using a keyed stream allows your window calculations to be parallelized by multiple tasks (the original stream will be split into multiple logical streams), because each logical keyed stream can be processed independently. Elements belonging to the same key will be sent to the same task.

        But for non-keyed stream , the original stream will not be divided into multiple logical streams, so all window calculations will be completed by the same task, that is, the parallelism is 1, which will affect performance.

Three, Window classification

        Flink provides a variety of windows to meet most usage scenarios. Such as: Tumbling Window, Sliding Window, Session Window and Global Windows.

1. Tumbling Windows

        The rolling window is to allocate data into the specified window, the size of the rolling window is fixed, and the respective ranges do not overlap . It can be subdivided into rolling time window and rolling count window .

        Usage scenario: It is suitable for counting indicators according to the specified period.

        For example: specify the size of the rolling window as 5 seconds, then every 5 seconds a window will be calculated and a new window will be created.

Code using:

DataStream<T> input = ...;

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// tumbling processing-time windows
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// daily tumbling event-time windows offset by -8 hours.
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

 Time intervals can be specified using  Time.milliseconds(x), , Time.seconds(x), etc.Time.minutes(x)

2. Sliding Windows

        Sliding windows allocate data into buckets of fixed size that allow overlapping , which means that each piece of data may belong to multiple buckets at the same time. The window size can be set by  the window size  parameter. Sliding windows require an additional sliding distance ( window slide ) parameter to control how often new windows are generated. It can be subdivided into time sliding window and count sliding window .

        The usage scenario is: calculate the index of the specified window time size according to the specified statistical cycle.

        For example: every 5 seconds, calculate the data of the previous 10 seconds (the window size is 10, the sliding distance is 5, and a new window is obtained every 5 seconds, which contains the data that arrived in the previous 10 seconds).
        When window size > sliding distance, there is overlap between windows, and the first two windows are window1 and window2 in the figure below.
        When window size = sliding distance, that is, the scrolling window, the first two windows are window1 and window3 in the figure below.
        When window size < sliding distance, the windows will not overlap. The first two windows are window1 and window4 in the figure below. This setting will miss data.

Code using:

DataStream<T> input = ...;

// 滑动 event-time 窗口
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// 滑动 processing-time 窗口
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// 滑动 processing-time 窗口,偏移量为 -8 小时
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

3. Session Windows

        Session windows group data by active sessions. Unlike rolling and sliding windows , session windows do not overlap each other and have no fixed start or end time . The session window is closed after a period of no data has been received (i.e. after an interval of inactivity). The session window can set a fixed session gap (session gap) to define how long it is considered inactive . When the period of inactivity is exceeded, the current session is closed and the next data is distributed to the new session window.

        Session windows are useful in some common real-world scenarios where neither rolling nor sliding windows are suitable.

Code using:

DataStream<T> input = ...;

// 设置了固定间隔的 event-time 会话窗口
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// 设置了动态间隔的 event-time 会话窗口
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap((element) -> {
        // 决定并返回会话间隔
    }))
    .<windowed transformation>(<window function>);

// 设置了固定间隔的 processing-time session 窗口
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// 设置了动态间隔的 processing-time 会话窗口
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
        // 决定并返回会话间隔
    }))
    .<windowed transformation>(<window function>);

4. Global Windows

        The global window is to distribute all data with the same key to a global window . The window is only useful when you specify a custom trigger. Otherwise, the computation doesn't happen because the global window has no natural endpoint to trigger the data accumulated in it.

Code using:

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>);

4. Example

1. Example of rolling window

Requirements: Count every 5s, and count the frequency of words appearing in the last 5s

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;


public class TestTumblingTimeWindows{

    public static void main(String[] args) throws Exception {
        //1.创建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2、获取数据
        DataStream<String> source = env.socketTextStream("node1", 9000);

        DataStream<Tuple2<String, Integer>> windowCounts = source
                .flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
                    for (String word : value.split("\\s")) {
                        out.collect(Tuple2.of(word, 1));
                    }
                })
                //.keyBy(0) //过时
                .keyBy(t -> t.f0)
                //滚动窗口
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .sum(1);


        windowCounts.print();

        env.execute("TestTumblingTimeWindows");
    }

}

 2. Example of sliding window

Requirements: Count every 5s, and count the frequency of words appearing in the last 10s

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;


public class TestSlidingTimeWindows{

    public static void main(String[] args) throws Exception {
        //1.创建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //2、获取数据
        DataStream<String> source = env.socketTextStream("node1", 9000);

        DataStream<Tuple2<String, Integer>> windowCounts = source
                .flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
                    for (String word : value.split("\\s")) {
                        out.collect(Tuple2.of(word, 1));
                    }
                })
                //.keyBy(0) //过时
                .keyBy(t -> t.f0)
                //滑动窗口
                .window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
                .sum(1);


        windowCounts.print();

        env.execute("TestSlidingTimeWindows");
    }

}

Guess you like

Origin blog.csdn.net/icanlove/article/details/126349532