Flink’s commonly used processing functions

Handling function

Overview

Processing Function is one of the core components in Apache Flink used to process elements on the data stream. The processing function is responsible for defining how the data on the data stream is processed, allowing developers to write custom logic to perform various operations, such as transformation, aggregation, filtering, joining, etc., and generate an output data stream after processing.

For data streams, you can directly call the .process() method for customized processing. The parameters passed in are called processing functions, which can also be divided into conversion operators.

Basic processing functions

ProcessFunction is the most basic processing function. It is passed in as a parameter when calling .process() directly based on DataStream.

Introduction to ProcessFunction

ProcessFunction is an abstract class, which inherits AbstractRichFunction and has two generic type parameters:

1.输入的数据类型

2.处理完成之后输出数据类型

Two methods are defined internally:

1.必须要实现的抽象方法.processElement()

2.一个非抽象方法.onTimer()

The ProcessFunction class is as follows:

/**
 * 处理流元素的函数
 *
 * 对于输入流中的每个元素,调用processElement(Object,ProcessFunction.Context,Collector) 可以产生零个或多个元素作为输出
 * 还可以通过提供的ProcessFunction.Context查询时间和设置计时器
 *
 * 对于触发计时器,将调用onTimer(long,ProcessFunction.OnTimerContext,Collector) 可以再次产生零个或多个元素作为输出,并注册其他计时器
 *
 * @param <I> 输入元素的类型
 * @param <O> 输出元素的类型
 */
@PublicEvolving
public abstract class ProcessFunction<I, O> extends AbstractRichFunction {
    
    

    private static final long serialVersionUID = 1L;

    /**
     * 处理输入流中的一个元素,对于流中的每个元素都会调用一次
     *
     * 可以使用输出零个或多个元素收集器参数,并使用更新内部状态或设置计时器ProcessFunction.Context参数
     *
     * @param value 输入值,类型与流中数据类型一致
     * @param ctx ProcessFunction的内部抽象类Context,表示当前运行的上下文,可以获取当前时间戳,用于查询时间和注册定时器的定时服务
     * @param out 用于返回结果值的收集器,与out.collect()方法向下游发数据类似
     */
    public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;

    /**
     * 当使用设置计时器时调用TimerService
     * 
     * 只有在注册好的定时器触发的时候才会调用,而定时器是通过定时服务TimerService来注册的
     * 
     * 事件时间语义下就是由水位线watermark来触发
     * 
     * 也可以自定义数据按照时间分组、定时触发计算输出结果,实现类似窗口window的功能
     *
     * @param timestamp 触发计时器的时间戳,指设定好的触发时间
     * @param ctx 上下文
     * @param out 用于返回结果值的收集器
     */
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {
    
    }
}

Usage example

The use of the basic processing function ProcessFunction is similar to the basic conversion operation. The .process() method is directly called based on the DataStream and a ProcessFunction is passed in as a parameter to define the processing logic.

Specific examples are as follows:

    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);
        DataStreamSource<Integer> stream = env.fromCollection(Arrays.asList(1, 2, 3, 4, 5, -6));
        /**
         * 创建OutputTag对象
         * 分别指定: 标签名、放入侧输出流的数据类型(Typeinformation)
         */
        OutputTag<Integer> evenTag = new OutputTag<>("even", Types.INT);
        OutputTag<Integer> oddTag = new OutputTag<>("odd", Types.INT);

        // 使用process算子
        SingleOutputStreamOperator<Integer> process = stream.process(
                new ProcessFunction<Integer, Integer>() {
    
    
                    @Override
                    public void processElement(Integer value, Context ctx, Collector<Integer> out) throws Exception {
    
    
                        if (value > 0) {
    
    
                            if (value % 2 == 0) {
    
    
                                // 偶数放到侧输出流evenTag中
                                // 调用上下文对象ctx的output方法,分别传入 Tag对象、放入侧输出流中的数据
                                ctx.output(evenTag, value);
                            } else if (value % 2 == 1) {
    
    
                                // 奇数放到侧输出流oddTag中
                                ctx.output(oddTag, value);
                            }
                        } else {
    
    
                            // 负数 数据,放到主流中
                            out.collect(value);
                        }
                    }
                }
        );

        // 在主流中,根据标签 获取 侧输出流
        SideOutputDataStream<Integer> even = process.getSideOutput(evenTag);
        SideOutputDataStream<Integer> odd = process.getSideOutput(oddTag);

        // 打印主流
        process.printToErr("主流-负数-job");
        //打印 侧输出流
        even.print("偶数-job");
        odd.print("奇数-job");
        env.execute();
    }
奇数-job:1> 1
偶数-job:2> 2
奇数-job:1> 3
偶数-job:2> 4
奇数-job:1> 5
主流-负数-job:2> -6

Key partition processing function

KeyedProcessFunction is the processing function after key partitioning of the stream, and is passed in as a parameter when calling .process() based on KeyedStream. To use a timer, it must be based on KeyedStream

Introduction to KeyedProcessFunction

The definitions of KeyedProcessFunction and ProcessFunction are almost identical. The only difference is that the type parameter has an extra K, which is the type of key for the current key partition.

The key partition processing function interface is as follows:

public abstract class KeyedProcessFunction<K, I, O> extends AbstractRichFunction {
    
    

    public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;


    public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {
    
    }
}    

Timer Timer and timing service TimerService

In addition, KeyedStream supports the use of timing service TimerService, through which you can access events, timestamps, watermarks in the stream, and even register timed events.

Timing processing logic can be implemented in the onTimer() method, and the premise for it to trigger is that the timer has been registered before and the trigger time has now reached.

The function of registering a timer is implemented through the timing service provided in the context.

// 获取定时服务
TimerService timerService = ctx.timerService();

TimerService is Flink's basic service interface for time and timers. There are three main corresponding operations: getting the current time, registering the timer, and deleting the timer. The specific methods are as follows:

// 获取当前的处理时间
long currentProcessingTime();
// 获取当前的水位线(事件时间)
long currentWatermark();

// 注册处理时间定时器,当处理时间超过time时触发
void registerProcessingTimeTimer(long time);
// 注册事件时间定时器,当水位线超过time时触发
void registerEventTimeTimer(long time);

// 删除触发时间为time的处理时间定时器
void deleteProcessingTimeTimer(long time);
// 删除触发时间为time的处理时间定时器
void deleteEventTimeTimer(long time);

Notice:

Although TimerService can be accessed in all processing functions, only KeyedStream-based processing functions can call the methods of registering and deleting timers.

Usage example

Based directly on the KeyedStream after keyBy, directly call the .process() method and pass in the implementation class parameters of KeyedProcessFunction.

必须实现processElement()抽象方法,用来处理流中的每一个数据

必须实现非抽象方法onTimer(),用来定义定时器触发时的回调操作
 public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 从socket接收数据流
        SingleOutputStreamOperator<Tuple2<String, Integer>> streamSource = env.socketTextStream("IP", 8086)
                // 将输入数据转换为Tuple2
                .map(new MapFunction<String, Tuple2<String, Integer>>() {
    
    
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
    
    
                        String[] split = value.split(",");
                        return Tuple2.of(split[0], Integer.valueOf(split[1]));
                    }
                })
                // 指定 watermark策略
                .assignTimestampsAndWatermarks(
                        // 定义Watermark策略
                        WatermarkStrategy
                                .<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((value, ts) -> value.f1 * 1000L)
                );

        // keyBy分区
        KeyedStream<Tuple2<String, Integer>, String> keyByStream = streamSource.keyBy(a -> a.f0);

        // 按键分区处理函数
        SingleOutputStreamOperator<Integer> process = keyByStream.process(
                new KeyedProcessFunction<String, Tuple2<String, Integer>, Integer>() {
    
    
                    /**
                     * 来一条数据调用一次
                     * @param value 当前数据
                     * @param ctx 上下文
                     * @param out 收集器
                     * @throws Exception
                     */
                    @Override
                    public void processElement(Tuple2<String, Integer> value, Context ctx, Collector<Integer> out) throws Exception {
    
    
                        //获取当前数据的key
                        String currentKey = ctx.getCurrentKey();
p();
                        // 获取定时服务
                        TimerService timerService = ctx.timerService();
						// 数据中提取出来的事件时间
                        Long currentEventTime = ctx.timestam
						// 注册事件时间定时器
                        timerService.registerEventTimeTimer(3000L);
                        System.out.println("key: " + currentKey + " 当前数据: " + value + " 当前时间: " + currentEventTime + " 注册一个3s定时器");

                    /**
                     * 时间进展到定时器注册的时间,调用该方法
                     * @param timestamp 定时器被触发时的时间
                     * @param ctx       上下文
                     * @param out       采集器
                     */
                    @Override
                    public void onTimer(long timestamp, OnTimerContext ctx, Collector<Integer> out) throws Exception {
    
    
                        super.onTimer(timestamp, ctx, out);
                        String currentKey = ctx.getCurrentKey();
                        System.out.println("key: " + currentKey + " 时间: " + timestamp + " 定时器触发");
                    }
                }
        );
        process.print();
        env.execute();
    }

other

1. Register a timer for event time

事件时间定时器,通过watermark来触发,即watermark >= 注册的时间

水印watermark = 当前最大事件时间 - 等待时间 -1ms

例子:等待3s,3s定时器,事件时间6s 则watermark = 6s - 3s -1ms = 2.99s,不会触发3s的定时器
// 数据中提取出来的事件时间
Long currentEventTime = ctx.timestam
// 注册事件时间定时器
timerService.registerEventTimeTimer(3000L);
System.out.println("key: " + currentKey + " 当前数据: " + value + " 当前时间: " + currentEventTime + " 注册一个3s定时器");

The input data is as follows. When 7 is entered, the water level is 7-3=4s-1ms=3.99s, that is, the water level exceeds the timer for 3s, and the callback operation is triggered.

nc -lk 8086
key1,1
key1,2
key2,3
key2,4
key1,5
key2,6
key1,7

Console output:

key: key1 当前数据: (key1,1) 当前时间: 1000 注册一个3s定时器
key: key1 当前数据: (key1,2) 当前时间: 2000 注册一个3s定时器
key: key2 当前数据: (key2,3) 当前时间: 3000 注册一个3s定时器
key: key2 当前数据: (key2,4) 当前时间: 4000 注册一个3s定时器
key: key1 当前数据: (key1,5) 当前时间: 5000 注册一个3s定时器
key: key2 当前数据: (key2,6) 当前时间: 6000 注册一个3s定时器
key: key1 当前数据: (key1,7) 当前时间: 7000 注册一个3s定时器
key: key1 时间: 3000 定时器触发
key: key2 时间: 3000 定时器触发

Notice:

TimerService will deduplicate timers based on key and timestamp. Therefore, for each key and timestamp, there can be at most one timer. If registered multiple times, the onTimer() method will only be called once.

2. Register a timer for processing time

long currentTs = timerService.currentProcessingTime();
timerService.registerProcessingTimeTimer(currentTs + 3000L);
 System.out.println("key: " + currentKey + " 当前数据: " + value + " 当前时间: " + currentTs + " 注册一个3s后的定时器");

Enter the test data as follows:

key1,1
key2,2

When registering a timer for processing time, the timer will trigger the operation after 3 seconds.

key: key1 当前数据: (key1,1) 当前时间: 1688136512301 注册一个3s后的定时器
key: key2 当前数据: (key2,2) 当前时间: 1688136514179 注册一个3s后的定时器
key: key1 时间: 1688136515301 定时器触发
key: key2 时间: 1688136517179 定时器触发

3. Get the current watermark of the process

long currentWatermark = timerService.currentWatermark();
System.out.println("当前数据: " + value + " 当前watermark: " + currentWatermark);
key1,1
key1,2
key1,3

Conclusion: Each time the process is processed, the watermark refers to the event time of the previous data - the waiting time, for example: 3-2-1ms=-1001

当前数据=(key1,1),当前watermark=-9223372036854775808
当前数据=(key1,2),当前watermark=-2001
当前数据=(key1,3),当前watermark=-1001

4. Delete a processing time timer

                        // 注册处理时间定时器
                        long currentTs = timerService.currentProcessingTime();
                        long timer = currentTs + 3000;
                        timerService.registerProcessingTimeTimer(timer);
                        System.out.println("key: " + currentKey + " 当前数据: " + value + " 当前时间: " + currentTs + " 注册一个3s后的定时器");

                        // 在3000毫秒后删除处理时间定时器
                        if("key1".equals(currentKey)){
    
    
                            timerService.deleteProcessingTimeTimer(timer)
                        }

Enter test data:

key1,1
key2,2

Console output:

key: key1 当前数据: (key1,1) 当前时间: 1688138104565 注册一个3s后的定时器
key: key2 当前数据: (key2,2) 当前时间: 1688138106441 注册一个3s后的定时器
key: key2 时间: 1688138109441 定时器触发

window handling function

The window processing function is a typical full-window function. It directly calls the .process() method based on WindowedStream.

There are 2 window processing functions:

1.ProcessWindowFunction:

The processing function after opening the window is also a representative of the full window function. When calling .process() based on WindowedStream, it is passed in as a parameter and must be a keyBy data stream.

2.ProcessAllWindowFunction:

It is also the processing function after opening the window. It is passed in as a parameter when calling .process() based on AllWindowedStream. It must be a non-keyBy data stream.

Introduction to ProcessWindowFunction

ProcessWindowFunction is both a processing function and a full window function. The specific interface is as follows:

      /**
     * ProcessWindowFunction它有四个类型参数:

     * @param <IN> 数据流中窗口任务的输入数据类型
     * @param <OUT> 窗口任务进行计算之后的输出数据类型
     * @param <KEY> 数据中键key的类型
     * @param <W>  窗口的类型,是Window的子类型。一般情况下我们定义时间窗口,W就是TimeWindow
     */

    public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> extends AbstractRichFunction {
    
    

        /**
         * 处理数据的核心方法process()方法
         *
         * @param key 窗口做统计计算基于的键,也就是之前keyBy用来分区的字段
         * @param context 当前窗口进行计算的上下文,它的类型就是ProcessWindowFunction内部定义的抽象类Context
         * @param elements 窗口收集到用来计算的所有数据,这是一个可迭代的集合类型
         * @param out 用来发送数据输出计算结果的收集器,类型为Collector
         * @throws Exception
         */
        public abstract void process(KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;

        /**
         * 主要是进行窗口的清理工作
         * 如果自定义了窗口状态,那么必须在clear()方法中进行显式地清除,避免内存溢出
         * @param context 当前窗口进行计算的上下文
         * @throws Exception
         */
        public void clear(Context context) throws Exception {
    
    }
    }

Introduction to ProcessAllWindowFunction

The usage of ProcessAllWindowFunction is similar, but it is based on AllWindowedStream, that is, directly opening the window for the data stream without keyBy and calling the .process() method

stream.windowAll( TumblingEventTimeWindows.of(Time.seconds(10)) )
    .process(new MyProcessAllWindowFunction())

Usage example

Take using ProcessWindowFunction as an example:

  public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 从socket接收数据流
        SingleOutputStreamOperator<String> source = env.socketTextStream("IP", 8086);


        // 将输入数据转换为(key, value)元组
        DataStream<Tuple2<String, Integer>> dataStream = source.map(new MapFunction<String, Tuple2<String, Integer>>() {
    
    
            @Override
            public Tuple2 map(String s) throws Exception {
    
    
                int number = Integer.parseInt(s);
                String key = number % 2 == 0 ? "key1" : "key2";
                Tuple2 tuple2 = new Tuple2(key, number);
                return tuple2;
            }
        }).returns(Types.TUPLE(Types.STRING, Types.INT));

        // 将数据流按键分组,并定义滚动窗口(处理时间窗口)
        DataStream<String> resultStream = dataStream
                .keyBy(tuple -> tuple.f0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .process(new MyProcessWindowFunction());

        resultStream.print();

        env.execute("ProcessWindowFunction Example");
    }

    public static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple2<String, Integer>, String, String, TimeWindow> {
    
    
        @Override
        public void process(String key, Context context, Iterable<Tuple2<String, Integer>> elements, Collector<String> out) {
    
    
            int sum = 0;
            for (Tuple2<String, Integer> element : elements) {
    
    
                sum += element.f1;
            }

            out.collect("Key: " + key + ", Window: " + context.window() + ", Sum: " + sum);
        }
    }

Stream merge processing function

CoProcessFunction is a processing function after merging two connect streams. It is passed in as a parameter when calling .process() based on ConnectedStreams.

Introduction to CoProcessFunction

When calling .process(), pass in a CoProcessFunction. What it needs to implement is the two methods processElement1() and processElement2()

The specific structure of the CoProcessFunction class is as follows:

/**
 * 用于同时处理两个连接的流
 * 它允许定义自定义处理逻辑,以处理来自两个不同输入流的事件并生成输出
 *
 * @param <IN1> 第一个输入流的元素类型
 * @param <IN2> 第二个输入流的元素类型
 * @param <OUT> 输出元素的类型
 */
public abstract class CoProcessFunction<IN1, IN2, OUT> extends AbstractRichFunction {
    
    

    /**
     * 处理第一个输入流的元素
     *
     * @param value 第一个输入流的元素
     * @param ctx 用于访问上下文信息,例如事件时间和状态的Context对象
     * @param out 用于发射输出元素的Collector对象
     * @throws Exception 处理时可能抛出的异常
     */
    public abstract void processElement1(IN1 value, Context ctx, Collector<OUT> out) throws Exception;

    /**
     * 处理第二个输入流的元素
     *
     * @param value 第二个输入流的元素
     * @param ctx 用于访问上下文信息,可以使用Context对象来访问事件时间、水位线和状态等上下文信息
     * @param out 用于发射输出元素的Collector对象
     * @throws Exception 处理时可能抛出的异常
     */
    public abstract void processElement2(IN2 value, Context ctx, Collector<OUT> out) throws Exception;

    /**
     * 当定时器触发时调用的方法。可以重写这个方法来执行基于时间的操作
     *
     * @param timestamp 触发定时器的时间戳
     * @param ctx 用于访问上下文信息,如事件时间和状态的OnTimerContext对象
     * @param out 用于发射输出元素的Collector对象
     * @throws Exception 处理时可能抛出的异常
     */
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out) throws Exception {
    
    }
}

Usage example

Assume there are two input streams, combine the two streams to calculate the total corresponding to each key, and output the result stream

    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<Tuple2<String, Integer>> source1 = env.fromElements(Tuple2.of("key1", 1), Tuple2.of("key2", 4), Tuple2.of("key1", 2));
        DataStreamSource<Tuple2<String, Integer>> source2 = env.fromElements(Tuple2.of("key1", 3), Tuple2.of("key2", 5), Tuple2.of("key2", 6));
        ConnectedStreams<Tuple2<String, Integer>, Tuple2<String, Integer>> connect = source1.connect(source2);

        // 进行keyby操作,将key相同数据放到一起
        ConnectedStreams<Tuple2<String, Integer>, Tuple2<String, Integer>> connectKeyby = connect.keyBy(s1 -> s1.f0, s2 -> s2.f0);

        /**
         * 对2个流中相同key的值求和
         */
        SingleOutputStreamOperator<String> process = connectKeyby.process(
                new CoProcessFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String>() {
    
    
                    Map<String, Integer> map = new HashMap<>();

                    /**
                     * 第一条流的处理逻辑
                     * @param value 第一条流的数据
                     * @param ctx   上下文
                     * @param out   采集器
                     * @throws Exception
                     */
                    @Override
                    public void processElement1(Tuple2<String, Integer> value, Context ctx, Collector<String> out) throws Exception {
    
    
                        String key = value.f0;
                        if (!map.containsKey(key)) {
    
    
                            // 如果key不存在,则将值直接put进map
                            map.put(key, value.f1);
                        } else {
    
    
                            // key存在,则计算:获取上一次put的值 + 本次的值
                            Integer total = map.get(key) + value.f1;
                            map.put(key, total);
                        }

                        out.collect("processElement1  key = " + key + " value = " + value + "total = " + map.get(key));
                    }

                    /**
                     * 第二条流的处理逻辑
                     * @param value 第二条流的数据
                     * @param ctx   上下文
                     * @param out   采集器
                     * @throws Exception
                     */
                    @Override
                    public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<String> out) throws Exception {
    
    
                        String key = value.f0;
                        if (!map.containsKey(key)) {
    
    
                            // 如果key不存在,则将值直接put进map
                            map.put(key, value.f1);
                        } else {
    
    
                            // key存在,则计算:获取上一次put的值 + 本次的值
                            Integer total = map.get(key) + value.f1;
                            map.put(key, total);
                        }
                        out.collect("processElement2  key = " + key + " value = " + value + "total = " + map.get(key));
                    }
                }
        );

        process.print();
        env.execute();
    }
3> processElement1  key = key2 value = (key2,4)total = 4
4> processElement1  key = key1 value = (key1,1)total = 1
4> processElement2  key = key1 value = (key1,3)total = 4
4> processElement1  key = key1 value = (key1,2)total = 6
3> processElement2  key = key2 value = (key2,5)total = 9
3> processElement2  key = key2 value = (key2,6)total = 15

Stream join handler function

JoinFunction and ProcessJoinFunction are two different interfaces in Flink used to perform window join operations.

Window Join JoinFunction

Flink specifically provides a window joint result for merging two streams based on a period of time, which can define a time window and place the data sharing a common key in the two streams in the window for pairing processing.

The JoinFunction interface is as follows:

    /**
     * 联接通过在指定的键上联接两个数据集的元素来组合它们,每对连接元素都调用此函数
     * 
     * 默认情况下,连接严格遵循SQL中 “inner join” 的语义
     *
     * @param <IN1> 第一个输入中元素的类型
     * @param <IN2> 第二个输入中元素的类型
     * @param <OUT> 结果元素的类型
     */
    public interface JoinFunction<IN1, IN2, OUT> extends Function, Serializable {
    
    

        /**
         * join方法,每对联接的元素调用一次
         *
         * @param first 来自第一个输入的元素
         * @param second 来自第二个输入的元素
         * @return 生成的元素
         */
        OUT join(IN1 first, IN2 second) throws Exception;
    }

The specific syntax format is as follows:

/**
 * 1.调用DataStream的.join()方法来合并两条流,得到一个JoinedStreams
 * 2.通过.where()和.equalTo()方法指定两条流中联结的key。注意:两者相同的元素,如果在同一窗口中,才可以匹配起来
 * 3.通过.window()开窗口,并调用.apply()传入联结窗口函数进行处理计算
 */
stream1.join(stream2)
        // where()参数是KeySelector键选择器,用来指定第一条流中的key
        .where(<KeySelector>)
        // equalTo()传入KeySelector则指定第二条流中的key
        .equalTo(<KeySelector>)
        // window()传入窗口分配器
        .window(<WindowAssigner>)
        // apply()看作实现一个特殊的窗口函数,只能调用.apply()。传入JoinFunction是一个函数类接口,使用时需要实现内部的.join()方法,方法有两个参数,分别表示两条流中成对匹配的数据。
        .apply(<JoinFunction>)

Examples are as follows:

 public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 生成数据源1
        DataStreamSource<Tuple2<String, Integer>> streamSource1 = env.fromElements(Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4));
        // 定义 使用 Watermark策略
        SingleOutputStreamOperator<Tuple2<String, Integer>> stream1 = streamSource1
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().withTimestampAssigner((value, ts) -> value.f1 * 1000L));


        // 生成数据源2
        DataStreamSource<Tuple2<String, Integer>> streamSource2 = env.fromElements(Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4), Tuple2.of("d", 5), Tuple2.of("e", 6));
        // 定义 使用 Watermark策略
        SingleOutputStreamOperator<Tuple2<String, Integer>> stream2 = streamSource2
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().withTimestampAssigner((value, ts) -> value.f1 * 1000L));

        /**
         * 根据keyby的key进行匹配关联
         *
         * 注意:落在同一个时间窗口范围内才能匹配
         */
        DataStream<String> join = stream1.join(stream2)
                // stream1的keyby
                .where(r1 -> r1.f0)
                // stream2的keyby
                .equalTo(r2 -> r2.f0)
                // 传入窗口分配器
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 传入JoinFunction函数类接口,实现内部的.join()方法,方法有两个参数,分别表示两条流中成对匹配的数据
                .apply(new JoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String>() {
    
    
                    /**
                     * 关联上的数据,调用join方法
                     * @param first  stream1的数据
                     * @param second stream2的数据
                     */
                    @Override
                    public String join(Tuple2<String, Integer> first, Tuple2<String, Integer> second) throws Exception {
    
    
                        return "stream1 数据: " + first + " 关联 stream2 数据: " + second;
                    }
                });

        join.print();
        env.execute();
    }

The execution results are as follows:

stream1 数据: (a,1) 关联 stream2 数据: (a,1)
stream1 数据: (a,1) 关联 stream2 数据: (a,2)
stream1 数据: (a,2) 关联 stream2 数据: (a,1)
stream1 数据: (a,2) 关联 stream2 数据: (a,2)
stream1 数据: (c,4) 关联 stream2 数据: (c,4)
stream1 数据: (b,3) 关联 stream2 数据: (b,3)

Interval Join ProcessJoinFunction

Interval Join is an interval join. For each data in a stream, a period of time before and after its timestamp is opened up to see if there is a match between data from another stream during this period.

The ProcessJoinFunction interface is as follows:

/**
 * 处理两个连接流的关联操作的抽象类
 * 该类允许定义自定义的处理逻辑,以在连接两个流时处理匹配的元素
 *
 * @param <IN1> 第一个输入流的元素类型
 * @param <IN2> 第二个输入流的元素类型
 * @param <OUT> 输出元素的类型
 */
public interface ProcessJoinFunction<IN1, IN2, OUT> {
    
    

    /**
     * 处理连接两个流的元素
     *
     * @param left  第一个输入流的元素
     * @param right 第二个输入流的元素
     * @param ctx   用于访问上下文信息的 Context 对象
     * @param out   用于发射输出元素的 Collector 对象
     * @throws Exception 处理时可能抛出的异常
     */
    void processElement(IN1 left, IN2 right, Context ctx, Collector<OUT> out) throws Exception;
}

The syntax for interval join is as follows:

// 第一条流进行KeyedStream
stream1
    .keyBy(<KeySelector>)
    // 得到KeyedStream之后,调用.intervalJoin()合并两条流,传入一个KeyedStream参数,两者key类型应该一致,最终得到一个IntervalJoin类型
    .intervalJoin(stream2.keyBy(<KeySelector>))
    // 通过.between()方法指定间隔的上下界
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    // 调用.process()方法,定义对匹配数据对的处理操作,传入一个处理函数
    .process (new ProcessJoinFunction<Integer, Integer, String(){
    
    
            @Override
            public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
    
    
                out.collect(left + "," + right);
            }
        });

Usage examples are as follows:

    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 生成数据源1
        DataStreamSource<Tuple2<String, Integer>> streamSource1 = env.fromElements(Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4));
        // 定义 使用 Watermark策略
        SingleOutputStreamOperator<Tuple2<String, Integer>> stream1 = streamSource1
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().withTimestampAssigner((value, ts) -> value.f1 * 1000L));


        // 生成数据源2
        DataStreamSource<Tuple2<String, Integer>> streamSource2 = env.fromElements(Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4), Tuple2.of("d", 5), Tuple2.of("e", 6));
        // 定义 使用 Watermark策略
        SingleOutputStreamOperator<Tuple2<String, Integer>> stream2 = streamSource2
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().withTimestampAssigner((value, ts) -> value.f1 * 1000L));


        // 对2条流分别做keyby,key就是关联条件
        KeyedStream<Tuple2<String, Integer>, String> keyedStream1 = stream1.keyBy(r1 -> r1.f0);
        KeyedStream<Tuple2<String, Integer>, String> keyedStream2 = stream2.keyBy(r2 -> r2.f0);


        // 执行间隔联结
        keyedStream1.intervalJoin(keyedStream2)
                .between(Time.seconds(-2), Time.seconds(2))
                .process(
                        new ProcessJoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String>() {
    
    
                            /**
                             * 当两条流数据匹配上时调用这个方法
                             * @param left  stream1的数据
                             * @param right stream2的数据
                             * @param ctx   上下文
                             * @param out   采集器
                             * @throws Exception
                             */
                            @Override
                            public void processElement(Tuple2<String, Integer> left, Tuple2<String, Integer> right, Context ctx, Collector<String> out) throws Exception {
    
    
                                // 关联的数据
                                out.collect("stream1 数据: " + left + " 关联 stream2 数据: " + right);
                            }
                        })
                .print();

        env.execute();
    }
stream1 数据: (a,1) 关联 stream2 数据: (a,1)
stream1 数据: (a,1) 关联 stream2 数据: (a,2)
stream1 数据: (a,2) 关联 stream2 数据: (a,2)
stream1 数据: (a,2) 关联 stream2 数据: (a,1)
stream1 数据: (b,3) 关联 stream2 数据: (b,3)
stream1 数据: (c,4) 关联 stream2 数据: (c,4)

Handling late data

The window interval connection processing function can handle late data

    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Integer>> streamSource1 = env.socketTextStream("112.74.96.150", 8086)
                .map(new MapFunction<String, Tuple2<String, Integer>>() {
    
    
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
    
    
                        String[] split = value.split(",");
                        return Tuple2.of(split[0], Integer.valueOf(split[1]));
                    }
                })
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((value, ts) -> value.f1 * 1000L)
                );


        SingleOutputStreamOperator<Tuple2<String, Integer>> streamSource2 = env.socketTextStream("112.74.96.150", 8087)
                .map(new MapFunction<String, Tuple2<String, Integer>>() {
    
    
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
    
    
                        String[] split = value.split(",");
                        return Tuple2.of(split[0], Integer.valueOf(split[1]));
                    }
                })
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy
                                .<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((value, ts) -> value.f1 * 1000L)
                );


        // 对2条流分别做keyby,key就是关联条件
        KeyedStream<Tuple2<String, Integer>, String> keyedStream1 = streamSource1.keyBy(r1 -> r1.f0);
        KeyedStream<Tuple2<String, Integer>, String> keyedStream2 = streamSource2.keyBy(r2 -> r2.f0);

        // 定义 标记操作符的侧面输出
        OutputTag<Tuple2<String, Integer>> keyedStream1OutputTag = new OutputTag<>("keyedStream1", Types.TUPLE(Types.STRING, Types.INT));
        OutputTag<Tuple2<String, Integer>> keyedStream2OutputTag = new OutputTag<>("keyedStream2", Types.TUPLE(Types.STRING, Types.INT));

        // 执行间隔联结
        SingleOutputStreamOperator<String> process = keyedStream1.intervalJoin(keyedStream2)
                // 指定间隔的上界、下界的偏移,负号代表时间往前,正号代表时间往后
                // 若keyedStream1中某事件时间为5,则其水位线是5-3=2,其上界是 5-2=3 下界是5+2=7 即2-7这个区间能匹配keyedStream2中事件时间是2-7的数据
                .between(Time.seconds(-2), Time.seconds(2))
                // 将streamSource1迟到数据,放入侧输出流
                .sideOutputLeftLateData(keyedStream1OutputTag)
                // 将streamSource2迟到数据,放入侧输出流
                .sideOutputRightLateData(keyedStream2OutputTag)
                // 对匹配数据对的处理操作 只能处理 join上的数据
                .process(
                        new ProcessJoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String>() {
    
    
                            /**
                             * 当两条流数据匹配上时调用这个方法
                             * @param left  stream1的数据
                             * @param right stream2的数据
                             * @param ctx   上下文
                             * @param out   采集器
                             * @throws Exception
                             */
                            @Override
                            public void processElement(Tuple2<String, Integer> left, Tuple2<String, Integer> right, Context ctx, Collector<String> out) throws Exception {
    
    
                                // 进入这个方法,是关联上的数据
                                out.collect("stream1 数据: " + left + " 关联 stream2 数据: " + right);
                            }
                        });

        process.print("主流");
        process.getSideOutput(keyedStream1OutputTag).printToErr("streamSource1迟到数据");
        process.getSideOutput(keyedStream2OutputTag).printToErr("streamSource2迟到数据");
        env.execute();
    }

1.2 stream data matching
If an event time in keyedStream1 is 5, then its water level is 5-3=2, its upper bound is 5-2=3, and its lower bound is 5+2=7 means that the range 2-7 can match the data whose event time is 2-7 in keyedStream2

 nc -lk 8086
key1,5
nc -lk 8087
key1,3
key1,7
key1,8
主流> stream1 数据: (key1,5) 关联 stream2 数据: (key1,3)
主流> stream1 数据: (key1,5) 关联 stream2 数据: (key1,7)

2.keyedStream2 late data
At this time, the watermark in keyedStream1 is 5-3=2, the watermark in keyedStream2 is 8-3=5, and the watermark is under multiple degrees of parallelism. Take the minimum, that is, take the water level 2

Input event time 1 in keyedStream2

nc -lk 8087
key1,3
key1,7
key1,8
key1,1

Event time 1 < water mark 2, and event time 1 is not included in the upper bound 5-2=3 and lower bound 5+2=7 of event time 5 of keyedStream1, that is, the data does not match and the streamSource2 data is late

streamSource2迟到数据> (key1,1)

3.keyedStream1 late data

keyedStream1 input event time 7

nc -lk 8086
key1,5
key1,7

At this time, 8 and 7 in streamSource2 are matched.

主流> stream1 数据: (key1,7) 关联 stream2 数据: (key1,8)
主流> stream1 数据: (key1,7) 关联 stream2 数据: (key1,7)

At this time, the water level of keyedStream1 is 7-3=4, and the water level of keyedStream2 is 8-3=5. Under multiple degrees of parallelism, the water level is the smallest, that is, the water level is 4.

keyedStream1 input event time 3

 nc -lk 8086
key1,5
key1,7
key1,3

Event time 3 < water mark 4, and event time 3 is included by the upper bound 3-2=1 and lower bound 3+2=5 of event time 3 of keyedStream2, that is, the data matches and the streamSource1 data is late.

streamSource1迟到数据> (key1,3)

Broadcast stream processing function

Used to connect a main data stream and multiple broadcast data streams. The processElement method can be implemented to process each element of the main data stream, and can also handle broadcast data streams, usually used for data broadcasting and connections.

There are 2 broadcast stream processing functions:

1.BroadcastProcessFunction:

Broadcast connection stream processing function, passed in as a parameter when calling .process() based on BroadcastConnectedStream. It is the product of connecting an ordinary DataStream without keyBy and a broadcast stream BroadcastStream.

2.KeyedBroadcastProcessFunction:

The broadcast connection stream processing function of key partition is also passed in as a parameter when calling .process() based on BroadcastConnectedStream. It is a KeyedStream connected to the broadcast stream

KeyedBroadcastProcessFunction

/**
 * @param <KS>  输入键控流的键类型
 * @param <IN1> 键控 (非广播) 端的输入类型
 * @param <IN2> 广播端的输入类型
 * @param <OUT> 运算符的输出类型
 */
public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT>
        extends BaseBroadcastProcessFunction {
    
    

    private static final long serialVersionUID = -2584726797564976453L;

    /**
     * (非广播) 的键控流中的每个元素调用此方法
     *
     * @param value 流元素
     * @param ctx   允许查询元素的时间戳、查询当前处理/事件时间以及以只读访问迭代广播状态
     * @param out   将结果元素发出
     */
    public abstract void processElement(
            final IN1 value, final ReadOnlyContext ctx, final Collector<OUT> out) throws Exception;

    /**
     * 针对broadcast stream中的每个元素调用该方法
     *
     * @param value stream元素
     * @param ctx   上下文 许查询元素的时间戳、查询当前处理/事件时间和更新广播状态
     * @param out   将结果元素发射到
     */
    public abstract void processBroadcastElement(
            final IN2 value, final Context ctx, final Collector<OUT> out) throws Exception;

    /**
     * 当使用TimerService设置的计时器触发时调用
     *
     * @param timestamp 触发计时器的时间戳
     * @param ctx       上下文
     * @param out       返回结果值的收集器
     */
    public void onTimer(final long timestamp, final OnTimerContext ctx, final Collector<OUT> out)
            throws Exception {
    
    
    }
}

BroadcastProcessFunction

BroadcastProcessFunction is similar to KeyedBroadcastProcessFunction, but it is based on AllWindowedStream, that is, directly opening the window for the data stream without keyBy and calling the .process() method

    public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends BaseBroadcastProcessFunction {
    
    

        public abstract void processElement( final IN1 value, final ReadOnlyContext ctx, final Collector<OUT> out) throws Exception;

        public abstract void processBroadcastElement(final IN2 value, final Context ctx, final Collector<OUT> out) throws Exception;
    }

Usage example

Take using KeyedBroadcastProcessFunction as an example:

public class KeyedBroadcastProcessFunctionExample {
    
    

    /**
     * 主流 数据对象
     */
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class MainRecord {
    
    
        private String key;
        private int value;
    }

    /**
     * 广播流 数据对象
     */
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class BroadcastRecord {
    
    
        private String configKey;
        private int configValue;
    }

    /**
     * 结果 数据对象
     */
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class ResultRecord {
    
    
        private String key;
        private int result;
    }


    // 使用给定的名称和给定的类型信息新建一个MapStateDescriptor
    static MapStateDescriptor<String, Integer> mapStateDescriptor = new MapStateDescriptor<>("broadcastState", String.class, Integer.class);


    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 创建主数据流
        DataStream<MainRecord> mainStream = env.fromElements(
                new MainRecord("A", 10),
                new MainRecord("B", 20),
                new MainRecord("A", 30)
        );

        // 创建广播数据流
        DataStream<BroadcastRecord> broadcastStream = env.fromElements(
                new BroadcastRecord("config", 5)
        );

        // 将广播数据流转化为 BroadcastStream
        BroadcastStream<BroadcastRecord> broadcast = broadcastStream.broadcast(mapStateDescriptor);

        // 使用 KeyedBroadcastProcessFunction 连接主数据流和广播数据流
        DataStream<ResultRecord> resultStream = mainStream
                .keyBy(new MainRecordKeySelector())
                .connect(broadcast)
                .process(new MyKeyedBroadcastProcessFunction());

        resultStream.print();

        env.execute("KeyedBroadcastProcessFunction Example");
    }


    /**
     * 使用提供的键对其运算符状态进行分区
     */
    public static class MainRecordKeySelector implements KeySelector<MainRecord, String> {
    
    
        @Override
        public String getKey(MainRecord mainRecord) {
    
    
            return mainRecord.getKey();
        }
    }

    /**
     *
     */
    public static class MyKeyedBroadcastProcessFunction extends KeyedBroadcastProcessFunction<String, MainRecord, BroadcastRecord, ResultRecord> {
    
    

        @Override
        public void processBroadcastElement(BroadcastRecord value, Context ctx, Collector<ResultRecord> out) throws Exception {
    
    
            // 通过上下文获取广播状态
            BroadcastState<String, Integer> broadcastState = ctx.getBroadcastState(mapStateDescriptor);
            // 处理广播数据流中的每个元素,更新广播状态
            broadcastState.put(value.getConfigKey(), value.getConfigValue());
        }

        @Override
        public void processElement(MainRecord value, ReadOnlyContext ctx, Collector<ResultRecord> out) throws Exception {
    
    
            // 在 processElement 中访问广播状态
            ReadOnlyBroadcastState<String, Integer> broadcastState = ctx.getBroadcastState(mapStateDescriptor);

            // 从广播状态中获取配置值
            Integer configValue = broadcastState.get("config");
            // 注意:刚启动时,可能是数据流的第1 2 3...条数据先来 不是广播流先来
            if (configValue == null) {
    
    
                return;
            }

            System.out.println(String.format("主数据流的Key: %s, value: %s, 广播更新结果: %s", value.key, value.value, value.value + configValue));
            // 根据配置值和主数据流中的元素执行处理逻辑
            int result = value.getValue() + configValue;

            // 发出结果记录
            out.collect(new ResultRecord(value.getKey(), result));
        }
    }
}


主数据流的Key: A, value: 10, 广播更新结果: 15
主数据流的Key: B, value: 20, 广播更新结果: 25
2> KeyedBroadcastProcessFunctionExample.ResultRecord(key=B, result=25)
7> KeyedBroadcastProcessFunctionExample.ResultRecord(key=A, result=15)
主数据流的Key: A, value: 30, 广播更新结果: 35
7> KeyedBroadcastProcessFunctionExample.ResultRecord(key=A, result=35)

Guess you like

Origin blog.csdn.net/qq_38628046/article/details/131466269