Flink底层API之ProcessFunction

一、Flink转换算子是无法访问事件的时间戳和watermark，因此DataStream提供了一套底层API,用于访问事件时间戳，watermark和注册的定时事件。Flink SQL是基于ProcessFunction实现的。

Flink提供了8个ProcessFunction

ProcessFunction

KeyedProcessFunction

CoProcessFunction

ProcessJoinFunction

BroadcastProcessFunction

KeyedBroadcastFunction

ProcessWindowFunction

ProcessAllWindowFunction

以KeyedProcessFunction为例：

扫描二维码关注公众号，回复： 8838372 查看本文章

DataStreamSource<String> stream = environment.socketTextStream("localhost", 7777);
SingleOutputStreamOperator<String> operator = stream.map(new MapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public Tuple2<String, Integer> map(String s) throws Exception {
        String[] split = s.split("\\s");
        return new Tuple2<String, Integer>(split[0], Integer.valueOf(split[1]));
    }
}).keyBy(0)
        .process(new KeyedProcessFunction<Tuple, Tuple2<String, Integer>, String>() {
            private long lazyTime = 10 * 1000;

            @Override
            public void processElement(Tuple2<String, Integer> stringIntegerTuple2, Context context, Collector<String> collector) throws Exception {
                Integer value = stringIntegerTuple2.f1;
                if (value > 10) {//添加timer
                    context.timerService().registerProcessingTimeTimer(context.timerService().currentProcessingTime() + lazyTime);
                }
                collector.collect(stringIntegerTuple2.toString());
            }

            @Override
            public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
                out.collect(ctx.getCurrentKey() + "超过阈值");
            }
        });
operator.print();
environment.execute(KeyProcessFunctionTest.class.getSimpleName());

注意：

（1）processElement 流中每条数据都会执行

（2）context 可以取到watermark，时间服务（timerService）等信息

（3）context.timerService().registerProcessingTimeTimer(context.timerService().currentProcessingTime() + lazyTime)

每个定时时间都有唯一的时间戳，registerProcessingTimeTimer的入参是定时事件执行的时间戳

二、侧输出流（sideoutput）

大部分的DataStream API的算子的输出是单一输出流，split算子可以产生的多条流，但是这些多条流的数据类型是一样的。

processFunction的sideoutput可以产生数据类型不一致的多条流。

StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

//输入的值：key 数字
DataStreamSource<String> stream = environment.socketTextStream("localhost", 7777);

DataStream<Tuple2<String, Integer>> outputTest = stream.map(new MapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public Tuple2<String, Integer> map(String s) throws Exception {
        String[] split = s.split("\\s");
        return new Tuple2<String, Integer>(split[0], Integer.valueOf(split[1]));
    }
}).process(new ProcessFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
    @Override
    public void processElement(Tuple2<String, Integer> stringIntegerTuple2, Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {
        Integer value = stringIntegerTuple2.f1;
        if (value > 10) {
            OutputTag<Tuple2<String, Integer>> outputTag = new OutputTag<Tuple2<String, Integer>>("outputTest_tuple"){};
            context.output(outputTag, stringIntegerTuple2);
        } else if(value>5) {
            OutputTag<String> outputTag = new OutputTag<String>("outputTest_String"){};
            context.output(outputTag, stringIntegerTuple2.f0);
        }else {
            collector.collect(stringIntegerTuple2);
        }
    }
});
//根据outputTest_tuple取Tuple类型的侧输出流
DataStream<Tuple2<String, Integer>> sideOutput_tuple =  ((SingleOutputStreamOperator<Tuple2<String, Integer>>) outputTest).getSideOutput(new OutputTag<Tuple2<String, Integer>>("outputTest_tuple"){});
sideOutput_tuple.print();
//根据outputTest_String取String类型的侧输出流
DataStream<String> sideOutput_String =  ((SingleOutputStreamOperator<Tuple2<String, Integer>>) outputTest).getSideOutput(new OutputTag<String>("outputTest_String"){});
sideOutput_String.print();
environment.execute(SideoutputTest.class.getSimpleName());

zuodaoyong

发布了59 篇原创文章 · 获赞 2 · 访问量 2060

私信关注

Flink底层API之ProcessFunction

猜你喜欢