【FLink】Watermark

Table of contents

1. About time semantics

1.1 Event time

1.2 Processing timeEdit

2. What is water level?

2.1 Sequential flow and out-of-sequence flow

2.2 Processing of out-of-order data

2.3 Characteristics of water level

3. Generation of water level line

3.1 General principles for generating water levels

3.2 Waterline generation strategy

3.3 Flink’s built-in water level

3.3.1 Built-in water level setting in ordered flow

3.4.2 Punctuated Generator

3.4.3 Send watermark in data source

4. Transmission of water level line

5. Handling of late data


1. About time semantics

1.1 Event time

        Under normal circumstances, the timestamp (timestamp) of data generation will be recorded in business log data, which can be used as the basis for judging event time. Starting from Flink version 1.12, Flink has used event time as the default time semantics.

1.2 Processing time

2. What is water level?

In Flink, the mark used to measure the progress of event time is called "Watermark" . To put it bluntly, it is the event timestamp.

2.1 Sequential flow and out-of-sequence flow

Ordered flow means that data is generated in the order in which each piece of data generates a water level line in order.

This is an ideal state (the amount of data is small), but in reality, the amount of data we generate is often very large, and the time interval between data is very small, so in order to improve efficiency, it is usually done every once in a while Generate a watermark .

In actual production, due to factors such as network transmission between multiple services, our data flow is often not the sequential result we think, but the data is disordered. This is an out-of- order flow .

2.2 Processing of out-of-order data

Since the data is out of order, we cannot correctly handle "late" data. In order to allow the window to correctly collect late data, we can also let the window wait for a period of time, such as 2 seconds. In other words, we can add some delay to the timestamp of the data to try to ensure that no data is lost.

2.3 Characteristics of water level

3

3. Generation of water level line

3.1 General principles for generating water levels

A perfect water level is "absolutely correct", that is, once a water level appears, it means that all the data before this time have been collected and will never appear again. However, if you want to ensure it is absolutely correct, you must wait long enough, which will bring higher latency.

If we want faster, more real-time processing, we can set the watermark delay lower. In this case, a lot of late data may arrive after the water mark, which will lead to missing data in the window and inaccurate calculation results. Of course, if we don't consider accuracy at all and just pursue processing speed, we can directly use processing time semantics, which can theoretically get the lowest latency.

Therefore, the watermark in Flink is actually a trade-off mechanism between low latency and result correctness in stream processing , and it gives the control power to the programmer. We can define the watermark generation strategy in the code.

3.2 Waterline generation strategy

In Flink's DataStream API, there is a separate method for generating watermarks: .assignTimestampsAndWatermarks(), which is mainly used to assign timestamps to data in the stream and generate watermarks to indicate event times.

DataStream<Event> stream = env.addSource(new ClickSource());

DataStream<Event> withTimestampsAndWatermarks = 
stream.assignTimestampsAndWatermarks(<watermark strategy>);

WatermarkStrategy as a parameter, this is the so-called "water mark generation strategy". WatermarkStrategy is an interface that contains a "timestamp assigner" TimestampAssigner and a "watermark generator" WatermarkGenerator.

public interface WatermarkStrategy<T> 
    extends TimestampAssignerSupplier<T>,
            WatermarkGeneratorSupplier<T>{

    // 负责从流中数据元素的某个字段中提取时间戳,并分配给元素。时间戳的分配是生成水位线的基础。
    @Override
    TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context);

    // 主要负责按照既定的方式,基于时间戳生成水位线
    @Override
    WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);
}

3.3 Flink’s built-in water level

3.3.1  Built-in water level setting in ordered flow

For ordered streams, the main feature is that the timestamp increases monotonically, so there will never be a problem with late data. This is the simplest scenario for periodically generating watermarks, which can be achieved by directly calling the WatermarkStrategy. forMonotonousTimestamps() method.

public class WatermarkMonoDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<WaterSensor> sensorDS = env
                .socketTextStream("hadoop102", 7777)
                .map(new WaterSensorMapFunction());

        // TODO 1.定义Watermark策略
        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
                // 1.1 指定watermark生成:升序的watermark,没有等待时间
                .<WaterSensor>forMonotonousTimestamps()
                // 1.2 指定 时间戳分配器,从数据中提取
                .withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
                    @Override
                    public long extractTimestamp(WaterSensor element, long recordTimestamp) {
                        // 返回的时间戳,要 毫秒
                        System.out.println("数据=" + element + ",recordTs=" + recordTimestamp);
                        return element.getTs() * 1000L;
                    }
                });

        // TODO 2. 指定 watermark策略
        SingleOutputStreamOperator<WaterSensor> sensorDSwithWatermark = sensorDS.assignTimestampsAndWatermarks(watermarkStrategy);


        sensorDSwithWatermark.keyBy(sensor -> sensor.getId())
                // TODO 3.使用 事件时间语义 的窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .process(
                        new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {

                            @Override
                            public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
                                long startTs = context.window().getStart();
                                long endTs = context.window().getEnd();
                                String windowStart = DateFormatUtils.format(startTs, "yyyy-MM-dd HH:mm:ss.SSS");
                                String windowEnd = DateFormatUtils.format(endTs, "yyyy-MM-dd HH:mm:ss.SSS");

                                long count = elements.spliterator().estimateSize();

                                out.collect("key=" + s + "的窗口[" + windowStart + "," + windowEnd + ")包含" + count + "条数据===>" + elements.toString());
                            }
                        }
                )
                .print();

        env.execute();
    }
}

3.3.2 Built-in water level setting in out-of-sequence flow

This can be achieved by calling the WatermarkStrategy. forBoundedOutOfOrderness () method.

This method needs to pass in a maxOutOfOrderness parameter, which represents the "maximum degree of disorder", which represents the maximum difference in timestamps of disordered data in the data stream.

public class WatermarkOutOfOrdernessDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);


        SingleOutputStreamOperator<WaterSensor> sensorDS = env
                .socketTextStream("hadoop102", 7777)
                .map(new WaterSensorMapFunction());


        // TODO 1.定义Watermark策略
        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
                // 1.1 指定watermark生成:乱序的,等待3s
                .<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                // 1.2 指定 时间戳分配器,从数据中提取
                .withTimestampAssigner(
                        (element, recordTimestamp) -> {
                            // 返回的时间戳,要 毫秒
                            System.out.println("数据=" + element + ",recordTs=" + recordTimestamp);
                            return element.getTs() * 1000L;
                        });

        // TODO 2. 指定 watermark策略
        SingleOutputStreamOperator<WaterSensor> sensorDSwithWatermark = sensorDS.assignTimestampsAndWatermarks(watermarkStrategy);


        sensorDSwithWatermark.keyBy(sensor -> sensor.getId())
                // TODO 3.使用 事件时间语义 的窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .process(
                        new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {

                            @Override
                            public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
                                long startTs = context.window().getStart();
                                long endTs = context.window().getEnd();
                                String windowStart = DateFormatUtils.format(startTs, "yyyy-MM-dd HH:mm:ss.SSS");
                                String windowEnd = DateFormatUtils.format(endTs, "yyyy-MM-dd HH:mm:ss.SSS");

                                long count = elements.spliterator().estimateSize();

                                out.collect("key=" + s + "的窗口[" + windowStart + "," + windowEnd + ")包含" + count + "条数据===>" + elements.toString());
                            }
                        }
                )
                .print();

        env.execute();
    }
}

3.4  Custom water level generator

3.4.1  Periodic Generator

Periodic generators generally observe and judge input events through onEvent(), and send out watermarks in onPeriodicEmit().

import com.atguigu.bean.Event;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

// 自定义水位线的产生
public class CustomPeriodicWatermarkExample {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .addSource(new ClickSource())
                .assignTimestampsAndWatermarks(new CustomWatermarkStrategy())
                .print();

        env.execute();
    }

    public static class CustomWatermarkStrategy implements WatermarkStrategy<Event> {

        @Override
        public TimestampAssigner<Event> createTimestampAssigner(TimestampAssignerSupplier.Context context) {

            return new SerializableTimestampAssigner<Event>() {

                @Override
                public long extractTimestamp(Event element,long recordTimestamp) {
                    return element.timestamp; // 告诉程序数据源里的时间戳是哪一个字段
                }
            };
        }

        @Override
        public WatermarkGenerator<Event> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
            return new CustomBoundedOutOfOrdernessGenerator();
        }
    }

    public static class CustomBoundedOutOfOrdernessGenerator implements WatermarkGenerator<Event> {

        private Long delayTime = 5000L; // 延迟时间
        private Long maxTs = -Long.MAX_VALUE + delayTime + 1L; // 观察到的最大时间戳

        @Override
        public void onEvent(Event event,long eventTimestamp,WatermarkOutput output) {
            // 每来一条数据就调用一次
            maxTs = Math.max(event.timestamp,maxTs); // 更新最大时间戳
        }

        @Override
        public void onPeriodicEmit(WatermarkOutput output) {
            // 发射水位线,默认200ms调用一次
            output.emitWatermark(new Watermark(maxTs - delayTime - 1L));
        }
    }
}

If you want to modify the default cycle time, you can modify it through the following method.

//修改默认周期为400ms
env.getConfig().setAutoWatermarkInterval(400L);

3.4.2  Punctuated  Generator _

The breakpoint generator will continuously detect events in onEvent(), and when an event with watermark information is found, the watermark will be emitted immediately. We can just write the logic of emitting the water level in the onEvent method.

3.4.3  Send watermark in data source

We can also extract event times from a custom data source and send watermarks. What should be noted here is that after sending the watermark in the custom data source, you can no longer use the assignTimestampsAndWatermarks method in the program to generate the watermark. You can only choose one of generating watermarks in a custom data source or using the assignTimestampsAndWatermarks method in a program.

env.fromSource(
kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3)), "kafkasource"
)

4. Transmission of water level line

In stream processing, after the upstream task processes the water level and clock changes, it needs to send the current water level again and broadcast it to all downstream subtasks. When a task receives watermarks passed by multiple upstream parallel tasks, the smallest one should be used as the event clock of the current task.

The transmission of water levels between upstream and downstream tasks very cleverly avoids the problem of not having a unified clock in a distributed system. Each task determines its own clock based on the criterion of "processing all previous data" .

In other words: the transfer of water level is based on the minimum event time.

5. Handling of late data

5.1  Delay watermark advancement

When the watermark is generated, an out-of-order tolerance is set to delay the advancement of system time to ensure that window calculations are delayed and gain more time for out-of-order data to enter the window.

WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10));

5.2 Set window delay closing

When window calculation is triggered, the current result will be calculated first, but the window will not be closed at this time. Until the wartermark exceeds the window end time + delay time, the window will actually close.

.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(3))

5.3 Using sidestream to receive late data

.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(3))
.sideOutputLateData(lateWS)

Complete example:

public class WatermarkLateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);


        SingleOutputStreamOperator<WaterSensor> sensorDS = env
                .socketTextStream("hadoop102", 7777)
                .map(new WaterSensorMapFunction());

        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
                .<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner((element, recordTimestamp) -> element.getTs() * 1000L);

        SingleOutputStreamOperator<WaterSensor> sensorDSwithWatermark = sensorDS.assignTimestampsAndWatermarks(watermarkStrategy);


        OutputTag<WaterSensor> lateTag = new OutputTag<>("late-data", Types.POJO(WaterSensor.class));

        SingleOutputStreamOperator<String> process = sensorDSwithWatermark.keyBy(sensor -> sensor.getId())
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .allowedLateness(Time.seconds(2)) // 推迟2s关窗
                .sideOutputLateData(lateTag) // 关窗后的迟到数据,放入侧输出流
                .process(
                        new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {

                            @Override
                            public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
                                long startTs = context.window().getStart();
                                long endTs = context.window().getEnd();
                                String windowStart = DateFormatUtils.format(startTs, "yyyy-MM-dd HH:mm:ss.SSS");
                                String windowEnd = DateFormatUtils.format(endTs, "yyyy-MM-dd HH:mm:ss.SSS");

                                long count = elements.spliterator().estimateSize();

                                out.collect("key=" + s + "的窗口[" + windowStart + "," + windowEnd + ")包含" + count + "条数据===>" + elements.toString());
                            }
                        }
                );


        process.print();
        // 从主流获取侧输出流,打印
        process.getSideOutput(lateTag).printToErr("关窗后的迟到数据");

        env.execute();
    }
}

Guess you like

Origin blog.csdn.net/weixin_38996079/article/details/134521600