flink watermark principle summary

flink watermark principle summary

How flink deal out of order?

Of the input window according eventTime by polymerization, generally in the order so that the occurrence of event time to process the data, while taking advantage of the window to trigger watermark. (Watermark + window mechanism)

watermark principle and introduction?

Watermark is a mechanism for processing EventTime time Flink types (types other times not consider issues disorder) window calculation presented, in essence also a time stamp. Watermark is out of order for processing the events, and correct handling events out of order, usually implemented in conjunction with watermark window mechanism.

When the data is processed by the operator based on a time window of Event Time, it must be determined that all messages belonging to the time window after all the flows of this operator, can begin processing the data. However, because the message may be out of order, it can not directly confirm the operator when the time window for all messages belonging to the entire inflow this operator. WaterMark contains a time stamp, Flink WaterMark using all markers are smaller than the time stamp of the message flows , Flink data source after confirmation that all messages have been less than a timestamp output to Flink stream processing system, including the generated one WaterMark timestamp, is inserted into the stream processing system Flink, Flink operator operator flowing stream message output to all messages in a time window cache, when the operator to WaterMark process, it is less than all of the time windows timestamp WaterMark the data is processed and sent to the next operator node, then also sent to the next WaterMark operator node.

watermark what's the use?

Process stream processing, from the event generation to flow through the Source, operator, and the middle is a process of time, although in most cases the time sequence, the data flows are in accordance with operator generated by the event, but due to network, back pressure and other reasons, resulting in scrambled. ( OUT-of-Order or late element)

For late element, we can not wait indefinitely, there must be a mechanism to ensure that after a certain period of time, it is necessary to calculate the trigger window, this particular mechanism is the watermark.

How watermark produce?

Typically, after receiving the data source, and it will immediately generate a Watermark; however, after the source may be, the application of a simple map, filter regenerated watermark.

There are two ways to generate watermark:

1.Periodic Watermarks
Periodic Watermarks,周期性的产生watermark,即每隔一定时间间隔或者达到一定的记录条数,产生一个watermark。

而在实际的生产中,periodic方式必须结合时间和记录数两个维度,否则,在极端情况下容易产生很大的延时。
2.Punctuated Watermarks
Punctuated Watermarks,数据流中每一个递增的event time 都会产生一个watermark。
在实际的生产中,punctuated 方式在TPS很高的场景下会产生大量的watermark,
在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择punctuated方式。

Code:

Note: Why Watermark = currentMaxTimestamp - maxLateTime?

Suppose the delay is not considered, watermark = currentMaxTimeStamp, with the rise of the water line, when the water level line (i.e. the current maximum time) exceeds the endtime, all the data have all the access to the window.

Continue to consider the case there is a delay, the delay in order to make all maxLateTime data into the window, let the water level drop maxLateTime in advance, in this case, when the water level is still more than endtime, show that in the case of a delay to allow all data all entered the window of.

If the order, only the latest event time> = windowEndTime, the window will be triggered.

If out of order, due to wait maxLateTime, so the latest event time - maxLateTime> time = windowEndTime, window trigger. Here take Secondly, as long as currentMaxTimestamp- maxLateTime> = windowEndTime, the window will be triggered.

/**
 *flink 1.7
 hello,2019-09-17 11:34:05.890
 hello,2019-09-17 11:34:07.890
 hello,2019-09-17 11:34:13.890
 hello,2019-09-17 11:34:08.890
 hello,2019-09-17 11:34:16.890
 hello,2019-09-17 11:34:19.890
 hello,2019-09-17 11:34:21.890
 */

public class WaterMarkTest {

    public static void main(String[] args) throws ParseException {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        env.setParallelism(1);
        //设置多久查看一下当前的水位线... 默认200ms
//        env.getConfig().setAutoWatermarkInterval(10000);
//        System.err.println("interval : " + env.getConfig().getAutoWatermarkInterval());

        DataStreamSource<String> streamSource = env.socketTextStream("hdp-01", 9999);

//
        DataStream<String> dataStream = streamSource.assignTimestampsAndWatermarks(new MyWaterMark());


        dataStream.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String value) throws Exception {
                String[] split = value.split(",");
                String key = split[0];

                return new Tuple2<>(key, 1);
            }
        }).keyBy(0)
                .timeWindow(Time.seconds(10))
//                .sum(1)
                //自定义的一个计算规则...
                .apply(new MyWindowFunction())
                .printToErr();

        try {
            env.execute();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

}

/*
 *数据进来,先extract时间,同时更新max值,再生成watermark
 */
class MyWaterMark implements AssignerWithPeriodicWatermarks<String> {
    //目前系统里所有数据的最大事件时间
    long currentMaxTimestamp = 0;
    long maxLateTime = 5000;//允许数据延迟5s

    Watermark wm=  null;

    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

    @Nullable
    @Override
    //周期性的获取目前的水位线时间,默认200ms
    public Watermark getCurrentWatermark() {
        //未处理数据的延迟/乱序问题
//        wm = new Watermark(currentMaxTimestamp);

        //处理数据的延迟/乱序问题
        wm = new Watermark(currentMaxTimestamp - maxLateTime);
        System.out.println(format.format(System.currentTimeMillis()) + " 获取当前水位线: " + wm + ","+ format.format(wm.getTimestamp()));
        return wm;
    }

    /**
     *
     * @param element  流中的数据  形如:"hello,2019-09-17 10:24:50.958"
     * @param previousElementTimestamp 上条数据的时间戳
     * @return 新的时间戳
     */
    @Override
    public long extractTimestamp(String element, long previousElementTimestamp) {
        String[] split = element.split(",");

        String key = split[0];

        long timestamp = 0 ;
        try {
            //将2019-09-17 10:24:50.958 格式时间转成时间戳
            timestamp = format.parse(split[1]).getTime();
        } catch (ParseException e) {
            e.printStackTrace();
        }

        //对比新数据的时间戳和目前最大的时间戳,取大的值作为新的时间戳
        currentMaxTimestamp  = Math.max(timestamp, currentMaxTimestamp);

        System.err.println(key +", 本条数据的时间戳: "+ timestamp + "," +format.format(timestamp)
                + "|目前数据中的最大时间戳: "+  currentMaxTimestamp + ","+ format.format(currentMaxTimestamp)
                + "|水位线时间戳: "+ wm + ","+ format.format(wm.getTimestamp()));

        return timestamp;
    }
}

class MyWindowFunction implements WindowFunction<Tuple2<String,Integer>, String, Tuple, TimeWindow> {

    @Override
    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<String> out) throws Exception {
        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

        int sum = 0;

        for(Tuple2<String,Integer> tuple2 : input){
            sum +=tuple2.f1;
        }

        long start = window.getStart();
        long end = window.getEnd();

        out.collect("key:" + tuple.getField(0) + " value: " + sum + "| window_start :"
        + format.format(start) + "  window_end :" + format.format(end)
        );

    }
}

Released four original articles · won praise 0 · Views 513

Guess you like

Origin blog.csdn.net/The_Inertia/article/details/104089489