flink WaterMark之TumblingEventWindow

1, WaterMark, translated into the water line or watermark, the watermark more abstract translation, translation water line to earth.

watermark is scrambled for processing events, often combined with a watermark window mechanisms. 

Generated from the event stream processing, the flow through the Source, then operator, is the middle of a process and time. Although in most cases, the operator data flows are generated in order of time to the event, but does not rule out due to network, back pressure and other reasons, resulting in scrambled (OUT -OF- Order or late element) . 

But for being late or out of sequence elements, we can not wait indefinitely, you must have a mechanism to ensure that after a certain time, the trigger window must be calculated. This particular mechanism is the watermark. Trigger time left and follow the natural time to open and close the right principles.

The normal and orderly flow: watermark actually coincides with the time stamp of the event

 

 Out of order flow: watermark used to trigger the window calculations, the watermark is not even stream data has fallen into multiple windows are not triggered if a watermark to the data of the window will trigger even if not to the computing, late data the default will be discarded.

2, TumblingEventWindow binding window WaterMark, with the code and verify ordered scrambled stream.

Receiving text from the socket, the text to pair (timestamp + text) occurs, the field separator is a space, the line delimiter is "\ n", the text of the received 10 second window to scroll text count. 
An orderly situation: watermark is 0, that is, do not delay receiving data. 
Where the disorder: watermark is 3s, three seconds trigger delay calculation window.

code:

public class TumblingEventWindowExample {
    public static void main(String args[]) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        DataStream<String> socketStream = env.socketTextStream("192.168.31.10",9000);
        DataStream<Tuple2<String,Long>> resultStream = socketStream
                 //Time.seconds(3)有序的情况修改成0
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(3)) {
                    @Override
                    public long extractTimestamp(String element) {
                        long eventTime = Long.parseLong(element.split(" ")[0]);
                        System.out.println(eventTime);
                        return eventTime;
                    }
                })
                .map(new MapFunction<String, Tuple2<String,Long>>() {
                    @Override
                    public Tuple2<String,Long> map(String value) throws Exception {
                        return Tuple2.of(value.split(" ")[1],1L);
                    }
                }).keyBy(0)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<Tuple2<String,Long>>() {
                    @Override
                    public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
                        return new Tuple2<>(value1.f0,value1.f1+value2.f1);
                    }
                });
        resultStream.print();

        env.execute();
    }
}

 

2.1 orderly situation, watermark as 0s

The first window:

10000
11000
12000
13000
14000
19888
13000
20000
1> (b,2)
3> (a,5)

20,000 timestamp trigger the first window calculated, actually 19999 could be triggered because the left and right open closed principle, the time stamp will not count towards the 20,000 in the first window, the first window is [10000-20000), the first two windows is [20000-30000), and so on.

 

The second window:

10000
11000
12000
13000
14000
19888
13000
20000
1> (b,2)
3> (a,5)
11000
12000
21000
22000
29999
3> (a,3)
1> (b,1)

After the first trigger window calculations, subsequent to which the two data is discarded 11000,12000, 29999 directly triggers calculation window and the second window itself belongs, it is also involved in the calculation.

 

 2.2 watermark for the 3s case

10000
11000
12000
20000
21000
22000
23000
3> (a,2)
1> (b,1)

从数据中可以验证,第一个窗口在20000的时候没有触发计算,而是在23000的时候触发计算,计算内容是第一个窗口[10000,20000),所以20000,21000,22000,23000属于第二个窗口,没有参与计算。

 

第二个窗口:

10000
11000
12000
20000
21000
22000
23000
3> (a,2)
1> (b,1)
24000
29000
30000
22000
23000
33000
3> (a,6)
1> (b,2)

第二个窗口[20000,30000),它是在33000触发计算,并且,迟到的数据22000,23000也被计算在内(如果这两个数据在水印33000后到达,则会被抛弃),30000和33000是第三个窗口的数据,没有计算在内。

 

 

Guess you like

Origin www.cnblogs.com/asker009/p/11299848.html