Flink's WaterMark, and demo examples

        Actual production, due to various reasons, resulting in inconsistent event creation time and processing time, collected provisions have a greater influence on real-time recommendations. So select Create a time when the general situation, and then create a window of time in advance of flink. But the question is, how to ensure that within this window of time for all events are in attendance? This time you can set the water line (waterMark).

The concept: to support time-based window operation, due to the time of the event from the source system, many times due to network latency, distributed processing, and various source systems and other reasons leading to the source of the event time data may be out of order. At this time may be set a threshold value, or the water line (Watermark), which defines a maximum effect scrambled time such as a time logs 2019-01-01 08:00:10, if the maximum allowable time is out of order 10s, then that all events generated before 2019-01-01 08:00:00 attendance, can be calculated.

Time windows: a fixed time interval specified window

A sliding window

1, SlidingEventTimeWindows.of (Time.second (4), Time.seconds (3)): represents a sliding window size is four seconds, the slide step size is 3 seconds, the same time, once every 3 seconds before the slide;

2, each of the data of the survival time of sliding window size;

3, if the sliding window than the previous window, the rear window in front of the data belonging to the lost;

4 to a data, the data while moving the sliding window (a window stay, once calculated, is not moved, not computing), the specified position until the window reaches the edge calculation.

Calculating a position of the time equation:   

// n-: timestamp; size window size; slide: sliding length
 // The arithmetic derivation formula 
AN = A1 + (. 1-X) * S 
A1 = size - Slide -1 
X = [n-- (Slide-size) ] / Slide      // divisor and then multiplied by Slide 
S = Slide 
 
// when n to a time stamp of the event, the event is considered that all the specified position in attendance time 
specified position = (size-slide-1) + [(n-waterMark) -    (size-slide)] / slide * slide

Second, tumbling window

Based on the time window, continuous data when iterative calculation, do not overlap. Tumbling window is a special sliding window, when the window is equal to the length of the slide, the sliding window is the window roll.

Calculating a position of the time equation:

Designated position + -1 = (n--Watermark) / * size size      // divisor and then multiplied by size, size of the window size, n is the timestamp

Third, the conversation window

Statistical calculations only up to a certain time interval length of time.

##

Test code (requires a cluster telnet producer):

Package Penalty for com.cjs 
 
Import org.apache.flink.streaming.api.TimeCharacteristic
 Import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
 Import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
 Import org.apache .flink.streaming.api.windowing.time.Time
 Import org.apache.flink.api.scala._
 Import org.apache.flink.streaming.api.windowing.assigners. SlidingEventTimeWindows {,} TumblingEventTimeWindows 
 
Object WaterMarkTest { 
 
    / ** 
      * want to use the WaterMark, 3 steps required: 
      * 1, timestamp data extraction, that call assignTimestampsAndWatermarks function,
      * Examples of BoundedOutOfOrdernessTimestampExtractor, extractTimestamp override method 
      * 2, set event times, because WaterMark is a time-based events
      * 3, define a time window: Window roll (TumblingEventWindows), sliding window (TimeWindow) 
      * does not implement any, will be reported exception: Record has Long.MIN_VALUE timestamp (= no timestamp marker) Is the time characteristic set to 'ProcessingTime'. , or DID you forget to Call 'DataStream.assignTimestampsAndWatermarks (...)'? 
      * / 
    DEF main (args: Array [String]): Unit = { 
        Val SENV = StreamExecutionEnvironment.getExecutionEnvironment 
 
        Val streamAdd   = senv.socketTextStream ( "192.168. 112.10 ", 9999 ) 
        Val Stream = streamAdd.assignTimestampsAndWatermarks (
                 new new BoundedOutOfOrdernessTimestampExtractor [String] (Time.seconds (0)) {   // WaterMark set
                     // the data stream is processed, acquisition timestamp, the data flow is not enough to affect 
                    the override DEF extractTimestamp (Element: String): Long = {
                         // definition of how timestamp extracted from the data 
                        ( "") (0 = Val eventTime element.split ) .toLong 
                        Print (S "$ eventTime \ n-" ) 
                        eventTime 
                    } 
                })   // after extraction time stamp, the data stream with a time for event window 
            .map (X => (x.split ( "") (. 1), 1L)). keyBy (0 ) 
 
        // set event times, since the time-based event is WaterMark
        senv.setStreamTimeCharacteristic (TimeCharacteristic.EventTime)
        // definition of tumbling window
 //         stream.window (TumblingEventTimeWindows.of (Time.seconds (. 3))). SUM (. 1) .print ()
 //         stream.sum (. 1) .print ()    // output directly without use the event time window, flink default cumulative statistics to a statistics a
         // define the sliding window 
        stream.window (SlidingEventTimeWindows.of (Time.seconds (4), Time.seconds (2))). SUM (1 ) .print () 
        senv.execute ( "Watermark" ) 
    } 
 
}

 

Guess you like

Origin www.cnblogs.com/SysoCjs/p/11466274.html