Actual production, due to various reasons, resulting in inconsistent event creation time and processing time, collected provisions have a greater influence on real-time recommendations. So select Create a time when the general situation, and then create a window of time in advance of flink. But the question is, how to ensure that within this window of time for all events are in attendance? This time you can set the water line (waterMark).
The concept: to support time-based window operation, due to the time of the event from the source system, many times due to network latency, distributed processing, and various source systems and other reasons leading to the source of the event time data may be out of order. At this time may be set a threshold value, or the water line (Watermark), which defines a maximum effect scrambled time such as a time logs 2019-01-01 08:00:10, if the maximum allowable time is out of order 10s, then that all events generated before 2019-01-01 08:00:00 attendance, can be calculated.
Time windows: a fixed time interval specified window
A sliding window
1, SlidingEventTimeWindows.of (Time.second (4), Time.seconds (3)): represents a sliding window size is four seconds, the slide step size is 3 seconds, the same time, once every 3 seconds before the slide;
2, each of the data of the survival time of sliding window size;
3, if the sliding window than the previous window, the rear window in front of the data belonging to the lost;
4 to a data, the data while moving the sliding window (a window stay, once calculated, is not moved, not computing), the specified position until the window reaches the edge calculation.
Calculating a position of the time equation:
// n-: timestamp; size window size; slide: sliding length
// The arithmetic derivation formula
AN = A1 + (. 1-X) * S
A1 = size - Slide -1
X = [n-- (Slide-size) ] / Slide // divisor and then multiplied by Slide
S = Slide
// when n to a time stamp of the event, the event is considered that all the specified position in attendance time
specified position = (size-slide-1) + [(n-waterMark) - (size-slide)] / slide * slide
Second, tumbling window
Based on the time window, continuous data when iterative calculation, do not overlap. Tumbling window is a special sliding window, when the window is equal to the length of the slide, the sliding window is the window roll.
Calculating a position of the time equation:
Designated position + -1 = (n--Watermark) / * size size // divisor and then multiplied by size, size of the window size, n is the timestamp
Third, the conversation window
Statistical calculations only up to a certain time interval length of time.
##
Test code (requires a cluster telnet producer):
Package Penalty for com.cjs
Import org.apache.flink.streaming.api.TimeCharacteristic
Import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
Import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
Import org.apache .flink.streaming.api.windowing.time.Time
Import org.apache.flink.api.scala._
Import org.apache.flink.streaming.api.windowing.assigners. SlidingEventTimeWindows {,} TumblingEventTimeWindows
Object WaterMarkTest {
/ **
* want to use the WaterMark, 3 steps required:
* 1, timestamp data extraction, that call assignTimestampsAndWatermarks function,
* Examples of BoundedOutOfOrdernessTimestampExtractor, extractTimestamp override method
* 2, set event times, because WaterMark is a time-based events
* 3, define a time window: Window roll (TumblingEventWindows), sliding window (TimeWindow)
* does not implement any, will be reported exception: Record has Long.MIN_VALUE timestamp (= no timestamp marker) Is the time characteristic set to 'ProcessingTime'. , or DID you forget to Call 'DataStream.assignTimestampsAndWatermarks (...)'?
* /
DEF main (args: Array [String]): Unit = {
Val SENV = StreamExecutionEnvironment.getExecutionEnvironment
Val streamAdd = senv.socketTextStream ( "192.168. 112.10 ", 9999 )
Val Stream = streamAdd.assignTimestampsAndWatermarks (
new new BoundedOutOfOrdernessTimestampExtractor [String] (Time.seconds (0)) { // WaterMark set
// the data stream is processed, acquisition timestamp, the data flow is not enough to affect
the override DEF extractTimestamp (Element: String): Long = {
// definition of how timestamp extracted from the data
( "") (0 = Val eventTime element.split ) .toLong
Print (S "$ eventTime \ n-" )
eventTime
}
}) // after extraction time stamp, the data stream with a time for event window
.map (X => (x.split ( "") (. 1), 1L)). keyBy (0 )
// set event times, since the time-based event is WaterMark
senv.setStreamTimeCharacteristic (TimeCharacteristic.EventTime)
// definition of tumbling window
// stream.window (TumblingEventTimeWindows.of (Time.seconds (. 3))). SUM (. 1) .print ()
// stream.sum (. 1) .print () // output directly without use the event time window, flink default cumulative statistics to a statistics a
// define the sliding window
stream.window (SlidingEventTimeWindows.of (Time.seconds (4), Time.seconds (2))). SUM (1 ) .print ()
senv.execute ( "Watermark" )
}
}