Flink [Flink] late for data processing

Allowable delay time is set by allowedLateness (lateness: Time) Set

Delay data is stored by sideOutputLateData (outputTag: OutputTag [T]) to save

Obtaining delay data by DataStream.getSideOutput: acquiring (tag OutputTag [X])

The following were first explain these methods, specific examples are given further deepen understanding

. 1, allowedLateness (lateness: Time)
DEF allowedLateness (lateness: Time): WindowedStream [T, K, W is] = {
javaStream.allowedLateness (lateness)
the this
}
which passed a Time value, the data set allows late time, the the concept of time and at different times of waterMark. Again look,

WaterMark = event time data - time value allows out of order

With the arrival of new data, the value of waterMark will be updated to the latest data event time - the time allowed out of order value, but if this time to a historical data, waterMark value is not updated. Overall, waterMark to receive the scrambled data as much as possible.

Time here is that the value of it? Mainly in order to wait for late data, within a certain time frame, if the data belongs to the arrival of the window, will be calculated later will have instructions carefully calculated

Note: This method is only directed to the window-based event-time, if it is based on processing-time, and specifies a non-zero time value will be thrown

2, sideOutputLateData (outputTag: OutputTag [T])
DEF sideOutputLateData (outputTag: OutputTag [T]): [T, K, W is] = {WindowedStream
javaStream.sideOutputLateData (outputTag)
the this
}
which is to save the data to a belated outputTag the given parameters, and is used to mark a OutputTag target delay data.

3, DataStream.getSideOutput (tag: OutputTag [ X])
invoke the method returned by the operating window DataStream the like, the incoming object data labeled delay to acquire the delayed data

4, to be understood that the data delay
delay data means:

In the current window is the window range [10-15] assumed after having been calculated, again a data belonging to the event time window is assumed [13], this time window will trigger operation, such data is called delayed data.

So the question is, how to calculate the delay time?

Suppose window range 10-15, the delay time of 2s, as long as waterMark <15 + 2, belonging to the window and can trigger operation window. If such data to a waterMark> = 15 + 2,10-15 window no longer trigger operation window, event-time data even if new <15 + 2 + 3

5, the code examples to explain
about the code explain the process:

1, a host port monitor 9000 reads data socket (format name: timestamp)

2, the data into the current program plus flink Watermark, is eventTime-3s

3, are grouped according to the value of name, according to the window size of the window is divided 5s, the lag time is provided to allow 2s, statistical data sequentially window name value

4, the output statistics and late data

5, start the Job

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, OutputTag, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

scala.collection.mutable.ArrayBuffer Import

/ **
* Delay Test
* explain in detail the blog address: HTTPS: //blog.csdn.net/hlp4207/article/details/90717905
* /
Object WaterMarkFunc02 {
// thread-safe time format Object
Val SDF: FastDateFormat = FastDateFormat.getInstance ( "the mM-dd-YYYY HH: mm: SS: the SSS")

DEF main (args: the Array [String]): Unit = {
Val of hostName = "S102"
Val Port = 9000
Val = DELIMITER '\ n-'
Val StreamExecutionEnvironment.getExecutionEnvironment the env =
// set the EventTime time data stream types
env.setStreamTimeCharacteristic (TimeCharacteristic.EventTime)
env.setParallelism (. 1)
Val sTREAMS: DataStream [String] = env.socketTextStream (of hostName, port, delimiter)
org.apache.flink.api.scala._ Import
Val = streams.map Data (Data => {
// input data format: name: timestamp
// Flink: 1559223685000
the try {
Val = data.split items ( ":" )
(items (0), items (. 1) .toLong)
} {the catch
Case _: Exception => the println ( "the input data does not conform to the format:" + data)
( "0", 0L)
}
}) filter (data. ! => data._1.equals ( "0") && data._2 = 0L)!

// assign time stamps for the data stream elements, and create a watermark on a regular basis to monitor events schedule
val waterStream: dataStream [(String, Long)] = data.assignTimestampsAndWatermarks (new new AssignerWithPeriodicWatermarks [(String, Long)] {
// event time
var currentMaxTimestamp 0L =
Val = 3000L maxOutOfOrderness
var lastEmittedWatermark: Long = Long.MinValue

The Watermark Returns Current //
the override DEF getCurrentWatermark: Watermark = {
// allow three seconds delay
Val potentialWM = currentMaxTimestamp - maxOutOfOrderness
// ensure that the watermark can be incrementally
IF (potentialWM> = lastEmittedWatermark) {
lastEmittedWatermark = potentialWM
}
new new Watermark (lastEmittedWatermark)
}

a timestamp to AN Assigns // element, in milliseconds the Epoch Operating since
the override DEF extractTimestamp (element: (String, Long), previousElementTimestamp: Long): Long = {
// time as the timestamp field values of the data elements of
val time = element._2
IF (Time> currentMaxTimestamp) {
currentMaxTimestamp = Time
}
outData is String.format = Val ( "Key:% S EventTime are: Watermark% S:% S",
element._1,
sdf.format (Time),
sdf.format (getCurrentWatermark.getTimestamp))
the println (outData is)
Time
}
})
= new new OutputTag lateData Val [(String, Long)] ( "LATE")
Val Result: DataStream [String] = waterStream.keyBy (0) // name values grouped according
.window (TumblingEventTimeWindows.of (Time.seconds (5L ))) // 5s span of time-based tumbling window event
/ **
* for this window, allowing 2 seconds late data, i.e., the first trigger is a watermark> end-of-window when the
* second (or multiple) condition to trigger the watermark <the end-of-window + allowedLateness time, the window has data arrives late
* /
.allowedLateness (Time.seconds (2L))
.sideOutputLateData (lateData)
.apply(new WindowFunction[(String, Long), String, Tuple, TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
val timeArr = ArrayBuffer[String]()
val iterator = input.iterator
while (iterator.hasNext) {
val tup2 = iterator.next()
timeArr.append(sdf.format(tup2._2))
}
val outData = String.format("key: %s data: %s startTime: %s endTime: %s",
key.toString,
timeArr.mkString("-"),
sdf.format(window.getStart),
sdf.format(window.getEnd))
out.collect(outData)
}
})
result.print("window计算结果:")

val late = result.getSideOutput(lateData)
late.print ( "late data:")

env.execute (this.getClass.getName)
}
}
the next start test input data verification:

 

Window you can see the range of [15-20], this time we re-enter several data belonging to the range:

 

 

Enter the event time 17,16,15 three data are triggered window operations, then we try to look at the window input range [10-15] of data:

 

 

Window range [10-15] of the data belonging to late data, has exceeded the maximum waiting time, we can try to calculate the allowed window late last data value waterMark

Window end time + the delay time = the maximum value waterMark

15 + 2 = 17

WaterMark current value of 20, greater than 17, the window data range is 10-15 has a late data

Again calculate the window of time for the critical value of 15-20:

20 + 2 = 22

That is, when the data up waterMark 22,15-20 window range of data belongs to late, can not participate in the calculation of the

Remember that we calculated the critical value 22, continue to enter test data:

 

 

When the input data A, waterMark rose to 21, when the input data belongs to B within the range 15-20 window, the window can still trigger the operation;

Input data C, waterMark up to 22, we just equal to the critical value counted out, when the input data B, already not trigger operation window, belongs to late data.

Finally, to sum up flink processing delay for the data:

If the delay of data have a business need, then set the allowable delay time, each window has their own maximum waiting time delay data limitations:

Guess you like

Origin www.cnblogs.com/java9188/p/11945964.html