Flink delayed data processing 3-piece set

Flink delayed data processing 3-piece set

  • | watermark
  • | allowedLateness (maximum lateness data)
  • | sideOutputLateData (side output stream)

sample code

package com.andy.flink.demo.datastream.sideoutputs

import com.andy.flink.demo.datastream.sideoutputs.FlinkHandleLateDataTest2.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time


object FlinkHandleLateDataTest2 {
    
    

  //定义类定义实体Model
  case class SensorReading(id: String,
                           timestamp: Long,
                           temperature: Double)

  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism( 1 )
    // 从调用时刻开始给env创建的每一个stream追加时间特征
    env.setStreamTimeCharacteristic( TimeCharacteristic.EventTime )
    // 设置watermark的默认生成周期(单位:毫秒) -> 100毫秒生成一个WaterMark. 全局设置, 算子中如果设置将覆盖该全局设置
    env.getConfig.setAutoWatermarkInterval( 100L )

    val inputDStream: DataStream[String] = env.socketTextStream( "localhost", 9999 )

    val dataDstream: DataStream[SensorReading] = inputDStream
      .map( data => {
    
    
        val dataArray: Array[String] = data.split( "," )
        SensorReading( dataArray( 0 ), dataArray( 1 ).toLong, dataArray( 2 ).toDouble )
      } )
      // .assignAscendingTimestamps( _.timestamp * 1000L ) // 最理想状态:数据无延迟,按时间正序到达(这种理想情况下,直接指定时间戳字段就可以了)
      .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
      // 给WaterMark的一个初始值延时时间,一般该值应能够覆盖住70%~80%左右的延迟数据
      ( Time.milliseconds( 1000 ) ) {
    
    
        // 指定时间戳字段以秒为单位 * 1000(这里需要使用 ms 单位,数据中的时间请自行转换为毫秒)
        override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
      } )

    val lateOutputTag = new OutputTag[SensorReading]( "late" )

    // 迟到数据处理的三重保证机制: watermark(水位线) | allowedLateness(最大迟到数据)  | sideOutputLateData(侧输出流)
    val resultDStream: DataStream[SensorReading] = dataDstream
      .keyBy( "id" ) //按什么分组,形成键控流
      .timeWindow( Time.seconds( 5 ) ) //简便起见,这里使用滚动窗口
      .allowedLateness( Time.minutes( 1 ) ) //允许的数据最大延迟时间,则触发窗口关闭的时间为(窗口长度+Watermark时长+允许数据延迟的时间, 本例中为:5+1+60)
      .sideOutputLateData( lateOutputTag )
      .reduce( new MyReduceFunc() )

    dataDstream.print( "main-flow" )
    resultDStream.print( "result-flow" )
    // 获取侧输出流late并打印输出
    resultDStream.getSideOutput( lateOutputTag ).print( "late-flow" )

    env.execute( "FlinkHandleLateDataTest2" )
  }
}

/**
 * 自定义reduce函数, 实现时间戳不断向前更新覆盖, 并获取温度中的最小值的功能.
 */
class MyReduceFunc extends ReduceFunction[SensorReading] {
    
    
  override def reduce(value1: SensorReading,
                      value2: SensorReading): SensorReading = {
    
    
    SensorReading(
      value1.id,
      value2.timestamp,
      value1.temperature.min( value2.temperature )
    )
  }
}

One of three sets: horizontal watermark

The window is closed for 5 seconds, and the watermark delay is 1 second, so in fact, the window data only needs to be [0,5), and 5 cannot be obtained (in fact, there will be a method for where the specific timestamp window starts and ends, not the first When a data timestamp is 0, it will be extended by 5 seconds, as will be seen below), but because the watermark is extended by 1s, the output will not be output until the timestamp is 6 seconds. At this time, the window is not closed, because we set Late data allowedLateness
insert image description here

The second of three sets: late data allowedLateness

The late data is set to one minute, and all the data in the [0,5) timestamp in this minute will be output, and a piece of data will be output directly. The window officially closes after a minute.
insert image description here

The third of three sets: side output stream sideOutputLateData

The bottom line is to ensure that after the window is closed, the data is output to the side output stream, and then the late data can be manually merged with the previous data.
insert image description here

The underlying algorithm of the actual window timestamp program

  1. Let's start with the use of the time window

Custom delayed data processing class:

val lateOutputTag = new OutputTag[SensorReading]( "late" )

    // 迟到数据处理的三重保证机制: watermark(水位线) | allowedLateness(最大迟到数据)  | sideOutputLateData(侧输出流)
    val resultDStream: DataStream[SensorReading] = dataDstream
      .keyBy( "id" ) //按什么分组,形成键控流
      .timeWindow( Time.seconds( 5 ) ) //简便起见,这里使用滚动窗口
      .allowedLateness( Time.minutes( 1 ) ) //允许的数据最大延迟时间,则触发窗口关闭的时间为(窗口长度+Watermark时长+允许数据延迟的时间, 本例中为:5+1+60)
      .sideOutputLateData( lateOutputTag )
      .reduce( new MyReduceFunc() )

This is set to a rolling window, the window size is 5 seconds.

  1. Monitoring stream scala processing class
    KeyedStream.scala
import org.apache.flink.streaming.api.datastream.{
    
     DataStream => JavaStream, KeyedStream => KeyedJavaStream, WindowedStream => WindowedJavaStream}

@Public
class KeyedStream[T, K](javaStream: KeyedJavaStream[T, K]) extends DataStream[T](javaStream) {
    
    

  // ------------------------------------------------------------------------
  //  Properties
  // ------------------------------------------------------------------------

  /**
   * Gets the type of the key by which this stream is keyed.
   */
  @Internal
  def getKeyType = javaStream.getKeyType()

/**
   * Windows this [[KeyedStream]] into tumbling time windows.
   *
   * This is a shortcut for either `.window(TumblingEventTimeWindows.of(size))` or
   * `.window(TumblingProcessingTimeWindows.of(size))` depending on the time characteristic
   * set using
   * [[StreamExecutionEnvironment.setStreamTimeCharacteristic()]]
   *
   * @param size The size of the window.
   */
  def timeWindow(size: Time): WindowedStream[T, K, TimeWindow] = {
    
    
    new WindowedStream(javaStream.timeWindow(size))
  }

From this code and the import statement, we can see that:
new WindowedStream(javaStream.timeWindow(size))
in the parameters of WindowsStream, the methods in KeyedStream.java are used: methods in javaStream,

  1. Monitoring stream java processing class
    KeyedStream.java
/**
 * A {@link KeyedStream} represents a {@link DataStream} on which operator state is
 * partitioned by key using a provided {@link KeySelector}. Typical operations supported by a
 * {@code DataStream} are also possible on a {@code KeyedStream}, with the exception of
 * partitioning methods such as shuffle, forward and keyBy.
 *
 * <p>Reduce-style operations, such as {@link #reduce}, {@link #sum} and {@link #fold} work on
 * elements that have the same key.
 *
 * @param <T> The type of the elements in the Keyed Stream.
 * @param <KEY> The type of the key in the Keyed Stream.
 */
@Public
public class KeyedStream<T, KEY> extends DataStream<T> {
    
    

	/**
	 * The key selector that can get the key by which the stream if partitioned from the elements.
	 */
	private final KeySelector<T, KEY> keySelector;

	/** The type of the key by which the stream is partitioned. */
	private final TypeInformation<KEY> keyType;

	// ------------------------------------------------------------------------
	//  Windowing
	// ------------------------------------------------------------------------

	/**
	 * Windows this {@code KeyedStream} into tumbling time windows.
	 *
	 * <p>This is a shortcut for either {@code .window(TumblingEventTimeWindows.of(size))} or
	 * {@code .window(TumblingProcessingTimeWindows.of(size))} depending on the time characteristic
	 * set using
	 * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)}
	 *
	 * @param size The size of the window.
	 */
	public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
    
    
		if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) {
    
    
			return window(TumblingProcessingTimeWindows.of(size));
		} else {
    
    
			return window(TumblingEventTimeWindows.of(size));
		}
	}
  1. Scroll event window (EventTimeWindow) java processing class
    TumblingEventTimeWindows.java
/**
 * A {@link WindowAssigner} that windows elements into windows based on the timestamp of the
 * elements. Windows cannot overlap.
 *
 * <p>For example, in order to window into windows of 1 minute:
 * <pre> {@code
 * DataStream<Tuple2<String, Integer>> in = ...;
 * KeyedStream<Tuple2<String, Integer>, String> keyed = in.keyBy(...);
 * WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowed =
 *   keyed.window(TumblingEventTimeWindows.of(Time.minutes(1)));
 * } </pre>
 */
@PublicEvolving
public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
    
    
	private static final long serialVersionUID = 1L;

	private final long size;

	private final long offset;

	protected TumblingEventTimeWindows(long size, long offset) {
    
    
		if (Math.abs(offset) >= size) {
    
    
			throw new IllegalArgumentException("TumblingEventTimeWindows parameters must satisfy abs(offset) < size");
		}

		this.size = size;
		this.offset = offset;
	}

	@Override
	public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
    
    
		if (timestamp > Long.MIN_VALUE) {
    
    
			// Long.MIN_VALUE is currently assigned when no timestamp is present
			long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);
			return Collections.singletonList(new TimeWindow(start, start + size));
		} else {
    
    
			throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). " +
					"Is the time characteristic set to 'ProcessingTime', or did you forget to call " +
					"'DataStream.assignTimestampsAndWatermarks(...)'?");
		}
	}

From the assignWindows method, you can see that the calculation of the starting position of the window comes from:
long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);

  1. Time window (TimeWindow) java processing class
    TimeWindow.java

Further in the TimeWindow.java class, you can see the following methods:

/**
 * A {@link Window} that represents a time interval from {@code start} (inclusive) to
 * {@code end} (exclusive).
 */
@PublicEvolving
public class TimeWindow extends Window {
    
    

	private final long start;
	private final long end;

	public TimeWindow(long start, long end) {
    
    
		this.start = start;
		this.end = end;
	}
	
	/**
	 * Method to get the window start for a timestamp.
	 *
	 * @param timestamp epoch millisecond to get the window start.
	 * @param offset The offset which window start would be shifted by.
	 * @param windowSize The size of the generated windows.
	 * @return window start
	 */
	public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
    
    
		return timestamp - (timestamp - offset + windowSize) % windowSize;
	}

It can be seen from this algorithm that the calculation formula of the starting position of the window is: In
timestamp - (timestamp - offset + windowSize) % windowSize
, first subtract the offset, add the window size, and finally take the modulo with the window size, the formula can be obtained:(timestamp - offset + windowSize) % windowSize

  1. The operation of offset and windowSize is to prevent the timestamp from taking a negative value
  2. First add a window size, and then use the window modulo, it can be understood that the contribution to the influence range of the data size can be ignored.
  3. (timestamp - offset + windowSize) % windowSizeIn the end, it can be understood as taking the modulo of the timestamp by the window, and the whole (timestamp - offset + windowSize) % windowSizecan be understood as removing the remainder from the timestamp, and obtaining an integer after removing the remainder
    as the starting position of the window.

Guess you like

Origin blog.csdn.net/liuwei0376/article/details/123682090