Flink 之 watermark

Table of Contents

 

Three time concepts

Processing time

Event Time

Ingestion time

watermark

Watermarks for parallel streams

Late event

watermark dispenser

Two dispensers of watermark

 


Three time concepts

Before talking about watermark, we first need to understand the three time concepts of flink. In flink, there are three timestamp concepts: Event Time, Processing Time and Ingestion Time . The watermark is only useful for timestamps of Event Time type. These three time concepts represent:

Processing time

Processing time refers to the current time of the machine performing the operator operation. When running based on processing time, all time-related operations (such as time windows) will use the local time of the machine performing the operator operation. For example, when the time window is one hour, if the application starts running at 9:15 am, the first window will include events processed between 9:15 am and 10:00 am, and the next window will include Events processed between 10:00 am and 11:00 am, and so on.

Processing time is the simplest concept of time and does not require coordination between streams and machines. It provides the best performance and the lowest latency. However, in distributed and asynchronous environments, processing time cannot provide determinism because it is vulnerable to the speed at which upstream systems (such as from message queues) reach Flink, the speed of interaction between operators within flink, and interruptions (scheduling or other conditions) ) And other factors.

Event Time

The event time is the time when each event is generated on its production equipment, that is, the time stamp that the element itself carries before it reaches the flink .

Therefore, the timestamp of Event Time depends on the data and has nothing to do with other times. To use Event Time, you must first introduce the time attribute of EventTime from the execution environment . Such as:

val env = StreamExecutionEnvironment.getExecutionEnvironment
// 从调用时刻开始给env创建的每一个stream追加时间特征
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

Then use Dastream's assignTimestampsAndWatermarks method to specify the event time timestamp, and the specific operation will not be repeated.

In an ideal situation, the event time is orderly. But in fact, due to distributed operations and network delays, events may not arrive in the order of event time. Therefore, flink's solution for processing out-of-order data is to provide an allowable delay time, and elements that arrive within the allowable delay time will trigger a calculation again. This delay time is relative to event time and not other times, and event time is not determined by flink. So how to judge the current event time? Flink uses a watermark to determine and maintain the maximum value of the current event time. This is what this article will focus on later.

Ingestion time

Ingestion time is the time when the event enters Flink, that is, the time when the source operation is performed.

Ingestion time is conceptually between Event Time and Processing time .

Compared with Processing time , it consumes slightly more resources, but the results are more predictable. Since Ingestion time uses a stable timestamp (only assigned once at addSource), different window operations recorded will reference the same timestamp, and in Processing time, each window operation will update the Processing time of the event , so It is possible that records in one upstream window will be allocated to different downstream windows (based on the local system clock and any possible delays).

Compared with Event Time , the Ingestion time program cannot handle any out-of-order events or late data, but the program does not have to specify how to generate watermarks .

 

The following figure shows the three kinds of time semantics:

 

watermark

As mentioned above, supporting Event Time requires a method of measuring time progress to determine the current event time. This mechanism is watermark.

The watermark will change according to the timestamp of the event in the data stream. Watermark means the maximum timestamp of the event that has arrived in the current stream, that is, the timestamp of the event that arrives in the future should be greater than the watermark, or the data with thetimestamp less than the watermark has arrived .

In the example below, the events are arranged in order (relative to their timestamps). Ideally, the watermark periodically maintains the maximum timestamp of the current event :

However, under normal circumstances, events are out of order, not sorted by time. Usually, watermark is used to declare a certain point in time, indicating that the data before a certain point in time should have arrived (the official website says so, The description on the official website is a bit vague. In fact, for out-of-sequence events, the allowable delay mechanism is generally combined. The condition for triggering calculation is: watermark = endtime of the window + maximum allowable delay time. So the official website actually says that watermark represents a certain point in time It means that the endtime + allowable delay, the elements before the endtime should all arrive) . Once the watermark reaches the time point of triggering the calculation, the window will calculate the events with the timestamp less than the endtime among the events that have arrived. As shown below:

So, what kind of form does watermark exist? In fact, watermark is a special event, which is mixed in Dstream . After watermark is generated by a certain operation of flink, it flows with the event in the entire program, as shown in the following figure:

The following is the code of the watermark. It can be seen that the watermark is a stream element and only contains a timestamp attribute:

public final class Watermark extends StreamElement {

	/** The watermark that signifies end-of-event-time. */
	public static final Watermark MAX_WATERMARK = new Watermark(Long.MAX_VALUE);

	// ------------------------------------------------------------------------

	/** The timestamp of the watermark in milliseconds. */
	private final long timestamp;

	/**
	 * Creates a new watermark with the given timestamp in milliseconds.
	 */
	public Watermark(long timestamp) {
		this.timestamp = timestamp;
	}

Watermarks for parallel streams

The watermark can be generated at the source (it can also be generated by other operators after the source, such as map, filter, etc.). If the source has multiple parallelism, each parallelism will generate a separate watermark. These watermarks define each partition The event time.
When the degree of parallelism changes (that is, when an upstream partition may be used by multiple downstream partitions), the watermark of each partition will be broadcast to each downstream partition, such as some operations that aggregate multiple streams, such as  keyBy(... ) Or partition(...), the watermark of this type of operation is the smallest watermark in all input streams. When a stream with watermark passes through this type of operator, the watermark is updated according to the watermark of each partition.

For example: when the upstream parallelism is 4, the watermark in the window of a certain downstream partition is as follows:

1. When the reached watermarks are 2, 4, 3, 6, and the watermark in the window is 2, the corresponding window calculation with the watermark of 2 is triggered, and the watermark=2 is broadcast to the downstream.

2. When the watermark of the first window is updated to 4, and the smallest watermark in all partitions is 3, the watermark of the window is updated to 3, which triggers the calculation of the corresponding window, and broadcasts watermark=3 to the downstream.

3. When the watermark of the second partition is updated to 7, the smallest watermark in all partitions is still 3, and no processing is performed.

4. When the watermark of the third partition is updated to 6, and the smallest watermark in all partitions is 4, the watermark of the window is updated to 4, the calculation of the corresponding window is triggered, and the watermark=4 is broadcast to the downstream .

The following figure shows an example of event and watermark in a parallel stream, and how the operator tracks the event time:

 

Late event

When introducing watermark, it was mentioned that in reality, out-of-order events are often handled, that is, when the event arrives late for some reason, the event time<watermark will happen. Obviously, this violates the conditions of watermark formulation. Yes, so flink has a mechanism to allow delays for the watermark handling of out-of-order events, allowing the late time within a certain event to still be regarded as a valid event.

watermark dispenser

When the watermark is completely based on the event time, if no element arrives, the watermark will not be updated. This means that when no element arrives for a period of time, the watermark will not increase during this time gap, and the window calculation will not be triggered. . Obviously, if this period of time is long, then the elements that have arrived in the window will wait a long time before the calculation results are output.

To avoid this situation, you can use periodic watermark allocators ( AssignerWithPeriodicWatermarks mentioned below), these allocators are not only based on event time for allocation. For example, you can use an allocator to use the current time as the watermark when no new event is received for a period of time.

 

Two dispensers of watermark

There are two mechanisms for flink to generate watermark:

  • AssignerWithPeriodicWatermarks : Assign a timestamp and periodically generate watermarks (can depend on event time, or based on processing time).
  • AssignerWithPunctuatedWatermarks : Assign a timestamp and generate a watermark according to each element (each element is judged once, which consumes more performance)

Normally, the first mechanism is used. In addition to saving performance, it is also mentioned in the above allocator.

The two mechanisms are introduced below.

AssignerWithPeriodicWatermarks

Call the extractTimestamp method for each element to get the timestamp and maintain a maximum timestamp. By ExecutionConfig.setAutoWatermarkInterval(...)defining the interval for generating watermark (every n milliseconds). According to this interval, the getCurrentWatermark()method of the allocator is called periodically to allocate a value for the watermark.

In the BoundedOutOfOrdernessGenerator allocator that comes with flink,  getCurrentWatermark是the current watermark is periodically updated to the maximum timestamp minus the allowable delay time.

Here are two AssignerWithPeriodicWatermarks simple examples of using the generated timestamp allocator:

/**
 * This generator generates watermarks assuming that elements arrive out of order,
 * but only to a certain degree. The latest elements for a certain timestamp t will arrive
 * at most n milliseconds after the earliest elements for timestamp t.
 */
class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {

    val maxOutOfOrderness = 3500L // 3.5 seconds

    var currentMaxTimestamp: Long = _

    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        val timestamp = element.getCreationTime()
        currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
        timestamp
    }

    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        new Watermark(currentMaxTimestamp - maxOutOfOrderness)
    }
}

/**
 * This generator generates watermarks that are lagging behind processing time by a fixed amount.
 * It assumes that elements arrive in Flink after a bounded delay.
 */
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {

    val maxTimeLag = 5000L // 5 seconds

    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        element.getCreationTime
    }

    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current time minus the maximum time lag
        new Watermark(System.currentTimeMillis() - maxTimeLag)
    }
}

 

AssignerWithPunctuatedWatermarks

Generate a watermark according to the event time of each element, extractTimestamp(...)assign a timestamp to the element through the method, checkAndGetNextWatermark(...)and update the watermark by checking the watermark of the element.

checkAndGetNextWatermark(...)The second parameter of the method is the extractTimestamp(...) returned timestamp. According to this timestamp, it is decided whether to generate a watermark. Whenever the checkAndGetNextWatermark(...) method returns a non-empty watermark, and the watermark is greater than the previous watermark, the watermark will be updated.

class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[MyEvent] {

	override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
		element.getCreationTime
	}

	override def checkAndGetNextWatermark(lastElement: MyEvent, extractedTimestamp: Long): Watermark = {
		if (lastElement.hasWatermarkMarker()) new Watermark(extractedTimestamp) else null
	}
}

 

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106246807