Flink window

Table of Contents

Overview

Type of window

Tumbling Windows

Sliding Windows

Session Windows

Global Windows

Window function

ReduceFunction

AggregateFunction

FoldFunction

ProcessWindowFunction

Low-level window function with incremental aggregation function

Use window state in ProcessWindowFunction

trigger

Trigger and clear

Default trigger and custom trigger

Rejector

How to deal with late event elements? 

Output late elements from the side output stream

Notes on handling late elements

What else can be done after the window is calculated?

The effect of watermarks on windows 

Continuous window operation

How to estimate the window storage size? 


Overview

Windows is the core of dealing with infinite streams. It is a method of cutting infinite data into finite blocks for processing . Windows divides the stream into " buckets " of limited size , and we can calculate based on the data of each block.

The general structure of Flink's window program is as follows. A first code segment refers to the keying stream window, a second code segment refers to a non-keyed stream window . The only difference between the two is that the call window(...) after keyBy(...) is called a keyed stream; the DataStream call windowAll(...) directly is called a non-keyed stream.

Keyed Windows

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

 

Non-Keyed Windows

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"
  • trigger() —— Trigger, defines when the window is closed, triggers the calculation and outputs the result
  • evitor()-remover (eliminator), which defines the logic to remove certain data
  • allowedLateness()-allow processing of late data
  • sideOutputLateData()-Put late data into the side output stream
  • getSideOutput() —— Get side output stream

The commands in square brackets ([...]) are optional. This shows that Flink allows customizing window logic in many different ways to achieve the most suitable operation.

The difference between the two: the keyed flow is logically divided into multiple flows by key, and different logical flows can be processed independently of other logical flows, that is, there are multiple parallelism. The non-keyed flow will not be split into multiple logic flows, so the window logic will be executed by a single task, that is, the degree of parallelism is 1.

 

Type of window

Tumbling Windows

The rolling window allocator allocates each element to a window of a specified window size. The rolling window has a fixed size and does not overlap. For example: if you specify a 5-minute rolling window, the window is created as shown in the figure below:

Code:

val input: DataStream[T] = ...

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>)

// tumbling processing-time windows
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>)

// daily tumbling event-time windows offset by -8 hours.
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
    .<windowed transformation>(<window function>)

Time interval can be Time.milliseconds(x), Time.seconds(x)Time.minutes(x), etc. specified.

As shown in the last example, the rolling window also has an optional parameter offset, which can be used to change the alignment of the window. For example, if there is no offset, the hourly rolling window is aligned with natural time, that is, 1:00:00.000-1:59:59.999, 2:00:00.000-2:59:59.999 and so on. If you want to change it, you can provide an offset. If you set an offset of 15 minutes, the window time is 1:15:00.000-2:14:59.999, 2:15:00.000-3:14:59.999. The most common usage of offset is to correct UTC time zone, etc. For example, in China, you can specify the offset Time.hours(-8).

Sliding Windows

The sliding window assigns each element to a fixed-length window. Similar to the rolling window, the window size is configured by the window size parameter window size . Another window sliding parameter controls the sliding step length of the sliding window. Therefore, if the sliding step is smaller than the window size, the sliding windows may overlap. In this case, the same element will be assigned to multiple windows.

For example, there is a 10-minute window, sliding for 5 minutes. In this way, every 5 minutes you will get a window containing events that arrived in the last 10 minutes shown in the figure below.

Code example:

val input: DataStream[T] = ...

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

// sliding processing-time windows
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

// sliding processing-time windows offset by -8 hours
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>)

The sliding window can be set the same as the rolling window.

Session Windows

The conversation window groups elements by active conversation. Compared with rolling windows and sliding windows, conversation windows do not overlap, and there is no fixed start and end time. On the contrary, when the conversation window does not receive elements for a period of time, it will be closed, that is, when there is no activity for a period of time, a window will be generated. The session window is configured by the session interval. This session interval defines the length of the inactive period. When no new element is received within this interval, the current session will be closed and subsequent elements will be allocated to the new session window.

Code example:

val input: DataStream[T] = ...

// event-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>)

// event-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[String] {
      override def extract(element: String): Long = {
        // determine and return session gap
      }
    }))
    .<windowed transformation>(<window function>)

// processing-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>)


// processing-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(DynamicProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[String] {
      override def extract(element: String): Long = {
        // determine and return session gap
      }
    }))
    .<windowed transformation>(<window function>)

Static interval may be specified by using one Time.milliseconds(x), , Time.seconds(x)Time.minutes(x)and the like.

The dynamic interval is SessionWindowTimeGapExtractorspecified through the implementation interface.

Note : Since session windows do not have a fixed start and end, they are calculated differently from scrolling and sliding windows. The conversation window creates a new window for each record that arrives, and if the distance between the windows is smaller than the defined interval, they are merged together. In order to be able to merge, the session window operator needs a merge trigger and a merge window function, such as ReduceFunction, AggregateFunction or ProcessWindowFunction (FoldFunction cannot be merged)

Global Windows

Assign all elements with the same key to the same global window. This window scheme is only useful when a custom trigger is specified. Otherwise, no calculations will be performed because the global window is not terminated.

There are almost no application scenarios for the global window.

Code example:

val input: DataStream[T] = ...

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>)

Window function

ReduceFunction

ReduceFunction specifies how to combine two elements in the input to generate the same type of output element. Flink uses ReduceFunction to increment the aggregate window elements.

The definition and usage of ReduceFunction are as follows:

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

The above example aggregates the second field of all element tuples in the window. Where v1 represents the element derived from all operations done before, and v2 represents the current element.

AggregateFunction

AggregateFunction has a wider range of applications than ReduceFunction (but it is generally not used, reduce is sufficient). It has three parameters: input type (IN), accumulator type (ACC) and output type (OUT). The input type is the type of the elements in the input stream, and AggregateFunction has a method to add an input element to the accumulator. The interface also has methods for creating an initial accumulator, combining two accumulators into one accumulator, and extracting output (OUT type) from the accumulator. We will see how this works in the example below.

Like ReduceFunction, Flink will incrementally aggregate the input elements of the window as they arrive.

The definition and usage of AggregateFunction are as follows:

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

The above code calculates the average value of the second field of the element in the window.

FoldFunction

FoldFunction specifies how to combine the input elements of the window with the elements of the output type. FoldFunction is called for each element added to the window and the current output value increment. Similar to ReduceFunction. The difference is that FoldFunction needs to be preset with an initial value, reduce is not used.

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .fold("") { (acc, v) => acc + v._2 }

The above example appends all entered values ​​to the initially empty String.

Note: fold() cannot be used with conversation windows or other mergeable windows.

ProcessWindowFunction

ProcessWindowFunction is a low-level function that obtains an Iterable that contains all the elements of the window, and a context object that can access time and state information, which makes it more flexible than other window functions. But this comes at the cost of performance and resource consumption, because elements cannot be aggregated incrementally, but need to be buffered internally until the window is considered to be processed.

The structure of ProcessWindowFunction is as follows:

abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key      The key for which this window is evaluated.
    * @param context  The context in which the window is being evaluated.
    * @param elements The elements in the window being evaluated.
    * @param out      A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def process(
      key: KEY,
      context: Context,
      elements: Iterable[IN],
      out: Collector[OUT])

  /**
    * The context holding window metadata
    */
  abstract class Context {
    /**
      * Returns the window that is being evaluated.
      */
    def window: W

    /**
      * Returns the current processing time.
      */
    def currentProcessingTime: Long

    /**
      * Returns the current event-time watermark.
      */
    def currentWatermark: Long

    /**
      * State accessor for per-key and per-window state.
      */
    def windowState: KeyedStateStore

    /**
      * State accessor for per-key global state.
      */
    def globalState: KeyedStateStore
  }

}

Note: If the key is the key of a tuple, or a field of a string, then the type of the key is always a two-tuple, and you need to manually set its tuple type and convert it to a tuple of the correct size.

The definition and usage of ProcessWindowFunction are as follows:

val input: DataStream[(String, Long)] = ...

input
  .keyBy(_._1)
  .timeWindow(Time.minutes(5))
  .process(new MyProcessWindowFunction())

/* ... */

class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {

  def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
    var count = 0L
    for (in <- input) {
      count = count + 1
    }
    out.collect(s"Window ${context.window} count: $count")
  }
}

This example shows the use of a processWindow function, which counts the number of elements in the window. And output information about the window.

Note that using ProcessWindowFunction for simple aggregations (such as count) is very inefficient. The following describes how to combine ReduceFunction or AggregateFunction with ProcessWindowFunction to obtain incremental aggregation and added information of ProcessWindowFunction.

Low-level window function with incremental aggregation function

ProcessWindowFunction can be combined with ReduceFunction, AggregateFunction, or FoldFunction to incrementally aggregate elements when they reach the window. When the window is closed, ProcessWindowFunction will return the aggregated result. This allows it to perform incremental calculations while accessing the context information of the window.

Tip : In general, incremental aggregation calculation is used directly, and it is rarely used in combination with ProcessWindowFunction.

The following example demonstrates how to combine an incremental ReduceFunction with ProcessWindowFunction to return the smallest event in the window and the start time of the window.

val input: DataStream[SensorReading] = ...

input
  .keyBy(<key selector>)
  .timeWindow(<duration>)
  .reduce(
    (r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
    ( key: String,
      context: ProcessWindowFunction[_, _, _, TimeWindow]#Context,
      minReadings: Iterable[SensorReading],
      out: Collector[(Long, SensorReading)] ) =>
      {
        val min = minReadings.iterator.next()
        out.collect((context.window.getStart, min))
      }
  )

The following example demonstrates how to combine AggregateFunction and ProcessWindowFunction to calculate the average value, and output the key and window as well as the average value at the same time.

val input: DataStream[(String, Long)] = ...

input
  .keyBy(<key selector>)
  .timeWindow(<duration>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction())

// Function definitions

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

class MyProcessWindowFunction extends ProcessWindowFunction[Double, (String, Double), String, TimeWindow] {

  def process(key: String, context: Context, averages: Iterable[Double], out: Collector[(String, Double)]) = {
    val average = averages.iterator.next()
    out.collect((key, average))
  }
}

Use window state in ProcessWindowFunction

In addition to accessing the keying state (any rich function can), ProcessWindowFunction can also use the keying state. Unlike rich functions, the state of ProcessWindowFunction is called Per-window State, that is, its scope is only the function currently being processed Inside the window.

Per-window State is divided into two categories in ProcessWindowFunction:

  • globeState: global state, the keyed state data in the window is not limited to a certain window.
  • windowState: Window state, the keyed state data in the window is limited to a fixed window.

If the same window will be triggered multiple times (such as event-time trigger plus the maximum allowable delay time, it is possible to trigger multiple calculations), this function is useful. For example, the number of times each window is triggered and the information of the latest trigger can be stored to provide logic processing information for the next window trigger. When using Per-window State data, it is necessary to clean up the state data in time. You can overwrite it and call the clear() of ProcessWindowFunction to complete the cleanup of state data.

trigger

The trigger determines when the window will hand over the data to the window function for processing. Each window has a default trigger. If the default trigger does not meet business needs, you can also use a custom trigger.

There are five methods for the trigger interface:

  • onElement() : This method is triggered when each element is added to the window.
  • onEventTime(): It will be called when registering an event-time timer.
  • onProcessingTime(): Called when a processing-time timer is registered.
  • onMerge(): Related to stateful triggers, when two windows are merged, the states of the two triggers are merged. For example, when the session window is used, the windows will be merged, and this method is called at this time.
  • clear() : Triggered when the window is closed, used to do some cleanup work.

Regarding the above method, two things should be noted:

1) The first three functions determine how to handle their call events by returning TriggerResult. The operation can be one of the following:

  • CONTINUE: Do nothing
  • FIRE: trigger calculation
  • PURGE: Clear the elements in the window
  • FIRE_AND_PURGE: Trigger the calculation, and then clear the elements in the window.

2) Any of these methods can be used to register processing-time timers or event-time timers for future operations.

Trigger and clear

Once a trigger thinks that a window can be processed, it will trigger and return FIRE or FIRE_AND_PURGE. This means that the current window is about to trigger the calculation and send the element to the calculation method. For example, a window with ProcessWindowFunction, when the trigger fires fire, all elements are passed to ProcessWindowFunction (if there is a culling device, it will pass through the culling device first).

If the calculation function of the window is ReduceFunction, AggregateFunction, or FoldFunction, only the result of their aggregation will be emitted, because these pre-aggregation methods have already performed ji's inside the window.

There are two ways of triggering: FIRE or FIRE_AND_PURGE. If it is FIRE, the contents of the window will be retained, and FIRE_AND_PURGE will clear the contents of the window. By default, the trigger uses FIRE.

Note : The clear operation only clears the contents of the window, but retains the metadata information and trigger status of the window.

Default trigger and custom trigger

The default triggers of WindowAssigner are suitable for various use cases. For example, all event time window allocators have an event time trigger as the default trigger. Once the watermark is greater than the endtime of the window, then this trigger will fire.

Note: The default trigger of the global window is a trigger that never fires. Therefore, when using GlobalWindow, if you need to use triggers, you must customize the triggers.

Note: After specifying the trigger by using trigger(), the default trigger of WindowAssigner will be overwritten. For example, if CountTrigger is specified for TumblingEventTimeWindows, the window will no longer be triggered by count according to the time progress, and the two will not take effect at the same time. Therefore, if you want to trigger a window based on time and count at the same time, you must write a custom trigger.

There are four default triggers for flink:

EventTimeTrigger: Time time trigger, triggered according to watermark.

ProcessingTimeTrigger: Processing time trigger, which is triggered based on the system time of the machine where the element is processed.

CountTrigger: Count trigger, which is triggered according to the number of elements.

PurgingTrigger: Clear the trigger, take another trigger as a parameter and convert it to a clear trigger, that is, add the function of clearing the window content on the basis of the original trigger.

If you need to customize triggers, you need to implement the Trigger abstract class. However, the API is still under development and may change in future versions of flink.

Rejector

Flink's window can also use a culler to remove elements from the window. Use the exictor(...) method call (as shown at the beginning of this article). The culling device acts after the trigger is triggered and before or after the window function is executed. The ejector interface has two methods:

/**
 * Optionally evicts elements. Called before windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

/**
 * Optionally evicts elements. Called after windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

evictBefore(): Used to remove elements before the window function is executed.

evictAfter(): Used to remove elements after the window function is executed.

Flink has three preset Evictors:

CountEvictor: When the elements in the window reach the customized number, no more elements are added to the window.

DeltaEvictor: Need to pass in a DeltaFunction and a threshold, use DeltaFunction to calculate the delta value of the last element in the window buffer and all other elements in the buffer, and remove the elements whose delta value is greater than or equal to the threshold.

TimeEvictor: Using the time interval of the window as a parameter, find the maximum timestamp max-ts in the window elements, and delete all elements whose timestamp is less than max-ts minus interval. That is, outdated data is eliminated.

By default, the preset rejectors are executed before the window function.

Note: If a rejector is specified, the pre-aggregation does not take effect, because before the calculation, each element will pass through the rejector before entering the window.

Note: Flink cannot guarantee the order of elements in the window. This means that even though the culler removes elements from the beginning of the window, these elements are not necessarily the first or last to arrive.

How to deal with late event elements? 

When using the event-time window, it may happen that the element arrives late, that is, the element that should have been processed in the previous window, due to the delay in reaching flink, and the watermark has exceeded the endtim of the window, and the calculation is started, resulting in this element Not processed in the previous window.

About watermark, it is introduced in another article of mine: https://blog.csdn.net/x950913/article/details/106246807

By default, when the watermark exceeds the endtime of the window, elements that arrive late will be deleted. However, Flink can specify a maximum delay time for the window. Allowed lateness specifies how much time after the watermark exceeds the endtime, when the element with the event time in the window is received, the calculation of the window will be triggered again, and its default value is 0. After the watermark exceeds the endtime, a calculation will be triggered. Elements that arrive within the allowed delay time range will still be added to the window and the calculation will be triggered again. So if you are using EventTimeTrigger triggers, delayed but not discarded elements may cause the window to trigger again.

By default, the allowed delay is set to 0. In other words, elements arriving after endtime will be deleted.

Specify the allowable delay time as follows:

val input: DataStream[T] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .<windowed transformation>(<window function>)

In addition, when using the GlobalWindows window, any data will not be considered delayed, because the end timestamp of the global window is the maximum value of Long.

Output late elements from the side output stream

The delayed arrival element can be output from the side output stream.

First, you need to create an OutputTag to receive delayed data. Then, specify the delayed data in the window to be sent to the OutputTag:

val lateOutputTag = OutputTag[T]("late-data")

val input: DataStream[T] = ...

val result = input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .sideOutputLateData(lateOutputTag)
    .<windowed transformation>(<window function>)

val lateStream = result.getSideOutput(lateOutputTag)

 

Notes on handling late elements

When the specified allowable delay is greater than 0, the window and its contents will be retained after the watermark exceeds the endtime of the window. In these cases, when a delayed but not discarded element arrives, it may trigger the window's trigger again. At this time, the fires of these triggers are called late firings because they are triggered by late elements, and the main fire is the first fire of the window. In the case of session windows, delayed triggers may further cause window merging, as they may "bridge" two pre-existing unmerged windows.

Note : The calculation triggered by the delayed element should be regarded as an update of the previous calculation result, that is, the window will perform a calculation after the watermark reaches the endtime, and the delayed element will trigger a new calculation to update the calculation result, so the data The stream will contain multiple results of the same calculation. According to different application scenarios, you need to consider whether you need to eliminate duplicate data.

What else can be done after the window is calculated?

The result obtained after using the window calculation is still a DataStream, and the elements in the DataStream will not retain information about the window. Therefore, if you need to save the metadata information of the window, you must write code to integrate the metadata information of the window with the elements in ProcessWindowFunction. The timestamp in the output element is the only information related to the window. It can be set to the maximum allowable delay time of the window, that is, endtime-1, because the element before endtime belongs to this window, and the one greater than or equal to endtime belongs to the next window (This is true for both event-time windows and processing-time windows).

The elements processed by the window function will always contain a timestamp, which can be event-time or processing-time.

It may not be useful for the processing-time window, but for the event-time window, with the watermark mechanism, the elements of the event-time belonging to the same window can be placed in another window for processing, that is, continuous window operations , which are described below.

The effect of watermarks on windows 

Here is a little mention about the role of watermark and windows.

There are two processing methods for triggers based on watermark:

  1. When the watermark is greater than or equal to the endtime of the window, the window calculation is triggered.
  2. When the watermark is less than the endtime of the window, the watermark is forwarded to downstream operations (maintain the watermark as the largest timestamp among all elements in the window)

Continuous window operation

As mentioned above, the result calculated by the window function still has a timestamp, and when used with watermark, multiple window operations can be used continuously. For example, after the calculation in the upstream window, the results can still be calculated with different keys and different functions. Such as:

DataStream<Integer> input = ...;

DataStream<Integer> resultsPerKey = input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .reduce(new Summer());

DataStream<Integer> globalResults = resultsPerKey
    .windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(new TopKWindowFunction());

In the above example, the element whose event timestamp is between 0 and 5 seconds (including 0 seconds but not 5 seconds), after the first window is calculated, the result generated is also 0 to 5 seconds when passed into the second window That is, the elements of the same window still belong to the same window in the downstream window. For example, the first window calculates the sum of each key in 0~5 seconds, and then takes the TopK of the key sum in 0~5 seconds in the second window.

How to estimate the window storage size? 

When defining the time range of the window, you can define a very long time, even days, weeks, or months. So the window can accumulate a lot of data. When estimating the amount of storage for window calculations, you need to remember the following rules:

  • Flink creates a copy of all the elements of each window. Therefore, the rolling window keeps only one copy of each element, because an element happens to belong to only one window, unless it is dropped late. On the contrary, sliding windows will save multiple copies of elements, because an element may belong to multiple windows, each window will be saved once. Therefore, if you use a sliding window, you should try to avoid the situation where the window size is large and the sliding step is small, such as the window size is 1 day and the sliding step is 1 second.
  • ReduceFunction, AggregateFunction and FoldFunction can greatly reduce storage requirements, because they will aggregate elements as early as possible, and each window only stores one value instead of all elements. Instead, ProcessWindowFunction needs to store every element.
  • Using the culling device Evictor will prevent all pre-aggregation operations, because all elements of the window must be passed to the culling device before the calculation.

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106203894