Flink window

Table of contents

window

Flink "buckets"

window classification

Classified by drive type

Classify according to the rules for window allocation data

rolling window

sliding window

session window

global window

window life cycle

window

Window: Cut infinite data into finite "data chunks" for processing, so that unbounded streams can be processed more efficiently

When processing an unbounded data stream, the unbounded stream is segmented, each piece of data is aggregated separately, and the result is only output once. This is equivalent to converting the aggregation of unbounded streams into the aggregation of bounded data sets

Flink "buckets"

In Flink, the window can cut the stream into multiple "buckets" of limited size; each data will be distributed to the corresponding bucket, and when the end time of the window is reached, the data collected in each bucket Perform calculation processing

窗口处理过程:

窗口创建:

The window in Flink is not statically prepared, but dynamically created - when the data that falls within the range of this window arrives, the corresponding window is created

window classification

Classified by drive type

Driver type, that is, what standard the window uses to start and end data interception

按照时间段截取:时间窗口

Window size: Subtract the start time from the end time to get the length of this period, which is the window size

Classes are used in flink TimeWindowto represent time windows:

It can be seen that the time range of the window is the interval of left closing and right opening

按照固定的个数截取:计数窗口

The counting window intercepts data based on the number of elements, triggers calculation and closes the window when a fixed number is reached

Classify according to the rules for window allocation data

rolling window

The rolling window has a fixed size, which is a way to divide the data into "uniform slices";

There is no overlap between the windows, and there will be no gaps. It is in the state of "end to end", so each data will be assigned to a window, and will only belong to one window

The rolling window can be defined based on time or the number of data; only one parameter is required, which is the size of the window (window size). For example, we can define a rolling time window with a length of 1 hour , then statistics will be performed every hour; or define a rolling count window with a length of 10 , and statistics will be performed every 10 numbers

sliding window

The size of the sliding window is also fixed, but the windows are not connected end to end, but can be "staggered" by a certain position

Parameters: window size and sliding step

The window size is fixed, representing the interval between the end times of the two windows

The sliding step represents how often the window is computed. The sliding distance represents the time interval between the start of the next window

When the sliding step is smaller than the window size, the sliding windows will overlap, and the data may also be allocated to multiple windows at the same time; the specific number is determined by the ratio of the window size to the sliding step;

For example, if we define a window length of 1 hour and a sliding step of 30 minutes, then the data at 8:55 should belong to two windows of [8:00, 9:00] and [8:30, 9:30) at the same time; And for the data at 8:10, it belongs to [8 o’clock, 9 o’clock) and [7:30, 8:30] two windows at the same time

session window

  • After the data arrives, a session window will be opened. If there is more data coming one after another, the session will be kept; if no data has been received for a period of time, the session will be considered to be timed out and the window will be closed automatically.
  • The session window can only be defined based on time; because the sign of "session" termination is "no data coming after a period of time"
  • Parameters: session timeout
  • Problems that will arise: the gap between two adjacent data is greater than the specified size, we think they belong to two session windows, and the previous window is closed; but in the case of data disorder, there may be late data, it The timestamp of is exactly between the previous two data. In this way, the interval we judged before is not "there has been no data", and the narrowed interval may be smaller than size-this means that the three data should belong to the same session window
  • Solution: Every time a new data comes, a new session window will be created; then the distance between the existing windows will be judged, and if it is smaller than the given size, they will be merged (merge)

global window

  • Globally valid, all data of the same key will be allocated to the same window (equivalent to no sub-window)
  • By default, trigger calculations are not performed. If you want it to be able to calculate and process the data, you also need to customize the "trigger" (Trigger)

The count window in Flink (Count Window), the bottom layer is implemented with the global window

window life cycle

①Window creation: The type and basic information of the window are specified by the window allocator, but the creation of the window is created by the data driver. When the first data element that should belong to this window arrives, the corresponding window will be created

②Triggering of window calculation: the trigger triggers the execution of the window function to perform data calculation

③ Destruction of the window: In general, when the time reaches the end point, the calculation output will be directly triggered, and then the state will be cleared to destroy the window; in special scenarios, the destruction of the window and the trigger calculation will be different

Under event time semantics, if the allowable delay is set, the window will not be destroyed when the water level reaches the end time of the window; the time point when the window is actually completely deleted is the end time of the window plus the allowable delay time specified by the user

④Window API:

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/130599711