Flink window (1) - basic concepts

Window: Cut unlimited data into limited "data chunks" for processing to process unbounded streams more efficiently

When processing unbounded data streams, the unbounded streams are segmented, each piece of data is aggregated separately, and the results are only output once. This is equivalent to converting the aggregation of unbounded streams into the aggregation of bounded data sets.

Flink “bucket”

In Flink, the window can cut the stream into multiple "buckets" of limited size; each data will be distributed to the corresponding bucket. When the end time of the window is reached, the data collected in each bucket will be processed. Perform calculations

窗口处理过程:

窗口创建:

Windows in Flink are not statically prepared, but dynamically created - when data falling within this window range arrives, the corresponding window is created.

Window classification

Classified by driver type

Driver type, that is, what standard the window uses to start and end data interception


按照时间段截取:时间窗口

Window size: Subtract the start time from the end time to get the length of this period, which is the size of the window

Classes are used in Flink TimeWindowto represent time windows:

It can be seen that the time range of the window is the interval where the left is closed and the right is open.


按照固定的个数截取:计数窗口

The counting window intercepts data based on the number of elements. When it reaches a fixed number, the calculation is triggered and the window is closed.

Classification according to the rules of window allocation data

rolling window

The rolling window has a fixed size and is a way to "evenly slice" the data;

There is no overlap or gap between windows. They are in a "end-to-end" state, so each data will be assigned to a window and will only belong to one window.

The rolling window can be defined based on time or the number of data; there is only one parameter required, which is the window size. For example, we can define a rolling time window with a length of 1 hour , and statistics will be calculated every hour; or we can define a rolling counting window with a length of 10 , and statistics will be calculated every 10 numbers.

sliding window

The size of the sliding window is also fixed, but the windows are not connected end to end, but can be "staggered" by a certain position.

Parameters: window size and sliding step

The window size is fixed and represents the interval between the end times of the two windows.

The sliding step size represents the frequency of window calculations. The sliding distance represents the time interval at the beginning of the next window

When the sliding step size is smaller than the window size, the sliding windows will overlap, and the data may be allocated to multiple windows at the same time; the specific number is determined by the ratio of the window size and the sliding step size;

For example, if the window length we define is 1 hour and the sliding step is 30 minutes, then the data at 8:55 should belong to both windows [8:00, 9:00) and [8:30, 9:30); For the data at 8:10, it belongs to two windows: [8:00, 9:00) and [7:30, 8:30).

session window

  • After the data arrives, a session window is opened. If data continues to arrive, the session will be maintained; if no data is received for a period of time, the session will be considered to have timed out and the window will automatically close.
  • The session window can only be defined based on time; because the sign of "session" termination is "no data comes after a period of time"
  • Parameters: session timeout
  • Problems that may occur: The time interval between two adjacent data is greater than the specified size. We think that they belong to two session windows, and the previous window is closed; but when the data is out of order, there may be late data, which The timestamp is exactly between the two previous data. In this way, the interval we judged before is not "no data at all", and the reduced interval may be smaller than size - this means that the three data should originally belong to the same session window
  • Solution: Every time new data comes in, a new session window will be created; then determine the distance between existing windows, and if it is less than the given size, merge them.

global window

  • Globally valid, all data with the same key will be allocated to the same window (equivalent to no windows)
  • By default, trigger calculations will not be performed. If you want it to calculate and process data, you also need to customize the "Trigger"

The Count Window in Flink is implemented using the global window at the bottom level.

Window life cycle

①Creation of windows: The type and basic information of the window is specified by the window allocator, but the creation of the window is driven by data. When the first data element that should belong to this window arrives, the corresponding window will be created.

②Triggering of window calculation: The trigger triggers the execution of the window function and performs data calculation.

③Destruction of the window: Under normal circumstances, when the time reaches the end point, the calculation output result will be directly triggered, and then the status will be cleared to destroy the window; in special scenarios, the destruction of the window and the trigger calculation will be different.

Under event time semantics, if an allowed delay is set, the window will still not be destroyed when the water level reaches the end time of the window; the point at which the window is actually completely deleted is the end time of the window plus the user-specified allowable delay time.

④Window API:

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/135111260