Window Mechanism in Stream Computing
what is window
In stream computing, the data flow is continuous, so it is impossible to wait for all the data to arrive before starting processing. The function of Window is to split the infinite Streaming into batches of limited size, and we can apply and calculate the data in each Window.
Basic functions of a typical Window
This article covers rolling windows, sliding windows, and session windows
Tumble Window
Rolling window features:
The windows do not overlap, each data can only belong to one window
fixed window length
When the time is greater than or equal to Window end, trigger the output of the corresponding Window at one time
Sliding Window
Rolling window features:
The sliding window continuously slides forward with a step size, and the length of the window is fixed.
windows may overlap
When the window length is greater than the step size of the sliding window, the data may belong to multiple windows
When the window length is less than the step size of the sliding window, the data may not belong to any window
When the time is greater than or equal to Window end, trigger the output of the corresponding Window at one time
Session Window
Session gap refers to the interval between sessions. Generally, the maximum gap of a session is set, such as 1 minute. When the session gap is greater than 1 minute, the data will be divided into different sessions.
The window length varies
When the time is greater than or equal to Window end, trigger the output of the corresponding Window at one time
Handling of late data
Definition of being late: After watermark drives a certain window to trigger output , if data comes later in this window, then this situation is considered late data.
Solution:
discard directly (default)
Set an allowable late time. In this case, the data will not be cleared immediately after the normal calculation time of the window ends, but an additional "late time" will be reserved. If data arrives within this period, the calculation will continue
Turn late arriving data into a separate stream, and let the user decide what to do with it (side output stream)
Incremental calculation and full calculation
Incremental calculation: After each piece of data arrives, it directly participates in the calculation, but does not output the result for the time being
Full calculation: After each piece of data arrives, put it into a buffer first, and this buffer will be stored in the state, and all the data will be taken out for unified calculation until the window triggers the output
EMIT trigger
Background: A normal window will only output at the end of the window. For example, the window time is one day, and the result will be output only at the end of the day. At this time, the meaning of real-time calculation will be lost.
作用:EMIT触发是一种可以提前把窗口内容输出的机制,比如窗口时间为一天的窗口,设置其5s输出一次,使下游更快的获得到窗口计算的结果。