[Big Data] Detailed Explanation of Flink (2): Core Part I

Flink Detailed Explanation (2): Core Part I

14. What are the four cornerstones of Flink?

The four cornerstones of Flink are:

  • Checkpoint _
  • State _
  • Time _
  • Window _

15. Talk about the Flink window and the division mechanism.

Window concept: Divide unbounded stream data into multiple pieces of data according to time intervals, and perform statistics (aggregation) separately.

Flink supports two ways of partitioning windows ( timeand count). The first is divided by time-driven , and the other is divided by data-driven .

insert image description here

  • Time-driven Time Window can be divided into Tumbling Window and Sliding Window.
  • Driven by data, Count Window can also be divided into Tumbling Window and Sliding Window .
  • Flink supports two important attributes of the window (window length sizeand sliding interval interval), and distinguishes rolling windows from sliding windows by window length and sliding interval.
    • If size = interval, then tumbling-window (no overlapping data) will be formed - rolling window
    • If size (1min) > interval (30s), then a sliding-window (with overlapping data) will be formed - sliding window

Four basic windows can be obtained through combination:

(1) Time-based rolling window : time-tumbling-windowa time window with no overlapping data, example of setting method: timeWindow(Time.seconds(5)).

insert image description here
insert image description here
(2) Time-based sliding window : time-sliding-windowa time window with overlapping data, example of setting method: timeWindow(Time.seconds(10), Time.seconds(5)).

insert image description here
insert image description here
Note: There is a small error in the above picture, it should be size > interval, so there will be overlapping data.

(3) Quantity-based rolling window : count-tumbling-windowQuantity window with no overlapping data, example of setting method: countWindow(5).

insert image description here

(4) Sliding window based on quantity : count-sliding-windowQuantity window with overlapping data, example of setting method: countWindow(10,5).

insert image description here
Flink also supports a special window: session window SessionWindows.

The session window allocator groups elements through session activities. Compared with rolling windows and sliding windows, session windows do not have overlapping and fixed start and end times.

If the session window no longer receives elements for a fixed period of time, that is, the inactivity interval occurs, then the window will be closed.

A session window is configured by a session interval, which defines the length of the inactive period. When this inactive period occurs, the current session will be closed and subsequent elements will be allocated to the new session window, as follows As shown in the figure:

insert image description here

16. Introduce Flink's window mechanism and how each component works with each other?

The following is a flowchart of the window mechanism:
insert image description here
WindowAssigner

1. The window operator is responsible for processing the window. When the data flow continuously enters the operator (Window Operator), each arriving element will be handed over to the WindowAssigner first. WindowAssigner will determine which window or windows (Windows) the element will be placed in, possibly creating new windows. Because an element can be placed in multiple windows (personal understanding is sliding window, rolling window will not have this phenomenon), so it is possible to have multiple windows at the same time. Note that Window itself is just an ID identifier, and some metadata may be stored inside it, such as the start and end time in TimeWindow, but the elements in the window will not be stored. The elements in the window are actually stored in the Key/Value State, the Key is Window, and the Value is a collection of elements (or aggregated values). In order to ensure the fault tolerance of the window, this implementation relies on Flink's State mechanism.

WindowTrigger

2. Each Window has its own Trigger, and there will be a timer on the Trigger to determine when a window can be calculated or cleared. Whenever an element is added to the window, or the previously registered timer times out, the Trigger will be called. The return result of Trigger can be:

  • Continue (continue, do nothing)
  • Fire (trigger calculation, process window data)
  • Purge (trigger cleaning, remove the window and the data in the window)
  • Fire + Purge (triggers calculation + purge, processes data and removes windows and data in windows)

When the data arrives, call the Trigger to determine whether the calculation needs to be triggered. If the result of the call is only Fire, then the window will be calculated and the window will be kept as it is, that is to say, the data in the window will not be cleaned up, and the calculation will be performed again when the next Trigger Fire occurs. The data in the window will be calculated repeatedly until the trigger result is cleared. The window and data will not be freed until it is cleaned up, so the window will always occupy memory.

Trigger trigger process

3. When Trigger Fire is activated, the collection of elements in the window will be handed over to Evictor (if specified). Evictor is mainly used to traverse the list of elements in the window and determine how many elements that first enter the window need to be removed. The remaining elements will be handed over to the user-specified function for window calculation. If there is no Evictor, all elements in the window will be handed over to the function for calculation.

4. The calculation function receives the elements of the window (maybe filtered by Evictor), calculates the result value of the window, and sends it to the downstream. The result value of the window can be one or more. Different types of calculation functions can be received on the DataStream API, including predefined sum(), , min(), max()and ReduceFunction, FoldFunction, and more WindowFunction. WindowFunctionIt is the most general calculation function, and other predefined functions are basically implemented based on this function.

5. Flink optimizes some aggregation window calculations (such as sum, ), because the aggregation calculation does not need to save all the data in the window, only one value needs to be saved. Each element that enters the window executes the aggregate function once and modifies the value. This can greatly reduce memory consumption and improve performance. But if the user defines Evictor, the optimization of the aggregation window will not be enabled, because Evictor needs to traverse all the elements in the window and must save all the elements in the window.minresultresult

17. Talk about Flink's Time concept.

In Flink's stream processing, different concepts of time are involved, which are mainly divided into three time mechanisms, as shown in the following figure:

insert image description here

  • EventTime, event time
    • When the event occurs, for example: when a link on the website is clicked, each log will record its own generation time.
    • If the time window is defined based on EventTime, it will form EventTimeWindow, requiring that the message itself should carry EventTime.
  • IngestionTime, ingestion time
    • The time when the data enters Flink, such as the time when the source operator of a Flink node receives the data, for example: the data consumed by a source in Kafka.
    • If the time window is defined based on IngestionTime, it will form IngestionTimeWindow, which is based on the systemTime of the source.
  • ProcessingTime, processing time
    • The time when a Flink node executes an operation, for example: the system time when timeWindow processes data, and the default time attribute is Processing Time.
    • If the time window is defined based on the ProcessingTime benchmark, a ProcessingTimeWindow will be formed, and the systemTime of the operator shall prevail.

In Flink's stream processing, most of the business will use EventTime, generally only when EventTime cannot be used, will be forced to use ProcessingTime or IngestionTime.

18. How should I use it when calling the API?

final StreamExecutionEnvironment env  
    = StreamExecutionEnvironment.getExecutionEnvironrnent();

// 使用处理时间
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime) ; 

// 使用摄入时间
env.setStrearnTimeCharacteristic(TimeCharacteristic.IngestionTime);

// 使用事件时间
env.setStrearnTimeCharacteristic(TimeCharacteri stic Eve~tTime);

19. In streaming data processing, have you ever encountered problems such as data delay, and how to deal with them?

Have encountered data delay issues. for example:

Case 1 :

  • Suppose you are on your way to an underground parking lot and want to order a takeaway on your phone.
  • After choosing the takeaway, you pay with the online payment function, and it is 11:50 at this time. Just then, you walk into the underground parking garage, and there is no cell phone signal here. Therefore, the online payment for food delivery did not succeed immediately, and the payment system has been retrying the "payment" operation in Retry.
  • By the time you find your car and drive out of the underground car park, it is already 12:05. At this time, the mobile phone has a signal again, and the payment data on the mobile phone is successfully sent to the takeaway online payment system, and the payment is completed.
  • In the above scenario you can see that the event time of the payment data is 11:50 and the processing time of the payment data is 12:05

Case 2 :

  • As shown in the figure above, an App will record all the user's click behaviors and return the logs (if the network is not good, save them locally first, and then send them back later).
  • User A operates the App at 11:02, and User B operates the App at 11:03.
  • However, the network of user A is not stable, and the return log is delayed. As a result, we first receive the message from user B at 11:03 on the server side, and then receive the message from user A at 11:02. The messages are out of order.

Generally, problems such as data delay and message out-of-sequence are dealt with through WaterMark watermarking. Watermarks are used to solve problems such as data delays and data out-of-sequence. The summary is shown in the following figure:

insert image description here
A watermark is a timestamp ( timestamp), and Flink can add watermarks to data streams:

  • The watermark does not affect the original EventTime event time.
  • When the watermark is added to the data stream, the window calculation will be triggered according to the watermark time, that is to say, the WaterMark watermark is used to trigger the window calculation.
  • Set the watermark time, which will be a few seconds less than the event time, indicating how long the maximum allowable data delay is.
  • Watermark time = event time - allowable delay time (for example: 10:09:57 = 10:10:00 - 3s)

20. Explain the principle of WaterMark?

As shown below:

insert image description here

The window is triggered every 10 minutes. Now there is a window between 12:00 - 12:10. Originally, a piece of data was calculated in this window between 12:00 - 12:10, but because of the delay, it arrived at 12:12. At this time, 12 :00 - 12:10 This window will be closed, and the data can only be sent to the next window for calculation, which will cause data delay and cause inaccurate calculation.

Now add a watermark: the data timestamp is 2 minutes. At this time the event time generated with the data 12:12 - watermark to allow delay 2 minutes = 12:10 >= window end time. The window triggers the calculation, and the data will be calculated into this window.

Use the TimestampAssigner interface in the DataStream API to define the extraction behavior of the timestamp, which contains two sub-interfaces AssignerWithPeriodicWatermarksinterface and AssignerWithPunctuatedWaterMarksinterface.

Define the extraction timestamp and the method of generating WaterMark, there are two types

  • AssignerWithPeriodicWatermarks
    • Generate WaterMark periodically: The system will periodically insert WaterMark into the stream.
    • The default period is 200 200200 milliseconds, can beExecutionConfig.setAutoWatermarklnterval()set using .
    • BoundedOutOfOrdernessIs based on periodic WaterMark.
  • AssignerWithPunctuatedWatermarks
    • There is no time cycle rule, and the generation of watermark can be interrupted
periodic generation Generated from special records
real time drive data driven
Every time the Timestamp is allocated, the generation method will be called Call the generate method every once in a while
accomplishAssignerWithPeriodicWatermarks accomplishAssignerWithPunctuatedWatermarks

21. What if the data delay is very serious? Can it be handled with only WaterMark? How should it be solved?

Using the WaterMark + EventTimeWindow mechanism can solve the problem of data disorder to a certain extent, but the WaterMark water level is not a panacea. In some cases, the data delay will be very serious. Even through Watermark + EventTimeWindow, it is impossible to wait for all the data to enter The window is processed again, because after the window triggers the calculation, for the delayed data that belongs to the window, Flink will discard the seriously delayed data by default.

Then, if you want to prevent delayed data within a certain time range from being discarded, you can use Allowed Lateness (allowing lateness mechanism/side-channel output mechanism) to set an allowable delay time and side-channel output object to solve the problem.

Even with the WaterMark + EventTimeWindow + Allowed Lateness solution (including side channel output), no data loss can be achieved.

API call

  • allowedLateness(lateness:Time): Set the time allowed for delay

This method passes in a Time value to set the time that the data is allowed to arrive late, which is different from the concept of time in WaterMark.

To recap, WaterMark = event time of the data - allows out-of-order time values. With the arrival of new data, the value of WaterMark will be updated to the latest data event time - out-of-order time values ​​are allowed, but if a piece of historical data comes at this time, the value of WaterMark will not be updated.

In general, WaterMark never goes backwards, it is designed to receive as much out-of-order data as possible.

What about the Time value here? The main reason is to wait for the late data. If the data belonging to this window arrives, the calculation will still be performed. The calculation method will be explained in detail later.

Note: This method is only for EventTime-based windows.

  • sideOutputLateData(outputTag:OutputTag[T]): save delayed data

This method is to save the late data to the given outputTag parameter, and OutputTag is an object used to mark the late data.

  • DataStream.getSideOutput(tag:OutputTag[X]): get delayed data

Call this method through the DataStream returned by operations such as window, and pass in the object that marks the delayed data to obtain the delayed data.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132136929