Big Data Theory No.2-Talk about Timestamp and Watermark in Flink stream computing

Chapter 1 Time Semantics

Three time semantics are defined in Flink: Event Time, Ingestion Time, and Processing Time.

In the entire flow calculation process, they represent the time when the event occurred, the time when the data first entered Flink, and the time when the data was processed by the Flink operator .

Event Time : The time when the event occurred, the time when data was actually generated in the real world. No matter how much time the data stream spends in the process of transmission and calculation, EventTime will not change, it is determined when the time occurs.

Ingestion Time : The earliest time when the data enters Flink, that is, the time when the data arrives at the Source. Ingestion Time is also not affected by internal operator calculation and data transmission efficiency.

Processing Time : When data enters each operator, the local time on the machine where the operator is located. Process Time only depends on the system clock of the machine where the current operator is located.

 

Chapter 2 Timestamp

In EventTime mode, all data that Flink flows in from Source will include Timestamp, where Timestamp represents the time when the event occurred. Of course, it can also be a custom Timestamp. But we must ensure the incrementality of the timestamp.

For example, in business, we often need to use window calculations to calculate the number of incidents within a certain period of time: calculate the number of confirmed cases in each region in the past two weeks, calculate the number of visits and user clicks of each module in the past hour Volume, etc..., these all need to use Timestamp to do window calculations in EventTIme mode.

In Flink, Timestamp is defined as an 8-byte long value . When each operator gets the data, it parses the long value with a Unix timestamp with millisecond precision by default, which is the number of milliseconds since 1970-01-01 00:00:00.000. Of course, the custom operator can define the parsing method of the timestamp.

However, for time-based applications, due to the different computing capabilities of each computing unit, the network transmission rate is also different, and today's big data systems are all distributed architectures. For these various reasons, there will be a certain degree of uncertainty when the data arrives at the Source and each computing unit, which is the problem of time disorder . Next, let's take a look at how Flink solves this data disorder problem: Watermark.

 

Chapter 3 Watermark

Flink defines Watermark, which is a data element StreamElement , which is transmitted between operators together with ordinary data. Essentially a Timestamp of long type , it is a global progress indicator.

Watermark can be emitted at the Source location, or generated on any operator of streaming, and propagated through the operators in the topology.

In the FLink stream calculation process, since data transmission will be delayed and the data will be out of order, when should we trigger the window calculation? To put it another way, it can be said: How do we know how long we have to wait to be sure that all the data calculated by the window has reached the operator?

Let's look at the above picture. Flink uses the Watermark interspersed in the data stream. Watermark makes the operator sure that there will be no more delayed events, which triggers the operator to do window calculations. In fact, after Watermark, will the data of this window be missed? The above diagram has already answered this question: there may be. If the delay defined by Watermark is smaller, the possibility of omission is greater; the greater the delay defined by Watermark delay, the possibility of omission is smaller, but this also means that the delay of window trigger is longer.

Therefore, under normal circumstances, Watermark needs to perform actual test adjustments according to its own business to reach an equilibrium value . In some cases, in order to reduce the calculation delay, the user wants to not directly discard the data excluded by Watermark, then the data can be written into the log or used to correct the previous results.

One more last word: Watermark is not required for window calculation based on Processing Time.

Guess you like

Origin blog.csdn.net/dzh284616172/article/details/109250973