Flink TableAPI Window and Watermarket

preamble

This time, the main purpose is to clarify the unified processing method of batch flow, because it uses SQL to operate batch flow calculation. So how does it set operator parallelism? How to set window? How to process streaming data? There are many questions. .

I still think it is better to directly use the stream computing API. The stream-batch integrated API is eventually converted into stream computing. The most important thing is to use sql to set operators or windows, which is not intuitive. It is a conversion stream operation itself, we can know Use flow directly. In addition, in version 1.12, it is said that the integration of flow and batch is not mature. Now it is not mature in 1.17, but there are still bugs. The screenshot is as follows

Dynamic table & continuous query (Continuous Query) 

First look at the difference between batch flow integration and traditional

Relational Database stream processing
A relation (or table) is a bounded (multiple) collection of tuples. A stream is an infinite sequence of tuples.
Queries performed on batch data (such as a table in a relational database) have access to the complete input data. Streaming queries do not have access to all data at startup and must "wait" for data to flow in.
Batch queries terminate after producing fixed-size results. A streaming query is constantly updating its results based on the records it receives, and never ends.

Understand what Flink official website said:

  •  A dynamic table is a table that is constantly changing (including insert, delete, and update operations) ,
  • Continuous query means to continuously query the latest changed data of the dynamic table.

Dynamic tables  are the core concepts of Flink's Table API and SQL that support streaming data. Unlike static tables, which represent batches of data, dynamic tables change over time. They can be queried just like static batch tables.

Querying a dynamic table will generate a  continuous query  . A continuous query never terminates and results in a dynamic table. A query is constantly updating its (dynamic) result table to reflect changes on its (dynamic) input table. Essentially, a continuous query on a dynamic table is very similar to a query defining a materialized view.

Note that the result of a continuous query is always semantically equivalent to the result of the same query executed in batch mode on a snapshot of the input table.

The following diagram shows the relationship between streams, dynamic tables, and continuous queries:

  1. Convert a stream to a dynamic table. (dynamic input table)
  2. Evaluates a continuous query on a dynamic table, producing a new dynamic table. (Dynamic result table)
  3. The resulting dynamic table is converted back to a stream.

Flow dynamic table

First define a table structure

[
  user:  VARCHAR,   // 用户名
  cTime: TIMESTAMP, // 访问 URL 的时间
  url:   VARCHAR    // 用户访问的 URL
]

In order to process streams with relational queries, they must be converted to  Table. Conceptually, each record of the stream is interpreted as  INSERT an operation on the resulting table. Essentially we are  INSERTbuilding the table from a -only changelog stream.

The diagram below shows how the stream of click events (left side) translates into a table (right side). The result table will keep growing as more clickstream records are inserted.

Note:  Tables defined on streams are not materialized internally.

continuous query

The SQL of continuous query determines the quality of the program, and the SQL here directly affects: [email protected]

  • Does the dynamic result table need to have an update operation? If only new additions are made, the efficiency will be very high.
  • And the size of the storage space of the intermediate intermediate state results, if there are too many calculation points, the memory occupation will become larger.

Give an example of how SQL affects efficiency.

For example, if we use Kafka as the source, if we use group statistics when writing SQL, the update operation information will be generated in the dynamic result table, so the sink service support is required to perform the update operation, and if it does not support it, an error will be reported .In addition, you can check the API description of SQL for the update operation. [email protected]

For example, execute the statement

 Table table = tEnv.sqlQuery("SELECT id ,count(name) as mycount FROM jjjk  group by id ");
2");


table.execute().print();

The printed information is:

It is withdrawn, + is after operation, I is insert, U is update, D is delete
For example -U is the data before withdrawal, +U is the updated data

official example

The first query is a simple  GROUP-BY COUNT aggregation query. It groups tables based  user on fields  clicks and counts the number of URLs visited. The diagram below shows  clicks how the query is evaluated when the table is updated with additional rows.

When the query starts, clicks the table (on the left) is empty. clicks The query starts evaluating the result table when the first row of data is inserted into  the table. [Mary,./home] After  the first row of data  is inserted, the resulting table (right, upper) [Mary, 1] consists of one row. When the second row  [Bob, ./cart] is inserted into  clicks the table, the query updates the result table with a new row inserted  [Bob, 1]. The third [Mary, ./prod?id=1] line will generate an update of the computed result row, [Mary, 1] update into  [Mary, 2]. clicks Finally, the query  [Liz, 1] inserts the third row into the result table when the fourth row of data is joined  to the table.

The second query is similar to the first, but in addition to user attributes, is also  clicks grouped into hourly tumbling windows , and then counts url counts (time-based calculations, such as windows based on specific time attributes , are discussed later). Again, the graph shows the input and output at different points in time to visualize the changing nature of the dynamic table.

As before, the input table is shown on the left  clicks. The query continuously calculates the result every hour and updates the result table. The clicks table contains four rows cTimeof data with timestamps ( )  between 12:00:00 and  12:59:59 . The query computes two result rows (one each  user ) from this input and appends them to the result table. For   the next window between 13:00:00 and  ,  the table contains three rows, which will cause two more rows to be appended to the resulting table. As time progresses, more rows are added to the   resulting table and the resulting table will be updated.13:59:59clicksclick

The difference between the above two queries

  • The first query updates the previously outputted results, ie the changelog stream that defines the results table includes  INSERT and  UPDATE manipulates.
  • The second query only appends to the result table, i.e. the changelog stream for the result table contains only  INSERT operations.

Query Limit  #

  • State Size:  Continuous queries are computed on unbounded streams and should typically run for weeks or months. Therefore, the total amount of data processed by continuous queries can be very large. Queries that must update previously outputted results need to maintain all outputted rows in order to be able to update them. For example, the first query example needs to store a per-user URL count so that it can be incremented and new results sent when the input table receives a new row. If you're only tracking registered users, the number of counts to maintain may not be too high. However, if unregistered users are assigned a unique username, the number of counts to maintain will grow over time and may eventually cause the query to fail.
  • Computational Updates:  Some queries require the recalculation and update of a large number of output result rows, even if only one input record is added or updated. Obviously, such a query is not suitable for execution as a continuous query. An example is the query below, which calculates one for each user based on the time of the last click  RANK. As soon as  click the table receives a new row, the user's  lastAction is updated and a new rank must be calculated. However, since two rows cannot have the same rank, all lower ranked rows also need to be updated.

Table to Stream Conversion  #

Interested, just use the flow

For details, refer to Dynamic Table | Apache Flink

Uncertainty

Quoting the SQL standard's description of determinism: "An operation is deterministic if it is guaranteed to compute the same result when it repeats the same input values. "

It is obvious that Flink cannot be deterministic, and batch processing or traditional databases cannot be deterministic. For example, querying the latest two records, although it is the same SQL, the data is being inserted all the time, so batch processing is also not deterministic. Unable to achieve certainty [email protected]

How to reduce the uncertainty of Filnk (eventually use watermarket)

The non-deterministic update (NDU) problem in streaming queries is usually not intuitive, and a small condition change in a more complex query may cause the risk of NDU problems. Starting from version 1.16, Flink SQL ( FLINK-27849 ) introduces an experimental The NDU problem handling mechanism  'table.optimizer.non-deterministic-update.strategy' , when  TRY_RESOLVE the mode is turned on, will check whether there is an NDU problem in the stream query, and try to eliminate the indeterminate update problem caused by Lookup Join (internally will increase Materialization processing), if the above-mentioned factors 1 or 3 cannot be automatically eliminated, Flink SQL will give as detailed an error message as possible to prompt the user to adjust the SQL to avoid introducing uncertainty (considering the high cost and calculation caused by materialization) Sub-complexity, there is currently no support for the corresponding automatic resolution mechanism).

Time attribute  #

Refer to the watermarket of streaming computing

 Flink can process data based on several different  notions of time .

  • Processing time  refers to the machine time when performing specific operations (absolute time as we all know, such as Java  System.currentTimeMillis()))
  • Event time  refers to the time carried by the data itself. This time is the time when the event was generated.
  • Ingestion time  refers to the time when data enters Flink; internally, it is treated as event time.

Each type of table can have a time attribute, which can be used in:

  • CREATE TABLE DDL specifies when creating a table,
  • can be  DataStream specified in,
  • Can be specified at definition  TableSource time.

Once a time attribute is defined, it can be used like a normal column or in time-related operations. (Because cuiyaonan is operated in the way of SQl operation fields)

As long as the time attribute is not modified, but simply passed from one table to another, it remains a valid time attribute. Time attributes can be used and calculated like ordinary timestamp columns. Once a time attribute is used in a computation , it is materialized and becomes a normal timestamp. Ordinary timestamps cannot be used together with Flink's time and watermark, so ordinary timestamps cannot be used in time-related operations. ----That is, if you want to use the time field as a window, you cannot participate in the calculation [email protected]

Processing Time  #

Time Attributes | Apache Flink

Event Time  #

Time Attributes | Apache Flink

Temporal Tables  #

A temporal table contains one or more versioned table snapshots of the table to track all change records. A temporal table can be a single table (such as the changelog of a database table, which contains multiple table snapshots), or after materializing all changes table (such as a database table, only the latest table snapshot). -----Like redis backup, it can be a collection of all operations, or the final memory result backup [email protected]

Temporal table is a more detailed classification of tables we create, mainly used in business scenarios. It is the lower classification of regular and virtual tables.

Version : A temporal table can be divided into a series of table snapshot sets with versions. The version in the table snapshot represents the effective interval of all records in the snapshot. The start time and end time of the effective interval can be specified by the user, depending on whether the temporal table can Tracking its own historical version or not, the temporal table can be divided into  版本表 and  普通表.

  • Version table : If the records in the temporal table can track and access its historical version, we call this kind of table a version table, and the changelog from the database can be defined as a version table. --- Distinguish versions based on primary key and time
  • Ordinary table : If the records in the temporal table can only be tracked and its latest version, we call this kind of table an ordinary table, and a table from a database or HBase can be defined as an ordinary table. --- Final result table

Version Table Description

Take the scenario of order flow associated product table as an example. orders The table contains the real-time order flow from Kafka, and product_changelog the table comes from the changelog of the database table  products . The price of the product  products changes in real time in the database table.

SELECT * FROM product_changelog;

(changelog kind)  update_time  product_id product_name price
================= ===========  ========== ============ ===== 
+(INSERT)         00:01:00     p_001      scooter      11.11
+(INSERT)         00:02:00     p_002      basketball   23.11
-(UPDATE_BEFORE)  12:00:00     p_001      scooter      11.11
+(UPDATE_AFTER)   12:00:00     p_001      scooter      12.99
-(UPDATE_BEFORE)  12:00:00     p_002      basketball   23.11 
+(UPDATE_AFTER)   12:00:00     p_002      basketball   19.99
-(DELETE)         18:00:00     p_001      scooter      12.99 

The table  represents the ever-growing changelog of  product_changelog the database table  . For example, the initial price of  the product  at the time point  is  the price increase is reached at the time  , and   the product price record is deleted at the time.productsscooter00:01:0011.1112:00:0012.9918:00:00

If we want to output  product_changelog the table in  10:00:00 the corresponding version, the content of the table is as follows:

update_time  product_id product_name price
===========  ========== ============ ===== 
00:01:00     p_001      scooter      11.11
00:02:00     p_002      basketball   23.11

If we want to output  product_changelog the table in  13:00:00 the corresponding version, the content of the table is as follows:

update_time  product_id product_name price
===========  ========== ============ ===== 
12:00:00     p_001      scooter      12.99
12:00:00     p_002      basketball   19.99

In the above example, products the version of the table is   tracked by update_time and  ,  corresponding to   the primary key of the table,  corresponding to the event time.product_idproduct_idproduct_changelogupdate_time

Ordinary table description

On the other hand, some user cases need to connect to changing dimension tables, which are external database tables.

Suppose  LatestRates it is a materialized latest exchange rate table (for example: an HBase table), which LatestRates always represents  Rates the latest content of the HBase table.

The content we  10:15:00 queried at the time is as follows:

10:15:00 > SELECT * FROM LatestRates;

currency  rate
========= ====
US Dollar 102
Euro      114
Yen       1

The content we  11:00:00 queried at the time is as follows:

11:00:00 > SELECT * FROM LatestRates;

currency  rate
========= ====
US Dollar 102
Euro      116
Yen       1

Create temporal table

Flink uses primary key constraints and event time to define a version table and version view.

Example of creating a version table

In Flink, a table with primary key constraints and event-time attributes defined is a version table.

-- 定义一张版本表
CREATE TABLE product_changelog (
  product_id STRING,
  product_name STRING,
  product_price DECIMAL(10, 4),
  update_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,
  PRIMARY KEY(product_id) NOT ENFORCED,      -- (1) 定义主键约束
  WATERMARK FOR update_time AS update_time   -- (2) 通过 watermark 定义事件时间              
) WITH (
  'connector' = 'kafka',
  'topic' = 'products',
  'scan.startup.mode' = 'earliest-offset',
  'properties.bootstrap.servers' = 'localhost:9092',
  'value.format' = 'debezium-json'
);

The behavior (1) table  product_changelog defines the primary key, and the row  (2) is  update_time defined as  product_changelog the event time of the table, so  product_changelog it is a version table.

NoteMETADATA FROM 'value.source.timestamp' VIRTUAL The syntax means to extract the execution time of the operation in the database table corresponding to the changelog from each changelog. It is strongly recommended to use the execution time of the operation in the database table as the event time, otherwise the version extracted by time may be different from the version in the database. match.

Declaring version views  #

Flink also supports defining version views as long as a view contains the primary key and event time is a version view.

Suppose we have table  RatesHistory like below:

-- 定义一张 append-only 表
CREATE TABLE RatesHistory (
    currency_time TIMESTAMP(3),
    currency STRING,
    rate DECIMAL(38, 10),
    WATERMARK FOR currency_time AS currency_time   -- 定义事件时间
) WITH (
  'connector' = 'kafka',
  'topic' = 'rates',
  'scan.startup.mode' = 'earliest-offset',
  'properties.bootstrap.servers' = 'localhost:9092',
  'format' = 'json'                                -- 普通的 append-only 流
)

The table  RatesHistory represents a table of currency exchange rates to Japanese Yen (Yen exchange rate is 1), which is a growing append-only table. For example, the 欧元 exchange  rate 日元 from  09:00:00 to  10:45:00 is  114. The exchange rate from  10:45:00 to  11:15:00 is  116.

SELECT * FROM RatesHistory;

currency_time currency  rate
============= ========= ====
09:00:00      US Dollar 102
09:00:00      Euro      114
09:00:00      Yen       1
10:45:00      Euro      116
11:15:00      Euro      119
11:49:00      Pounds    108

In order to  RatesHistory define a version table on Flink, Flink supports defining a version view through deduplication query . Deduplication query can produce an ordered changelog stream. Deduplication query can infer the primary key and preserve the event time attribute of the original data stream.

CREATE VIEW versioned_rates AS              
SELECT currency, rate, currency_time            -- (1) `currency_time` 保留了事件时间
  FROM (
      SELECT *,
      ROW_NUMBER() OVER (PARTITION BY currency  -- (2) `currency` 是去重 query 的 unique key,可以作为主键
         ORDER BY currency_time DESC) AS rowNum 
      FROM RatesHistory )
WHERE rowNum = 1; 

-- 视图 `versioned_rates` 将会产出如下的 changelog:

(changelog kind) currency_time currency   rate
================ ============= =========  ====
+(INSERT)        09:00:00      US Dollar  102
+(INSERT)        09:00:00      Euro       114
+(INSERT)        09:00:00      Yen        1
+(UPDATE_AFTER)  10:45:00      Euro       116
+(UPDATE_AFTER)  11:15:00      Euro       119
+(INSERT)        11:49:00      Pounds     108

The row  (1) holds the event time as the view  versioned_rates 's event time, the row  (2) makes the view  versioned_rates have a primary key, and thus the view  versioned_rates is a versioned view.

The deduplication query in the view will be optimized by Flink and efficiently generate the changelog stream. The generated changelog retains the primary key constraint and event time.

If we want to output  versioned_rates the table in  11:00:00 the corresponding version, the content of the table is as follows:

currency_time currency   rate  
============= ========== ====
09:00:00      US Dollar  102
09:00:00      Yen        1
10:45:00      Euro       116

If we want to output  versioned_rates the table in  12:00:00 the corresponding version, the content of the table is as follows:

currency_time currency   rate  
============= ========== ====
09:00:00      US Dollar  102
09:00:00      Yen        1
10:45:00      Euro       119
11:49:00      Pounds     108

Declare a normal table  #

The declaration of ordinary tables is consistent with the DDL for creating tables in Flink. Refer to  the create table  page for more information on how to create tables.

-- 用 DDL 定义一张 HBase 表,然后我们可以在 SQL 中将其当作一张时态表使用
-- 'currency' 列是 HBase 表中的 rowKey
 CREATE TABLE LatestRates (   
     currency STRING,   
     fam1 ROW<rate DOUBLE>   
 ) WITH (   
    'connector' = 'hbase-1.4',   
    'table-name' = 'rates',   
    'zookeeper.quorum' = 'localhost:2181'   
 );

Guess you like

Origin blog.csdn.net/cuiyaonan2000/article/details/131188999