preamble
This time, the main purpose is to clarify the unified processing method of batch flow, because it uses SQL to operate batch flow calculation. So how does it set operator parallelism? How to set window? How to process streaming data? There are many questions. .
I still think it is better to directly use the stream computing API. The stream-batch integrated API is eventually converted into stream computing. The most important thing is to use sql to set operators or windows, which is not intuitive. It is a conversion stream operation itself, we can know Use flow directly. In addition, in version 1.12, it is said that the integration of flow and batch is not mature. Now it is not mature in 1.17, but there are still bugs. The screenshot is as follows
Dynamic table & continuous query (Continuous Query)
First look at the difference between batch flow integration and traditional
Relational Database | stream processing |
---|---|
A relation (or table) is a bounded (multiple) collection of tuples. | A stream is an infinite sequence of tuples. |
Queries performed on batch data (such as a table in a relational database) have access to the complete input data. | Streaming queries do not have access to all data at startup and must "wait" for data to flow in. |
Batch queries terminate after producing fixed-size results. | A streaming query is constantly updating its results based on the records it receives, and never ends. |
Understand what Flink official website said:
- A dynamic table is a table that is constantly changing (including insert, delete, and update operations) ,
- Continuous query means to continuously query the latest changed data of the dynamic table.
Dynamic tables are the core concepts of Flink's Table API and SQL that support streaming data. Unlike static tables, which represent batches of data, dynamic tables change over time. They can be queried just like static batch tables.
Querying a dynamic table will generate a continuous query . A continuous query never terminates and results in a dynamic table. A query is constantly updating its (dynamic) result table to reflect changes on its (dynamic) input table. Essentially, a continuous query on a dynamic table is very similar to a query defining a materialized view.
Note that the result of a continuous query is always semantically equivalent to the result of the same query executed in batch mode on a snapshot of the input table.
The following diagram shows the relationship between streams, dynamic tables, and continuous queries:
- Convert a stream to a dynamic table. (dynamic input table)
- Evaluates a continuous query on a dynamic table, producing a new dynamic table. (Dynamic result table)
- The resulting dynamic table is converted back to a stream.
Flow dynamic table
First define a table structure
[
user: VARCHAR, // 用户名
cTime: TIMESTAMP, // 访问 URL 的时间
url: VARCHAR // 用户访问的 URL
]
In order to process streams with relational queries, they must be converted to Table
. Conceptually, each record of the stream is interpreted as INSERT
an operation on the resulting table. Essentially we are INSERT
building the table from a -only changelog stream.
The diagram below shows how the stream of click events (left side) translates into a table (right side). The result table will keep growing as more clickstream records are inserted.
Note: Tables defined on streams are not materialized internally.
continuous query
The SQL of continuous query determines the quality of the program, and the SQL here directly affects: [email protected]
- Does the dynamic result table need to have an update operation? If only new additions are made, the efficiency will be very high.
- And the size of the storage space of the intermediate intermediate state results, if there are too many calculation points, the memory occupation will become larger.
Give an example of how SQL affects efficiency.
For example, if we use Kafka as the source, if we use group statistics when writing SQL, the update operation information will be generated in the dynamic result table, so the sink service support is required to perform the update operation, and if it does not support it, an error will be reported .In addition, you can check the API description of SQL for the update operation. [email protected]
For example, execute the statement
Table table = tEnv.sqlQuery("SELECT id ,count(name) as mycount FROM jjjk group by id ");
2");
table.execute().print();
The printed information is:
It is withdrawn, + is after operation, I is insert, U is update, D is delete
For example -U is the data before withdrawal, +U is the updated data
official example
The first query is a simple GROUP-BY COUNT
aggregation query. It groups tables based user
on fields clicks
and counts the number of URLs visited. The diagram below shows clicks
how the query is evaluated when the table is updated with additional rows.
When the query starts, clicks
the table (on the left) is empty. clicks
The query starts evaluating the result table when the first row of data is inserted into the table. [Mary,./home]
After the first row of data is inserted, the resulting table (right, upper) [Mary, 1]
consists of one row. When the second row [Bob, ./cart]
is inserted into clicks
the table, the query updates the result table with a new row inserted [Bob, 1]
. The third [Mary, ./prod?id=1]
line will generate an update of the computed result row, [Mary, 1]
update into [Mary, 2]
. clicks
Finally, the query [Liz, 1]
inserts the third row into the result table when the fourth row of data is joined to the table.
The second query is similar to the first, but in addition to user attributes, is also clicks
grouped into hourly tumbling windows , and then counts url counts (time-based calculations, such as windows based on specific time attributes , are discussed later). Again, the graph shows the input and output at different points in time to visualize the changing nature of the dynamic table.
As before, the input table is shown on the left clicks
. The query continuously calculates the result every hour and updates the result table. The clicks table contains four rows cTime
of data with timestamps ( ) between 12:00:00
and 12:59:59
. The query computes two result rows (one each user
) from this input and appends them to the result table. For the next window between 13:00:00
and , the table contains three rows, which will cause two more rows to be appended to the resulting table. As time progresses, more rows are added to the resulting table and the resulting table will be updated.13:59:59
clicks
click
The difference between the above two queries
- The first query updates the previously outputted results, ie the changelog stream that defines the results table includes
INSERT
andUPDATE
manipulates. - The second query only appends to the result table, i.e. the changelog stream for the result table contains only
INSERT
operations.
Query Limit #
- State Size: Continuous queries are computed on unbounded streams and should typically run for weeks or months. Therefore, the total amount of data processed by continuous queries can be very large. Queries that must update previously outputted results need to maintain all outputted rows in order to be able to update them. For example, the first query example needs to store a per-user URL count so that it can be incremented and new results sent when the input table receives a new row. If you're only tracking registered users, the number of counts to maintain may not be too high. However, if unregistered users are assigned a unique username, the number of counts to maintain will grow over time and may eventually cause the query to fail.
- Computational Updates: Some queries require the recalculation and update of a large number of output result rows, even if only one input record is added or updated. Obviously, such a query is not suitable for execution as a continuous query. An example is the query below, which calculates one for each user based on the time of the last click
RANK
. As soon asclick
the table receives a new row, the user'slastAction
is updated and a new rank must be calculated. However, since two rows cannot have the same rank, all lower ranked rows also need to be updated.
Table to Stream Conversion #
Interested, just use the flow
For details, refer to Dynamic Table | Apache Flink
Uncertainty
Quoting the SQL standard's description of determinism: "An operation is deterministic if it is guaranteed to compute the same result when it repeats the same input values. "
It is obvious that Flink cannot be deterministic, and batch processing or traditional databases cannot be deterministic. For example, querying the latest two records, although it is the same SQL, the data is being inserted all the time, so batch processing is also not deterministic. Unable to achieve certainty [email protected]
How to reduce the uncertainty of Filnk (eventually use watermarket)
The non-deterministic update (NDU) problem in streaming queries is usually not intuitive, and a small condition change in a more complex query may cause the risk of NDU problems. Starting from version 1.16, Flink SQL ( FLINK-27849 ) introduces an experimental The NDU problem handling mechanism 'table.optimizer.non-deterministic-update.strategy' , when
TRY_RESOLVE
the mode is turned on, will check whether there is an NDU problem in the stream query, and try to eliminate the indeterminate update problem caused by Lookup Join (internally will increase Materialization processing), if the above-mentioned factors 1 or 3 cannot be automatically eliminated, Flink SQL will give as detailed an error message as possible to prompt the user to adjust the SQL to avoid introducing uncertainty (considering the high cost and calculation caused by materialization) Sub-complexity, there is currently no support for the corresponding automatic resolution mechanism).
Time attribute #
Refer to the watermarket of streaming computing
Flink can process data based on several different notions of time .
- Processing time refers to the machine time when performing specific operations (absolute time as we all know, such as Java
System.currentTimeMillis()
)) - Event time refers to the time carried by the data itself. This time is the time when the event was generated.
- Ingestion time refers to the time when data enters Flink; internally, it is treated as event time.
Each type of table can have a time attribute, which can be used in:
- CREATE TABLE DDL specifies when creating a table,
- can be
DataStream
specified in, - Can be specified at definition
TableSource
time.
Once a time attribute is defined, it can be used like a normal column or in time-related operations. (Because cuiyaonan is operated in the way of SQl operation fields)
As long as the time attribute is not modified, but simply passed from one table to another, it remains a valid time attribute. Time attributes can be used and calculated like ordinary timestamp columns. Once a time attribute is used in a computation , it is materialized and becomes a normal timestamp. Ordinary timestamps cannot be used together with Flink's time and watermark, so ordinary timestamps cannot be used in time-related operations. ----That is, if you want to use the time field as a window, you cannot participate in the calculation [email protected]
Processing Time #
Time Attributes | Apache Flink
Event Time #
Time Attributes | Apache Flink
Temporal Tables #
A temporal table contains one or more versioned table snapshots of the table to track all change records. A temporal table can be a single table (such as the changelog of a database table, which contains multiple table snapshots), or after materializing all changes table (such as a database table, only the latest table snapshot). -----Like redis backup, it can be a collection of all operations, or the final memory result backup [email protected]
Temporal table is a more detailed classification of tables we create, mainly used in business scenarios. It is the lower classification of regular and virtual tables.
Version : A temporal table can be divided into a series of table snapshot sets with versions. The version in the table snapshot represents the effective interval of all records in the snapshot. The start time and end time of the effective interval can be specified by the user, depending on whether the temporal table can Tracking its own historical version or not, the temporal table can be divided into 版本表
and 普通表
.
- Version table : If the records in the temporal table can track and access its historical version, we call this kind of table a version table, and the changelog from the database can be defined as a version table. --- Distinguish versions based on primary key and time
- Ordinary table : If the records in the temporal table can only be tracked and its latest version, we call this kind of table an ordinary table, and a table from a database or HBase can be defined as an ordinary table. --- Final result table
Version Table Description
Take the scenario of order flow associated product table as an example. orders
The table contains the real-time order flow from Kafka, and product_changelog
the table comes from the changelog of the database table products
. The price of the product products
changes in real time in the database table.
SELECT * FROM product_changelog;
(changelog kind) update_time product_id product_name price
================= =========== ========== ============ =====
+(INSERT) 00:01:00 p_001 scooter 11.11
+(INSERT) 00:02:00 p_002 basketball 23.11
-(UPDATE_BEFORE) 12:00:00 p_001 scooter 11.11
+(UPDATE_AFTER) 12:00:00 p_001 scooter 12.99
-(UPDATE_BEFORE) 12:00:00 p_002 basketball 23.11
+(UPDATE_AFTER) 12:00:00 p_002 basketball 19.99
-(DELETE) 18:00:00 p_001 scooter 12.99
The table represents the ever-growing changelog of product_changelog
the database table . For example, the initial price of the product at the time point is , the price increase is reached at the time , and the product price record is deleted at the time.products
scooter
00:01:00
11.11
12:00:00
12.99
18:00:00
If we want to output product_changelog
the table in 10:00:00
the corresponding version, the content of the table is as follows:
update_time product_id product_name price
=========== ========== ============ =====
00:01:00 p_001 scooter 11.11
00:02:00 p_002 basketball 23.11
If we want to output product_changelog
the table in 13:00:00
the corresponding version, the content of the table is as follows:
update_time product_id product_name price
=========== ========== ============ =====
12:00:00 p_001 scooter 12.99
12:00:00 p_002 basketball 19.99
In the above example, products
the version of the table is tracked by update_time
and , corresponding to the primary key of the table, corresponding to the event time.product_id
product_id
product_changelog
update_time
Ordinary table description
On the other hand, some user cases need to connect to changing dimension tables, which are external database tables.
Suppose LatestRates
it is a materialized latest exchange rate table (for example: an HBase table), which LatestRates
always represents Rates
the latest content of the HBase table.
The content we 10:15:00
queried at the time is as follows:
10:15:00 > SELECT * FROM LatestRates;
currency rate
========= ====
US Dollar 102
Euro 114
Yen 1
The content we 11:00:00
queried at the time is as follows:
11:00:00 > SELECT * FROM LatestRates;
currency rate
========= ====
US Dollar 102
Euro 116
Yen 1
Create temporal table
Flink uses primary key constraints and event time to define a version table and version view.
Example of creating a version table
In Flink, a table with primary key constraints and event-time attributes defined is a version table.
-- 定义一张版本表
CREATE TABLE product_changelog (
product_id STRING,
product_name STRING,
product_price DECIMAL(10, 4),
update_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,
PRIMARY KEY(product_id) NOT ENFORCED, -- (1) 定义主键约束
WATERMARK FOR update_time AS update_time -- (2) 通过 watermark 定义事件时间
) WITH (
'connector' = 'kafka',
'topic' = 'products',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = 'localhost:9092',
'value.format' = 'debezium-json'
);
The behavior (1)
table product_changelog
defines the primary key, and the row (2)
is update_time
defined as product_changelog
the event time of the table, so product_changelog
it is a version table.
Note : METADATA FROM 'value.source.timestamp' VIRTUAL
The syntax means to extract the execution time of the operation in the database table corresponding to the changelog from each changelog. It is strongly recommended to use the execution time of the operation in the database table as the event time, otherwise the version extracted by time may be different from the version in the database. match.
Declaring version views #
Flink also supports defining version views as long as a view contains the primary key and event time is a version view.
Suppose we have table RatesHistory
like below:
-- 定义一张 append-only 表
CREATE TABLE RatesHistory (
currency_time TIMESTAMP(3),
currency STRING,
rate DECIMAL(38, 10),
WATERMARK FOR currency_time AS currency_time -- 定义事件时间
) WITH (
'connector' = 'kafka',
'topic' = 'rates',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = 'localhost:9092',
'format' = 'json' -- 普通的 append-only 流
)
The table RatesHistory
represents a table of currency exchange rates to Japanese Yen (Yen exchange rate is 1), which is a growing append-only table. For example, the 欧元
exchange rate 日元
from 09:00:00
to 10:45:00
is 114
. The exchange rate from 10:45:00
to 11:15:00
is 116
.
SELECT * FROM RatesHistory;
currency_time currency rate
============= ========= ====
09:00:00 US Dollar 102
09:00:00 Euro 114
09:00:00 Yen 1
10:45:00 Euro 116
11:15:00 Euro 119
11:49:00 Pounds 108
In order to RatesHistory
define a version table on Flink, Flink supports defining a version view through deduplication query . Deduplication query can produce an ordered changelog stream. Deduplication query can infer the primary key and preserve the event time attribute of the original data stream.
CREATE VIEW versioned_rates AS
SELECT currency, rate, currency_time -- (1) `currency_time` 保留了事件时间
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY currency -- (2) `currency` 是去重 query 的 unique key,可以作为主键
ORDER BY currency_time DESC) AS rowNum
FROM RatesHistory )
WHERE rowNum = 1;
-- 视图 `versioned_rates` 将会产出如下的 changelog:
(changelog kind) currency_time currency rate
================ ============= ========= ====
+(INSERT) 09:00:00 US Dollar 102
+(INSERT) 09:00:00 Euro 114
+(INSERT) 09:00:00 Yen 1
+(UPDATE_AFTER) 10:45:00 Euro 116
+(UPDATE_AFTER) 11:15:00 Euro 119
+(INSERT) 11:49:00 Pounds 108
The row (1)
holds the event time as the view versioned_rates
's event time, the row (2)
makes the view versioned_rates
have a primary key, and thus the view versioned_rates
is a versioned view.
The deduplication query in the view will be optimized by Flink and efficiently generate the changelog stream. The generated changelog retains the primary key constraint and event time.
If we want to output versioned_rates
the table in 11:00:00
the corresponding version, the content of the table is as follows:
currency_time currency rate
============= ========== ====
09:00:00 US Dollar 102
09:00:00 Yen 1
10:45:00 Euro 116
If we want to output versioned_rates
the table in 12:00:00
the corresponding version, the content of the table is as follows:
currency_time currency rate
============= ========== ====
09:00:00 US Dollar 102
09:00:00 Yen 1
10:45:00 Euro 119
11:49:00 Pounds 108
Declare a normal table #
The declaration of ordinary tables is consistent with the DDL for creating tables in Flink. Refer to the create table page for more information on how to create tables.
-- 用 DDL 定义一张 HBase 表,然后我们可以在 SQL 中将其当作一张时态表使用
-- 'currency' 列是 HBase 表中的 rowKey
CREATE TABLE LatestRates (
currency STRING,
fam1 ROW<rate DOUBLE>
) WITH (
'connector' = 'hbase-1.4',
'table-name' = 'rates',
'zookeeper.quorum' = 'localhost:2181'
);