[Translation] Flink Table Api & SQL -Streaming concept - dynamic table

This translation from the official website: Flink Table Api & SQL dynamic table  https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/dynamic_tables.html 

SQL and relational algebra are designed without considering streaming data. So, between the relational algebra (and SQL) and there are some gaps in the stream processing concept.

This page discusses these differences and explains how to achieve Flink unbounded data on conventional database engine on the same bounded data semantics.

Relational query on the data stream

The table for the input data, and output the results of execution, comparing the difference between the traditional and the relational algebra stream processing.

Relational algebra / SQL Streaming
Relations (or tables) is (s) tuples bounded. Stream is infinite tuples.
Query performed on a batch of data (for example, a relational database table) can access the full input data. Stream query can not access all data at startup, but must "wait" to stream data.
After a batch query produces a fixed size of the resulting termination. Constantly updated stream query result based on the received record and never complete.

Despite these differences, the use of SQL and relational query processing flow is not impossible. Advanced relational database systems offer called "  materialized view" feature (view the results as a table stored and updated regularly, Oracle has, Baidu Oracle materialized views) . Materialized view is defined as a SQL query, just like conventional virtual view of the same. In contrast to the virtual view, materialized view cache query results, so no need to evaluate the query when accessing the view. A common challenge is to prevent buffer cache to provide outdated results. Modifying the query base table definition, materialized view will be obsolete. Eager View Maintenance  a materialized view updated immediately after the base table update technique.

If you consider the following factors, then Eager View Maintenance  and convection contact between the SQL query will become apparent:

  • Database table is a result of flow of INSERT, UPDATEand DELETE DML statements, commonly referred to as update log stream .
  • Materialized view definition as SQL queries. To update the view, query continuous process of changing the basic relationship of log stream view.
  • Examples of view is a result of flow of the SQL query.

Considering these points, we will in the next section describes the dynamic table of the following concepts .

Dynamic tables and continuous query

Dynamic table is Flink's Table API SQL convective core concepts and data support. In contrast, the dynamic and static table change table represents the batch of data over time. They can query the same table as static batch. Dynamic query table will produce a continuous query . Continuous query never end and generate dynamic table as a result. Queries continually update (dynamic) Results Table to reflect the changes on their (dynamic) input table . In essence, very similar to the query sustained dynamic query tables defined materialized view.

It is important to note that the results of continuous query semantically equivalent to the result is always the same query performed in batch mode on a snapshot of the input table.

The following figure shows the relationship between flow, dynamic and continuous query table:

Dynamic table
  1. Convert dynamic flow table.
  2. Table continued on a dynamic evaluation of the query, generate a new dynamic table.
  3. Generating a dynamic table to convert reflux.

Note: Dynamic first table is a logical concept. Not necessarily during query execution (full) dynamic table.

In the following, we will explain the concept of dynamic tables and queries by clicking sustained flow of events has the following modes:

[
  user:  VARCHAR,   // the name of the user
  cTime: TIMESTAMP, // the time when the URL was accessed
  url:   VARCHAR    // the URL that was accessed by the user
]

Definition table in the stream

In order to use relational query processing flow, it must be converted Table. Conceptually, each of the recorded stream is interpreted as INSERTthe result of the table are modified. In essence, we are building a table from changelog flow only INSERT's .

The figure below visualizes click event stream (left side) How to convert to a table (on the right). With the insertion of more clickstream records, the results table will continue to grow.

Append mode

NOTE: table defined in the stream is not implemented within.

Continuous query


In the assessment of the dynamic table continuous query and generate a new dynamic table as a result. In contrast to the batch query, the query may never end and continue to update its list based on the results of the update of the input table. At any point in time, the result of continuous queries are semantically equivalent to the results for the same query in the snapshot of the input table in batch mode.

In the following example, we show the flow of the click event defined in the table of two examples of queries .

The first query is a simple GROUP-BY COUNT aggregate queries. It will clickstable a field user  groups, and calculate the number of URL access. The following figure shows that as the clickstable update the other row, over time, how to evaluate the query .

Non-continuous query window

After starting the inquiry, clicks table (on the left) is empty. When the first row is inserted into table clicks in the query result table counted. After insertion of the first row [Mary,. / Home], the result table (right, top) by a single line [Mary, 1] composition. When the second row [Bob,. / Cart] clicks inserted into the table, update the query result table and insert rows [Bob, 1]. Third row [Mary,. / Prod? Id = 1] to generate an updated calculated results of the line, so that the [Mary, 1] is updated to [Mary, 2]. Finally, when attached to the fourth row clicks table, query third row [Liz, 1] is inserted into the result table.

The second query to a query similar to the first, but on the clickstable before the counting , except that the userattributes classified addition, the table also hours a scrollable window for the packet (before the counting URL-based, based on the calculated time ( For example, based on a special window of time attributes) , will be discussed later)). Similarly, the figure shows the input and output at different time points, varying visual dynamic process table.

Continuous Query window group

和以前一样,输入表clicks显示在左侧。该查询每小时持续计算结果并更新结果表。clicks 表包含四行,其时间戳(cTime)在 12:00:00 和 12:59:59 之间。该查询从输入计算出两个结果行(每个用户一个),并将它们附加到结果表中对于 13:00:00 和 13:59:59 之间的下一个窗口,该clicks表包含三行,这将导致另外两行追加到结果表中。结果表将更新,因为clicks随着时间的推移会添加更多行

更新和 Append 查询

尽管两个示例查询看起来非常相似(都计算分组计数汇总),但是它们在一个重要方面有所不同:

  • 第一个查询更新先前发出的结果,即结果表包含INSERTUPDATE更改的变更日志流
  • 第二个查询仅附加到结果表,即结果表的changelog流仅包含INSERT更改。

查询是生成仅追加表还是更新表具有一些含义:

  • 产生更新更改的查询通常必须维护更多状态(请参阅以下部分)。
  • 仅追加表到流的转换与更新表的转换不同(请参阅表到流转换部分)。

查询限制

可以将许多但不是全部的语义有效查询评估为流中的持续查询。某些查询由于需要维护的状态大小或计算更新过于昂贵(注:计算成本)而无法计算。

  • 状态大小:持续查询是在无限制的流上评估的,通常应该运行数周或数月。因此,持续查询处理的数据总量可能非常大。必须更新先前发出的结果的查询需要维护所有发出的行,以便能够更新它们。例如,第一个示例查询需要存储每个用户的URL计数,以便能够增加计数并在输入表接收到新行时发出新结果。如果仅跟踪注册用户,则要维护的计数数量可能不会太高。但是,如果未注册的用户获得分配的唯一用户名,则要维护的计数数量将随着时间的推移而增加,并最终可能导致查询失败。
SELECT user, COUNT(url)
FROM clicks
GROUP BY user;
  • 计算更新:即使只添加或更新一条输入记录,某些查询也需要重新计算和更新很大一部分发射结果行显然,这样的查询不太适合作为持续查询执行。下面的查询是一个示例,该查询根据最终点击的时间为每个用户计算 排名clicks表格收到新行后,lastAction用户的身份将更新,并且必须计算新排名。但是,由于两行不能具有相同的排名,因此所有排名较低的行也需要更新。
SELECT user, RANK() OVER (ORDER BY lastLogin)
FROM (
  SELECT user, MAX(cTime) AS lastAction FROM clicks GROUP BY user
);

“ 查询配置”章节讨论了用于控制持续查询的执行的参数。某些参数可用于权衡维护状态的大小以提高结果的准确性。

表到流的转换

动态表可以通过INSERTUPDATE以及DELETE不断修改,就像一个普通的数据库表。它可能是具有单行的表,该表会不断更新;可能是一个仅插入的表,没有UPDATEDELETE修改,或者介于两者之间。

将动态表转换为流或将其写入外部系统时,需要对这些更改进行编码。Flink的Table API和SQL支持三种方式来编码动态表的更改:

  • Append-only流:可以通过发出插入的行将仅通过INSERT更改修改的动态表转换为流

  • Retract 流:撤回流是具有两种消息类型的流,添加消息撤回消息。通过将INSERT更改编码为添加消息,将DELETE更改编码为撤会消息,将UPDATE更改编码为更新(先前)行的更新消息,并将更新消息编码为更新(新)行,将动态表转换为撤回流下图可视化了动态表到撤回流的转换。

Dynamic table



  • Upsert flow: Upsert stream is a stream having two message types, Upsert  messages and Delete messages . Upsert dynamic conversion table requires a stream (possibly complex) unique key. By changing the encoding upsert INSERT and UPDATE message to change the coding and DELETE to delete the message, having a dynamic table into a unique key stream . Current consumption of the operator needs to know the unique key attributes in order to apply the correct message. Consumer stream operator needs to know the unique key attributes in order to apply the correct message. The main difference lies in the withdrawal stream UPDATE message is encoded using a single change, and therefore more efficient . FIG visualized under dynamic flow upsert conversion table.
Dynamic table



We discussed to convert a dynamic table of API DataStream on the Common Concepts page. Note that the table conversion only supports dynamic addition and withdrawal of flow when DataStream . The dynamic table discussed transmitted to the external system interface TableSink TableSources and TableSinks page.

Welcome rookie public attention Flink number will occasionally update Flink (technology development) related Tweets

 

Guess you like

Origin www.cnblogs.com/Springmoon-venn/p/11839715.html