Flink principle (seven) - Dynamic table (Dynamic tables)

Foreword

  This article is a combination of Flink's official website, personal understanding of income, if there is a mistake welcome message pointed out , thank you! FIG herein are from the official website (link [1]).

  This article will expand with the following question, more vivid explanation for this problem can be found in bamboo teacher to share (links [2]).

  SQL for stream computing scene it?

  For flow calculation, the arrival of each piece of data will trigger a query produce a result, and launch out. We found that for the same data source, using the same SQL queries, batch, streams are the same, that is, in different modes, SQL semantics are consistent (One Query One Result), the end result is the same.

1, the continuous dynamic query table (Dynamic Table & Continuous Query)

  Dynamic and static table corresponding to the table - a conventional database table or the like batch table, which does not change the data in the query. Dynamic table changes over time, even when the query. How to understand? Data on the stream is a steady stream of the arrival of a data triggers a query, the query is executed as well as the arrival of the next data, the data in the table itself is changing.

  Query the dynamic table is continuous, that is, continuous query (Continuous Query). Substantially continuously inquiry defined physical view (Materialized View) on a very similar dynamic query table. Physical view is defined as a SQL query, just like conventional virtual view, the view will be different is the physical cache query results, so do not need to be recalculated when you visit, and the challenge brought by the cache is likely to provide outdated results, Eager View Maintenance is a timely technology with new physical view, there is not carried out.

  Flow, dynamic table, the relationship between the three continuous query as shown below:

  One sentence is: stream is converted into the dynamic table, dynamic table continuous query to generate a new dynamic table (Table result), then the result table is converted into a stream.

2, the flow definition table

  2.1 Definitions table

  In order to use relational query on the stream, you need to stream into the table. The following analysis are made official website (Ref [2]) is described in the examples.

  1) Click on the event flow schema as follows:

1 [
2   user:  VARCHAR,   // the name of the user
3   cTime: TIMESTAMP, // the time when the URL was accessed
4   url:   VARCHAR    // the URL that was accessed by the user
5 ]    

  2) Conceptually, the dynamic record table is modified every day INSERT stream. Essentially, a table is constructed from an INSERT-Only (insert only) of ChangeLog stream. Click on the event stream build table shown below, and as more clickstream record insertion, increasing generated table:

  Note : in the flow definition table in the interior is not achieved.

  2.2 Continuous Query

  Continuous query will not abort, the results table will be updated according to the input table, here are two examples of queries.

  1) Simple GROUP-BY Count aggregate query

  Below, on the left is the input table click, updata increase with time, the right is the result of table lookup. Start clicks only one data table [Mary, ./home] Table 1 shows the result, a new data table when clicks [Bob, ./cart], the result table is a table -2, successively lower Push. Table travel update or INSERT operations, SQL statement will generate a new table of results will come from existing data before each new data.

  2) aggregate queries with windows (window) of

  窗口的时间间隔是1个小时,窗口-1对应的时表-1,窗口-2对应的时表-2,依次类推。和第一种查询不一样的是,每一张时表只是统计对应窗口的数据,之前窗口的数据对其没有影响,对不同窗口的查询结果是以追加的形式写入result表中的。

   2.3 Update和Append查询

  2.2中的两种例子分别对应的两种查询方式,

  1)例子1对应着Update查询,这种方式需要更新之前已经发出的结果,包括INSERT和UPDATE两种改变。改变之前已经发出的结果意味着,这种查询需要维护更多的状态(state)数据;

  2)例子2对应着Append查询,这种方式查询的结果都是以追加的形式加入到result表中,仅包含INSERT操作。这种方式生成的表和update生成的表转换成流的方式不一样(见下文)。

  2.4 Restrictions查询

  对于有些SQL查询会因需要保留的state多大或重新计算已发出的记录用来更新的代价太大而得不偿失。

  1)state size:例如下面的SQL,在连续查询中,当一条新的消息到来时,为了更新之前已发出的结果(联想2.2中例1),需要保存之前的计算结果即state。当时当连续查询持续很长时间时,需要保存的state的容量会很大,且随着时间的递增会越来越大,更糟糕的是若不断有新用户(分配不同的username)加入,其要保存的count会随着时间更加恐怖,最后有可能导致任务失败。

1 SELECT user, COUNT(url) FROM clicks GROUP BY user;

  2)computing updates:例如下面的SQL,当clicks表新增一条记录,为计算rank,需要对之前所有的重新计算和更新已发出结果的中很大一部分,一条记录的的增加,有可能导致很多user的rank变化。

1 SELECT user, RANK() OVER (ORDER BY lastLogin)
2 FROM (
3   SELECT user, MAX(cTime) AS lastAction FROM clicks GROUP BY user
4 );

   3)查询配置(链接[3])

  在常见的场景中,对长期的运行的job做连续查询,为了防止保存的state过大超出存储而任务失败,可能会对state的大小做一定限制即删除state。但这种方式可能引发另一个问题——查询出来的结果可能不准确。Flink Table API和SQL中提供查询参数试图在准确性和资源消耗中找到一个平衡点。

   Idle State Retention Time含义是state的key在被删除之前多长时间没有被更新,即没有被更新state的保存时间。使用方式如下:

1 StreamQueryConfig qConfig = ...
2 
3 // set idle state retention time: min = 12 hours, max = 24 hours
4 qConfig.withIdleStateRetentionTime(Time.hours(12), Time.hours(24));

  3、Table到流的转换

  可以通过INSERT、UPDATE、DELETE像修改常规表一样去改变动态表。将动态表转换为流或将其写入外部系统时,需要对这些更改进行encode。 Flink的Table API和SQL支持三种encode改变动态表的方法:

  1)Append-only Stream(仅追加流):仅通过INSERT操作得到的动态表可以发射插入行来转换为流(联想2.2中例2),这种方式转换的流中数据都是片段性的,一个片段代表一个窗口;

  2)Retract Stream(回溯流):restract stream有两种消息:添加(add)消息和回溯(retract)消息。将动态表转换为回溯(retract)流,通过将INSERT更改encode为添加消息,将DELETE更改encode为回溯消息,将UPDATE更改endcode为更新(上一个)行的回溯消息以及添加消息更新新的行 。 下图显示了动态表到回溯流的转换。

  流上每条消息都有一个标识位,其中+标识INSERT操作,-标识DELETE操作。在clicks表中第一、二行消息[Mary, ./home]和[Bob, ./cat]被转换为流中1第、2条消息,当clicks表中第三行[Mary, ./prod?id=1]转换时,会先将已发出的第1条信息标记为DELETE告诉下游,然后第4条消息重新插入user为Mary的消息,依次类推,这样可以保证输出结果的正确性。

  3)Upsert Stream(上插流):Upsert流包括upsert消息和删除消息。 动态表要转换为upsert流需要(可能是复合的)唯一键。 通过将INSERT和UPDATE 操作encode为upsert消息,并将DELETE更改encode为删除消息,可以是具有唯一键的动态表转换为流。 流运算需要知道唯一键属性才能正确应用消息。 与回溯流的主要区别在于UPDATE使用单个消息((主键))进行编码,因此更有效。

  (个人理解待验证)Upsert流和Retract流的区别在于数据存在第三方系统中时,前者可能存在重复数据,后者没有。

 

   NOTE:在将动态表(dynamic table)转换为数据流(Data Stream)时,仅支持append和retract两种方式

  

Ref:

  [1]https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/dynamic_tables.html#table-to-stream-conversion

  [2]http://www.itdks.com/Course/detail?id=13213&from=search

  [3]https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/query_configuration.html

Guess you like

Origin www.cnblogs.com/love-yh/p/11816516.html