Table Api & SQL queries consecutive Join

This article is mainly on the official website Flink relevant content for translation, the original address: https: //ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/streaming/time_attributes.html

Join batch data processing is common and easy to understand operation of the line for connecting the two relationships. However, the connection semantically dynamic table is not obvious even confusing.

There are several ways to use SQL Table API or actual execution of connections Flink.

For the article in the time attribute and temporal tables , please refer to the author before article.

General JOIN

Conventional coupling is the most common type of join, or any new records any changes to the input sides of the join are visible and can affect the overall result of the coupling. For example, if the left there is a new record, it will be merged with all previous and future record right.

SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id

These semantics allow any type of update (insert, update, delete) input table.

However, this action has an important implication: it requires the input of both ends of the connection remain in the state Flink . Thus, if one or both of the input table continues to grow, resources will also grow indefinitely .

JOIN time window

The time window defined by the connecting join, the join check the input record time attribute is within a certain time limit, i.e. the time window.

SELECT *
FROM
  Orders o,
  Shipments s
WHERE o.id = s.orderId AND
      o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime

Compared with conventional join operations, such connections support only append-only table having a time attribute (append-only tables). Because of the time attribute is to increase the quasi-monontic, Flink can delete the old values of its state without affecting the accuracy of the results.

Temporal table function JOIN

Has a state table join the only additional function (append-only) table (left input / probe side) with temporal table (right input / Construction side), i.e., the change tracking table and its variation with time.
The following example shows only one additional (append-only) Orders table, these orders should be connected to the changing currency rate table RatesHistory. Orders are only append (append-only) table, represent a given payment amount and given currency. For example, at 10:15, the order amount of 200 euros.

SELECT * FROM Orders;
 
rowtime amount currency
======= ====== =========
10:15        2 Euro
10:30        1 US Dollar
10:32       50 Yen
10:52        3 Euro
11:04        5 US Dollar

RatesHistory a representative of changing currency exchange rates only added to the table, relative to the Japanese Yen (exchange rate 1). For example, the euro exchange rate against the yen from 09:00 to 10:45 for from 10:45 to 11:15 114. exchange rate of 116.

SELECT * FROM RatesHistory;
 
rowtime currency   rate
======= ======== ======
09:00   US Dollar   102
09:00   Euro        114
09:00   Yen           1
10:45   Euro        116
11:15   Euro        119
11:49   Pounds      108

Since we want to calculate the amount of all orders converted into a common currency (yen).

For example, we want to use a given conversion rate conversion rowtime following order (114)

rowtime amount currency
======= ====== =========
10:15        2 Euro

If you do not use state table, you need to write the following query:

SELECT
  SUM(o.amount * r.rate) AS amount
FROM Orders AS o,
  RatesHistory AS r
WHERE r.currency = o.currency
AND r.rowtime = (
  SELECT MAX(rowtime)
  FROM RatesHistory AS r2
  WHERE r2.currency = o.currency
  AND r2.rowtime <= o.rowtime);

State table function Rates, can use the following query by sql

SELECT
  o.amount * r.rate AS amount
FROM
  Orders AS o,
  LATERAL TABLE (Rates(o.rowtime)) AS r
WHERE r.currency = o.currency

When the time attribute associated recorded probe end, the probe end of each record from the terminal associated with the build table versions. To support the previous value to generate an updated list of upper (cover), the table must be defined a primary key.

In our example, each record in the Orders table to table and Rates join, at the time o.rowtime. Currency (Currency) field has been defined as the primary key before Rates, and in our example for connecting two tables. If the query uses the concept of processing time, in the implementation of the operation, adding new order will always be connected with the latest version of Rates.

Compared with the conventional connection, which means that if you build end (temporal tables) there is a new record, it will not affect the results of the previous connection. This again allows Flink limit the number of elements to maintain state.

Compared with the time window join, join temporal tables do not define a time window (data in the time window will be join). Construction of versions recorded probe end side is always specified in connection with the time properties. Accordingly, the recording may be constructed of any aspects of the old. Over time, the state will remove the previous version and records are no longer needed (for a given primary key).

usage

After the state table functions, we can use it in the definition. Use temporal manner using the table function is the same as an ordinary table function.

The following code snippet to solve our problem converting currency from the Orders table:

SQLSELECT
  SUM(o_amount * r_rate) AS amount
FROM
  Orders,
  LATERAL TABLE (Rates(o_proctime))
WHERE
  r_currency = o_currency
  
JAVA:
Table result = orders
    .join(new Table(tEnv, "rates(o_proctime)"), "o_currency = r_currency")
    .select("(o_amount * r_rate).sum as amount");
 
SCALA:
val result = orders
    .join(rates('o_proctime), 'r_currency === 'o_currency)
    .select(('o_amount * 'r_rate).sum as 'amount)

Note : For temporal table join, the state defined in the query has not been implemented in the configuration retention. This means that the calculation results of queries required state could grow indefinitely, depending on the number of different primary key of the history table.

State connection time processing

Using the processing time attribute, the attribute can not be transmitted to the elapsed time as a function of the temporal parameter table. By definition, it is always current timestamp. Therefore, calling state table function will always return the latest known versions of the underlying table when dealing with time and any updates historical basis of the table will immediately cover the current value.

Construction of only the latest version of the recording side (with respect to the definition of the primary key) held in this state. Construction of the end of the update will not affect the results of previously issued connection.

When the coupling state when the processing may be viewed as a simple HashMap <K, V>, which stores all the records from the end of the construct. When constructing a new record end from the previous record with the same key, only the old value is overwritten. Always evaluates each record from the detector terminal according to the latest / current status of the HashMap.

State connection time event

Use of event time attribute (ie rowtime property), the last time property can be passed to the schedule function. This allows a common point in time joined together two tables.

Compared with the state of the connection processing time, only temporal table held the latest version of the constructed state record side (with respect to the definition of the primary key), and all versions from a stored since the Watermark (identified by time).

For example, the concept of temporal tables at time 12:30:00 to attach to the probe side table event time stamp is 12:30:00 incoming row connected together side to the build table. . Accordingly, only the incoming row with a timestamp equal to or less than the line connected to 12:30:00, and the updated primary key application until this time.

By defining an event time, the connecting operation Watermark allows timely moves forward and discard version construction table is no longer needed, because it is not desirable to have lower or equal incoming row time stamp.

Table JOIN tenses

When coupled with an arbitrary table state table (left input / probe side) with temporal table (right side of the input / Construction side) coupling, i.e. time-varying external dimension table.

Note : it can not be used in any state table when the table, by the need to use tables LookupableTableSource supported. LookupableTableSource table is used only as a temporal coupling time. For more details on how to define LookupableTableSource, see How to the DEFINE LookupableTableSource .

The following example shows the Orders stream, which should be combined with the changing currency table LatestRates.

LatestRates is using the latest exchange rates to achieve the dimension table. At time 10: 15,10: 30,10: 52, LatestRates reads as follows:

10:15> SELECT * FROM LatestRates;

currency   rate
======== ======
US Dollar   102
Euro        114
Yen           1

10:30> SELECT * FROM LatestRates;

currency   rate
======== ======
US Dollar   102
Euro        114
Yen           1


10:52> SELECT * FROM LatestRates;

currency   rate
======== ======
US Dollar   102
Euro        116     <==== changed from 114 to 116
Yen           1

Content equal time 10:15 and 10:30 of the LastestRates. Euro exchange rate changes from 114 to 116 at 10:52. Additional orders is only a table, on behalf of a given payment amount and given currency. For example, a sum of two euros of orders at the time of 10:15.

SELECT * FROM Orders;

amount currency
====== =========
     2 Euro             <== arrived at time 10:15
     1 US Dollar        <== arrived at time 10:30
     2 Euro             <== arrived at time 10:52

Suppose we want to compute all be converted to a common currency (yen) order amount. For example, we want to use the latest exchange rate LatestRates conversion in the following order. The result will be:

amount currency     rate   amout*rate
====== ========= ======= ============
     2 Euro          114          228    <== arrived at time 10:15
     1 US Dollar     102          102    <== arrived at time 10:30
     2 Euro          116          232    <== arrived at time 10:52

When state aid table joins, we can query in SQL will be expressed as:

SELECT
  o.amout, o.currency, r.rate, o.amount * r.rate
FROM
  Orders AS o
  JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
  ON r.currency = o.currency

Each probe end will record the version table associated with the end of the current build. In our example, the query is processed using the concept of time, so while performing the operation, additional new orders will always combined with the latest version of LatestRates. Note that the results for the processing time is not determined.

In contrast to conventional coupling, despite changes in terms of building, but tense exemplar join previous results will not be affected. Further, when the state table join operator is very lightweight, and does not retain any state.

Compared with the time window is coupled, when the coupled state table does not define a time window in which to record the coupling. When processing, recording the detection side is always combined with the latest version of the end of the building. Accordingly, the recording may be constructed of any aspects of the old.

Temporal Table Temporal and join functions join tables are from the same motives, but having a different implement and run SQL syntax:

  • Temporal join SQL syntax table function is coupled UDTF, while using state table join SQL: state table 2011 standard syntax introduced.
  • Achieve temporal coupling table function actually join the two streams and keep them state, while only a single state table join receiving an input stream according to an external database and find the record button.
  • Temporal table function used commonly coupled hoin change log stream, while commonly used coupling state table join the outer table (i.e., dimension tables).

Such behavior enabled state table becomes a good candidate, the term may be used to represent the relationship between the rich stream.

In the future, when the join state table support table temporal coupling function, the coupling state of the flow i.e. the change log support.

usage

State table joins syntax is as follows:

SELECT [column_list]
FROM table1 [AS <alias1>]
[LEFT] JOIN table2 FOR SYSTEM_TIME AS OF table1.proctime [AS <alias2>]
ON table1.column-name1 = table2.column-name1

Currently, only support INNER JOIN and LEFT JOIN. Should follow FOR SYSTEM_TIME AS OF table1.proctime after a temporary table. proctime processing time is table1 properties. This means that each record in the table left connected, it too will be in dealing with state table to take a snapshot.

For example, after defining temporal table, we can use it as follows.

SELECT
  SUM(o_amount * r_rate) AS amount
FROM
  Orders
  JOIN LatestRates FOR SYSTEM_TIME AS OF o_proctime
  ON r_currency = o_currency

Note :

  1. Blink planner is only supported in the program.
  2. It is only supported in SQL, but is not yet supported in the Table API.
  3. State table joins when Flink does not currently support the event time.
Published 87 original articles · won praise 69 · views 130 000 +

Guess you like

Origin blog.csdn.net/lp284558195/article/details/104392048