Flink SQL Regular Join 、Interval Join、Temporal Join、Lookup Join 详解

Flink supports a large number of data Join methods, mainly including the following three:

  • Join between dynamic table (stream) and dynamic table (stream)
  • Join between dynamic table (stream) and external dimension table (such as Redis)
  • Column switching of dynamic table fields (a special type of Join)

Breakdown of joins supported by Flink SQL:

Regular Join: Stream-to-stream Join, including Inner Equal Join, Outer Equal Join

Interval Join: Join between streams, Join between two streams within a period of time

Temporal Join: Temporal Join between streams, including event time and processing time, similar to offline snapshot Join

Lookup Join: Join between streams and external dimension tables

Array Expansion: Column expansion of table fields, similar to column expansion of Hive’s explode data explosion

Table Function: Column switching of table fields of custom function, supports Inner Join and Left Outer Join

1.Regular Join

**Regular Join definition (supports Batch\Streaming): **Regular Join is the same Regular Join as offline Hive SQL, and associates two stream data outputs through conditions.

**Application scenarios:** For example, log association expands dimension data and builds wide tables; logs calculate CTR through ID association.

Regular Join includes the following types (with L as the data identifier in the left stream and R as the data identifier in the right stream):

  • Inner Join (Inner Equal Join): In a streaming task, only two streams are output when they are joined, and the output is +[L, R]
  • Left Join (Outer Equal Join): In the streaming task, after the left stream data arrives, regardless of whether there is data Joined to the right stream, it will be output (Join to output + [L, R], no Join to output + [L, null] ), if after the right stream data arrives, it is found that the left stream has previously output data that has not been joined, a retraction stream will be initiated, first outputting -[L, null], and then outputting +[L, R]
  • Right Join (Outer Equal Join): Like Left Join, the execution logic of the left table and the right table is completely opposite.
  • Full Join (Outer Equal Join): In a streaming task, after the data from the left stream or the right stream arrives, it will be output regardless of whether there is data joined to another stream (for the right stream: Join to output + [L , R], without Join, the output will be +[null, R]; for the left stream: Join will be the output +[L, R], without Join, the output will be +[L, null]). If after the data of one stream arrives, it is found that another stream has previously output data that has not been joined, a retraction stream will be initiated (the left stream data arrives as an example: retraction-[null, R], output +[L , R], the right stream data arrives as an example: retracement-[L, null], output +[L, R])

**Actual case: **The case is to associate the exposure log with the click log, filter the existing exposure and click data, and add the extended parameters of the click

a) Inner Join case:
-- 曝光⽇志数据
CREATE TABLE show_log_table (
 log_id BIGINT,
 show_params STRING
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '2',
 'fields.show_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '100'
);

-- 点击⽇志数据
CREATE TABLE click_log_table (
 log_id BIGINT,
 click_params STRING
)
WITH (
 'connector' = 'datagen',
 'rows-per-second' = '2',
 'fields.click_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

-- 流的 INNER JOIN,条件为 log_id
INSERT INTO sink_table
SELECT
 show_log_table.log_id as s_id,
 show_log_table.show_params as s_params,
 click_log_table.log_id as c_id,
 click_log_table.click_params as c_params
FROM show_log_table
INNER JOIN click_log_table 
ON show_log_table.log_id = click_log_table.log_id;

The output is as follows:

+I[5, d, 5, f]
+I[5, d, 5, 8]
+I[5, d, 5, 2]
+I[3, 4, 3, 0]
+I[3, 4, 3, 3]
b) Left Join case:
CREATE TABLE show_log_table (
 log_id BIGINT,
 show_params STRING
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.show_params.length' = '3',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE click_log_table (
 log_id BIGINT,
 click_params STRING
)
WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.click_params.length' = '3',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

set sql-client.execution.result-mode=changelog;

INSERT INTO sink_table
SELECT
 show_log_table.log_id as s_id,
 show_log_table.show_params as s_params,
 click_log_table.log_id as c_id,
 click_log_table.click_params as c_params
FROM show_log_table
LEFT JOIN click_log_table 
ON show_log_table.log_id = click_log_table.log_id;

The output is as follows:

+I[5, f3c, 5, c05]
+I[5, 6e2, 5, 1f6]
+I[5, 86b, 5, 1f6]
+I[5, f3c, 5, 1f6]
-D[3, 4ab, null, null]
-D[3, 6f2, null, null]
+I[3, 4ab, 3, 765]
+I[3, 6f2, 3, 765]
+I[2, 3c4, null, null]
+I[3, 4ab, 3, a8b]
+I[3, 6f2, 3, a8b]
+I[2, c03, null, null]
...
c) Full Join case:
CREATE TABLE show_log_table (
 log_id BIGINT,
 show_params STRING
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '2',
 'fields.show_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE click_log_table (
 log_id BIGINT,
 click_params STRING
)WITH (
 'connector' = 'datagen',
 'rows-per-second' = '2',
 'fields.click_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT
 show_log_table.log_id as s_id,
 show_log_table.show_params as s_params,
 click_log_table.log_id as c_id,
 click_log_table.click_params as c_params
FROM show_log_table
FULL JOIN click_log_table 
ON show_log_table.log_id = click_log_table.log_id;

The output is as follows:

+I[null, null, 7, 6]
+I[6, 5, null, null]
-D[1, c, null, null]
+I[1, c, 1, 2]
+I[3, 1, null, null]
+I[null, null, 7, d]
+I[10, 0, null, null]
+I[null, null, 2, 6]
-D[null, null, 7, 6]
-D[null, null, 7, d]
...

Notes on Regular Join:

  • Real-time Regular Join does not need to be an equal value join. The difference between an equal value join and a non-equivalent join is that the data shuffle strategy of the equal value join is Hash, which will be sent to the corresponding downstream according to the equality condition in the Join on as the ID; non-equivalent join The data shuffle strategy is Global. All data is sent to one concurrency and is associated according to non-equivalent conditions.

    Equivalent Join:

Insert image description here

Non-equivalent Join:

Insert image description here

  • The process of Join is that after a new piece of data comes in from the left stream, it will be joined with all the data that meets the conditions in the right stream, and then output.

  • The upstream of the stream has unlimited data. To achieve correlation, Flink will store all the data of the two streams in the State. Therefore, the State of the Flink task will increase indefinitely. It is necessary to configure the appropriate TTL for the State. Prevent State from being too large.

2.Interval Join (time interval join)

**Interval Join definition (supports Batch\Streaming): **Interval Join allows one stream to join data from another stream within a certain period of time.

**Application scenario: **Regular Join will generate a retraction flow. The sink generally written in the real-time data warehouse is a message queue similar to Kafka, and then connected to engines such as clickhouse. These engines are not equipped to handle the retraction flow. The ability of Interval Join is used to eliminate the retracement flow.

Interval Join includes the following types (using L as the data identifier in the left stream and R as the data identifier in the right stream):

  • Inner Interval Join: In the streaming task, only when the two streams are joined (satisfying the conditions in Join on: the data of the two streams are in the time interval + satisfying other equivalent conditions) will be output, and the output will be +[L, R]
  • Left Interval Join: In a streaming task, after the left stream data arrives, if there is no data to join to the right stream, it will wait (put it in the State). If the data arrives after the right stream, it will be found that it can be joined with the left stream data just now. to, +[L,R] will be output. As Watermark progresses in event time (processing time is also supported). If it is found that the data in the left stream State has expired, delete the expired data in the left stream from the State, and then output +[L, null]. If the data in the right stream State has expired, delete it directly from the State. .
  • Right Interval Join: The execution logic is the same as Left Interval Join, except that the execution logic of the left table and the right table is completely opposite.
  • Full Interval Join: In a streaming task, after the data from the left or right stream arrives, if there is no data Joined to another stream, it will wait (the left stream is placed in the State corresponding to the left stream, and the right stream is placed in the State corresponding to the right stream). State (medium)), if another stream of data arrives later and it is found that it can be joined to the data just now, +[L, R] will be output. As Watermark advances in event time (processing time is also supported), it is found that the data in the State has expired, and the data is deleted from the State and output (left flow expired output + [L, null], right flow expired output -[null, R] )

The difference between **Inner Interval Join and Outer Interval Join is: **Outer will output the data that has not been joined based on whether it is Outer. If some data expires over time, it will also be output.

**Actual case:** The exposure log is associated with the click log, and the data of both exposure and clicks are filtered. The condition is that the clicks are within 4 hours after the exposure occurs, and the extended parameters of the clicks are added.

a)Inner Interval Join
CREATE TABLE show_log_table (
 log_id BIGINT,
 show_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.show_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE click_log_table (
 log_id BIGINT,
 click_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
)
WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.click_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT
 show_log_table.log_id as s_id,
 show_log_table.show_params as s_params,
 click_log_table.log_id as c_id,
 click_log_table.click_params as c_params
FROM show_log_table 
INNER JOIN click_log_table 
ON show_log_table.log_id = click_log_table.log_id
AND show_log_table.row_time BETWEEN click_log_table.row_time - INTERVAL '4' SECOND AND click_log_table.row_time

The output is as follows:

6> +I[2, a, 2, 6]
6> +I[2, 6, 2, 6]
2> +I[4, 1, 4, 5]
2> +I[10, 8, 10, d]
2> +I[10, 7, 10, d]
2> +I[10, d, 10, d]
2> +I[5, b, 5, d]
6> +I[1, a, 1, 7]
b)Left Interval Join
CREATE TABLE show_log (
 log_id BIGINT,
 show_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.show_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE click_log (
 log_id BIGINT,
 click_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
)
WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.click_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT
 show_log.log_id as s_id,
 show_log.show_params as s_params,
 click_log.log_id as c_id,
 click_log.click_params as c_params
FROM show_log LEFT JOIN click_log ON show_log.log_id = click_log.log_id
AND show_log.row_time BETWEEN click_log.row_time - INTERVAL '5' SECOND AND click_log.row_time

The output is as follows:

+I[6, e, 6, 7]
+I[11, d, null, null]
+I[7, b, null, null]
+I[8, 0, 8, 3]
+I[13, 6, null, null]
c)Full Interval Join
CREATE TABLE show_log (
 log_id BIGINT,
 show_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.show_params.length' = '1',
 'fields.log_id.min' = '5',
 'fields.log_id.max' = '15'
);

CREATE TABLE click_log (
 log_id BIGINT,
 click_params STRING,
 row_time AS cast(CURRENT_TIMESTAMP as timestamp(3)),
 WATERMARK FOR row_time AS row_time
)
WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.click_params.length' = '1',
 'fields.log_id.min' = '1',
 'fields.log_id.max' = '10'
);

CREATE TABLE sink_table (
 s_id BIGINT,
 s_params STRING,
 c_id BIGINT,
 c_params STRING
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT
 show_log.log_id as s_id,
 show_log.show_params as s_params,
 click_log.log_id as c_id,
 click_log.click_params as c_params
FROM show_log FULL JOIN click_log ON show_log.log_id = click_log.log_id
AND show_log.row_time BETWEEN click_log.row_time - INTERVAL '5' SECOND AND click_log.row_time

The output is as follows:

+I[6, 1, null, null]
+I[7, 3, 7, 8]
+I[null, null, 6, 6]
+I[null, null, 4, d]
+I[8, d, null, null]
+I[null, null, 3, b]

Notes on Interval Join:

Real-time Interval Join does not need to be an equal value join. The difference between an equal value join and a non-equivalent join is that the data shuffle strategy of the equal value join is Hash, which will be sent to the corresponding downstream as the ID according to the equality condition in Join on; non-equivalent join The data shuffle strategy is Global. All data is sent to one concurrency, and the data that meets the conditions are correlated and output.

3.Temporal Join (snapshot Join)

**Temporal Join definition (supports Batch\Streaming): **Same as the zipper snapshot table in offline, the corresponding table in Flink SQL is called Versioned Table, and the join operation of using a detailed table to join this Versioned Table is called Temporal Join.

In Temporal Join, Versioned Table maintains the historical version (divided according to time) of the same key (the same key is marked with the primary key in DDL). When there is a detailed table to join this table, it can be based on the detailed table. For the time version in Versioned Table, select the snapshot data in the corresponding time interval to join.

**Application scenario:** For example, exchange rate data (real-time calculation of total amount based on exchange rate), before 12:00 (event time), the exchange rate of RMB and US dollar is 7:1, and changes after 12:00 is 6:1, then the data before 12:00 will be calculated according to 7:1, and the data after 12:00 will be calculated according to 6:1.

**Verisoned Table: **The data stored in Verisoned Table usually comes from CDC or updated data. Flink SQL will maintain all historical time versions of data under the Primary Key for the Versioned Table.

**Example:** Two ways to define Versioned Table in exchange rate calculation.

-- 定义⼀个汇率 versioned 表
CREATE TABLE currency_rates (
 currency STRING,
 conversion_rate DECIMAL(32, 2),
 update_time TIMESTAMP(3) METADATA FROM `values.source.timestamp` VIRTUAL,
 WATERMARK FOR update_time AS update_time,
 -- PRIMARY KEY 定义⽅式
 PRIMARY KEY(currency) NOT ENFORCED
) WITH (
 'connector' = 'kafka',
 'value.format' = 'debezium-json',
 /* ... */
);

-- 将数据源表按照 Deduplicate ⽅式定义为 Versioned Table
CREATE VIEW versioned_rates AS
SELECT currency, conversion_rate, update_time -- 1. 定义 `update_time` 为时间字段
 FROM (
 SELECT *,
 ROW_NUMBER() OVER (PARTITION BY currency -- 2. 定义 `currency` 为主键
 ORDER BY update_time DESC -- 3. ORDER BY 中必须是时间戳列
 ) AS rownum 
 FROM currency_rates)
WHERE rownum = 1;

**Time semantics supported by Temporal Join: **Event time, processing time

**Actual case:**Exchange rate calculation based on event time task example

-- 1. 定义⼀个输⼊订单表
CREATE TABLE orders (
 order_id BIGINT,
 price BIGINT,
 currency STRING,
 order_time TIMESTAMP(3),
 WATERMARK FOR order_time AS order_time
) WITH (
  'connector' = 'filesystem', 
  'path' = 'file:///Users/hhx/Desktop/orders.csv',
  'format' = 'csv'
);

1,100,a,2023-11-01 10:10:10.100
2,200,a,2023-11-02 10:10:10.100
3,300,a,2023-11-03 10:10:10.100
4,300,a,2023-11-04 10:10:10.100
5,300,a,2023-11-05 10:10:10.100
6,300,a,2023-11-06 10:10:10.100

-- 2. 定义⼀个汇率 versioned 表,其中 versioned 表的概念下⽂会介绍到
CREATE TABLE currency_rates (
 currency STRING,
 conversion_rate BIGINT,
 update_time TIMESTAMP(3),
 WATERMARK FOR update_time AS update_time,
 PRIMARY KEY(currency) NOT ENFORCED
) WITH (
 'connector' = 'filesystem', 
  'path' = 'file:///Users/hhx/Desktop/currency_rates.csv',
  'format' = 'csv'
);

a,10,2023-11-01 09:10:10.100
a,11,2023-11-01 10:00:10.100
a,12,2023-11-01 10:10:10.100
a,13,2023-11-01 10:20:10.100
a,14,2023-11-02 10:20:10.100
a,15,2023-11-03 10:20:10.100
a,16,2023-11-04 10:20:10.100
a,17,2023-11-05 10:20:10.100
a,18,2023-11-06 10:00:10.100
a,19,2023-11-06 10:11:10.100

SELECT
 order_id,
 price,
 orders.currency,
 conversion_rate,
 order_time,
 update_time
FROM orders
-- 3. Temporal Join 逻辑
-- SQL 语法为:FOR SYSTEM_TIME AS OF
LEFT JOIN currency_rates FOR SYSTEM_TIME AS OF orders.order_time
ON orders.currency = currency_rates.currency;

You can see that the same currency exchange rate will be different according to the event time of the specific data. Join to the exchange rate of the corresponding time [Join to the most recent available exchange rate]:

Insert image description here

Notice:

For Temporal Join of event time, Watermark must be set for both the left and right tables.

Temporal Join at event time must include the primary key of the Versioned Table in the Join on condition.

**Actual case:** Exchange rate calculation based on processing time task example

10:15> SELECT * FROM LatestRates;

currency rate
======== ======
US Dollar 102
Euro 114
Yen 1

10:30> SELECT * FROM LatestRates;

currency rate
======== ======
US Dollar 102
Euro 114
Yen 1

-- 10:42 时,Euro 的汇率从 114 变为 116
10:52> SELECT * FROM LatestRates;

currency rate
======== ======
US Dollar 102
Euro 116 
Yen 1

-- 从 Orders 表查询数据
SELECT * FROM Orders;

amount currency
====== =========
 2 Euro <== 在处理时间 10:15 到达的⼀条数据
 1 US Dollar <== 在处理时间 10:30 到达的⼀条数据
 2 Euro <== 在处理时间 10:52 到达的⼀条数据
 
-- 执⾏关联查询
SELECT
 o.amount,
 o.currency,
 r.rate, 
 o.amount * r.rate
FROM
 Orders AS o
 JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
 ON r.currency = o.currency
 
-- 结果如下:
amount currency rate amount*rate
====== ========= ======= ============
 2 Euro 114 228 <== 在处理时间 10:15 到达的⼀条数据
 1 US Dollar 102 102 <== 在处理时间 10:30 到达的⼀条数据
 2 Euro 116 232 <== 在处理时间 10:52 到达的⼀条数据

In the processing time semantics, the exchange rate value is determined based on the arrival time of the left stream data. Flink only maintains the latest status data for LatestRates and does not need to care about historical versions of data.

Notice:

Processing-time temporal join is not supported yet.
4.Lookup Join

**Lookup Join definition (supports Batch\Streaming): **Lookup Join is a dimension table join. In real-time data warehouse scenarios, external caches are obtained in real time.

**Application scenarios: **Regular Join, Interval Join, etc. The above-mentioned Joins are all Joins between streams, while Lookup Join is a Join between streams and storage media such as Redis, Mysql, and HBase. Lookup means It's a real-time search.

**Actual case:** Use the exposed user log stream (show_log) to associate the user profile dimension table (user_profile) with the user dimension, and then provide it to the downstream to calculate the number of exposed users by gender and age group. Use.

Input data: Expose user log stream (show_log) data (data is stored in kafka):

log_id timestamp user_id
1 2021-11-01 00:01:03 a
2 2021-11-01 00:03:00 b
3 2021-11-01 00:05:00 c
4 2021-11-01 00:06:00 b
5 2021-11-01 00:07:00 c

User profile dimension table (user_profile) data (data is stored in redis)

user_id(主键) age sex
a 12-18 男
b 18-24 ⼥
c 18-24 男

**Note: **The data structure in redis is stored according to key and value, where key is user_id and value is json of age and sex.

CREATE TABLE show_log (
 log_id BIGINT,
 `timestamp` TIMESTAMP(3),
 user_id STRING,
 proctime AS PROCTIME()
) WITH (
  'connector' = 'filesystem', 
  'path' = 'file:///Users/hhx/Desktop/show_log.csv',
  'format' = 'csv'
);

1 2021-11-01 00:01:03 a
2 2021-11-01 00:03:00 b
3 2021-11-01 00:05:00 c
4 2021-11-01 00:06:00 b
5 2021-11-01 00:07:00 c

CREATE TABLE user_profile (
 user_id STRING,
 age STRING,
 sex STRING,
 proctime AS PROCTIME(),
 PRIMARY KEY(user_id) NOT ENFORCED
) WITH (
 'connector' = 'filesystem', 
  'path' = 'file:///Users/hhx/Desktop/currency_rates.csv',
  'format' = 'csv'
);

a 12-18 男
b 18-24 ⼥
c 18-24 男

CREATE TABLE sink_table (
 log_id BIGINT,
 `timestamp` TIMESTAMP(3),
 user_id STRING,
 proctime TIMESTAMP(3),
 age STRING,
 sex STRING
) WITH (
 'connector' = 'print'
);

-- Processing-time temporal join is not supported yet.
-- lookup join 的 query 逻辑
INSERT INTO sink_table
SELECT
 s.log_id as log_id
 , s.`timestamp` as `timestamp`
 , s.user_id as user_id
 , s.proctime as proctime
 , u.sex as sex
 , u.age as age
FROM show_log AS s
LEFT JOIN user_profile FOR SYSTEM_TIME AS OF s.proctime AS u
ON s.user_id = u.user_id

The output data is as follows:

log_id timestamp user_id age sex
1 2021-11-01 00:01:03 a 12-18 男
2 2021-11-01 00:03:00 b 18-24 ⼥
3 2021-11-01 00:05:00 c 18-24 男
4 2021-11-01 00:06:00 b 18-24 ⼥
5 2021-11-01 00:07:00 c 18-24 男

Real-time lookup dimension table association can use processing time to do the association.

Notice:

a) The dimension data associated with the same piece of data may be different.

The real-time dimension tables commonly used in real-time data warehouses are constantly changing. After the current flow table data is associated with the dimension table data, if the data of the dimension table with the same key changes, the result data of the associated dimension table will not be the same. Will be synchronized and updated again.

For example, the age of the data with user_id 1 in the dimension table changed from 12-18 to 18-24 at 08:00. Then when the task started to trace back the data from 07:59 after failover at 08:01, it should have been associated Data from 12-18 will be associated with age data from 18-24, which may affect data quality.

b) For dimension tables that will be created and updated in real time, data delay monitoring should be established to prevent the flow table data from arriving before the dimension table data and the dimension table data not being associated with it.

c) Common performance issues and optimization ideas for dimension tables

Dimension table performance issues: Task back pressure and data output delay caused by accessing the dimension table storage engine under high qps.

For example:

**When dimension tables are not used:** If the delay of a piece of data from the input Flink task to the output Flink task is 0.1 ms, then the throughput of a task with a parallelism of 1 can reach 1 query / 0.1 ms = 1w qps.

**After using the dimension table:** The time for each piece of data to access the external storage of the dimension table is 2 ms, so the delay from inputting a piece of data to the Flink task to outputting the Flink task will become 2.1 ms. Then the throughput of a task with the same parallelism degree of 1 can only reach 1 query / 2.1 ms = 476 qps. The throughput difference between the two is 21 times, which causes the dimension table join operator to generate back pressure and task output to be delayed. .

Commonly used optimization solutions-DataStream:

  • **Bucketing according to the key of the redis dimension table + local cache: **By bucketing according to the key, the data associated with the dimension table of most data can be accessed from the local cache that has been accessed before, and the access to external storage can be 2.1 ms to process a query becomes 0.1 ms to access memory. The time to process a query.
  • **Asynchronous access to external memory: **DataStream api has asynchronous operators, which can use the thread pool to request the external storage of dimension tables multiple times at the same time, changing 2.1 ms to process 1 query to 2.1 ms to process 10 queries, with variable throughput Optimized to 10 / 2.1 ms = 4761 qps.
  • **Batch access to external storage: **In addition to asynchronous access, you can also access external storage in batches. For example: when accessing a redis dimension table, 1 query takes 2.1 ms, of which 2 ms may be spent on the network. Among the time-consuming requests, only 0.1 ms is the time taken by the redis server to process the request. You can use the pipeline capability provided by redis to save a batch of data on the client (that is, in the flink task lookup join algorithm). Use pipeline to access redis server at the same time, changing 2.1 ms to process 1 query to 7ms (2ms + 50 * 0.1ms) to process 50 queries, and the throughput can become 50 query / 7 ms = 7143 qps.

**Actual measurement:** Among the above optimization effects, the best one is 1 + 3. Compared with 2, 3 still sends requests one by one, and the performance will be worse.

Commonly used optimization solutions-Flink SQL:

**Bucketing + local cache according to the key of the redis dimension table: **To do bucketing in SQL, you must first do group by. If you do aggregation of group by, you can only access redis in udaf, and the UDAF output The result can only be one, and the implementation is complicated, so we chose not to do keyby bucketing and directly use local cache for local caching. Although the effect of [direct caching] is worse than [first bucketing by key and then caching] , but it can also reduce the pressure of accessing redis.

**Asynchronous access to external storage: **The officially implemented hbase connector supports asynchronous access, search for lookup.async.

https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/hbase/ 

**Batch access to external storage: **Redis-based batch access to external storage optimization function, please refer to the following.

https://mp.weixin.qq.com/s/ku11tCZp7CAFzpkqd4J1cQ
5.Regular Join 、Interval Join、Temporal Join、Lookup Join 总结
a) FlinkSQL’s Join is divided into
  • Stream to stream Join: Regular Join+Interval Join+Temporal Join
  • Join flowing to external storage: Lookup Join
b) The difference between Inner Join and Outer Join

Inner Join: It will only be issued on the Join of two streams, and does not involve the retracement stream.

Outer Join: If the Join is not successful, null will be emitted. If the Regular Outer Join involves the retracement flow, the Interval Outer Join does not involve the retracement flow.

c) Differences between Regular Join, Interval Join and Temporal Join

Regular Join: If the TTL of the status is not set, all data of the two streams will be temporarily stored for Join, involving the retraction stream.

Interval Join: You can select a stream within the specified time interval< a i=4>'sdata is joined, and no retracement flow is involved

Temporal Join: Based on the time field of one stream, select the historical time interval of another stream to join, without involving the retracement stream.

Guess you like

Origin blog.csdn.net/m0_50186249/article/details/134249277