43. Flink's Hive reading and writing and detailed verification example

Flink series of articles

1. Links to a series of comprehensive articles such as Flink deployment, concept introduction, source, transformation, sink usage examples, introduction and examples of the four cornerstones

13. Basic concepts of Flink's table api and sql, introduction to general api and getting started examples 14. Data types of Flink's table api and sql
: built-in data types and their attributes
Detailed introduction to dynamic tables, time attribute configuration (how to process update results), temporal tables, joins on streams, determinism on streams, and query configuration 16. The connection between Flink’s table api and sql external systems: reading and writing
external System connectors and formats and FileSystem examples (1)
16. Connection between Flink’s table api and sql to external systems: Connectors and formats for reading and writing external systems and Elasticsearch examples (2)
16. Connection between Flink’s table api and sql External systems: Read and write connectors and formats of external systems and Apache Kafka examples (3)
16. Connection between Flink’s table api and sql External systems: Read and write connectors and formats of external systems and JDBC examples (4)

16. Flink's table api and sql connection to external systems: connectors and formats for reading and writing external systems and examples of Apache Hive (6)

20. SQL Client of Flink SQL: You can try Flink SQL without writing code, and you can directly submit SQL tasks to the cluster

22. Flink's table api and sql create table DDL
24. Flink's table api and sql catalogs

30. Flink SQL's SQL client (introducing the use of configuration files - tables, views, etc. through the examples of kafka and filesystem) 41. Flink
's Hive dialect introduction and detailed examples
42. Flink's table api and sql's Hive Catalog
43. Flink Hive reading, writing and detailed verification examples



This article introduces in detail the integration of Flink and hive, and reading and writing hive data through flink sql.
This article depends on all environments including hadoop, hive, kafka, flink and so on.
This part is divided into 4 parts, namely reading hive data, application of temporal table, writing data in hive and file format.
In this example, the version of hive is 3.1.2,
the verified version of flink is 1.13.6,
the version of hadoop is 3.1.4, and
the version of kafka is 2.12-3.0.0

1. Introduction to reading and writing in Hive

Apache Flink can perform unified BATCH and STREAM processing of Apache Hive tables through HiveCatalog. This means that Flink can be used as a high-performance alternative to Hive's batch processing engine, or to continuously write data to and read from Hive tables to support real-time data warehouse applications.

1. Read hive data

Flink supports reading data from Hive in BATCH and STREAMING modes. When running as a BATCH application, Flink performs queries on the state of the table at the point in time when the query is executed. Streaming reads continuously monitor the table and incrementally fetch new data as it becomes available. Flink will read bounded tables by default.

Streaming reads support the use of both partitioned and non-partitioned tables. For partitioned tables, Flink monitors the generation of new partitions and incrementally reads them when available. For non-partitioned tables, Flink monitors the generation of new files in the folder and reads new files incrementally.
Insert image description here
SQL Hints can be used to apply configurations to Hive tables without changing their definition in the Hive metastore.
Examples are as follows:

SELECT * 
FROM hive_table 
/*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.consume-start-offset'='2023-08-20') */;

Notice:

  • The monitoring strategy is to scan all directories/files in the current location path. Many partitions can cause performance degradation.
  • Streaming reads from non-partitioned tables require that each file be written atomically to the target directory.
  • Streaming reads of partitioned tables require each partition to be added atomically in the Hive metastore view. Otherwise, new data added to the existing partition will be used.
  • Streaming reading does not support the watermark syntax in Flink DDL. These tables cannot be used with window operators.

1), read the view of hive

Flink can read data from views defined by Hive, but there are some limitations:

  • The current catalog must be set to hivecatalog. There are two setting methods, namely TableAPI: tenv.useCatalog("alan_hivecatalog") and SQLcli: use catalog alan_hivecatalog
  • Hive and Flink SQL have different syntax, such as different reserved keywords and literals. Make sure that the view's query is compatible with Flink's syntax.

2), Vectorized Optimization upon Read

Flink will automatically use vectorized reads of Hive tables when the following conditions are met:

  • When the data format is ORC or parquet
  • When the field does not have complex data types, such as List, Map, Struct and Union

This feature is enabled by default. It can be disabled using the following configuration.

table.exec.hive.fallback-mapred-reader=true

3), Source concurrent reasoning (Source Parallelism Inference)

By default, Flink infers optimal parallelism for its Hive readers based on the number of files and the number of blocks in each file.

Flink allows you to flexibly configure parallel inference strategies. You can configure the following parameters in TableConfig (note that these parameters affect all sources of the job):
Insert image description here

4). Adjust the data sharding (Split) size when reading the Hive table.

When reading a Hive table, the data file will be divided into several fragments (split), and each fragment is part of the data to be read. Sharding is Flink’s basic granularity for task allocation and parallel data reading. Users can adjust the size of each shard through the following parameters to perform certain read performance tuning.
Insert image description here

In order to adjust the size of data shards, Flink will first calculate the sizes of all files in all partitions. But this will be more time-consuming when the number of partitions is large. You can configure the job parameter table.exec.hive.calculate-partition-size.thread-num (default is 3) to a larger value to use more threads. to accelerate.
Currently, the above parameters are only applicable to Hive tables in ORC format.

5) Load partition slices

Flink uses multiple threads to concurrently divide the Hive partition into multiple splits for reading. You can use table.exec.hive.load-partition-splits.thread-num to configure the number of threads. The default value is 3, and the value you configure should be greater than 0.

6), read partitions with subdirectories

In some cases, it might be possible to create an external table that references other tables, but whose partition columns are a subset of the other table's partition fields. For example, you create a partition table fact_tz with partition fields day and hour:

CREATE TABLE fact_tz(x int) PARTITIONED BY (day STRING, hour STRING);

Then you create an external table fact_daily based on the fact_tz table, and use a coarse-grained partition field day:

CREATE EXTERNAL TABLE fact_daily(x int) PARTITIONED BY (ds STRING) LOCATION '/user/hive/warehouse/test.db/fact_tz';

When reading the external table fact_daily, there are subdirectories (hour=1 to hour=24) under the partition directory of the table.

By default, partitions with subdirectories can be added to external tables. Flink SQL will recursively scan all subdirectories and get the data in all subdirectories.

ALTER TABLE fact_daily ADD PARTITION (ds='2023-08-11') location '/user/hive/warehouse/test.db/fact_tz/ds=2023-08-11';

You can set the job property table.exec.hive.read-partition-with-subdirectory.enabled (default true) to false to disable Flink from reading subdirectory. If you set it to false and the partition directory does not contain any subdirectories, Flink will throw a java.io.IOException: Not a file: /path/to/data/* exception.

2. Tense table Join

You can use Hive tables as temporal tables, and then a data stream can use temporal join to associate Hive tables. Please refer to temporal join for more information about temporal join.

Flink supports processing-time temporal joins to Hive tables, and processing-time temporal joins are always associated with the latest version of the temporal table. Flink supports temporal join for partitioned and non-partitioned tables of Hive. For partitioned tables, Flink supports automatic tracking of the latest partitions of Hive tables.

Note: Flink does not yet support event-time temporal joins to Hive tables.

1), Temporal Join latest partition

For a partition table that changes at any time, we can regard it as an unbounded stream for reading. If each partition contains complete data, the partition can be used as a version of the temporal table, and the version of the temporal table saves the data of the partition.

Flink supports automatically tracking the latest partition (version) when using processing time temporal join, and defines the latest partition (version) through streaming-source.partition-order. The most common case used by users is to use Hive tables as dimension tables in Flink streaming jobs.

Note: This feature only supports Flink streaming mode.

The following case demonstrates a classic business pipeline, using tables in Hive as dimension tables, which are updated by batch tasks or Flink tasks once a day (changed to hourly updates for the convenience of verification). Kafka data flow comes from real-time online business data or logs, and this flow needs to be associated with dimension tables to enrich the data flow.

1. Code examples

-- 假设 Hive 表中的数据每天更新, 每天包含最新和完整的维度数据
SET table.sql-dialect=hive;
CREATE TABLE alan_dim_user_table (
  u_id STRING,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT
) PARTITIONED BY (pt_year STRING, pt_month STRING, pt_day STRING) TBLPROPERTIES (
  -- 使用默认的 partition-name 每12小时加载最新分区数据(推荐)
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.monitor-interval' = '12 h',
  'streaming-source.partition-order' = 'partition-name',  -- 有默认的配置项,可以不填。

  -- 使用分区文件create-time 每12小时加载最新分区数据
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.partition-order' = 'create-time',
  'streaming-source.monitor-interval' = '12 h'

  -- 使用 partition-time 每12小时加载最新分区数据
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.monitor-interval' = '12 h',
  'streaming-source.partition-order' = 'partition-time',
  'partition.time-extractor.kind' = 'default',
  'partition.time-extractor.timestamp-pattern' = '$pt_year-$pt_month-$pt_day 00:00:00' 
);

SET table.sql-dialect=hive;
CREATE TABLE alan_dim_user_table (
  u_id BIGINT,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT
) PARTITIONED BY (t_year STRING, t_month STRING, t_day STRING) 
  row format delimited 
  fields terminated by "," 
  TBLPROPERTIES (
  -- 使用默认的 partition-name 每1小时加载最新分区数据(推荐)
  'streaming-source.enable' = 'true',
  'streaming-source.partition.include' = 'latest',
  'streaming-source.monitor-interval' = '1 h',
  'streaming-source.partition-order' = 'partition-name'--默认的,可以不设置
);

-- streaming sql, kafka temporal join Hive 维度表. Flink 将在 'streaming-source.monitor-interval' 的间隔内自动加载最新分区的数据。
SELECT * FROM orders_table AS o 
JOIN alan_dim_user_table FOR SYSTEM_TIME AS OF o.proctime AS u
ON o.u_id = u.u_id;

2. Flink verification steps


-------------------------flink、kafka、hive操作示例----------------------------------
----本示例是在flink版本为1.13.6的环境验证的----------------------------------
---------1、创建flink 的维表,每小时更新一次数据----------------------------------
Flink SQL> SET table.sql-dialect=hive;
[INFO] Session property has been set.

Flink SQL> show tables;
+--------------+
|   table name |
+--------------+
| alan_student |
|  student_ext |
|          tbl |
|  test_change |
|    user_dept |
+--------------+
5 rows in set
Flink SQL> CREATE TABLE alan_dim_user_table (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT
> ) PARTITIONED BY (t_year STRING, t_month STRING, t_day STRING) 
>   row format delimited 
>   fields terminated by "," 
>   TBLPROPERTIES (
>   -- 使用默认的 partition-name 每1小时加载最新分区数据(推荐)
>   'streaming-source.enable' = 'true',
>   'streaming-source.partition.include' = 'latest',
>   'streaming-source.monitor-interval' = '1 h',
>   'streaming-source.partition-order' = 'partition-name'--默认的,可以不设置
> );
[INFO] Execute statement succeed.

-----------2、hive中手动加载数据,第一次只增加一条数据----------------------------------
0: jdbc:hive2://server4:10000> show tables;
+----------------------+
|       tab_name       |
+----------------------+
| alan_dim_user_table  |
| alan_student         |
| student_ext          |
| tbl                  |
| test_change          |
| user_dept            |
+----------------------+
6 rows selected (0.05 seconds)

0: jdbc:hive2://server4:10000> load data  inpath '/flinktest/hivetest' into table alan_dim_user_table partition(t_year='2023',t_month='09',t_day='04');
0: jdbc:hive2://server4:10000> select * from alan_dim_user_table;
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+
| alan_dim_user_table.u_id  | alan_dim_user_table.u_name  | alan_dim_user_table.balance  | alan_dim_user_table.age  | alan_dim_user_table.t_year  | alan_dim_user_table.t_month  | alan_dim_user_table.t_day  |
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+
| 1                         | alan                        | 12.2300                      | 18                       | 2023                        | 09                           | 04                         |
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+

-----3、flink 创建事实表----------------------------------
Flink SQL> SET table.sql-dialect=default;
[INFO] Session property has been set.

Flink SQL> CREATE TABLE alan_fact_order_table (
>     o_id STRING,
>     o_amount DOUBLE,
>     u_id BIGINT, -- 用户id
>     item_id BIGINT, -- 商品id
>     action STRING,  -- 用户行为
>     ts     BIGINT,  -- 用户行为发生的时间戳
>     proctime as PROCTIME(),   -- 通过计算列产生一个处理时间列
>     `event_time` TIMESTAMP(3) METADATA FROM 'timestamp',-- 事件时间
>     WATERMARK FOR event_time as event_time - INTERVAL '5' SECOND  -- 在eventTime上定义watermark
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test_hive_topic',
>   'properties.bootstrap.servers' = '192.168.10.41:9092,192.168.10.42:9092,192.168.10.43:9092',
>   'properties.group.id' = 'testhivegroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'format' = 'csv'
> );
[INFO] Execute statement succeed.

---------4、创建kafka 主题、发送消息(发送消息是在flink流式查询语句后)----------------------------------
[alanchan@server2 bin]$ kafka-topics.sh --delete --topic test_hive_topic --bootstrap-server server1:9092
[alanchan@server2 bin]$ kafka-topics.sh --create --bootstrap-server server1:9092 --topic test_hive_topic --partitions 1 --replication-factor 1
Created topic test_hive_topic.
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive_topic
>1,123.34,1,8001,'b',1693874219248
----------5、flink 流式查询(观察维表是否加载出来数据)----------------------------------
Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.event_time,
>   u.u_name,
>   u.t_year,
>   u.t_month,
>   u.t_day 
> FROM alan_fact_order_table AS o 
> JOIN alan_dim_user_table FOR SYSTEM_TIME AS OF o.proctime AS u ON o.u_id = u.u_id;

+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| op |                           o_id |                 u_id |                         action |                   ts |              event_time |                         u_name |                         t_year |                        t_month |                          t_day |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |                              1 |                    1 |                            'b' |        1693874219248 | 2023-09-05 00:51:28.407 |                           alan |                           2023 |                             09 |                             04 |
-------6、hive中加载更多的维度表数据(验证维度表是否1小时更新一次)----------------------------------
0: jdbc:hive2://server4:10000> load data  inpath '/flinktest/hivetest2' into table alan_dim_user_table partition(t_year='2023',t_month='09',t_day='05');
No rows affected (0.194 seconds)
0: jdbc:hive2://server4:10000> select * from alan_dim_user_table;
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+
| alan_dim_user_table.u_id  | alan_dim_user_table.u_name  | alan_dim_user_table.balance  | alan_dim_user_table.age  | alan_dim_user_table.t_year  | alan_dim_user_table.t_month  | alan_dim_user_table.t_day  |
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+
| 1                         | alan                        | 12.2300                      | 18                       | 2023                        | 09                           | 04                         |
| 2                         | alanchan                    | 22.2300                      | 10                       | 2023                        | 09                           | 05                         |
| 3                         | alanchanchn                 | 32.2300                      | 28                       | 2023                        | 09                           | 05                         |
| 4                         | alan_chan                   | 12.4300                      | 29                       | 2023                        | 09                           | 05                         |
| 5                         | alan_chan_chn               | 52.2300                      | 38                       | 2023                        | 09                           | 05                         |
+---------------------------+-----------------------------+------------------------------+--------------------------+-----------------------------+------------------------------+----------------------------+
5 rows selected (0.143 seconds)
--------------7、kafka中继续发送消息,然后观察flink流式查询结果的变化----------------------------------
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive_topic
>1,123.34,1,8001,'b',1693874219248-----------该数据上文中已经发送过,为了表示数据的连续性,没有删除
>20,321.34,3,9001,'a',1693874222274
>30,41.34,5,7001,'c',1693874223285    
>50,666.66,2,3001,'d',1693875816640

--------------8、kafka发送消息后,flink流式查询结果----------------------------------
Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.event_time,
>   u.u_name,
>   u.t_year,
>   u.t_month,
>   u.t_day 
> FROM alan_fact_order_table AS o 
> JOIN alan_dim_user_table FOR SYSTEM_TIME AS OF o.proctime AS u ON o.u_id = u.u_id;
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| op |                           o_id |                 u_id |                         action |                   ts |              event_time |                         u_name |                         t_year |                        t_month |                          t_day |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |                             20 |                    3 |                            'a' |        1693874222274 | 2023-09-05 00:54:49.526 |                    alanchanchn |                           2023 |                             09 |                             05 |
| +I |                             30 |                    5 |                            'c' |        1693874223285 | 2023-09-05 00:55:55.461 |                  alan_chan_chn |                           2023 |                             09 |                             05 |
| +I |                             50 |                    2 |                            'd' |        1693875816640 | 2023-09-05 01:07:23.891 |                       alanchan |                           2023 |                             09 |                             05 |

--------------9、hive维表数据不变化,kafka再次发送消息,观察flink流式查询结果----------------------------------
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive_topic
>1,123.34,1,8001,'b',1693874219248
>20,321.34,3,9001,'a',1693874222274
>30,41.34,5,7001,'c',1693874223285    
>50,666.66,2,3001,'d',1693875816640
>60,666.66,4,3001,'e',1693880868579
>
--------------10、hive维表数据不变化,kafka再次发送消息后,观察flink流式查询结果(还是原来的查询界面)---------------
Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.event_time,
>   u.u_name,
>   u.t_year,
>   u.t_month,
>   u.t_day 
> FROM alan_fact_order_table AS o 
> JOIN alan_dim_user_table FOR SYSTEM_TIME AS OF o.proctime AS u ON o.u_id = u.u_id;
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| op |                           o_id |                 u_id |                         action |                   ts |              event_time |                         u_name |                         t_year |                        t_month |                          t_day |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |                             20 |                    3 |                            'a' |        1693874222274 | 2023-09-05 00:54:49.526 |                    alanchanchn |                           2023 |                             09 |                             05 |
| +I |                             30 |                    5 |                            'c' |        1693874223285 | 2023-09-05 00:55:55.461 |                  alan_chan_chn |                           2023 |                             09 |                             05 |
| +I |                             50 |                    2 |                            'd' |        1693875816640 | 2023-09-05 01:07:23.891 |                       alanchan |                           2023 |                             09 |                             05 |
| +I |                             60 |                    4 |                            'e' |        1693880868579 | 2023-09-05 02:30:58.368 |                      alan_chan |                           2023 |                             09 |                             05 |

---及时查出了数据的变化-------------------

2), the latest table of Temporal Join

For the Hive table, we can treat it as an unbounded stream for reading. In this case, we can only track the latest version when querying. The latest version of the table retains all data from the Hive table.

When the latest Hive table is temporally joined, the Hive table will be cached in the Slot memory, and each record in the data stream will be associated with the table through the key to find the corresponding match. Using the latest Hive table as a temporal table requires no additional configuration. As an option, you can use the following configuration items to configure the TTL of Hive table cache. When the cache expires, the Hive table will be rescanned and loaded with the latest data.
Insert image description here
The following case demonstrates loading all data of the Hive table as a temporal table.

1. Code examples

-- 假设 Hive 表中的数据被批处理 pipeline 覆盖。
SET table.sql-dialect=hive;
CREATE TABLE alan_dim_user_table2 (
  u_id BIGINT,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT
)
  row format delimited 
  fields terminated by "," 
  TBLPROPERTIES (
  'streaming-source.enable' = 'false',           -- 有默认的配置项,可以不填。
  'streaming-source.partition.include' = 'all',  -- 有默认的配置项,可以不填。
  'lookup.join.cache.ttl' = '1 h'
);

SET table.sql-dialect=default;
CREATE TABLE alan_fact_order_table2 (
    o_id STRING,
    o_amount DOUBLE,
    u_id BIGINT, -- 用户id
    item_id BIGINT, -- 商品id
    action STRING,  -- 用户行为
    ts     BIGINT,  -- 用户行为发生的时间戳
    proctime as PROCTIME()   -- 通过计算列产生一个处理时间列
) WITH (
  'connector' = 'kafka',
  'topic' = 'test_hive2_topic',
  'properties.bootstrap.servers' = '192.168.10.41:9092,192.168.10.42:9092,192.168.10.43:9092',
  'properties.group.id' = 'testhivegroup',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'csv'
);

-- streaming sql, kafka join Hive 维度表. 当缓存失效时 Flink 会加载维度表的所有数据。
SELECT
  o.o_id,
  o.u_id,
  o.action,
  o.ts,
  o.proctime,
  dim.u_name,
  dim.age,
  dim.balance 
FROM alan_fact_order_table2 AS o 
JOIN alan_dim_user_table2 FOR SYSTEM_TIME AS OF o.proctime AS dim
ON o.u_id = dim.u_id;

2. Flink verification steps

----本示例是在flink版本为1.13.6的环境验证的----------------------------------
----本示例ttl设置为1小时,方便验证----------------------------------
----1、flink创建维表----------------------------------
Flink SQL> show tables;
+-----------------------+
|            table name |
+-----------------------+
|   alan_dim_user_table |
| alan_fact_order_table |
|          alan_student |
|           student_ext |
|                   tbl |
|           test_change |
|             user_dept |
+-----------------------+
7 rows in set

Flink SQL> SET table.sql-dialect=hive;
[INFO] Session property has been set.

Flink SQL> CREATE TABLE alan_dim_user_table2 (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT
> )
>   row format delimited 
>   fields terminated by "," 
>   TBLPROPERTIES (
>   'streaming-source.enable' = 'false',           -- 有默认的配置项,可以不填。
>   'streaming-source.partition.include' = 'all',  -- 有默认的配置项,可以不填。
>   'lookup.join.cache.ttl' = '1 h'
> );
[INFO] Execute statement succeed.

----2、hive中对维表插入数据----------------------------------
0: jdbc:hive2://server4:10000> load data  inpath '/flinktest/hivetest' into table alan_dim_user_table2;
No rows affected (0.139 seconds)
0: jdbc:hive2://server4:10000> select * from alan_dim_user_table2;
+----------------------------+------------------------------+-------------------------------+---------------------------+
| alan_dim_user_table2.u_id  | alan_dim_user_table2.u_name  | alan_dim_user_table2.balance  | alan_dim_user_table2.age  |
+----------------------------+------------------------------+-------------------------------+---------------------------+
| 1                          | alan                         | 12.2300                       | 18                        |
| 2                          | alanchan                     | 22.2300                       | 10                        |
| 3                          | alanchanchn                  | 32.2300                       | 28                        |
+----------------------------+------------------------------+-------------------------------+---------------------------+
3 rows selected (0.124 seconds)

----3、flink中创建事实表----------------------------------

Flink SQL> SET table.sql-dialect=default;
Hive Session ID = 4d502166-65b7-4079-af12-35919101ed8d
[INFO] Session property has been set.

Flink SQL> CREATE TABLE alan_fact_order_table2 (
>     o_id STRING,
>     o_amount DOUBLE,
>     u_id BIGINT, -- 用户id
>     item_id BIGINT, -- 商品id
>     action STRING,  -- 用户行为
>     ts     BIGINT,  -- 用户行为发生的时间戳
>     proctime as PROCTIME()   -- 通过计算列产生一个处理时间列
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test_hive2_topic',
>   'properties.bootstrap.servers' = '192.168.10.41:9092,192.168.10.42:9092,192.168.10.43:9092',
>   'properties.group.id' = 'testhivegroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'format' = 'csv'
> );
[INFO] Execute statement succeed.

----4、创建kafka topic,并发送数据----------------------------------
[alanchan@server2 bin]$ kafka-topics.sh --create --bootstrap-server server1:9092 --topic test_hive2_topic --partitions 1 --replication-factor 1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic test_hive2_topic.
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive2_topic
>1,123.34,1,8001,'b',1693887925763
>30,41.34,5,7001,'c',1693874222274
>30,41.34,5,7001,'c',1693887926780
>20,321.34,3,9001,'a',1693887928801
>50,666.66,2,3001,'d',1693887927790

----5、flink中查询,观察查询结果----------------------------------
Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.proctime,
>   dim.u_name,
>   dim.age,
>   dim.balance 
> FROM alan_fact_order_table2 AS o 
> JOIN alan_dim_user_table2 FOR SYSTEM_TIME AS OF o.proctime AS dim
> ON o.u_id = dim.u_id;
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| op |                           o_id |                 u_id |                         action |                   ts |                proctime |                         u_name |         age |      balance |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| +I |                              1 |                    1 |                            'b' |        1693887925763 | 2023-09-05 04:24:47.825 |                           alan |          18 |      12.2300 |
| +I |                             20 |                    3 |                            'a' |        1693887928801 | 2023-09-05 04:26:06.437 |                    alanchanchn |          28 |      32.2300 |
| +I |                             50 |                    2 |                            'd' |        1693887927790 | 2023-09-05 04:26:46.404 |                       alanchan |          10 |      22.2300 |

----6、在hive中加载新的数据,kafka中发送新的消息,观察flink的查询结果----------------------------------
0: jdbc:hive2://server4:10000> load data  inpath '/flinktest/hivetest' into table alan_dim_user_table2;
No rows affected (0.129 seconds)
0: jdbc:hive2://server4:10000> select * from alan_dim_user_table2;
+----------------------------+------------------------------+-------------------------------+---------------------------+
| alan_dim_user_table2.u_id  | alan_dim_user_table2.u_name  | alan_dim_user_table2.balance  | alan_dim_user_table2.age  |
+----------------------------+------------------------------+-------------------------------+---------------------------+
| 1                          | alan                         | 12.2300                       | 18                        |
| 2                          | alanchan                     | 22.2300                       | 10                        |
| 3                          | alanchanchn                  | 32.2300                       | 28                        |
| 4                          | alan_chan                    | 12.4300                       | 29                        |
| 5                          | alan_chan_chn                | 52.2300                       | 38                        |
+----------------------------+------------------------------+-------------------------------+---------------------------+

[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive2_topic
>1,123.34,1,8001,'b',1693887925763
>30,41.34,5,7001,'c',1693874222274
>30,41.34,5,7001,'c',1693887926780
>20,321.34,3,9001,'a',1693887928801
>50,666.66,2,3001,'d',1693887927790
>30,41.34,5,7001,'c',1693887926780-----该条数据在flink的查询结果中没有显示

Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.proctime,
>   dim.u_name,
>   dim.age,
>   dim.balance 
> FROM alan_fact_order_table2 AS o 
> JOIN alan_dim_user_table2 FOR SYSTEM_TIME AS OF o.proctime AS dim
> ON o.u_id = dim.u_id;
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| op |                           o_id |                 u_id |                         action |                   ts |                proctime |                         u_name |         age |      balance |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| +I |                              1 |                    1 |                            'b' |        1693887925763 | 2023-09-05 04:24:47.825 |                           alan |          18 |      12.2300 |
| +I |                             20 |                    3 |                            'a' |        1693887928801 | 2023-09-05 04:26:06.437 |                    alanchanchn |          28 |      32.2300 |
| +I |                             50 |                    2 |                            'd' |        1693887927790 | 2023-09-05 04:26:46.404 |                       alanchan |          10 |      22.2300 |

----7、ttl过期后,再在kafka中发送新的消息,观察flink的查询结果----------------------------------
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic test_hive2_topic
...,下面2条数据是TTL过期后发送的,如预期一样查出来了结果
>30,41.34,5,7001,'c',1693893016308
>1,123.34,1,8001,'b',1693893020334

Flink SQL> SELECT
>   o.o_id,
>   o.u_id,
>   o.action,
>   o.ts,
>   o.proctime,
>   dim.u_name,
>   dim.age,
>   dim.balance 
> FROM alan_fact_order_table2 AS o 
> JOIN alan_dim_user_table2 FOR SYSTEM_TIME AS OF o.proctime AS dim
> ON o.u_id = dim.u_id;
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| op |                           o_id |                 u_id |                         action |                   ts |                proctime |                         u_name |         age |      balance |
+----+--------------------------------+----------------------+--------------------------------+----------------------+-------------------------+--------------------------------+-------------+--------------+
| +I |                              1 |                    1 |                            'b' |        1693887925763 | 2023-09-05 04:24:47.825 |                           alan |          18 |      12.2300 |
| +I |                             20 |                    3 |                            'a' |        1693887928801 | 2023-09-05 04:26:06.437 |                    alanchanchn |          28 |      32.2300 |
| +I |                             50 |                    2 |                            'd' |        1693887927790 | 2023-09-05 04:26:46.404 |                       alanchan |          10 |      22.2300 |
| +I |                             30 |                    5 |                            'c' |        1693893016308 | 2023-09-05 05:49:47.984 |                  alan_chan_chn |          38 |      52.2300 |
| +I |                              1 |                    1 |                            'b' |        1693893020334 | 2023-09-05 05:50:23.696 |                           alan |          18 |      12.2300 |

------以上,完成了验证

Each subtask participating in the join needs to keep the Hive table in their cache. Please make sure that the Hive table can be placed in the TM task slot.
It is recommended to configure these two options to larger values ​​streaming-source.monitor-interval (the latest partition is used as a temporal table) and lookup.join.cache.ttl (all partitions are used as a temporal table). Otherwise, tasks will frequently update and load tables, prone to performance issues.
Currently (as of flink version 1.17), the entire Hive table will be reloaded when the cache is refreshed, so there is no way to distinguish whether the data is new data or old data.

3. Write hive data

Flink supports both batch and stream modes for writing data into Hive. When used as a batch program, the data written by Flink into the Hive table can only be seen when the job is completed. Batch mode writes support appending to or overwriting existing tables.

1), code example 1

# ------ INSERT INTO 将追加到表或者分区,保证数据的完整性 ------ 
Flink SQL> INSERT INTO mytable SELECT 'Tom', 25;

# ------ INSERT OVERWRITE 将覆盖表或者分区中所有已经存在的数据 ------ 
Flink SQL> INSERT OVERWRITE mytable SELECT 'Tom', 25;

2), flink verification steps

-------------flink 1.13.6环境中操作示例---------
Flink SQL> CREATE TABLE alan_w_user_table (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT
> )
>   row format delimited 
>   fields terminated by "," 
>  ;
Hive Session ID = 30451c4a-5ca9-470c-9274-9ecf5330c76d
[INFO] Execute statement succeed.

Flink SQL> show tables;
Hive Session ID = 8c5f20ac-989e-423c-b936-d8274ceff5b1
+------------------------+
|             table name |
+------------------------+
|    alan_dim_user_table |
|   alan_dim_user_table2 |
|  alan_fact_order_table |
| alan_fact_order_table2 |
|           alan_student |
|      alan_w_user_table |
|            student_ext |
|                    tbl |
|            test_change |
|              user_dept |
+------------------------+
10 rows in set

Flink SQL> INSERT INTO alan_w_user_table values (1,'alan',12.4,18);
Job ID: ea03b7c37aca92197c608da292cbb8f3

Flink SQL> select * from alan_w_user_table;
+----+----------------------+--------------------------------+--------------+-------------+
| op |                 u_id |                         u_name |      balance |         age |
+----+----------------------+--------------------------------+--------------+-------------+
| +I |                    1 |                           alan |      12.4000 |          18 |
+----+----------------------+--------------------------------+--------------+-------------+
Received a total of 1 row
-----flink streaming模式下是不支持insert overwrite的,需要设置为batch模式
Flink SQL> INSERT OVERWRITE  alan_w_user_table values (1,'alanchan',22.4,19);
Hive Session ID = 58ec8fbd-aa1b-40c1-ab09-6da083e6327e
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalStateException: Streaming mode not support overwrite.
-----默认为streaming模式,设置为batch模式
Flink SQL> SET execution.runtime-mode = batch;
Hive Session ID = 3eb977f9-1036-42e3-8b0f-22c2357706fc
[INFO] Session property has been set.
------flink batch模式下,不能开启checkpoint,需要关闭checkpoint才能支持batch job,
Flink SQL> INSERT OVERWRITE  alan_w_user_table values (1,'alanchan',22.4,19);
Hive Session ID = 5b2db357-5c12-44a0-8159-f6f18ba5fbea
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalArgumentException: Checkpoint is not supported for batch jobs.
----------此处只是为了演示insert into与insert overwrite的区别,此区别与hive中的一致,此处不再赘述,详见hive专栏的部分

It is also possible to insert data into specific partitions.

3), code example 2

# ------ 插入静态分区 ------ 
Flink SQL> INSERT OVERWRITE myparttable PARTITION (my_type='type_1', my_date='2019-08-08') SELECT 'Tom', 25;

# ------ 插入动态分区 ------ 
Flink SQL> INSERT OVERWRITE myparttable SELECT 'Tom', 25, 'type_1', '2019-08-08';

# ------ 插入静态(my_type)和动态(my_date)分区 ------ 
Flink SQL> INSERT OVERWRITE myparttable PARTITION (my_type='type_1') SELECT 'Tom', 25, '2019-08-08';

4), flink verification steps

------------------------flink 1.13.6环境中操作示例---------------------------------------------
----------静态分区,插入数据----------
Flink SQL> SET table.sql-dialect=hive;
[INFO] Session property has been set.

Flink SQL> CREATE TABLE alan_wp_user_table (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT
> ) PARTITIONED BY (dt STRING,hr STRING) 
>   row format delimited 
>   fields terminated by "," 
>   TBLPROPERTIES (
>   'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
>   'sink.partition-commit.trigger'='partition-time',
>   'sink.partition-commit.delay'='10 s',
>   'sink.partition-commit.policy.kind'='metastore,success-file'
> );
[INFO] Execute statement succeed.

Flink SQL> INSERT into alan_wp_user_table PARTITION (dt='2023-09-05', hr = '05') values (1,'alan',12.4,18);
Job ID: 8b88ccfb6e6e47a79334e79bbc946389

Flink SQL> select * from alan_wp_user_table;
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| op |                 u_id |                         u_name |      balance |         age |                             dt |                             hr |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| +I |                    1 |                           alan |      12.4000 |          18 |                     2023-09-05 |                             05 |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
Received a total of 1 row

---------另外一种插入方式----------
Flink SQL> INSERT into alan_wp_user_table PARTITION (dt='2023-09-05', hr = '05') SELECT 2,'alanchan', 25.8,19;
Job ID: 93dbf92c01e41c245a38fb5776eb7d59


Flink SQL>  select * from alan_wp_user_table;
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| op |                 u_id |                         u_name |      balance |         age |                             dt |                             hr |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| +I |                    2 |                       alanchan |      25.8000 |          19 |                     2023-09-05 |                             05 |
| +I |                    1 |                           alan |      12.4000 |          18 |                     2023-09-05 |                             05 |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+

------ 插入动态分区 ------ 
INSERT into alan_wp_user_table SELECT 3,'alanchanchn', 35.8,29, '2023-09-05', '05';

Flink SQL> INSERT into alan_wp_user_table PARTITION (dt='2023-09-05', hr = '05') values (1,'alan',12.4,18);
------如果hive中查得到数据,flink sql中查不到数据,flink sql cli 中执行  SET table.sql-dialect=hive;命令再查即可

Flink SQL> select * from alan_wp_user_table;

+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| op |                 u_id |                         u_name |      balance |         age |                             dt |                             hr |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| +I |                    1 |                           alan |      12.4000 |          18 |                     2023-09-05 |                             05 |
| +I |                    2 |                       alanchan |      25.8000 |          19 |                     2023-09-05 |                             05 |
| +I |                    3 |                    alanchanchn |      35.8000 |          29 |                     2023-09-05 |                             05 |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+

------ 插入静态(my_type)和动态(my_date)分区 ------ 
------该种插入方式需要是batch模式,batch模式不支持checkpoint ,该种情况没有进一步验证
Flink SQL> SET execution.runtime-mode = batch;
[INFO] Session property has been set.

Flink SQL> INSERT OVERWRITE alan_wp_user_table PARTITION (dt='2023-09-05') SELECT 4,'alan_chanchn', 45.8,39, '06';
Hive Session ID = 26829c28-8581-4bf4-b4f7-bea17042e6de
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalArgumentException: Checkpoint is not supported for batch jobs.

Streaming writes continuously add new data to Hive and commit records to make them visible. The user can control how commits are triggered through several properties.

Stream writing does not support Insert overwrite

5), code example 3

The following example demonstrates how to stream write from Kafka to Hive table and perform partition commit, and then run a batch query to read the data.

---创建hive表
SET table.sql-dialect=hive;
CREATE TABLE alan_hive_user_table (
  u_id BIGINT,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT
) PARTITIONED BY (dt STRING,hr STRING) 
  row format delimited 
  fields terminated by "," 
  TBLPROPERTIES (
  'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
  'sink.partition-commit.trigger'='partition-time',
  'sink.partition-commit.delay'='10 s',
  'sink.partition-commit.policy.kind'='metastore,success-file',
  'sink.rolling-policy.rollover-interval'='5s',
  'sink.partition-commit.watermark-time-zone'='Asia/Shanghai' -- 假设用户配置的时区为 'Asia/Shanghai', 
);

---创建kafka表
SET table.sql-dialect=default;
CREATE TABLE alan_kafka_table (
  u_id BIGINT,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT,
  `event_time` TIMESTAMP(3) METADATA FROM 'timestamp',-- 事件时间
  WATERMARK FOR event_time as event_time - INTERVAL '5' SECOND  -- 在eventTime上定义watermark
) WITH (
  'connector' = 'kafka',
  'topic' = 'alan_kafka_hive_topic',
  'properties.bootstrap.servers' = '192.168.10.41:9092,192.168.10.42:9092,192.168.10.43:9092',
  'properties.group.id' = 'testGroup',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'csv'
);

-- 流式sql 插入hive数据库
INSERT INTO alan_hive_user_table 
SELECT u_id, u_name,balance,age, DATE_FORMAT(`event_time`, 'yyyy-MM-dd'), DATE_FORMAT(`event_time`, 'HH')
FROM alan_kafka_table;

-- 批处理sql ,按照分区查询
SELECT * FROM alan_hive_user_table WHERE dt='2023-09-05' and hr='07';

6), flink verification steps

-----设置运行环境
Flink SQL> SET execution.runtime-mode = streaming;
[INFO] Session property has been set.

----设置hive方言
Flink SQL> SET table.sql-dialect=hive;
Hive Session ID = b64d5e77-1f0e-4480-a680-0f7ebf7e34c4
[INFO] Session property has been set.
-----创建hive表
Flink SQL> CREATE TABLE alan_hive_user_table (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT
> ) PARTITIONED BY (dt STRING,hr STRING) 
>   row format delimited 
>   fields terminated by "," 
>   TBLPROPERTIES (
>   'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
>   'sink.partition-commit.trigger'='partition-time',
>   'sink.partition-commit.delay'='10 s',
>   'sink.partition-commit.policy.kind'='metastore,success-file',
>   'sink.rolling-policy.rollover-interval'='5s',
>   'sink.partition-commit.watermark-time-zone'='Asia/Shanghai' -- 假设用户配置的时区为 'Asia/Shanghai', 
> );
[INFO] Execute statement succeed.
----设置flink 默认方言
Flink SQL> SET table.sql-dialect=default;
[INFO] Session property has been set.
------创建kafka表
Flink SQL> CREATE TABLE alan_kafka_table (
>   u_id BIGINT,
>   u_name STRING,
>   balance DECIMAL(10, 4),
>   age INT,
>   `event_time` TIMESTAMP(3) METADATA FROM 'timestamp',-- 事件时间
>   WATERMARK FOR event_time as event_time - INTERVAL '5' SECOND  -- 在eventTime上定义watermark
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'alan_kafka_hive_topic',
>   'properties.bootstrap.servers' = '192.168.10.41:9092,192.168.10.42:9092,192.168.10.43:9092',
>   'properties.group.id' = 'testGroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'format' = 'csv'
> );
------流式sql ,按照分区流式插入数据,也即flink的一个任务
Flink SQL> INSERT INTO alan_hive_user_table 
> SELECT u_id, u_name,balance,age, DATE_FORMAT(`event_time`, 'yyyy-MM-dd'), DATE_FORMAT(`event_time`, 'HH')
> FROM alan_kafka_table;

Job ID: 95fceba5540315957ed7d0b873461e43
-----kafka 发送数据
[alanchan@server2 bin]$ kafka-console-producer.sh --broker-list server1:9092 --topic alan_kafka_hive_topic
>1,'alan',123.34,18
>2,'alanchan',223.34,28
>

---flink sql 查询数据,kafka发送一次查询一次
Flink SQL> select * from alan_hive_user_table where dt='2023-09-05' and hr='07';
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| op |                 u_id |                         u_name |      balance |         age |                             dt |                             hr |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| +I |                    1 |                         'alan' |     123.3400 |          18 |                     2023-09-05 |                             07 |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
Received a total of 1 row

Flink SQL> select * from alan_hive_user_table where dt='2023-09-05' and hr='07';


+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| op |                 u_id |                         u_name |      balance |         age |                             dt |                             hr |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
| +I |                    1 |                         'alan' |     123.3400 |          18 |                     2023-09-05 |                             07 |
| +I |                    2 |                     'alanchan' |     223.3400 |          28 |                     2023-09-05 |                             07 |
+----+----------------------+--------------------------------+--------------+-------------+--------------------------------+--------------------------------+
Received a total of 2 rows


If a watermark is defined in the TIMESTAMP_LTZ column and partition-time submission is used, the session time zone needs to be set to sink.partition-commit.watermark-time-zone, otherwise partition commit will occur after a few hours.
For the following examples, please refer to 16. Flink's table api and sql to connect external systems: connectors and formats for reading and writing external systems and the examples in FileSystem example (1). The difference is that the connectors are different, and the actual settings are the same, so I will not go into details.

SET table.sql-dialect=hive;
CREATE TABLE hive_table (
  user_id STRING,
  order_amount DOUBLE
) PARTITIONED BY (dt STRING, hr STRING) STORED AS parquet TBLPROPERTIES (
  'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
  'sink.partition-commit.trigger'='partition-time',
  'sink.partition-commit.delay'='1 h',
  'sink.partition-commit.watermark-time-zone'='Asia/Shanghai', -- 假设用户配置的时区是 'Asia/Shanghai'。
  'sink.partition-commit.policy.kind'='metastore,success-file'
);

SET table.sql-dialect=default;
CREATE TABLE kafka_table (
  user_id STRING,
  order_amount DOUBLE,
  ts BIGINT, -- time in epoch milliseconds
  ts_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
  WATERMARK FOR ts_ltz AS ts_ltz - INTERVAL '5' SECOND -- 在 TIMESTAMP_LTZ 列声明 watermark。
) WITH (...);

-- streaming sql, insert into hive table
INSERT INTO TABLE hive_table 
SELECT user_id, order_amount, DATE_FORMAT(ts_ltz, 'yyyy-MM-dd'), DATE_FORMAT(ts_ltz, 'HH')
FROM kafka_table;

-- batch sql, select with partition pruning
SELECT * FROM hive_table WHERE dt='2020-05-20' and hr='12';

By default, Flink only supports renaming committers for streams, and does not support the exactly-once semantics of stream writing for S3 file systems. Writing to S3 exactly-once is possible by setting the following parameters to false. This will call Flink's native writer, but is only valid for parquet and orc file types. This configuration item can be configured in TableConfig and is effective for all sinks of the job.
Insert image description here

7), writing of dynamic partition

Unlike static partitioning, which always requires the user to specify the value of the partition column, dynamic partitioning allows users to not specify the value of the partition column when writing data. For example, there is such a partition table:

CREATE TABLE alan_wp_user_table (
  u_id BIGINT,
  u_name STRING,
  balance DECIMAL(10, 4),
  age INT
) PARTITIONED BY (dt STRING,hr STRING) 
  row format delimited 
  fields terminated by "," 
  TBLPROPERTIES (
  'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
  'sink.partition-commit.trigger'='partition-time',
  'sink.partition-commit.delay'='10 s',
  'sink.partition-commit.policy.kind'='metastore,success-file'
);

Users can use the following SQL statement to write data to the partition table:

INSERT into alan_wp_user_table SELECT 3,'alanchanchn', 35.8,29, '2023-09-05', '05';

In this SQL statement, the user does not specify the value of the partition column. This is a typical example of dynamic partition writing.

By default, if it is a dynamic partition write, Flink will additionally sort the data according to the dynamic partition column before actually writing to the target table. This means that the data received by the sink node is sorted by partition, that is, the data of one partition is received first, and then the data of another partition is received. The data of different partitions will not be mixed together. In this way, the Hive sink node can only maintain the writer of one partition at a time. Otherwise, the Hive sink needs to maintain the writers of all partitions corresponding to the received data. If there are too many partition writers, it may cause a memory overflow (OutOfMemory) exception.

To avoid additional sorting, you can set the job configuration item table.exec.hive.sink.sort-by-dynamic-partition.enable (default is true) to false. However, under this configuration, as mentioned before, if a single sink node receives too many dynamic partitions, a memory overflow exception may occur.

If the data skew is not serious, you can add DISTRIBUTED BY <partition_field> in the SQL statement to distribute the data of the same partition to the same sink node to alleviate the problem of too many partition writers for a single sink node.

In addition, you can also add DISTRIBUTED BY <partition_field> in the SQL statement to achieve the effect of setting table.exec.hive.sink.sort-by-dynamic-partition.enable to false.

This configuration item table.exec.hive.sink.sort-by-dynamic-partition.enable only takes effect in batch mode.
Currently (as of version 1.17), DISTRIBUTED BY and SORTED BY can only be used if the Hive dialect is used in Flink batch mode.

8) Automatically collect statistical information

When using Flink to write to a Hive table, Flink will automatically collect the statistics of the written data by default and then submit it to the Hive metastore. But in some cases, you may not want to automatically collect statistics, because collecting these statistics may take some time. To prevent Flink from automatically collecting statistics, you can set the job parameter table.exec.hive.sink.statistic-auto-gather.enable (default is true) to false.

If the written Hive table is stored in Parquet or ORC format, numFiles/totalSize/numRows/rawDataSize statistics can be collected by Flink. Otherwise, only numFiles/totalSize can be collected.

For tables in Parquet or ORC format, Flink will only read the footer of the file in order to quickly collect statistical information numRows/rawDataSize. But in the case of a large number of files, this may also be time-consuming. You can set the job parameter table.exec.hive.sink.statistic-auto-gather.thread-num (default is 3) to a larger value to speed up statistics collection.

Only batch mode supports automatic collection of statistics. Streaming mode currently does not support automatic collection of statistics.

9), file merge

When using Flink to write Hive tables, Flink also supports automatic merging of small files to reduce the number of small files.

4. Format

Flink's integration with Hive has been tested in the following file formats:

Above, the integration of Flink and hive and the reading and writing of hive data through flink sql are introduced in detail.

Guess you like

Origin blog.csdn.net/chenwewi520feng/article/details/132668626