Hive/Spark/Flink Incremental Query Hudi best practices all in one go

b7046be630bc083448698c58284e8b61.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

1. Hive incremental query Hudi table

Synchronize Hive

table nameWhen we write data, we can configure the synchronization Hive parameters to generate the corresponding Hive table to query the Hudi table. Specifically, two Hive tables named by are passed during the writing process. For example, if table name = hudi_tbl, we get

hudi_tblImplemented HoodieParquetInputFormatread-optimized views of datasets backed by , providing purely columnar data

hudi_tbl_rtReal-time views of datasets supported by are implemented HoodieParquetRealtimeInputFormat, providing a consolidated view of base and log data

The above two comparisons are taken from the official website, here is an explanation: the real-time view _rttable will only exist when the MOR table synchronizes Hive metadata, and hudi_tblwhen the table type is MOR and is configured, skipROSuffix=trueit is a read-optimized view, when it is false ( When the default is false), the read-optimized view should be hudi_tbl_ro, and when the table type is COW, hudi_tblit should be a real-time view, so please pay attention to the official website’s explanation of this part.

Incremental query

Modify the configuration hive-site.xml

Add hoodie.* to the Hive SQL whitelist. Others are existing configurations. You can also add other whitelists as needed, such as:tez.*|parquet.*|planner.*

hive.security.authorization.sqlstd.confwhitelist.append hoodie.*|mapred.*|hive.*|mapreduce.*|spark.*
Setting parameters

Take the table name hudi_tbl as an example

Connect Hive connect/Hive Shell

Set the table as an incremental table

set hoodie.hudi_tbl.consume.mode=INCREMENTAL;

Set the timestamp of the start of the increment (not included), function: to filter at the file level and reduce the number of maps

set hoodie.hudi_tbl.consume.start.timestamp=20211015182330;

Set the number of commits for incremental consumption, the default setting is -1, which means that the incremental consumption reaches the current new data

set hoodie.hudi_tbl.consume.max.commits=-1;

Modify the number of commits as needed

Check for phrases

select * from hudi_tbl where `_hoodie_commit_time` > "20211015182330";

Due to the small file merging mechanism, the new commit timestamp file contains old data, so it is necessary to add where for secondary filtering

Note: The effective range of setting parameters here is that connect session
Hudi version 0.9.0 only supports table name parameters, and does not support database restrictions. After setting it hudi_tblas an incremental table, all databases with this table name will be incremental queries when querying the table. The parameters such as mode and start time are the last set values. In the new version later, database restrictions are added, such as hudi database

2. Spark SQL incremental query Hudi table

Programming method (DF+SQL)

First look at the way of Spark SQL incremental query on the official document

Address 1: https://hudi.apache.org/cn/docs/quick-start-guide#incremental-query
Address 2:https://hudi.apache.org/cn/docs/querying_data#incremental-query

It first reads the Hudi table as DF by adding incremental parameters in spark.read, then registers DF as a temporary table, and finally queries the temporary table through Spark SQL to realize incremental query

parameter

  • hoodie.datasource.query.type=incremental query type, when the value is incremental, it represents incremental query, the default value is snapshot, when incremental query, this parameter is required

  • hoodie.datasource.read.begin.instanttime Incremental query start time, required for example: 20221126170009762

  • hoodie.datasource.read.end.instanttime Incremental query end time, optional example: 20221126170023240

  • hoodie.datasource.read.incr.path.glob Incremental query to specify the partition path, optional eg /dt=2022-11/

Query range (BEGIN_INSTANTTIME, END_INSTANTTIME], which is greater than the start time (not included), less than or equal to the end time (included), if no end time is specified, then query the latest data greater than BEGIN_INSTANTTIME so far, if INCR_PATH_GLOB is specified, then only in Query the corresponding data under the specified partition path

code example

import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, END_INSTANTTIME, INCR_PATH_GLOB, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.TableIdentifier

val tableName = "test_hudi_incremental"

spark.sql(
  s"""
     |create table $tableName (
     |  id int,
     |  name string,
     |  price double,
     |  ts long,
     |  dt string
     |) using hudi
     | partitioned by (dt)
     | options (
     |  primaryKey = 'id',
     |  preCombineField = 'ts',
     |  type = 'cow'
     | )
     |""".stripMargin)

spark.sql(s"insert into $tableName values (1,'hudi',10,100,'2022-11-25')")
spark.sql(s"insert into $tableName values (2,'hudi',10,100,'2022-11-25')")
spark.sql(s"insert into $tableName values (3,'hudi',10,100,'2022-11-26')")
spark.sql(s"insert into $tableName values (4,'hudi',10,100,'2022-12-26')")
spark.sql(s"insert into $tableName values (5,'hudi',10,100,'2022-12-27')")

val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
val basePath = table.storage.properties("path")

// incrementally query data
val incrementalDF = spark.read.format("hudi").
  option(QUERY_TYPE.key, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME.key, beginTime).
  option(END_INSTANTTIME.key, endTime).
  option(INCR_PATH_GLOB.key, "/dt=2022-11*/*").
        load(basePath)
//  table(tableName)

incrementalDF.createOrReplaceTempView(s"temp_$tableName")

spark.sql(s"select * from  temp_$tableName").show()
spark.stop()

result

+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price| ts|        dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|  20221126165954300|20221126165954300...|              id:1|         dt=2022-11-25|de99b299-b9de-423...|  1|hudi| 10.0|100|2022-11-25|
|  20221126170009762|20221126170009762...|              id:2|         dt=2022-11-25|de99b299-b9de-423...|  2|hudi| 10.0|100|2022-11-25|
|  20221126170030470|20221126170030470...|              id:5|         dt=2022-12-27|75f8a760-9dc3-452...|  5|hudi| 10.0|100|2022-12-27|
|  20221126170023240|20221126170023240...|              id:4|         dt=2022-12-26|4751225d-4848-4dd...|  4|hudi| 10.0|100|2022-12-26|
|  20221126170017119|20221126170017119...|              id:3|         dt=2022-11-26|2272e513-5516-43f...|  3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+

+-----------------+
|      commit_time|
+-----------------+
|20221126170030470|
|20221126170023240|
|20221126170017119|
|20221126170009762|
|20221126165954300|
+-----------------+

20221126170009762
20221126170023240
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price| ts|        dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|  20221126170017119|20221126170017119...|              id:3|         dt=2022-11-26|2272e513-5516-43f...|  3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+

Commenting out INCR_PATH_GLOB, the result

+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price| ts|        dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|  20221127155346067|20221127155346067...|              id:4|         dt=2022-12-26|33e7a2ed-ea28-428...|  4|hudi| 10.0|100|2022-12-26|
|  20221127155339981|20221127155339981...|              id:3|         dt=2022-11-26|a5652ae0-942a-425...|  3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+

Proceed to comment out END_INSTANTTIME, the result

20221127161253433
20221127161311831
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price| ts|        dt|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+
|  20221127161320347|20221127161320347...|              id:5|         dt=2022-12-27|7b389e57-ca44-4aa...|  5|hudi| 10.0|100|2022-12-27|
|  20221127161311831|20221127161311831...|              id:4|         dt=2022-12-26|2707ce02-548a-422...|  4|hudi| 10.0|100|2022-12-26|
|  20221127161304742|20221127161304742...|              id:3|         dt=2022-11-26|264bc4a9-930d-4ec...|  3|hudi| 10.0|100|2022-11-26|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+---+----------+

You can see that the start time is not included, but the end time is included

Pure SQL

In general projects, pure SQL method is used for incremental query, which is more convenient. The parameters of pure SQL method are the same as those mentioned above. Next, let’s see how to implement it with pure SQL method.

Create a table and create a number
create table hudi.test_hudi_incremental (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
 partitioned by (dt)
 options (
  primaryKey = 'id',
  preCombineField = 'ts',
  type = 'cow'
);

insert into hudi.test_hudi_incremental values (1,'a1', 10, 1000, '2022-11-25');
insert into hudi.test_hudi_incremental values (2,'a2', 20, 2000, '2022-11-25');
insert into hudi.test_hudi_incremental values (3,'a3', 30, 3000, '2022-11-26');
insert into hudi.test_hudi_incremental values (4,'a4', 40, 4000, '2022-12-26');
insert into hudi.test_hudi_incremental values (5,'a5', 50, 5000, '2022-12-27');

Look at what commit_time

select distinct(_hoodie_commit_time) from test_hudi_incremental order by _hoodie_commit_time
+----------------------+
| _hoodie_commit_time  |
+----------------------+
| 20221130163618650    |
| 20221130163703640    |
| 20221130163720795    |
| 20221130163726780    |
| 20221130163823274    |
+----------------------+
Pure SQL method (1)

Use Call Procedures: copy_to_temp_view, copy_to_table, these two commands have been merged into the master at present, contributed by scxwhite Su Chengxiang, these two parameters are similar, it is recommended to use, copy_to_temp_viewbecause copy_to_tablethe data will be placed on the disk first copy_to_temp_viewbut a temporary table will be created, and the efficiency will be higher. Moreover, it is meaningless to place the data on the disk, and the table on the disk will be deleted later.

Supported parameters

  • table

  • query_type

  • view_name

  • begin_instance_time

  • end_instance_time

  • as_of_instant

  • replace

  • global

test-sql

call copy_to_temp_view(table => 'test_hudi_incremental', query_type => 'incremental', 
view_name => 'temp_incremental', begin_instance_time=> '20221130163703640', end_instance_time => '20221130163726780');

select _hoodie_commit_time, id, name, price, ts, dt from temp_incremental;

result

+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time  | id  | name  | price  |  ts   |     dt      |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163726780    | 4   | a4    | 40.0   | 4000  | 2022-12-26  |
| 20221130163720795    | 3   | a3    | 30.0   | 3000  | 2022-11-26  |
+----------------------+-----+-------+--------+-------+-------------+

It can be seen that this method can realize incremental query, but it should be noted that if you need to modify the start time of incremental query, you need to repeat copy_to_temp_view, but because the temporary table temp_incremental already exists, or a new table name , or delete it first, and then create a new one. I suggest deleting it first, and deleting it through the following command

drop view if exists temp_incremental;
Pure SQL method (2)

PR address: https://github.com/apache/hudi/pull/7182

This PR is also scxwhitecontributed by Spark, currently only supports Spark3.2 and above (currently the community has not merged)

Incremental query SQL

select id, name, price, ts, dt from tableName
[
'hoodie.datasource.query.type'=>'incremental',
'hoodie.datasource.read.begin.instanttime'=>'$instant1',
'hoodie.datasource.read.end.instanttime'=>'$instant2'
]

This method supports a new syntax. After querying SQL, by adding parameters in [], if you are interested, you can pull the code and try it yourself.

Pure SQL method (3)

The final effect is as follows

select
  /*+
    hoodie_prop(
      'default.h1',
      map('hoodie.datasource.read.begin.instanttime', '20221127083503537', 'hoodie.datasource.read.end.instanttime', '20221127083506081')
    ),
    hoodie_prop(
      'default.h2',
      map('hoodie.datasource.read.begin.instanttime', '20221127083508715', 'hoodie.datasource.read.end.instanttime', '20221127083511803')
    )
  */
  id, name, price, ts
from (
  select id, name, price, ts
  from default.h1
  union all
  select id, name, price, ts
  from default.h2
)

It is to add parameters related to incremental query in the hint, first specify the table name and then write the parameters, but the article does not seem to give the complete code address, you can try it yourself if you have time

Pure SQL method (4)

This method is the source code I modified according to the way of Hive incremental query Hudi, and the incremental query is realized by the method of set

PR address: https://github.com/apache/hudi/pull/7339

We already know that the s parameter DefaultSource.createRelationin Hudi is options = table.storage.properties ++ pathOption, which is the configuration parameter + path in the properties of the table itself. After that, it does not receive other parameters, so it cannot be set in the form of parameters make an inquiryoptParamreadDataSourceTablecreateRelation

Same as Hive incremental query, specify the incremental query parameters of the specific table name

set hoodie.test_hudi_incremental.datasource.query.type=incremental
set hoodie.test_hudi_incremental.datasource.read.begin.instanttime=20221130163703640;
select _hoodie_commit_time, id, name, price, ts, dt from test_hudi_incremental;
+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time  | id  | name  | price  |  ts   |     dt      |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163823274    | 5   | a5    | 50.0   | 5000  | 2022-12-27  |
| 20221130163726780    | 4   | a4    | 40.0   | 4000  | 2022-12-26  |
| 20221130163720795    | 3   | a3    | 30.0   | 3000  | 2022-11-26  |
+----------------------+-----+-------+--------+-------+-------------+

If different libraries have the same table name, you can use the form of library name.table name

## 需要先开启使用数据库名称限定表名的配置,开启后上面不加库名的配置就失效了
set hoodie.query.use.database = true;
set hoodie.hudi.test_hudi_incremental.datasource.query.type=incremental;
set hoodie.hudi.test_hudi_incremental.datasource.read.begin.instanttime=20221130163703640;
set hoodie.hudi.test_hudi_incremental.datasource.read.end.instanttime=20221130163726780;
set hoodie.hudi.test_hudi_incremental.datasource.read.incr.path.glob=/dt=2022-11*/*;
refresh table test_hudi_incremental;
select _hoodie_commit_time, id, name, price, ts, dt from test_hudi_incremental;
+----------------------+-----+-------+--------+-------+-------------+
| _hoodie_commit_time  | id  | name  | price  |  ts   |     dt      |
+----------------------+-----+-------+--------+-------+-------------+
| 20221130163720795    | 3   | a3    | 30.0   | 3000  | 2022-11-26  |
+----------------------+-----+-------+--------+-------+-------------+

You can try it yourself, the situation of different library table associations

One thing to note here is that after updating the parameters, you need to refresh tablequery first, otherwise the parameters modified during the query will not take effect, because the parameters in the cache will be used

This method simply modifies the source code to make the parameters of set take effect on the query

In order to avoid that some readers find it troublesome to pack, here is hudi-spark3.1-bundle_2.12-0.13.0-SNAPSHOT.jarthe download address for everyone:https://download.csdn.net/download/dkl12/87221476

3. Flink SQL incremental query Hudi table

Official website document

地址:https://hudi.apache.org/cn/docs/querying_data#incremental-query

parameter

  • read.start-commit Incremental query start time For streaming reading, if this value is not specified, the latest instantTime is taken by default, that is, streaming reading starts from the latest instantTime by default (including the latest). For batch reading, if this parameter is not specified and only read.end-commit is specified, the function of time travel can be realized and the historical records can be queried

  • read.end-commit Incremental query end time If this parameter is not specified, the latest record will be read by default. This parameter is generally only applicable to batch reading, because the general requirement of streaming reading is to query all incremental data

  • read.streaming.enabled Whether to read or not, the default is false

  • read.streaming.check-interval The check interval of streaming reading, in seconds (s), the default value is 60, which is one minute

The query range is [BEGIN_INSTANTTIME,END_INSTANTTIME], which includes both the start time and the end time. For the default value, please refer to the parameter description above

Version

Create a table and create a number:
  • Cruel 0.9.0

  • Spark 2.4.5

I use Hudi Spark SQL 0.9.0 for table creation here, the purpose is to simulate the Hudi table created with Java Client and Spark SQL on the project, to verify whether the Hudi Flink SQL incremental query is compatible with the old version of the Hudi table (you don’t have For this kind of demand, you can use any method to make numbers normally)

Inquire
  • Severe 0.13.0-SNAPSHOT

  • Flink 1.14.3 (incremental query)

  • Spark 3.1.2 (mainly for viewing commit information using the Call Procedures command)

Create a table and create a number

-- Spark SQL Hudi 0.9.0
create table hudi.test_flink_incremental (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
 partitioned by (dt)
 options (
  primaryKey = 'id',
  preCombineField = 'ts',
  type = 'cow'
);

insert into hudi.test_flink_incremental values (1,'a1', 10, 1000, '2022-11-25');
insert into hudi.test_flink_incremental values (2,'a2', 20, 2000, '2022-11-25');
update hudi.test_flink_incremental set name='hudi2_update' where id = 2;
insert into hudi.test_flink_incremental values (3,'a3', 30, 3000, '2022-11-26');
insert into hudi.test_flink_incremental values (4,'a4', 40, 4000, '2022-12-26');

Use show_commits to see what commits are there (the query here is the master of Hudi, because show_commits is supported in version 0.11.0, and you can also view the .commit files under the .hoodie folder by using the hadoop command)

call show_commits(table => 'hudi.test_flink_incremental');
20221205152736
20221205152723
20221205152712
20221205152702
20221205152650

Flink SQL creates Hudi memory table

CREATE TABLE test_flink_incremental (
  id int PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  price double,
  ts bigint,
  dt VARCHAR(10)
)
PARTITIONED BY (dt)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_flink_incremental'
);

Parameters related to incremental query are not specified when creating a table. We specify it dynamically when querying, which is more flexible. To dynamically specify the parameter method, add the following statement after the query statement

/*+ 
options(
  'read.start-commit' = '20221205152723',
  'read.end-commit'='20221205152736'
) 
*/

batch reading

Flink SQL has two modes for reading Hudi: batch reading and streaming reading. The default batch reading, first look at the incremental query of batch reading

Verify that a start time and a default end time are included
select * from test_flink_incremental 
/*+ 
options(
    'read.start-commit' = '20221205152723' --起始时间对应id=3的记录
) 
*/

The result contains the start time, if no end time is specified, the latest data will be read by default

id   name     price        ts                 dt
 4     a4      40.0      4000      dt=2022-12-26
 3     a3      30.0      3000      dt=2022-11-26
Verify that the end time is included
select * from test_flink_incremental 
/*+ 
options(
    'read.start-commit' = '20221205152712',  --起始时间对应id=2的记录
    'read.end-commit'='20221205152723'       --结束时间对应id=3的记录
) 
*/

Result contains end time

id           name        price       ts                 dt
 3             a3        30.0      3000      dt=2022-11-26
 2   hudi2_update        20.0      2000      dt=2022-11-25
Verify default start time

In this case, the end time is specified, but the start time is not specified. If neither is specified, all the records of the latest version of the table are read

select * from test_flink_incremental 
/*+ 
options(
    'read.end-commit'='20221205152712'       --结束时间对应id=2的更新记录
) 
*/

Result: only query the records corresponding to end-commit

id           name        price       ts                 dt
 2   hudi2_update        20.0      2000      dt=2022-11-25
Time travel (query history)

Verify whether the historical records can be queried. We update the name with id 2. Before the update, the name is a2, and after the update, it is hudi2_update. Let us verify whether the Hudi historical records can be queried through FlinkSQL. The expected result is id=2, name=a2

select * from test_flink_incremental 
/*+ 
options(
    'read.end-commit'='20221205152702'       --结束时间对应id=2的历史记录
) 
*/

Result: History can be queried correctly

id           name        price       ts                 dt
 2             a2        20.0      2000      dt=2022-11-25

streaming

Parameters to enable stream reading

read.streaming.enabled = true

Stream reading does not need to set the end time, because the general requirement is to read all incremental data, we only need to verify the start time

Verify default start time
select * from test_flink_incremental 
/*+ 
options(
    'read.streaming.enabled'='true',
    'read.streaming.check-interval' = '4'
) 
*/

Result: Incremental reading starts from the latest instantTime, that is, the default read.start-commit is the latest instantTime

id   name     price        ts                 dt
 4     a4      40.0      4000      dt=2022-12-26
Verify specified start time
select * from test_flink_incremental 
/*+ 
options(
    'read.streaming.enabled'='true',
    'read.streaming.check-interval' = '4',
    'read.start-commit' = '20221205152712'
) 
*/

result

id           name        price       ts                 dt
 2   hudi2_update        20.0      2000      dt=2022-11-25
 3             a3        30.0      3000      dt=2022-11-26
 4             a4        40.0      4000      dt=2022-11-26

If you want to query all historical data for the first time, you can set the start-commit earlier, such as last year: 'read.start-commit' = '20211205152712'

select * from test_flink_incremental 
/*+ 
options(
    'read.streaming.enabled'='true',
    'read.streaming.check-interval' = '4',
    'read.start-commit' = '20211205152712'
) 
*/
id           name        price       ts                 dt
 1             a1        10.0      1000      dt=2022-11-25
 2   hudi2_update        20.0      2000      dt=2022-11-25
 3             a3        30.0      3000      dt=2022-11-26
 4             a4        40.0      4000      dt=2022-11-26
Verify continuity of streaming reads

Verify that the new incremental data comes in, whether it can continue to consume Hudi incremental data, and verify the accuracy and consistency of the data. In order to facilitate verification, I can use Flink SQL incremental stream to read the Hudi table and then sink it into the MySQL table, and finally pass the read The data in the MySQL table verifies the accuracy of the data

Flink SQL reads and writes MySQL. You need to configure the jar package. Just put flink-connector-jdbc_2.12-1.14.3.jar under lib. Download address: https://repo1.maven.org/maven2/org/apache/ flink/flink-connector-jdbc_2.12/1.14.3/flink-connector-jdbc_2.12-1.14.3.jar

First create a Sink table in MySQL

-- MySQL
CREATE TABLE `test_sink` (
  `id` int(11),
  `name` text DEFAULT NULL,
  `price` int(11),
  `ts` int(11),
  `dt`  text DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Create the corresponding sink table in Flink

create table test_sink (
  id int,
  name string,
  price double,
  ts bigint,
  dt string
) with (
 'connector' = 'jdbc',
 'url' = 'jdbc:mysql://192.468.44.128:3306/hudi?useSSL=false&useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8',
 'username' = 'root',
 'password' = 'root-123',
 'table-name' = 'test_sink',
 'sink.buffer-flush.max-rows' = '1'
);

Then stream incrementally read the Hudi table Sink Mysql

insert into test_sink
select * from test_flink_incremental 
/*+ 
options(
    'read.streaming.enabled'='true',
    'read.streaming.check-interval' = '4',
    'read.start-commit' = '20221205152712'
) 
*/

This will start a long task, which has been in the running state, we can verify this on the yarn-session interface

9f28f30cf51ba36d94317baba84356d1.png

Then first verify the accuracy of the historical data in MySQL

76cff8822c5fb455893653f05b39b86c.png

Then use Spark SQL to insert two pieces of data into the source table

-- Spark SQL
insert into hudi.test_flink_incremental values (5,'a5', 50, 5000, '2022-12-07');
insert into hudi.test_flink_incremental values (6,'a6', 60, 6000, '2022-12-07');

The interval of our incremental reading is set to 4s. After successfully inserting data and waiting for 4s, verify the data in the MySQL table.

4c042b9abf240c031c86f427545e0e01.png

It is found that the newly added data has been successfully Sinked into MySQL, and the data is not repeated

Finally, verify the updated incremental data, Spark SQL updates the Hudi source table

-- Spark SQL
update hudi.test_flink_incremental set name='hudi5_update' where id = 5;

Continue to verify results

673daee5d9f0b909e9c23bba2855a9be.png

The result is that the updated incremental data will also be inserted into the sink table in MySQL, but the original data will not be updated

So what if you want to achieve the effect of updating? We need to add primary key fields to the sink tables of MySQL and Flink, both of which are indispensable, as follows

-- MySQL
CREATE TABLE `test_sink` (
  `id` int(11),
  `name` text DEFAULT NULL,
  `price` int(11),
  `ts` int(11),
  `dt`  text DEFAULT NULL,
   PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-- Flink SQL
create table test_sink (
  id int PRIMARY KEY NOT ENFORCED,
  name string,
  price double,
  ts bigint,
  dt string
) with (
 'connector' = 'jdbc',
 'url' = 'jdbc:mysql://192.468.44.128:3306/hudi?useSSL=false&useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8',
 'username' = 'root',
 'password' = 'root-123',
 'table-name' = 'test_sink',
 'sink.buffer-flush.max-rows' = '1'
);

Turn off the long task just started, re-execute the insert statement just now, run the historical data first, and finally verify the incremental effect

-- Spark SQL
update hudi.test_flink_incremental set name='hudi6_update' where id = 6;
insert into hudi.test_flink_incremental values (7,'a7', 70, 7000, '2022-12-07');

It can be seen that the expected effect is achieved, the update operation is performed for id=6, and the insert operation is performed for id=7

99d5a91a9cc7870dfdb70cfdf510fc7c.png

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

682d7e78d3f1a794df6c7fd68706aedc.png

2f48d598d76e933b9fe69dbc28e57d65.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/131777789