Flow data lake platform Apache Paimon (2) integrates Flink engine

Chapter 2 Integrating the Flink Engine

Paimon currently supports Flink 1.17, 1.16, 1.15 and 1.14. This course uses Flink 1.17.0.

2.1 Environment preparation

Environmental preparation

2.1.1 Install Flink

1) Upload and decompress the Flink installation package

tar -zxvf flink-1.17.0-bin-scala_2.12.tgz -C /opt/module/

2) Configure environment variables

sudo vim /etc/profile.d/my_env.sh

export HADOOP_CLASSPATH=hadoop classpath

source /etc/profile.d/my_env.sh

2.1.2 Upload jar package

1) Download and upload Paimon's jar package

jar package download address: https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-1.17/0.5-SNAPSHOT/

2) Copy the jar package of paimon to the lib directory of flink

cp paimon-flink-1.17-0.5-20230703.002437-67.jar /opt/module/flink-1.17.0/lib

2.1.3 Start Hadoop

(slightly)

2.1.4 start sql-client

1) Modify flink-conf.yaml configuration

vim /opt/module/flink-1.16.0/conf/flink-conf.yaml

#Solve Chinese garbled characters, the parameter before 1.17 is env.java.opts

env.java.opts.all: -Dfile.encoding=UTF-8

classloader.check-leaked-classloader: false

taskmanager.numberOfTaskSlots: 4

execution.checkpointing.interval: 10s

state.backend: rocksdb

state.checkpoints.dir: hdfs://hadoop102:8020/ckps

state.backend.incremental: true

2) Start the Flink cluster

(1) Solve the dependency problem

cp /opt/module/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.4.jar /opt/module/flink-1.17.0/lib/

(2) Here we take the Yarn-Session mode as an example

/opt/module/flink-1.17.0/bin/yarn-session.sh -d

3) Start Flink's sql-client

/opt/module/flink-1.17.0/bin/sql-client.sh -s yarn-session

img

4) Set the result display mode

SET ‘sql-client.execution.result-mode’ = ‘tableau’;

2.2 Catalog

Paimon Catalog can persist metadata and currently supports two types of metastores:

filesystem (default): store metadata and table files in the filesystem.

hive: Store metadata in hive metastore. Users can access tables directly from Hive.

2.2.1 File system

CREATE CATALOG fs_catalog WITH (

‘type’ = ‘paimon’,

‘warehouse’ = ‘hdfs://hadoop102:8020/paimon/fs’

);

USE CATALOG fs_catalog;

2.2.2 Hive Catalog

By using the Hive Catalog, changes to the Catalog will directly affect the corresponding hive metastore. Tables created in such catalogs can also be accessed directly from Hive.

To use Hive Catalog, database names, table names, and field names should be lowercase.

1) Upload hive-connector

Upload flink-sql-connector-hive-3.1.3_2.12-1.17.0.jar to the lib directory of Flink

2) Restart the yarn-session cluster

3) Start the metastore service of hive

nohup hive --service metastore &

4) Create Hive Catalog

CREATE CATALOG hive_catalog WITH (

  'type' = 'paimon',

  'metastore' = 'hive',

'uri' = 'thrift://hadoop102:9083',

'hive-conf-dir' = '/opt/module/hive/conf',

  'warehouse' = 'hdfs://hadoop102:8020/paimon/hive'

);


USE CATALOG hive_catalog;

5) Precautions

See HIVE-17832 when using hive Catalog to change incompatible column types through alter table. Need to configure

vim /opt/module/hive/conf/hive-site.xml;

  <property>

​    <name>hive.metastore.disallow.incompatible.col.type.changes</name>

​    <value>false</value>

  </property>

The above configuration needs to be configured in hive-site.xml, and the hive metastore service needs to be restarted.

If using Hive3, disable Hive ACID:

hive.strict.managed.tables=false

hive.create.as.insert.only=false

metastore.create.as.acid=false

2.2.3 sql initialization file

1) Create an initialization sql file

vim conf/sql-client-init.sql

CREATE CATALOG fs_catalog WITH (

  'type' = 'paimon',

  'warehouse' = 'hdfs://hadoop102:8020/paimon/fs'

);

 

CREATE CATALOG hive_catalog WITH (

  'type' = 'paimon',

  'metastore' = 'hive',

'uri' = 'thrift://hadoop102:9083',

'hive-conf-dir' = '/opt/module/hive/conf',

  'warehouse' = 'hdfs://hadoop102:8020/paimon/hive'

);

 

 

USE CATALOG hive_catalog;

 

SET 'sql-client.execution.result-mode' = 'tableau';

2) When starting sql-client, specify the sql initialization file

bin/sql-client.sh -s yarn-session -i conf/sql-client-init.sql

3) View the catalog

show catalogs;

show current catalog;

2.3 DDL

2.3.1 Create table

2.3.1.1 Management table

The table created in Paimon Catalog is Paimon's management table, which is managed by Catalog. When a table is deleted from the Catalog, its table file will also be deleted, similar to Hive's internal tables.

1) create table

CREATE TABLE test (

  user_id BIGINT,

  item_id BIGINT,

  behavior STRING,

  dt STRING,

  hh STRING,

  PRIMARY KEY (dt, hh, user_id) NOT ENFORCED

);

2) Create a partition table

CREATE TABLE test_p (

  user_id BIGINT,

  item_id BIGINT,

  behavior STRING,

  dt STRING,

  hh STRING,

  PRIMARY KEY (dt, hh, user_id) NOT ENFORCED

) PARTITIONED BY (dt, hh);

By configuring partition.expiration-time, expired partitions can be automatically deleted.

If a primary key is defined, the partition field must be a subset of the primary key.

The following three types of fields can be defined as partition fields:

Creation time (recommended): Creation time is usually immutable, so you can safely treat it as a partition field and add it to your primary key.

Event time: The event time is a field in the original table. For CDC data, such as tables synchronized from MySQL CDC or Changelogs generated by Paimon, they are all complete CDC data, including UPDATE_BEFORE records, even if you declare a primary key that contains partition fields, it can also achieve unique effects.

CDC op_ts: cannot be defined as a partition field, and the previous record timestamp cannot be known.

3)Create Table As

Tables can be created and populated by query results, for example, we have a sql like this: CREATE TABLE table_b AS SELECT id, name FORM table_a, the generated table table_b will be equivalent to the following statement to create a table and insert data: CREATE TABLE table_b(id INT, name STRING); INSERT INTO table_b SELECT id, name FROM table_a;

When using CREATE TABLE AS SELECT we can specify the primary key or partition.

CREATE TABLE test1(

user_id BIGINT,

item_id BIGINT

);

CREATE TABLE test2 AS SELECT * FROM test1;

– specify the partition

CREATE TABLE test2_p WITH (‘partition’ = ‘dt’) AS SELECT * FROM test_p;

– specify the configuration

CREATE TABLE test3(

​ user_id BIGINT,

​ item_id BIGINT

) WITH (‘file.format’ = ‘orc’);

CREATE TABLE test3_op WITH (‘file.format’ = ‘parquet’) AS SELECT * FROM test3;

– specify the primary key

CREATE TABLE test_pk WITH (‘primary-key’ = ‘dt,hh’) AS SELECT * FROM test;

– Specify primary key and partition

CREATE TABLE test_all WITH (‘primary-key’ = ‘dt,hh’, ‘partition’ = ‘dt’) AS SELECT * FROM test_p;

4)Create Table Like

Create a table with the same schema, partitions, and table attributes as another table.

CREATE TABLE test_ctl LIKE test;

5) Table properties

Users can specify table properties to enable Paimon's functionality or to improve Paimon's performance. See Configurations for a complete list of such properties: https://paimon.apache.org/docs/master/maintenance/configurations/.

CREATE TABLE tbl(

  user_id BIGINT,

  item_id BIGINT,

  behavior STRING,

  dt STRING,

  hh STRING,

  PRIMARY KEY (dt, hh, user_id) NOT ENFORCED

) PARTITIONED BY (dt, hh) 

WITH (

  'bucket' = '2',

  'bucket-key' = 'user_id'

);

2.3.1.2 External Tables

External tables are recorded but not managed by the Catalog. If you delete an external table, its table file will not be deleted, similar to Hive's external table.

Paimon external tables can be used in any Catalog. If you don't want to create a Paimon Catalog and just want to read/write tables, you can consider external tables.

CREATE TABLE ex (

  user_id BIGINT,

  item_id BIGINT,

  behavior STRING,

  dt STRING,

  hh STRING,

  PRIMARY KEY (dt, hh, user_id) NOT ENFORCED

) WITH (

  'connector' = 'paimon',

  'path' = 'hdfs://hadoop102:8020/paimon/external/ex',

  'auto-create' = 'true' 

);

2.3.1.3 Temporary tables

Only Flink supports temporary tables. Like external tables, temporary tables are just records, but not managed by the current Flink SQL session. If a temporary table is dropped, its resources will not be dropped. Temporary tables are also dropped when the Flink SQL session is closed. The difference from external tables is that temporary tables are created in Paimon Catalog.

If you want to use the Paimon Catalog with other tables, but don't want to store them in other catalogs, you can create temporary tables.

USE CATALOG hive_catalog;

 

CREATE TEMPORARY TABLE temp (

  k INT,

  v STRING

) WITH (

  'connector' = 'filesystem',

  'path' = 'hdfs://hadoop102:8020/temp.csv',

  'format' = 'csv'

);

2.3.2 Modify table

2.3.2.1 Modify table

1) Change/add table properties

ALTER TABLE test SET (

‘write-buffer-size’ = ‘256 MB’

);

2) Rename the table name

ALTER TABLE test1 RENAME TO test_new;

3) Delete the table attribute

ALTER TABLE test RESET (‘write-buffer-size’);

2.3.2.2 Modify Columns

1) Add new column

ALTER TABLE test ADD (c1 INT, c2 STRING);

2) Rename the column names

ALTER TABLE test RENAME c1 TO c0;

3) Delete the column

ALTER TABLE test DROP (c0, c2);

4) Change the nullability of the column

CREATE TABLE test_null(

id INT PRIMARY KEY NOT ENFORCED,

coupon_info FLOAT NOT NULL

);

– The column coupon_info is modified to allow null

ALTER TABLE test_null MODIFY coupon_info FLOAT;

– The column coupon_info is modified to not allow null

– If there is already a null value in the table, set the following parameters to delete the null value before modifying

SET ‘table.exec.sink.not-null-enforcer’ = ‘DROP’;

ALTER TABLE test_null MODIFY coupon_info FLOAT NOT NULL;

5) Change the column annotation

ALTER TABLE test MODIFY user_id BIGINT COMMENT ‘user id’;

6) Add column position

ALTER TABLE test ADD a INT FIRST;

ALTER TABLE test ADD b INT AFTER a;

7) Change column position

ALTER TABLE test MODIFY b INT FIRST;

ALTER TABLE test MODIFY a INT AFTER user_id;

8) Change the column type

ALTER TABLE test MODIFY a DOUBLE;

2.3.2.3 Modify watermark

1) Add watermark

CREATE TABLE test_wm (

id INT,

name STRING,

ts BIGINT

);

ALTER TABLE test_wm ADD(

et AS to_timestamp_ltz(ts,3),

WATERMARK FOR et AS et - INTERVAL ‘1’ SECOND

);

2) Change the watermark

ALTER TABLE test_wm MODIFY WATERMARK FOR et AS et - INTERVAL ‘2’ SECOND;

3) Remove the watermark

ALTER TABLE test_wm DROP WATERMARK;

2.4 DML

2.4.1 Insert data

The INSERT statement inserts new rows into a table or overwrites existing data in the table. Inserted rows can be specified by value expressions or query results, consistent with standard sql syntax.

INSERT { INTO | OVERWRITE } table_identifier [ part_spec ] [ column_list ] { value_expr | query }

part_spec

Optional, a list of key-value pairs for the specified partition, separated by commas. Type literals can be used (for example, date'2019-01-02').

Syntax: PARTITION (partition column name = partition column value [ , ... ] )

column_list

Optionally, specify a comma-separated list of fields.

Syntax: (col_name1 [, column_name2, ...])

All specified columns should exist in the table and not be duplicates of each other. It includes all columns except static partition columns. The field list should be exactly the same size as the data in the VALUES clause or query.

value_expr

Specifies the value to insert. An explicitly specified value or NULL can be inserted. Each value in the clause must be separated by a comma. More than one set of values ​​can be specified to insert multiple rows.

Syntax: VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ]

Currently, Flink does not support the direct use of NULL, so NULL needs to be converted to an actual data type value, such as "CAST (NULL AS STRING)"

Note: Write Nullable fields to Not-null fields

A nullable column of another table cannot be inserted into a non-nullable column of one table. Flink can use the COALESCE function to handle it. For example, key1 of table A is not null, and key2 of table B is nullable:

INSERT INTO A key1 SELECT COALESCE(key2, ) FROM B

case:

INSERT INTO test VALUES(1,1,‘order’,‘2023-07-01’,‘1’), (2,2,‘pay’,‘2023-07-01’,‘2’);

INSERT INTO test_p PARTITION(dt=‘2023-07-01’,hh=‘1’) VALUES(3,3, ‘pv’);

– The execution mode distinguishes stream and batch

INSERT INTO test_p SELECT * from test;

Paimon supports data shuffle through partitions and buckets in the sink phase.

2.4.2 Coverage data

Coverage data only supports batch mode. By default, streaming reads ignore commits generated by INSERT OVERWRITE . If you want to read OVERWRITE commits, you can configure streaming-read-overwrite.

RESET ‘execution.checkpointing.interval’;

SET ‘execution.runtime-mode’ = ‘batch’;

1) Overwrite non-partitioned tables

INSERT OVERWRITE test VALUES(3,3,‘pay’,‘2023-07-01’,‘2’);

2) Overwrite the partition table

For partitioned tables, Paimon's default coverage mode is dynamic partition coverage (that is, Paimon only deletes partitions that appear in insert overwrite data). You can configure dynamic partition overrides to change this.

INSERT OVERWRITE test_p SELECT * from test;

Overwrite the specified partition:

INSERT OVERWRITE test_p PARTITION (dt = ‘2023-07-01’, hh = ‘2’) SELECT user_id,item_id,behavior from test;

3) Empty table

You can use INSERT OVERWRITE to clear the table by inserting null values ​​(turn off dynamic partition overwrite).

INSERT OVERWRITE test_p/*+ OPTIONS(‘dynamic-partition-overwrite’=‘false’) */ SELECT * FROM test_p WHERE false;

2.4.3 Update data

Currently, Paimon supports UPDATE to update records in Flink 1.17 and later versions. You can perform UPDATEs in Flink's batch mode.

Only primary key tables support this feature. Updating primary keys is not supported.

MergeEngine requires deduplicate or partial-update to support this feature. (default deduplicate)

UPDATE test SET item_id = 4, behavior = ‘pv’ WHERE user_id = 3;

2.4.4 Delete data

Delete from table (Flink 1.17):

This feature is only supported for tables with write mode set to change-log. (with a primary key, the default is change-log)

If the table has a primary key, MergeEngine needs to be deduplicate. (default deduplicate)

DELETE FROM test WHERE user_id = 3;

2.4.5 Merge Into

Row-level updates are implemented through merge into, and only primary key tables support this function. This operation does not produce UPDATE_BEFORE, so setting 'changelog-producer' = 'input' is not recommended.

The merge-into operation uses "upsert" semantics instead of "update", which means that if the row exists, an update is performed, otherwise an insert is performed.

1) Syntax description:

<FLINK_HOME>/bin/flink run \

  /path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

  merge-into \

  --warehouse <warehouse-path> \

  --database <database-name> \

  --table <target-table> \

  [--target-as <target-table-alias>] \

  --source-table <source-table-name> \

  [--source-sql <sql> ...]\

  --on <merge-condition> \

  --merge-actions <matched-upsert,matched-delete,not-matched-insert,not-matched-by-source-upsert,not-matched-by-source-delete> \

  --matched-upsert-condition <matched-condition> \

  --matched-upsert-set <upsert-changes> \

  --matched-delete-condition <matched-condition> \

  --not-matched-insert-condition <not-matched-condition> \

  --not-matched-insert-values <insert-values> \

  --not-matched-by-source-upsert-condition <not-matched-by-source-condition> \

  --not-matched-by-source-upsert-set <not-matched-upsert-changes> \

  --not-matched-by-source-delete-condition <not-matched-by-source-condition> \

  [--catalog-conf <paimon-catalog-conf> [--catalog-conf <paimon-catalog-conf> ...]]

--source-sql <sql> 可以传递sql来配置环境并在运行时创建源表。

Description of "match":

(1) matched: The changed row comes from the target table, and each row can match the source table row (source ∩ target) according to the condition:

Merge condition (–on)

Matching condition (–matched-xxx-condition)

(2) not-matched: The changed rows are from the source table, and none of the rows can match any of the target table's rows according to the condition (source – target):

Merge condition (–on)

Unmatched condition (--not-matched-xxx-condition): The column of the target table cannot be used to construct the conditional expression.

(3) not-matched-by-source: The changed rows come from the target table, and none of the rows can match any of the source table's rows based on the condition (target – source):

Merge condition (–on)

Source unmatched condition (--not-matched-by-source-xxx-condition): The column of the source table cannot be used to construct the conditional expression.

2) Case Practice

Need to use paimon-flink-action-xxxx.jar, upload:

cp paimon-flink-action-0.5-20230703.002437-53.jar /opt/module/flink-1.17.0/opt

download link:

https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-action/0.5-SNAPSHOT/

(1) Prepare the test form:

use catalog hive_catalog;

create database test;

use test;

 

CREATE TABLE ws1 (

  id INT,

  ts BIGINT,

  vc INT,

  PRIMARY KEY (id) NOT ENFORCED

);

 

INSERT INTO ws1 VALUES(1,1,1),(2,2,2),(3,3,3);

 

 

CREATE TABLE ws_t (

  id INT,

  ts BIGINT,

  vc INT,

  PRIMARY KEY (id) NOT ENFORCED

);

INSERT INTO ws_t VALUES(2,2,2),(3,3,3),(4,4,4),(5,5,5);

(2) Case 1: match id between ws_t and ws1, change the vc of ts>2 in ws_t to 10, and delete the vc of ts<=2

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

merge-into \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table ws_t \

–source-table test.ws1 \

–on “ws_t.id = ws1.id” \

–merge-actions matched-upsert,matched-delete \

–matched-upsert-condition “ws_t.ts > 2” \

–matched-upsert-set “vc = 10” \

–matched-delete-condition “ws_t.ts <= 2”

(3) Case 2: ws_t matches id with ws1, if it matches, add 10 to vc in ws_t, if it does not match in ws1, insert it into ws_t

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

merge-into \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table ws_t \

–source-table test.ws1 \

–on “ws_t.id = ws1.id” \

–merge-actions matched-upsert,not-matched-insert \

–matched-upsert-set “vc = ws_t.vc + 10” \

–not-matched-insert-values “*”

(4) Case 3: ws_t and ws1 match id, ws_t does not match, if ts is greater than 4, add 20 to vc, and if ts=4, delete

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

merge-into \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table ws_t \

–source-table test.ws1 \

–on “ws_t.id = ws1.id” \

–merge-actions not-matched-by-source-upsert,not-matched-by-source-delete \

–not-matched-by-source-upsert-condition “ws_t.ts > 4” \

–not-matched-by-source-upsert-set “vc = ws_t.vc + 20” \

–not-matched-by-source-delete-condition " ws_t.ts = 4"

(5) Case 4: Use --source-sql to create the source table under the new catalog, match the id of ws_t, and insert ws_t if there is no match

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

merge-into \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table ws_t \

–source-sql “CREATE CATALOG fs2 WITH (‘type’ = ‘paimon’,‘warehouse’ = ‘hdfs://hadoop102:8020/paimon/fs2’)” \

–source-sql “CREATE DATABASE IF NOT EXISTS fs2.test” \

–source-sql “CREATE TEMPORARY VIEW fs2.test.ws2 AS SELECT id+10 as id,ts,vc FROM test.ws1” \

–source-table fs2.test.ws2 \

–on “ws_t.id = ws2. id” \

–merge-actions not-matched-insert\

–not-matched-insert-values “*”

2.5 DQL query table

2.5.1 Batch query

Just like all other tables, Paimon tables can be queried using the SELECT statement.

Paimon's bulk read returns all the data in the table snapshot. By default, bulk reads return the latest snapshot.

In sql-client, set the execution mode to batch:

RESET ‘execution.checkpointing.interval’;

SET ‘execution.runtime-mode’ = ‘batch’;

2.5.1.1 Time travel

1) Read the snapshot of the specified id

SELECT * FROM ws_t /*+ OPTIONS(‘scan.snapshot-id’ = ‘1’) */;

SELECT * FROM ws_t /*+ OPTIONS(‘scan.snapshot-id’ = ‘2’) */;

2) Read the snapshot of the specified timestamp

– View snapshot information

SELECT * FROM ws_t&snapshots;

SELECT * FROM ws_t /*+ OPTIONS(‘scan.timestamp-millis’ = ‘1688369660841’) */;

3) Read the specified tag

SELECT * FROM ws_t /*+ OPTIONS(‘scan.tag-name’ = ‘my-tag’) */;

2.5.1.2 Incremental query

Read incremental changes between the start snapshot (exclusive) and the end snapshot. For example, "3,5" indicates changes between snapshot 3 and snapshot 5:

SELECT * FROM ws_t /*+ OPTIONS(‘incremental-between’ = ‘3,5’) */;

In batch mode, DELETE records are not returned, so -D records will be deleted. If you want to view DELETE records, you can query the audit_log table:

SELECT * FROM ws_t$audit_log /*+ OPTIONS(‘incremental-between’ = ‘3,5’) */;

2.5.2 Streaming query

By default, Streaming read produces the latest snapshot on the table when it is first started, and continues to read the latest changes.

SET ‘execution.checkpointing.interval’=‘30s’;

SET ‘execution.runtime-mode’ = ‘streaming’;

It is also possible to read from the latest, setting the scan mode:

SELECT * FROM ws_t /*+ OPTIONS(‘scan.mode’ = ‘latest’) */

2.5.2.1 Time travel

If you only want to process today's data and later, you can use partition filters to achieve this:

SELECT * FROM test_p WHERE dt > ‘2023-07-01’

If the table is not partitioned, or filtering by partition is not possible, stream reads with time travel can be used.

1) Read the changed data from the specified snapshot id

SELECT * FROM ws_t /*+ OPTIONS(‘scan.snapshot-id’ = ‘1’) */;

2) Start reading from the specified timestamp

SELECT * FROM ws_t /*+ OPTIONS(‘scan.timestamp-millis’ = ‘1688369660841’) */;

3) Read the specified snapshot data when starting for the first time, and continue to read changes

SELECT * FROM ws_t /*+ OPTIONS(‘scan.mode’=‘from-snapshot-full’,‘scan.snapshot-id’ = ‘3’) */;

2.5.2.2 Consumer ID

1) Advantages

Specify the consumer-id when streaming the table, this is an experimental feature.

When the stream reads the Paimon table, the next snapshot id will be recorded to the file system. This has several advantages:

When the previous job is stopped, the newly started job can continue to consume the previous progress without needing to resume from the state. New reads will start reading from the next snapshot ID found in the consumer file.

When judging whether a snapshot has expired, Paimon will look at all consumers of the table in the file system. If there are still consumers relying on the snapshot, the snapshot will not be deleted due to expiration.

When no watermark is defined, the Paimon table will pass the watermark from the snapshot to the downstream Paimon table, which means you can track the progress of the watermark throughout the pipeline.

NOTE: Consumers will prevent snapshots from expiring. A "consumer.expiration-time" can be specified to manage the lifetime of the consumer.

2) Case presentation

Specify consumer-id to start streaming query:

SELECT * FROM ws_t /*+ OPTIONS(‘consumer-id’ = ‘atguigu’) */;

Stop the original streaming query and insert data:

insert into ws_t values(6,6,6);

Specify the consumer-id streaming query again:

SELECT * FROM ws_t /*+ OPTIONS(‘consumer-id’ = ‘atguigu’) */;

2.5.3 Query Optimization

It is strongly recommended to specify the partition and primary key filter at query time, which will speed up the data skipping speed of the query.

Filter functions that can speed up data jumping are:

=

<

<=

=

IN (…)

LIKE ‘abc%’

IS NULL

Paimon sorts data by primary key, which speeds up point and range queries. When using a composite primary key, the query filter should preferably form the leftmost prefix of the primary key for good speedup.

CREATE TABLE orders (

catalog_id BIGINT,

order_id BIGINT,

…,

PRIMARY KEY (catalog_id, order_id) NOT ENFORCED – composite primary key

)

By specifying a range filter on the leftmost prefix of the primary key, the query gets a nice speedup.

SELECT * FROM orders WHERE catalog_id=1025;

SELECT * FROM orders WHERE catalog_id=1025 AND order_id=29495;

SELECT * FROM orders

WHERE catalog_id=1025jkjkjk

AND order_id>2035 AND order_id<6000;

The following example filter does not speed up the query very well:

SELECT * FROM orders WHERE order_id=29495;

SELECT * FROM orders WHERE catalog_id=1025 OR order_id=29495;

2.6 System tables

System tables contain metadata and information about each table, such as snapshots created and options used. Users can access system tables through batch queries.

2.6.1 Snapshots Table

Through the snapshots table, you can query the snapshot history information of the table, including the number of records that occurred in the snapshot.

SELECT * FROM ws_t$snapshots;

By querying the snapshot table, you can learn about the table's commit and expiration information, as well as the time travel of the data.

2.6.2 Schemas Table

The historical schema of the table can be queried through the schemas table.

SELECT * FROM ws_t$schemas;

A snapshot table and a schema table can be joined to get the fields for a given snapshot.

SELECT s.snapshot_id, t.schema_id, t.fields

FROM ws_t s n a p s h o t s s J O I N w s t snapshots s JOIN ws_t snapshotssJOINwstschemas t

ON s.schema_id=t.schema_id where s.snapshot_id=3;

2.6.3 Options Table Options Table

The option information of the table specified in the DDL can be queried through the option table. Options not shown will be default values.

SELECT * FROM ws_t$options;

2.6.4 Audit log Table

If you need the changelog of the audit table, you can use the audit_log system table. Through the audit_log table, the rowkind column can be obtained when obtaining table incremental data. You can use this column to filter and other operations to complete the review.

rowkind has four values:

+I: insert operation.

-U: Use the previous contents of the updated line for the update operation.

+U: Perform an update operation with the new content of the updated line.

-D: delete operation.

SELECT * FROM ws_t$audit_log;

2.6.5 Files Table

Files that can be queried for a specific snapshot table.

– Query the files of the latest snapshot

SELECT * FROM ws_t$files;

– Query the files of the specified snapshot

SELECT * FROM ws_t$files /*+ OPTIONS(‘scan.snapshot-id’=‘1’) */;

2.6.6 Tags Table

Through the tags table, you can query the tag history information of the table, including which snapshots are used to tag and some historical information of the snapshots. You can also get all tag names and time travel data to a specific tag by name.

SELECT * FROM ws_t$tags;

2.7 Dimensional table Join

Paimon supports the Lookup Join syntax, which is used to supplement dimension fields with data queried from Paimon. Requires that one table has a processing time attribute and the other table is backed by a lookup source connector.

Paimon supports tables with primary keys and append-only table lookup joins in Flink. The following example illustrates this functionality.

USE CATALOG fs_catalog;

CREATE TABLE customers (

id INT PRIMARY KEY NOT ENFORCED,

name STRING,

country STRING,

zip STRING

);

INSERT INTO customers VALUES(1,‘zs’,‘ch’,‘123’),(2,‘ls’,‘ch’,‘456’), (3,‘ww’,‘ch’,‘789’);

CREATE TEMPORARY TABLE Orders (

order_id INT,

total INT,

customer_id INT,

proc_time AS PROCTIME()

) WITH (

'connector' = 'data gen',

‘rows-per-second’=‘1’,

‘fields.order_id.kind’=‘sequence’,

‘fields.order_id.start’=‘1’,

‘fields.order_id.end’=‘1000000’,

‘fields.total.kind’=‘random’,

‘fields.total.min’=‘1’,

‘fields.total.max’=‘1000’,

‘fields.customer_id.kind’=‘random’,

‘fields.customer_id.min’=‘1’,

‘fields.customer_id.max’=‘3’

);

SELECT o.order_id, o.total, c.country, c.zip

FROM Orders AS o

JOIN customers

FOR SYSTEM_TIME AS OF o.proc_time AS c

ON o.customer_id = c.id;

The Lookup Join operator maintains a RocksDB cache locally and pulls the latest updates of the table in real time. The lookup join operator will only pull the necessary data, so your filter criteria is very important for performance.

If the record Join of Orders (main table) is missing, because the data corresponding to customers (lookup table) is not ready yet. Consider using Flink's Delayed Retry Strategy For Lookup.

2.8 CDC Integration

Paimon supports several methods of extracting data into Paimon tables through schema evolution. This means that added columns are synced to the Paimon table in real-time and the sync job will not be restarted for this.

The following synchronization methods are currently supported:

MySQL table synchronization: synchronize one or more tables in MySQL to a Paimon table.

MySQL Sync Database: Synchronize the entire MySQL database into a Paimon database.

API Sync Table: Synchronize your custom DataStream input into a Paimon table.

Kafka synchronization table: Synchronize a Kafka topic table to a Paimon table.

Kafka synchronization database: Synchronize a Kafka topic containing multiple tables or multiple topics containing one table each to a Paimon database.

2.8.1 MySQL

Add Flink CDC connector.

cp flink-sql-connector-mysql-cdc-2.4.0.jar /opt/module/flink-1.17.0/lib

Restart the yarn-session cluster and sql-client.

2.8.1.1 Synchronization table

1) Grammar Description

<FLINK_HOME>/bin/flink run \

/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

mysql-sync-table

–warehouse \

–database \

–table \

[–partition-keys ] \

[–primary-keys ] \

[–computed-column <‘column-name=expr-name(args[, …])’> [–computed-column …]] \

[–mysql-conf [–mysql-conf …]] \

[–catalog-conf [–catalog-conf …]] \

[–table-conf [–table-conf …]]

Parameter Description:

configuration describe
–warehouse Paimon warehouse path.
–database Database name in Paimon Catalog.
–table Paimon table name.
–partition-keys The partition key for the Paimon table. If there are multiple partition keys, please connect them with commas, such as "dt,hh,mm".
–primary-keys The primary key of the Paimon table. If there are multiple primary keys, please connect them with commas, such as "buyer_id,seller_id".
–computed-column Computed column definition. Parameter fields come from MySQL table field names.
–mysql-conf Configuration of Flink CDC MySQL source table. Each configuration should be specified in the format "key=value". Host name, user name, password, database name and table name are required configurations, others are optional configurations.
–catalog-conf Configuration of Paimon Catalog. Each configuration should be specified in the format "key=value".
–table-conf Paimon table sink configuration. Each configuration should be specified in the format "key=value".

This action will automatically create the specified Paimon table if it does not exist. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared with the schemas of all specified MySQL tables.

2) Case Practice

(1) Synchronize a table in MySQL to a table in Paimon

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

mysql-sync-table \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table order_info_cdc \

–primary-keys id \

–mysql-conf hostname=hadoop102 \

–mysql-conf username=root \

–mysql-conf password=000000 \

–mysql-conf database-name=gmall \

–mysql-conf table-name=‘order_info’ \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hadoop102:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4

(2) Synchronize multiple MySQL tables to one Paimon table

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

mysql-sync-table \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table order_cdc \

–primary-keys id \

–mysql-conf hostname=hadoop102 \

–mysql-conf username=root \

–mysql-conf password=000000 \

–mysql-conf database-name=gmall \

–mysql-conf table-name=‘order_.*’ \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hadoop102:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4

2.8.1.2 Synchronizing the database

1) Grammar Description

<FLINK_HOME>/bin/flink run \

/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

mysql-sync-database

–warehouse \

–database \

[–ignore-incompatible <true/false>] \

[–table-prefix ] \

[–table-suffix] \

[–including-tables <mysql-table-name|name-regular-expr>] \

[–excluding-tables <mysql-table-name|name-regular-expr>] \

[–mysql-conf [–mysql-conf …]] \

[–catalog-conf [–catalog-conf …]] \

[–table-conf [–table-conf …]]

Parameter Description:

configuration describe
–warehouse Paimon warehouse path.
–database Database name in Paimon Catalog.
–ignore-incompatible Defaults to false, in which case an exception will be thrown if MySQL table names exist in Paimon and their schemas are incompatible. You can explicitly specify this as true to ignore incompatible tables and exceptions.
–table-prefix The prefix of all Paimon tables that need to be synchronized. For example, if you want all sync tables to be prefixed with "ods_", you can specify "--table-prefix ods_".
–table-suffix The suffix of all Paimon tables that need to be synchronized. Usage is the same as "--table-prefix".
–including-tables Used to specify which source tables to synchronize. You must use '|' to separate multiple tables, for example: 'a|b|c'. Regular expressions are supported, for example specifying "--include-tables test|paimon.*" means to sync table 'test' and all tables start with "paimon".
–excluding-tables Used to specify which source tables are out of sync. Usage is the same as "--include-tables". If "--except-tables" is specified at the same time, "--except-tables" takes precedence over "--include-tables".
–mysql-conf Configuration of Flink CDC MySQL source table. Each configuration should be specified in the format "key=value". Host name, user name, password, database name and table name are required configurations, others are optional configurations.
–catalog-conf Configuration of Paimon Catalog. Each configuration should be specified in the format "key=value".
–table-conf Paimon table sink configuration. Each configuration should be specified in the format "key=value".

Only tables with primary keys will be synchronized.

For each MySQL table that needs to be synchronized, if the corresponding Paimon table does not exist, this operation will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared with the schemas of all specified MySQL tables.

2) Case Practice

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

mysql-sync-database \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table-prefix “ods_” \

–table-suffix “_cdc” \

–mysql-conf hostname=hadoop102 \

–mysql-conf username=root \

–mysql-conf password=000000 \

–mysql-conf database-name=gmall \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hadoop102:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4 \

–including-tables ‘user_info|order_info|activity_rule’

3) Synchronize the newly added tables under the database

First assume that the Flink job is synchronizing the tables [product, user, address] under the database source_db. The command to submit the job is as follows:

<FLINK_HOME>/bin/flink run \

/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

mysql-sync-database \

–warehouse hdfs:///path/to/warehouse \

–database test_db \

–mysql-conf hostname=127.0.0.1 \

–mysql-conf username=root \

–mysql-conf password=123456 \

–mysql-conf database-name=source_db \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hive-metastore:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4 \

–including-tables ‘product|user|address’

Later, we want the job to also sync the table [order, custom] which contains historical data. We can do this by restoring from a previous snapshot of the job and thus reusing the existing state of the job. A resumed job will first take a snapshot of the newly added table, then automatically continue reading the changelog from where it was before.

The command to restore from a previous snapshot and add a new table for synchronization is as follows:

<FLINK_HOME>/bin/flink run \

–fromSavepoint savepointPath \

/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

mysql-sync-database \

–warehouse hdfs:///path/to/warehouse \

–database test_db \

–mysql-conf hostname=127.0.0.1 \

–mysql-conf username=root \

–mysql-conf password=123456 \

–mysql-conf database-name=source_db \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hive-metastore:9083 \

–table-conf bucket=4 \

–including-tables ‘product|user|address|order|custom’

2.8.2 Kafka

Flink provides several Kafka CDC formats: canal-json, debezium-json, ogg-json, maxwell-json. You can use Paimon Kafka CDC if the messages in the Kafka topic are change events captured from another database using a change data capture (CDC) tool. Write the parsed INSERT, UPDATE, and DELETE messages into the paimon table. The supported formats listed on Paimon's official website are as follows:

img

Add Kafka connector:

cp flink-sql-connector-kafka-1.17.0.jar /opt/module/flink-1.17.0/lib

Restart the yarn-session cluster and sql-client.

2.8.2.1 Synchronization table

1) Grammar Description

Synchronize one or more tables in a Kafka topic to a Paimon table.

<FLINK_HOME>/bin/flink run \

/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

kafka-sync-table

–warehouse \

–database \

–table \

[–partition-keys ] \

[–primary-keys ] \

[–computed-column <‘column-name=expr-name(args[, …])’> [–computed-column …]] \

[–kafka-conf [–kafka-conf …]] \

[–catalog-conf [–catalog-conf …]] \

[–table-conf [–table-conf …]]

Parameter Description

configuration describe
–warehouse Paimon warehouse path.
–database Database name in Paimon Catalog.
–table Paimon table name.
–partition-keys The partition key for the Paimon table. If there are multiple partition keys, please connect them with commas, such as "dt,hh,mm".
–primary-keys The primary key of the Paimon table. If there are multiple primary keys, please connect them with commas, such as "buyer_id,seller_id".
–computed-column Computed column definition. The parameter fields come from the Kafka topic's table field names.
–kafka-conf Configuration of the Flink Kafka source. Each configuration should be specified in the format "key=value". properties.bootstrap.servers, topic, properties.group.idand value.formatare required configurations, other configurations are optional.
–catalog-conf Configuration of Paimon Catalog. Each configuration should be specified in the format "key=value".
–table-conf Paimon table sink configuration. Each configuration should be specified in the format "key=value".

This action will automatically create the Paimon table you specify if it does not exist. Its schema will be derived from the tables of all specified Kafka topics, and it gets the earliest non-DDL data parsing schema from the topic. If the Paimon table already exists, its schema will be compared with the schemas of all specified Kafka topic tables.

2) Case Practice

(1) Prepare data (canal-json format)

For convenience, directly insert the data in canal format into the topic (user_info single table data):

kafka-console-producer.sh --broker-list hadoop102:9092 --topic paimon_canal

#Insert data as follows:

{“data”:[{“id”:“6”,“login_name”:“t7dk2h”,“nick_name”:“冰冰11”,“passwd”:null,“name”:“淳于冰”,“phone_num”:“13178654378”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“1997-12-08”,“gender”:null,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689150607000,“id”:1,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“冰冰”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151566836,“type”:“UPDATE”}

{“data”:[{“id”:“7”,“login_name”:“vihcj30p1”,“nick_name”:“豪心22”,“passwd”:null,“name”:“魏豪心”,“phone_num”:“13956932645”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“1991-06-07”,“gender”:“M”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151623000,“id”:2,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“豪心”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151623139,“type”:“UPDATE”}

{“data”:[{“id”:“8”,“login_name”:“02r2ahx”,“nick_name”:“卿卿33”,“passwd”:null,“name”:“穆卿”,“phone_num”:“13412413361”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“2001-07-08”,“gender”:“F”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151626000,“id”:3,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“卿卿”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151626863,“type”:“UPDATE”}

{“data”:[{“id”:“9”,“login_name”:“mjhrxnu”,“nick_name”:“武新44”,“passwd”:null,“name”:“罗武新”,“phone_num”:“13617856358”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“2001-08-08”,“gender”:null,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151630000,“id”:4,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“武新”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151630781,“type”:“UPDATE”}

{“data”:[{“id”:“10”,“login_name”:“kwua2155”,“nick_name”:“纨纨55”,“passwd”:null,“name”:“姜纨”,“phone_num”:“13742843828”,“email”:“[email protected]”,“head_img”:null,“user_level”:“3”,“birthday”:“1997-11-08”,“gender”:“F”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151633000,“id”:5,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“纨纨”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151633697,“type”:“UPDATE”}

(2)从一个 Kafka 主题(包含单表数据)同步到 Paimon表

bin/flink run \

  /opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

  kafka-sync-table \

  --warehouse hdfs://hadoop102:8020/paimon/hive \

  --database test \

  --table kafka_user_info_cdc \

  --primary-keys id \

  --kafka-conf properties.bootstrap.servers=hadoop102:9092 \

  --kafka-conf topic=paimon_canal \

--kafka-conf properties.group.id=atguigu \

--kafka-conf scan.startup.mode=earliest-offset \

  --kafka-conf value.format=canal-json \

  --catalog-conf metastore=hive \

  --catalog-conf uri=thrift://hadoop102:9083 \

  --table-conf bucket=4 \

  --table-conf changelog-producer=input \

  --table-conf sink.parallelism=4

2.8.2.2 同步数据库

1)语法说明

将多个主题或一个主题同步到一个 Paimon 数据库中。

<FLINK_HOME>/bin/flink run \

  /path/to/paimon-flink-action-0.5-SNAPSHOT.jar \

  kafka-sync-database

  --warehouse <warehouse-path> \

  --database <database-name> \

  [--schema-init-max-read <int>] \

  [--ignore-incompatible <true/false>] \

  [--table-prefix <paimon-table-prefix>] \

  [--table-suffix <paimon-table-suffix>] \

  [--including-tables <table-name|name-regular-expr>] \

  [--excluding-tables <table-name|name-regular-expr>] \

  [--kafka-conf <kafka-source-conf> [--kafka-conf <kafka-source-conf> ...]] \

  [--catalog-conf <paimon-catalog-conf> [--catalog-conf <paimon-catalog-conf> ...]] \

  [--table-conf <paimon-table-sink-conf> [--table-conf <paimon-table-sink-conf> ...]]

参数说明:

配置 描述
–warehouse The path to Paimon warehouse.通往派蒙仓库的道路。
–database Paimon 目录中的数据库名称。
–schema-init-max-read 如果您的表全部来自某个Topic,您可以设置该参数来初始化需要同步的表数量。默认值为 1000。
–ignore-incompatible 默认为 false,在这种情况下,如果 Paimon 中存在 MySQL 表名,并且它们的 schema 不兼容,则会抛出异常。您可以显式将其指定为 true 以忽略不兼容的表和异常。
–table-prefix 所有需要同步的Paimon表的前缀。例如,如果您希望所有同步表都以“ods_”作为前缀,则可以指定“–table-prefix ods_”。
–table-suffix 所有需要同步的Paimon表的后缀。用法与“–table-prefix”相同。
–including-tables 用于指定要同步哪些源表。您必须使用“|”分隔多个表。因为“|”为特殊字符,需要逗号,例如:‘a|b|c’。支持正则表达式,例如指定“–include-tables test|paimon.*”表示同步表’test’和所有表都以“paimon”开头。
–excluding-tables 用于指定哪些源表不同步。用法与“–include-tables”相同。如果同时指定了“-- except-tables”,则“-- except-tables”的优先级高于“–include-tables”。
–kafka-conf Flink Kafka 源的配置。每个配置都应以“key=value”的格式指定。 properties.bootstrap.serverstopicproperties.group.idvalue.format 是必需配置,其他配置是可选的。有关完整配置列表,请参阅其文档。
–catalog-conf Paimon 目录的配置。每个配置都应以“key=value”的格式指定。请参阅此处以获取目录配置的完整列表。
–table-conf Paimon 餐桌水槽的配置。每个配置都应以“key=value”的格式指定。请参阅此处了解表配置的完整列表。

只有具有主键的表才会被同步。

对于每个要同步的Kafka主题的表,如果对应的Paimon表不存在,该操作将自动创建该表。它的schema将从所有指定的Kafka topic的表中派生出来,它从topic中获取最早的非DDL数据解析schema。如果 Paimon 表已存在,则其schema将与所有指定 Kafka 主题表的schema进行比较。

2)案例实操

(1)准备数据(canal-json格式)

为了方便,直接将canal格式的数据插入topic里(user_info和spu_info多表数据):

kafka-console-producer.sh --broker-list hadoop102:9092 --topic paimon_canal_2

#插入数据如下(注意不要有空行):

{“data”:[{“id”:“6”,“login_name”:“t7dk2h”,“nick_name”:“冰冰11”,“passwd”:null,“name”:“淳于冰”,“phone_num”:“13178654378”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“1997-12-08”,“gender”:null,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689150607000,“id”:1,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“冰冰”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151566836,“type”:“UPDATE”}

{“data”:[{“id”:“7”,“login_name”:“vihcj30p1”,“nick_name”:“豪心22”,“passwd”:null,“name”:“魏豪心”,“phone_num”:“13956932645”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“1991-06-07”,“gender”:“M”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151623000,“id”:2,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“豪心”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151623139,“type”:“UPDATE”}

{“data”:[{“id”:“8”,“login_name”:“02r2ahx”,“nick_name”:“卿卿33”,“passwd”:null,“name”:“穆卿”,“phone_num”:“13412413361”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“2001-07-08”,“gender”:“F”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151626000,“id”:3,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“卿卿”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151626863,“type”:“UPDATE”}

{“data”:[{“id”:“9”,“login_name”:“mjhrxnu”,“nick_name”:“武新44”,“passwd”:null,“name”:“罗武新”,“phone_num”:“13617856358”,“email”:“[email protected]”,“head_img”:null,“user_level”:“1”,“birthday”:“2001-08-08”,“gender”:null,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151630000,“id”:4,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“武新”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151630781,“type”:“UPDATE”}

{“data”:[{“id”:“10”,“login_name”:“kwua2155”,“nick_name”:“纨纨55”,“passwd”:null,“name”:“姜纨”,“phone_num”:“13742843828”,“email”:“[email protected]”,“head_img”:null,“user_level”:“3”,“birthday”:“1997-11-08”,“gender”:“F”,“create_time”:“2022-06-08 00:00:00”,“operate_time”:null,“status”:null}],“database”:“gmall”,“es”:1689151633000,“id”:5,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“login_name”:“varchar(200)”,“nick_name”:“varchar(200)”,“passwd”:“varchar(200)”,“name”:“varchar(200)”,“phone_num”:“varchar(200)”,“email”:“varchar(200)”,“head_img”:“varchar(200)”,“user_level”:“varchar(200)”,“birthday”:“date”,“gender”:“varchar(1)”,“create_time”:“datetime”,“operate_time”:“datetime”,“status”:“varchar(200)”},“old”:[{“nick_name”:“纨纨”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“login_name”:12,“nick_name”:12,“passwd”:12,“name”:12,“phone_num”:12,“email”:12,“head_img”:12,“user_level”:12,“birthday”:91,“gender”:12,“create_time”:93,“operate_time”:93,“status”:12},“table”:“user_info”,“ts”:1689151633697,“type”:“UPDATE”}

{“data”:[{“id”:“12”,“spu_name”:“华为智慧屏 4K全面屏智能电视机1”,“description”:“华为智慧屏 4K全面屏智能电视机”,“category3_id”:“86”,“tm_id”:“3”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151648000,“id”:6,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“华为智慧屏 4K全面屏智能电视机”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151648872,“type”:“UPDATE”}

{“data”:[{“id”:“3”,“spu_name”:“Apple iPhone 13”,“description”:“Apple iPhone 13”,“category3_id”:“61”,“tm_id”:“2”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151661000,“id”:7,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“Apple iPhone 12”,“description”:“Apple iPhone 12”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151661828,“type”:“UPDATE”}

{“data”:[{“id”:“4”,“spu_name”:“HUAWEI P50”,“description”:“HUAWEI P50”,“category3_id”:“61”,“tm_id”:“3”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151669000,“id”:8,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“HUAWEI P40”,“description”:“HUAWEI P40”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151669966,“type”:“UPDATE”}

{“data”:[{“id”:“1”,“spu_name”:“小米12sultra”,“description”:“小米12”,“category3_id”:“61”,“tm_id”:“1”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151700000,“id”:9,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“description”:“小米10”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151700998,“type”:“UPDATE”}

再准备一个只包含spu_info单表数据的Topic:

kafka-console-producer.sh --broker-list hadoop102:9092 --topic paimon_canal_1

#插入数据如下:

{“data”:[{“id”:“12”,“spu_name”:“华为智慧屏 4K全面屏智能电视机1”,“description”:“华为智慧屏 4K全面屏智能电视机”,“category3_id”:“86”,“tm_id”:“3”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151648000,“id”:6,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“华为智慧屏 4K全面屏智能电视机”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151648872,“type”:“UPDATE”}

{“data”:[{“id”:“3”,“spu_name”:“Apple iPhone 13”,“description”:“Apple iPhone 13”,“category3_id”:“61”,“tm_id”:“2”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151661000,“id”:7,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“Apple iPhone 12”,“description”:“Apple iPhone 12”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151661828,“type”:“UPDATE”}

{“data”:[{“id”:“4”,“spu_name”:“HUAWEI P50”,“description”:“HUAWEI P50”,“category3_id”:“61”,“tm_id”:“3”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151669000,“id”:8,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“spu_name”:“HUAWEI P40”,“description”:“HUAWEI P40”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151669966,“type”:“UPDATE”}

{“data”:[{“id”:“1”,“spu_name”:“小米12sultra”,“description”:“小米12”,“category3_id”:“61”,“tm_id”:“1”,“create_time”:“2021-12-14 00:00:00”,“operate_time”:null}],“database”:“gmall”,“es”:1689151700000,“id”:9,“isDdl”:false,“mysqlType”:{“id”:“bigint”,“spu_name”:“varchar(200)”,“description”:“varchar(1000)”,“category3_id”:“bigint”,“tm_id”:“bigint”,“create_time”:“datetime”,“operate_time”:“datetime”},“old”:[{“description”:“小米10”}],“pkNames”:[“id”],“sql”:“”,“sqlType”:{“id”:-5,“spu_name”:12,“description”:12,“category3_id”:-5,“tm_id”:-5,“create_time”:93,“operate_time”:93},“table”:“spu_info”,“ts”:1689151700998,“type”:“UPDATE”}

(2)从一个 Kafka 主题(包含多表数据)同步到 Paimon 数据库

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

kafka-sync-database \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table-prefix “t1_” \

–table-suffix “_cdc” \

–schema-init-max-read 500 \

–kafka-conf properties.bootstrap.servers=hadoop102:9092 \

–kafka-conf topic=paimon_canal_2 \

–kafka-conf properties.group.id=atguigu \

–kafka-conf scan.startup.mode=earliest-offset \

–kafka-conf value.format=canal-json \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hadoop102:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4

从多个 Kafka 主题同步到 Paimon 数据库

bin/flink run \

/opt/module/flink-1.17.0/opt/paimon-flink-action-0.5-20230703.002437-53.jar \

kafka-sync-database \

–warehouse hdfs://hadoop102:8020/paimon/hive \

–database test \

–table-prefix “t2_” \

–table-suffix “_cdc” \

–kafka-conf properties.bootstrap.servers=hadoop102:9092 \

–kafka-conf topic=“paimon_canal;paimon_canal_1” \

–kafka-conf properties.group.id=atguigu \

–kafka-conf scan.startup.mode=earliest-offset \

–kafka-conf value.format=canal-json \

–catalog-conf metastore=hive \

–catalog-conf uri=thrift://hadoop102:9083 \

–table-conf bucket=4 \

–table-conf changelog-producer=input \

–table-conf sink.parallelism=4

2.8.3 支持的schema变更

cdc 集成支持有限的schema变更。目前,框架无法删除列,因此 DROP 的行为将被忽略,RENAME 将添加新列。当前支持的架构更改包括:

(1)添加列。

(2)更改列类型:

从字符串类型(char、varchar、text)更改为长度更长的另一种字符串类型,

change from a binary type (binary, varbinary, blob) to another binary type with a longer length,

Changing from an integer type (tinyint, smallint, int, bigint) to another integer type with a wider range,

Change from a floating point type (float, double) to another floating point type with a wider range.

Guess you like

Origin blog.csdn.net/xianyu120/article/details/132001087