Flow data lake platform Apache Paimon (6) DML insertion data integrated with Spark

4.4. Insert data

The INSERT statement inserts new rows into a table. Inserted rows can be specified by value expressions or query results, consistent with standard sql syntax.

INSERT INTO table_identifier [ part_spec ] [ column_list ] { value_expr | query }

part_spec

Optional, a list of key-value pairs for the specified partition, separated by commas. Type literals can be used (for example, date'2019-01-02').

Syntax: PARTITION (partition column name = partition column value [ , ... ] )

column_list

Optionally, specify a comma-separated list of fields.

Syntax: (col_name1 [, column_name2, ...])

All specified columns should exist in the table and not be duplicates of each other. It includes all columns except static partition columns. The field list should be exactly the same size as the data in the VALUES clause or query.

value_expr

Specifies the value to insert. An explicitly specified value or NULL can be inserted. Each value in the clause must be separated by a comma. More than one set of values ​​can be specified to insert multiple rows.

Syntax: VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ]

Note: Write Nullable fields to Not-null fields

A nullable column of another table cannot be inserted into a non-nullable column of one table. Spark can use the nvl function to handle it. For example, key1 of table A is not null, and key2 of table B is nullable:

INSERT INTO A key1 SELECT nvl(key2, ) FROM B

case:

INSERT INTO tests VALUES(1,1,'order','2023-07-01','1'), (2,2,'pay','2023-07-01','2');

INSERT INTO tests_p SELECT * from tests;

4.5. Query data

Just like all other tables, Paimon tables can be queried using the SELECT statement.

Paimon's bulk read returns all the data in the table snapshot. By default, bulk reads return the latest snapshot.

4.5.1 Time travel

Time travel can be done using VERSION AS OF and TIMESTAMP AS OF in queries.

1) Read the snapshot of the specified id

SELECT * FROM tests VERSION AS OF 1;

SELECT * FROM tests VERSION AS OF 2;

2) Read the snapshot of the specified timestamp

-- 查看快照信息

SELECT * FROM tests&snapshots;

SELECT * FROM tests TIMESTAMP AS OF '2023-07-03 15:34:20.123';

-- 时间戳指定到秒(向上取整)

SELECT * FROM tests TIMESTAMP AS OF 1688369660;

3) Read the specified tag

SELECT * FROM tests VERSION AS OF 'my-tag';

4.5.2 Incremental query

Read incremental changes between the start snapshot (exclusive) and the end snapshot. For example, "3,5" indicates changes between snapshot 3 and snapshot 5:

spark.read()

.format(“paimon”)

.option(“incremental-between”, “3,5”)

.load(“path/to/table”)

4.6 System tables

System tables contain metadata and information about each table, such as snapshots created and options used. Users can access system tables through batch queries.

4.6.1 Snapshots Table

Through the snapshots table, you can query the snapshot history information of the table, including the number of records that occurred in the snapshot. Backticks are required for use in Spark 表名$系统表名.

SELECT * FROM tests$snapshots;

By querying the snapshot table, you can learn about the table's commit and expiration information, as well as the time travel of the data.

4.6.2 Schemas Table

The historical schema of the table can be queried through the schemas table.

SELECT * FROM tests$schemas;

A snapshot table and a schema table can be joined to get the fields for a given snapshot.

SELECT s.snapshot_id, t.schema_id, t.fields

FROM tests$snapshots s JOIN tests$schemas t

ON s.schema_id=t.schema_id where s.snapshot_id=3;

4.6.3 Options Table Options Table

The option information of the table specified in the DDL can be queried through the option table. Options not shown will be the default.

SELECT * FROM tests$options;

4.6.4 Audit log Table

If you need the changelog of the audit table, you can use the audit_log system table. Through the audit_log table, the rowkind column can be obtained when obtaining table incremental data. You can use this column to filter and other operations to complete the review.

rowkind has four values:

+I: insert operation.

-U: Use the previous contents of the updated line for the update operation.

+U: Perform an update operation with the new content of the updated row.

-D: delete operation.

SELECT * FROM tests$audit_log;

4.6.5 Files Table

Files that can be queried for a specific snapshot table.

– Query the files of the latest snapshot

SELECT * FROM tests$files;

4.6.6 Tags Table

Through the tags table, you can query the tag history information of the table, including which snapshots are used to tag and some historical information of the snapshots. You can also get all tag names and time travel data to a specific tag by name.

SELECT * FROM tests$tags;

Guess you like

Origin blog.csdn.net/xianyu120/article/details/132130940