Spark SQL practice of operating HUDI tables

HUDI table related concepts

  • table type

    • cow
    • mor
  • Partitioned table/non-partitioned table

    Users can create partitioned tables and non-partitioned tables in Spark SQL. To create a partitioned table, you need to use the partitioned by statement to specify the partitioning column to create the partitioned table. When there is no by statement for partitioning using the create table command, the table is considered an unpartitioned table.

  • Internal and external tables

    Generally, Spark SQL supports two types of tables, namely internal tables and external tables. If you specify a location using the location statement, or create the table explicitly using create external table, then it is an external table, otherwise it is considered an internal table.

pay attention:

  1. Starting from Hudi 0.10.0, primaryKey must be specified to represent the primary key field when creating a Hudi table. If you do not specify the primaryKey field, Hudi will use uuid as the primary key field by default. We recommend that when creating a Hudi table, the primary key field must be specified.
  2. For the mor table, the preCombineField field must be specified to indicate the order of data.
  3. If the table is a partitioned table, the configuration hoodie.datasource.write.hive_style_partitioning must be explicitly specified as true, and if the table is a non-partitioned table, the configuration hoodie.datasource.write.hive_style_partitioning must be explicitly specified as false.

Use hudi catalog

After spark in WDP installs hudi, it uses the hudi catalog by default. Metadata will be saved to the hive metastore when the table is created.

Create an internal table without partitioning

-- 创建cow不分区表,指定主键为uuid
create table hudi_cow_nonpcf_tbl (
  uuid int,
  name string,
  price double
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'uuid',
  hoodie.datasource.write.hive_style_partitioning = 'false'
);

-- 创建mor不分区表,指定主键为id,指定预合并字段为ts
create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.write.hive_style_partitioning = 'false'
);

Create partitioned external table

create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.write.hive_style_partitioning = 'true',
  hoodie.datasource.hive_sync.mode = 'hms'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';

create external table hudi_cow_pt_tbl_2 (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.write.hive_style_partitioning = 'true'
 )
partitioned by (dt, hh);

More information about the settings of the tblproperties attributes in the table can be found on the official website: https://hudi.apache.org/docs/basic_configurations

CTAS

Hudi supports using CTAS (Create Table As Select) on Spark SQL to create hudi tables.

Note: For better performance when loading data into hudi tables, CTAS uses bulk insert as write operation.

The following is an example of using the CTAS command to create a non-partitioned COW table without preCombineField.

create table hudi_ctas_cow_nonpcf_tbl
using hudi
tblproperties (primaryKey = 'id')
as
select 1 as id, 'a1' as name, 10 as price;

Use CTAS to create a COW table with primary key and partition fields

create table hudi_ctas_cow_pt_tbl
using hudi
tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts')
partitioned by (dt)
as
select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt;

Use CTAS to load data from another table.

# create managed parquet table
create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet';

# CTAS by loading data into hudi table
create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (datestr) as select * from parquet_mngd;

Insert data

-- insert into non-partitioned table
insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;
insert into hudi_mor_tbl select 1, 'a1', 20, 1000;

-- insert dynamic partition
insert into hudi_cow_pt_tbl partition (dt, hh)
select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt, '10' as hh;

-- insert static partition
insert into hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='11') select 2, 'a2', 1000;

Notice:

By default, insert into uses upsert as the type of write operation if preCombineKey is provided, otherwise insert is used.

We support using bulk_insert as the type of write operation. You only need to set two configurations: hoodie.sql.bulk.insert.enable and hoodie.sql.insert.mode. Take the following as an example:

-- upsert mode for preCombineField-provided table
insert into hudi_mor_tbl select 1, 'a1_1', 20, 1001;
select id, name, price, ts from hudi_mor_tbl;
1   a1_1    20.0    1001

-- bulk_insert mode for preCombineField-provided table
set hoodie.sql.bulk.insert.enable=true;
set hoodie.sql.insert.mode=non-strict;

insert into hudi_mor_tbl select 1, 'a1_2', 20, 1002;
select id, name, price, ts from hudi_mor_tbl;
1   a1_1    20.0    1001
1   a1_2    20.0    1002

data query

select * from hudi_mor_nonpcf_tbl where price > 10;

Starting from 0.9.0, hudi has supported hudi's built-in FileIndex: HoodieFileIndex to query hudi tables, supporting partition pruning and metatable query. This will help improve query performance. It also supports non-global query paths, which means users can query tables through the base path without specifying "*" in the query path. This feature is enabled by default for non-global query paths. For global query paths, hudi uses the old query path. For more information about all supported table types and query types, see Table Types and Queries.

time travel query

Hudi supports time travel queries starting from 0.9.0. Three query time formats are currently supported, as shown below.

Note: Only supports Spark 3.2+ version

create table hudi_cow_pt_tbl_3 (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.write.hive_style_partitioning = 'true'
 )
partitioned by (dt, hh);


insert into hudi_cow_pt_tbl_3 select 1, 'a0', 1000, '2021-12-09', '10';
select * from hudi_cow_pt_tbl_3;

-- record id=1 changes `name`
insert into hudi_cow_pt_tbl_3 select 1, 'a1', 1001, '2021-12-09', '10';
select * from hudi_cow_pt_tbl_3;

-- time travel based on first commit time, assume `20220725112636518`
select * from hudi_cow_pt_tbl_3 timestamp as of '20220725112636518' where id = 1;
-- time travel based on different timestamp formats
select * from hudi_cow_pt_tbl_3 timestamp as of '2022-07-25 11:26:36.100' where id = 1;
select * from hudi_cow_pt_tbl_3 timestamp as of '2022-07-26' where id = 1;

Data Update

This is similar to inserting new data. Use a data generator to generate updates to existing trips, load into a DataFrame and write the DataFrame to a hudi table.

Spark SQL supports two types of DML for updating hudi tables: Merge-Into and Update.

Update

grammar

UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]

Example

update hudi_mor_tbl set price = price * 2, ts = 1111 where id = 1;

update hudi_cow_pt_tbl set name = 'a1_1', ts = 1001 where id = 1;

-- update using non-PK field
update hudi_cow_pt_tbl set ts = 1001 where name = 'a1';

Note: The update operation requires preCombineFieldfields

MergeInto

grammar

MERGE INTO tableIdentifier AS target_alias
USING (sub_query | tableIdentifier) AS source_alias
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]

<merge_condition> =A equal bool condition 
<matched_action>  =
  DELETE  |
  UPDATE SET *  |
  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
<not_matched_action>  =
  INSERT *  |
  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])

Example

-- source table using hudi for testing merging into non-partitioned table
create table merge_source (id int, name string, price double, ts bigint) using hudi
tblproperties (
  primaryKey = 'id',
  preCombineField = 'ts',
  hoodie.datasource.write.hive_style_partitioning = 'false'
);

insert into merge_source values (1, "old_a1", 22.22, 900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000);

-- 该语法会导致使用HIVE JDBC,所以源表需要指定同步方式为hms
merge into hudi_mor_tbl as target
using merge_source as source
on target.id = source.id
when matched then update set *
when not matched then insert *
;

-- source table using parquet for testing merging into partitioned table
create table merge_source2 (id int, name string, flag string, dt string, hh string) using parquet;
insert into merge_source2 values (1, "new_a1", 'update', '2021-12-09', '10'), (2, "new_a2", 'delete', '2021-12-09', '11'), (3, "new_a3", 'insert', '2021-12-09', '12');

merge into hudi_cow_pt_tbl as target
using (
  select id, name, '1000' as ts, flag, dt, hh from merge_source2
) source
on target.id = source.id
when matched and flag != 'delete' then
 update set id = source.id, name = source.name, ts = source.ts, dt = source.dt, hh = source.hh
when matched and flag = 'delete' then delete
when not matched then
 insert (id, name, ts, dt, hh) values(source.id, source.name, source.ts, source.dt, source.hh)
;

Data deletion

grammar

DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]

Example

delete from hudi_cow_nonpcf_tbl where uuid = 1;

delete from hudi_mor_tbl where id % 2 = 0;

-- delete using non-PK field
delete from hudi_cow_pt_tbl where name = 'a1';

Insert Overwrite

For batch ETL jobs, this operation may be faster than upsert, which recalculates the entire target partition at once (rather than incrementally updating the target table). This is because we are able to completely bypass the indexing, pre-combining and other repartitioning steps in the upsert write path.

Insert overwriting a partitioned table uses the INSERT_OVERWRITE type of write operation, while a non-partitioned table uses INSERT_OVERWRITE_TABLE.

-- insert overwrite non-partitioned table
insert overwrite hudi_mor_tbl select 99, 'a99', 20.0, 900;
insert overwrite hudi_cow_nonpcf_tbl select 99, 'a99', 20.0;

-- insert overwrite partitioned table with dynamic partition
insert overwrite table hudi_cow_pt_tbl select 10, 'a10', 1100, '2021-12-09', '10';

-- insert overwrite partitioned table with static partition
insert overwrite hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100;

More Spark SQL commands

Alter Table

Schema evolution can be achieved through the ALTER TABLE command. Some basic examples are shown below.

grammar:

-- Alter table name
ALTER TABLE oldTableName RENAME TO newTableName

-- Alter table add columns
ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)

-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType

-- Alter table properties
ALTER TABLE tableIdentifier SET TBLPROPERTIES (key = 'value')

Example

--rename to:
ALTER TABLE hudi_cow_nonpcf_tbl RENAME TO hudi_cow_nonpcf_tbl2;

--add column:
ALTER TABLE hudi_cow_nonpcf_tbl2 add columns(remark string);

--change column:
ALTER TABLE hudi_cow_nonpcf_tbl2 change column uuid uuid bigint;

--set properties;
alter table hudi_cow_nonpcf_tbl2 set tblproperties (hoodie.keep.max.commits = '10');

Partition SQL commands

grammar:

-- Drop Partition
ALTER TABLE tableIdentifier DROP PARTITION ( partition_col_name = partition_col_val [ , ... ] )

-- Show Partitions
SHOW PARTITIONS tableIdentifier

Example

--show partition:
show partitions hudi_cow_pt_tbl;

--drop partition:
alter table hudi_cow_pt_tbl drop partition (dt='2021-12-09', hh='10');

Currently, the results of show partitions are based on file system table paths. It is inaccurate to delete the entire partition data or directly delete a partition.

Procedures

grammar

--Call procedure by positional arguments
CALL system.procedure_name(arg_1, arg_2, ... arg_n)

--Call procedure by named arguments
CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1, ... arg_name_n => arg_n)

Example

--show commit's info
call show_commits(table => 'test_hudi_table', limit => 10);

The Call command already supports some commit procedures and table optimizers. Please refer to Procedures for more details .

Guess you like

Origin blog.csdn.net/weixin_39636364/article/details/128343038