Introduction and use of data lake Iceberg (integrated Hive, SparkSQL, FlinkSQL)

Article directory

Introduction

Overview

In order to solve the problem of adaptation between data storage and computing engines, Netflix developed Iceberg. It entered the Apache incubator on November 16, 2018, and graduated from the incubator on May 19, 2020, becoming a top-level project of Apache.

Iceberg is an open table format (Table Format) for massive data analysis scenarios . Table Format can be understood as an organization method of metadata and data files, under the computing framework (Flink, Spark...) and on top of the data files .

effect

The development of the big data field has gone through quite a long period of development and exploration. Although the emergence and iteration of big data technology has lowered the threshold for users to process massive data, there is a problem that cannot be ignored. The data format is adapted to different engines.

That is to say, when we use different engines for calculation, we need to adapt the data according to the engine. This is quite a tricky question.

A new solution emerged for this: a middle layer between the upper computing engine and the underlying storage format. This middle layer is not a data storage method, but only defines the metadata organization method of data, and provides a unified semantic similar to "table" in a traditional database to the engine level. Its bottom layer is still storage formats such as Parquet and ORC. Based on this, Netflix developed Iceberg, which is currently an Apache top-level project.

characteristic

Data storage and computing engine plug-in

Iceberg provides an open and universal Table Format implementation solution that is not tied to a specific data storage or calculation engine. Currently, common data storage (HDFS, S3...) and computing engines (Flink, Spark...) in the field of big data can be connected to Iceberg.

In a production environment, different components can be selected and used together. You can even read data stored on the file system directly without going through the calculation engine.

Real-time streaming and batch integration

After the Iceberg upstream component completes writing data, the downstream component can read and query it in time. Can meet real-time scenarios. And Iceberg also provides stream/batch reading interface and stream/batch writing interface. Stream data and batch data can be processed simultaneously in the same process, greatly simplifying the ETL link.

Data performance (Table Evolution)

Iceberg can perform table-level schema evolution through SQL. When doing these operations, the cost is extremely low. There is no time-consuming and laborious operation of reading data, rewriting or migrating data.

For example, in the commonly used Hive, if we need to change a table partitioned by day to partition by hour. At this point, you cannot directly modify the original table. You can only create a table partitioned by hour, and then insert the data into the new hour partition table. Moreover, even if we change the name of the new table to the original table through the Rename command and use the previous layer application of the original table, the SQL may need to be modified due to the modification of the partition field, which is very cumbersome.

Schema Evolution

Iceberg supports the following modes of evolution:

  • ADD: Add new columns to a table or nested structure

  • Drop: Remove a column from a table or nested structure

  • Rename: Rename a column in a table or in a nested structure

  • Update: Extend the type length of basic types in complex structures (struct, map<key, value>, list), such as modifying tinyint to int.

  • Reorder: Change the order of columns or fields in a nested structure

Iceberg guarantees that Schema Evolution is an independent operation process with no side effects. It is a metadata operation and does not involve the process of rewriting data files. The details are as follows:

  • When adding a column, existing data will not be read from another column.

  • Deleting a column or field in a nested structure does not change the value of any other column.

  • When you update a column or field in a nested structure, it does not change the value of any other column.

  • Changing the order of columns or fields in a nested structure will not change the associated values.

Iceberg uses a unique ID to locate information for each column in the table. When adding a new column, it will be assigned a unique ID, and the already used ID will never be used.

There will be some problems when using names or position information to locate columns. For example, if names are used, names may be repeated. If positions are used, the order cannot be modified and obsolete fields cannot be deleted.

Partition Evolution

Iceberg can be directly modified on an existing table, because Iceberg's query process is not directly related to partition information.

When we change the partitioning strategy of a table, the data before the modified partition will not change. The old partitioning strategy will still be used, and the new data will use the new partitioning strategy. That is to say, the same table will have two partitioning strategies. , the old data adopts the old partition strategy, and the new data adopts the new partition strategy. In the metadata, the two partition strategies are independent of each other and do not overlap.

When querying data, if there is a cross-partition strategy, it will be parsed into two different execution plans, as shown in the figure provided by Iceberg's official website:

MMSIZE

The booking_table table in the figure was partitioned by month in 2008, and changed to partition by day after entering 2009. This medium partitioning strategy coexists in this table.

With Iceberg's Hidden Partition, when writing SQL queries, there is no need to specify partition filter conditions in SQL. Iceberg will automatically partition and filter out unnecessary data.

The Iceberg partition evolution operation is also a metadata operation and will not rewrite data files.

Sort Order Evolution

Iceberg can modify the sorting strategy on an existing table. After modifying the sorting strategy, the old data still adopts the old sorting strategy. The calculation engine that writes data to Iceberg will always choose the latest sorting strategy, but when the cost of sorting is extremely high, the sorting will not be performed.

Hidden Partition

Iceberg's partition information does not require manual maintenance, it can be hidden. Unlike other partition strategies similar to Hive, Iceberg's partition field/strategy (calculated from a certain field) does not need to be a field of the table or the table data storage directory. It doesn't matter. After creating a table or modifying the partitioning strategy, the new data will automatically calculate the partition to which it belongs. When querying, you also don’t need to know what fields/strategies are used to partition the relational table. You only need to focus on the business logic. Iceberg will automatically filter unnecessary partition data.

It is precisely because Iceberg's partition information and table data storage directory are independent that Iceberg's table partitions can be modified, and the discord involves data migration.

Mirror data query (Time Travel)

Iceberg provides the ability to query data mirroring (snapshot) at a certain point in the history of the query table. Through this feature, the latest SQL logic can be applied to historical data.

Support transactions (ACID)

By providing a transaction (ACID) mechanism, Iceberg has the ability to upsert and makes it possible to read while writing, so that data can be consumed by downstream components faster. Transactions ensure that downstream components can only consume committed data and will not read partial or even uncommitted data.

Concurrency support based on optimistic locking

Based on optimistic locking, Iceberg provides the ability for multiple programs to write concurrently and ensures linear consistency of data.

File-level data pruning

Iceberg's metadata provides some statistical information for each data file, such as maximum value, minimum value, Count count, and so on. Therefore, in addition to conventional partitioning and column filtering, query SQL filter conditions can even be pushed down to the file level, greatly speeding up query efficiency.

Comparison of other data lake frameworks

MMSIZE

MMSIZE

storage structure

MMSIZE

MMSIZE

data filesdata files

The data file is the file where the Apache Iceberg table actually stores data. It is usually in the data directory of the table's data storage directory. If our file format is parquet, then the file ends with ".parquet".

For example: 00000-0-atguigu_20230203160458_22ee74c9-643f-4b27-8fc1-9cbd5f64dad4-job_1675409881387_0007-00001.parquet is a data file.

Iceberg generates multiple data files for each update.

Table snapshotSnapshot

A snapshot represents the state of a table at a certain moment in time. Each snapshot will list all the data files of the table at a certain moment. Data files are stored in different manifest files. Manifest files are stored in a Manifest list file, and a Manifest list file represents a snapshot.

Manifest list Manifest list

The manifest list is a metadata file that lists the manifest file for building a table snapshot. Stored in this metadata file is a list of Manifest files, with each Manifest file occupying one line. Each line stores the path of the Manifest file, the partition range of the data files it stores, how many data files have been added, how many data files have been deleted, and other information. This information can be used to provide filtering during querying. Boost.

For example: snap-6746266566064388720-1-52f2f477-2585-4e69-be42-bbad9a46ed17.avro is a Manifest List file.

Manifest file

The Manifest file is also a metadata file that lists the list information of the data files that make up the snapshot. Each line is a detailed description of each data file, including the status of the data file, file path, partition information, column-level statistical information (such as the maximum and minimum values ​​of each column, the number of null values, etc.), the size of the file, and the contents of the file. Information such as the number of data rows. The column-level statistics can filter out unnecessary files when scanning table data.

The Manifest file is stored in the avro format and ends with the ".avro" suffix, for example: 52f2f477-2585-4e69-be42-bbad9a46ed17-m0.avro.

Integrate with Hive

Environmental preparation

(1) The version correspondence between Hive and Iceberg is as follows

Hive version Officially recommended Hive version Iceberg version
2.x 2.3.8 0.8.0-incubating – 1.1.0
3.x 3.1.2 0.10.0 – 1.1.0

Iceberg's integration with Hive 2 and Hive 3.1.2/3 supports the following features:

  • Create table

  • Delete table

  • Read table

  • INSERT into table

More features require Hive 4.x (current alpha version) to support.

(2) Upload the jar package and copy it to the auxlib directory of Hive

mkdir auxlib
cp iceberg-hive-runtime-1.1.0.jar /opt/module/hive/auxlib
cp libfb303-0.9.3.jar /opt/module/hive/auxlibcp iceberg-hive-runtime-1.1.0.jar /opt/module/hive/auxlibcp libfb303-0.9.3.jar /opt/module/hive/auxlib

(3) Modify hive-site.xml and add configuration items

<property>
    <name>iceberg.engine.hive.enabled</name>
    <value>true</value>
</property>

<property>
    <name>hive.aux.jars.path</name>
    <value>/opt/module/hive/auxlib</value>
</property>

Notes on using TEZ engine:

  • Use Hive version >=3.1.2, require TEZ version >=0.10.1

  • Specify tez update configuration:

    <property>
        <name>tez.mrreader.config.update.properties</name>
        <value>hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids</value>
    </property>
    
  • Starting from Iceberg 0.11.0, if Hive uses the Tez engine, vectorization execution needs to be turned off:

    <property>
        <name>hive.vectorized.execution.enabled</name>
        <value>false</value>
    </property>
    

(4) Start HMS service

(5) Start Hadoop

Create and manage catalogs

Iceberg supports many different catalog types, such as: Hive, Hadoop, Amazon's AWS Glue, and custom catalogs.

According to different configurations, there are three situations:

  • iceberg.catalog is not set, HiveCatalog is used by default
configuration item illustrate
iceberg.catalog.<catalog_name>.type Catalog type: hive, hadoop, not set if using custom catalog
iceberg.catalog.<catalog_name>.catalog-impl Catalog implementation class, if the above type is not set, this parameter must be set
iceberg.catalog.<catalog_name>.<key> Other configuration items of Catalog
  • The type of iceberg.catalog is set, and the specified Catalog type is used, as shown in the following table:

  • Set iceberg.catalog=location_based_table to load the Iceberg table directly through the specified root path

HiveCatalog is used by default

CREATE TABLE iceberg_test1 (i int) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
 
INSERT INTO iceberg_test1 values(1);

Looking at HDFS, you can find that the table directory is under the default hive warehouse path.

Specify Catalog type

(1) Using HiveCatalog

set iceberg.catalog.iceberg_hive.type=hive;
set iceberg.catalog.iceberg_hive.uri=thrift://hadoop1:9083;
set iceberg.catalog.iceberg_hive.clients=10;
set iceberg.catalog.iceberg_hive.warehouse=hdfs://hadoop1:8020/warehouse/iceberg-hive;

CREATE TABLE iceberg_test2 (i int) 
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES('iceberg.catalog'='iceberg_hive');
 
INSERT INTO iceberg_test2 values(1);

(2) Use HadoopCatalog

set iceberg.catalog.iceberg_hadoop.type=hadoop;
set iceberg.catalog.iceberg_hadoop.warehouse=hdfs://hadoop1:8020/warehouse/iceberg-hadoop;

CREATE TABLE iceberg_test3 (i int) 
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs://hadoop1:8020/warehouse/iceberg-hadoop/default/iceberg_test3'
TBLPROPERTIES('iceberg.catalog'='iceberg_hadoop');

INSERT INTO iceberg_test3 values(1);

Specify path to load

If an iceberg format table already exists in HDFS, we can specify the corresponding location path mapping data by creating an iceberg format table in Hive.

CREATE EXTERNAL TABLE iceberg_test4 (i int)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://hadoop1:8020/warehouse/iceberg-hadoop/default/iceberg_test3'
TBLPROPERTIES ('iceberg.catalog'='location_based_table');

Basic operations

Create table

(1) Create an external table

CREATE EXTERNAL TABLE iceberg_create1 (i int) 
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

describe formatted iceberg_create1;

(2) Create an internal table

CREATE TABLE iceberg_create2 (i int) 
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

describe formatted iceberg_create2;

(3) Create partition table

CREATE EXTERNAL TABLE iceberg_create3 (id int,name string)
PARTITIONED BY (age int)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

describe formatted iceberg_create3;

Hive syntax creates a partitioned table, does not create partitions in HMS, but converts partition data into Iceberg ID partitions. In this case, Iceberg's partition conversion cannot be used, for example: days (timestamp). If you want to use the partition conversion of the Iceberg format table to identify the partitions, you need to use the Spark or Flink engine to create the table.

Modify table

Only the HiveCatalog table is supported to modify table attributes, and the Iceberg table attributes and Hive table attributes are stored in HMS synchronously.

ALTER TABLE iceberg_create1 SET TBLPROPERTIES('external.table.purge'='FALSE');

insert table

Supports standard single table INSERT INTO operation:

INSERT INTO iceberg_create2 VALUES (1);
INSERT INTO iceberg_create1 select * from iceberg_create2;

In HIVE 3.x, although INSERT OVERWRITE can be executed, it is actually an addition.

Delete table

DROP TABLE iceberg_create1;

Integration with Spark SQL

Environmental preparation

(1) Install Spark

1) The version correspondence between Spark and Iceberg is as follows

Spark version Iceberg version
2.4 0.7.0-incubating – 1.1.0
3.0 0.9.0 – 1.0.0
3.1 0.12.0 – 1.1.0
3.2 0.13.0 – 1.1.0
3.3 0.14.0 – 1.1.0

2) Upload and decompress the Spark installation package

tar -zxvf spark-3.3.1-bin-hadoop3.tgz -C /opt/module/
mv /opt/module/spark-3.3.1-bin-hadoop3 /opt/module/spark-3.3.1

3) Configure environment variables

sudo vim /etc/profile.d/my_env.sh

export SPARK_HOME=/opt/module/spark-3.3.1
export PATH=$PATH:$SPARK_HOME/bin

source /etc/profile.d/my_env.sh

4) Copy iceberg’s jar package to Spark’s jars directory

cp /opt/software/iceberg/iceberg-spark-runtime-3.3_2.12-1.1.0.jar /opt/module/spark-3.3.1/jars

(2) Start Hadoop

Spark configuration catalog

Spark supports two Catalog settings: hive and hadoop. Hive Catalog uses Hive's default data path for Iceberg table storage. Hadoop Catalog requires specifying the Iceberg format table storage path.

vim spark-defaults.conf

Hive Catalog

spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type = hive
spark.sql.catalog.hive_prod.uri = thrift://hadoop1:9083

use hive_prod.db;

Hadoop Catalog

spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = hdfs://hadoop1:8020/warehouse/spark-iceberg

use hadoop_prod.db;

SQL operations

Create table

use hadoop_prod;
create database default;
use default;

CREATE TABLE hadoop_prod.default.sample1 (
    id bigint COMMENT 'unique id',
    data string)
USING iceberg
  • PARTITIONED BY (partition-expressions): Configure partitions

  • LOCATION '(fully-qualified-uri)' : Specify table path

  • COMMENT 'table documentation': Configuration table remarks

  • TBLPROPERTIES ('key'='value', …): Configuration table properties

Table properties: https://iceberg.apache.org/docs/latest/configuration/

Every change to an Iceberg table generates a new metadata file (json file) to provide atomicity. By default, old metadata files are saved as history files and are not deleted.

If you want to automatically clear metadata files, set write.metadata.delete-after-commit.enabled=true in the table properties. This will keep some metadata files (up to write.metadata.previous-versions-max) and delete old metadata files after each newly created metadata file.

(1) Create partition table

1) Partition table

CREATE TABLE hadoop_prod.default.sample2 (
    id bigint,
    data string,
    category string)
USING iceberg
PARTITIONED BY (category)

2) Create a hidden partition table

CREATE TABLE hadoop_prod.default.sample3 (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)

Supported conversions are:

  • years(ts): divided by year

  • months(ts): divided by month

  • days(ts) or date(ts): equivalent to dateint partitioning

  • hours(ts) or date_hour(ts): Equivalent to dateint and hour partitioning

  • bucket(N, col): divide mod N buckets by hash value

  • truncate(L, col): divide by the value truncated to L

String is truncated to the given length

Integers and long types are truncated to bin: truncate(10, i) generates partitions 0,10,20,30,…

(2) Use CTAS syntax to create tables

CREATE TABLE hadoop_prod.default.sample4
USING iceberg
AS SELECT * from hadoop_prod.default.sample3

If you do not specify a partition, there is no partition. You need to re-specify the partition and table attributes:

CREATE TABLE hadoop_prod.default.sample5
USING iceberg
PARTITIONED BY (bucket(8, id), hours(ts), category)
TBLPROPERTIES ('key'='value')
AS SELECT * from hadoop_prod.default.sample3

(3) Use Replace table to create a table

REPLACE TABLE hadoop_prod.default.sample5
USING iceberg
AS SELECT * from hadoop_prod.default.sample3

REPLACE TABLE hadoop_prod.default.sample5
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT * from hadoop_prod.default.sample3


CREATE OR REPLACE TABLE hadoop_prod.default.sample6
USING iceberg
AS SELECT * from hadoop_prod.default.sample3

Delete table

For HadoopCatalog, running DROP TABLE will remove the table from the catalog and delete the table contents.

CREATE EXTERNAL TABLE hadoop_prod.default.sample7 (
    id bigint COMMENT 'unique id',
    data string)
USING iceberg

INSERT INTO hadoop_prod.default.sample7 values(1,'a')
DROP TABLE hadoop_prod.default.sample7

For HiveCatalog:

  • Prior to 0.14, running DROP TABLE would drop the table from the catalog and delete the table contents.

  • Starting from 0.14, DROP TABLE will only delete the table from the catalog, not the data. In order to delete table contents, DROP table PURGE should be used.

CREATE TABLE hive_prod.default.sample7 (
    id bigint COMMENT 'unique id',
    data string)
USING iceberg

INSERT INTO hive_prod.default.sample7 values(1,'a')

(1) Delete table

DROP TABLE hive_prod.default.sample7

(2) Delete tables and data

DROP TABLE hive_prod.default.sample7 PURGE

Modify table

Iceberg fully supports ALTER TABLE in Spark 3, including:

  • Rename table

  • Set or delete table properties

  • Add, remove and rename columns

  • Add, remove and rename nested fields

  • Reorder top-level columns and nested structure fields

  • Expand the types of int, float and decimal fields

  • Change required columns to optional columns

Additionally, SQL extensions can be used to add support for partition evolution and set the write order of tables.

CREATE TABLE hive_prod.default.sample1 (
    id bigint COMMENT 'unique id',
    data string)
USING iceberg

(1) Modify the table name (modifying the table name of HadoopCatalog is not supported)

ALTER TABLE hive_prod.default.sample1 RENAME TO hive_prod.default.sample2

(2) Modify table attributes

  • Modify table properties

    ALTER TABLE hive_prod.default.sample1 SET TBLPROPERTIES (
        'read.split.target-size'='268435456'
    )
    
    ALTER TABLE hive_prod.default.sample1 SET TBLPROPERTIES (
        'comment' = 'A table comment.'
    )
    
  • Delete table attributes

    ALTER TABLE hive_prod.default.sample1 UNSET TBLPROPERTIES ('read.split.target-size')
    

(3) Add columns

ALTER TABLE hadoop_prod.default.sample1
ADD COLUMNS (
    category string comment 'new_column'
)

-- 添加struct类型的列
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN point struct<x: double, y: double>;

-- 往struct类型的列中添加字段
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN point.z double

-- 创建struct的嵌套数组列
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN points array<struct<x: double, y: double>>;

-- 在数组中的结构中添加一个字段。使用关键字'element'访问数组的元素列。
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN points.element.z double

-- 创建一个包含Map类型的列,key和value都为struct类型
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN pointsm map<struct<x: int>, struct<a: int>>;

-- 在Map类型的value的struct中添加一个字段。
ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN pointsm.value.b int

In Spark 2.4.4 and later, columns can be added anywhere by adding a FIRST or AFTER clause:

ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN new_column1 bigint AFTER id

ALTER TABLE hadoop_prod.default.sample1
ADD COLUMN new_column2 bigint FIRST

(4) Modify columns

  • Modify column name

    ALTER TABLE hadoop_prod.default.sample1 RENAME COLUMN data TO data1
    
  • Alter Column modification type (only allow safe conversion)

    ALTER TABLE hadoop_prod.default.sample1
    ADD COLUMNS (
        idd int
      )
    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN idd TYPE bigint
    
  • Alter Column Modifies the comment of the column

    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN id TYPE double COMMENT 'a'
    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN id COMMENT 'b'
    
  • Alter Column changes the order of columns

    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN id FIRST
    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN new_column2 AFTER new_column1
    
  • Alter Column modifies whether the column is allowed to be null

    ALTER TABLE hadoop_prod.default.sample1 ALTER COLUMN id DROP NOT NULL
    

    ALTER COLUMN is not used to update struct types. Use ADD COLUMN and DROP COLUMN to add or delete fields of type struct.

(5) Delete columns

ALTER TABLE hadoop_prod.default.sample1 DROP COLUMN idd
ALTER TABLE hadoop_prod.default.sample1 DROP COLUMN point.z

(6) Add partitions (Spark3, need to configure extensions)

vim spark-default.conf
spark.sql.extensions = org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Re-enter spark-sql shell:

ALTER TABLE hadoop_prod.default.sample1 ADD PARTITION FIELD category 

ALTER TABLE hadoop_prod.default.sample1 ADD PARTITION FIELD bucket(16, id)
ALTER TABLE hadoop_prod.default.sample1 ADD PARTITION FIELD truncate(data, 4)
ALTER TABLE hadoop_prod.default.sample1 ADD PARTITION FIELD years(ts)

ALTER TABLE hadoop_prod.default.sample1 ADD PARTITION FIELD bucket(16, id) AS shard

(7) Delete partition (Spark3, need to configure extension)

ALTER TABLE hadoop_prod.default.sample1 DROP PARTITION FIELD category
ALTER TABLE hadoop_prod.default.sample1 DROP PARTITION FIELD bucket(16, id)
ALTER TABLE hadoop_prod.default.sample1 DROP PARTITION FIELD truncate(data, 4)
ALTER TABLE hadoop_prod.default.sample1 DROP PARTITION FIELD years(ts)
ALTER TABLE hadoop_prod.default.sample1 DROP PARTITION FIELD shard

Note that despite dropping the partition, the column still exists in the table structure.

Dropping a partition field is a metadata operation and does not alter any existing table data. New data will be written to the new partitions, but existing data will remain in the old partition layout.

When partitions change, dynamic partition override behavior also changes. For example, if you partition by day and instead partition by hour, the override will cover the hourly partition but no longer override the day partition.

Be careful when dropping partition fields, it may cause metadata queries to fail or produce different results.

(8) Modify the partition (Spark3, need to configure the extension)

ALTER TABLE hadoop_prod.default.sample1 REPLACE PARTITION FIELD bucket(16, id) WITH bucket(8, id)

(9) Modify the writing order of the table

ALTER TABLE hadoop_prod.default.sample1 WRITE ORDERED BY category, id
ALTER TABLE hadoop_prod.default.sample1 WRITE ORDERED BY category ASC, id DESC
ALTER TABLE hadoop_prod.default.sample1 WRITE ORDERED BY category ASC NULLS LAST, id DESC NULLS FIRST

The table writing order does not guarantee the data order of the query. It only affects the way data is written to the table.

WRITE ORDERED BY sets a global ordering, that is, an ordering of rows across tasks, just like using ORDER BY in an INSERT command:

INSERT INTO hadoop_prod.default.sample1
SELECT id, data, category, ts FROM another_table
ORDER BY ts, category

To sort within each task, rather than across tasks, use local ORDERED BY:

ALTER TABLE hadoop_prod.default.sample1 WRITE LOCALLY ORDERED BY category, id

(10) Parallel writing by partition

ALTER TABLE hadoop_prod.default.sample1 WRITE DISTRIBUTED BY PARTITION
ALTER TABLE hadoop_prod.default.sample1 WRITE DISTRIBUTED BY PARTITION LOCALLY ORDERED BY category, id

Insert data

CREATE TABLE hadoop_prod.default.a (
    id bigint,
    count bigint)
USING iceberg

CREATE TABLE hadoop_prod.default.b (
    id bigint,
    count bigint,
    flag string)
USING iceberg

(1)Insert Into

INSERT INTO hadoop_prod.default.a VALUES (1, 1), (2, 2), (3, 3);
INSERT INTO hadoop_prod.default.b VALUES (1, 1, 'a'), (2, 2, 'b'), (4, 4, 'd');

(2) MERGE INTO row-level update

MERGE INTO hadoop_prod.default.a t 
USING (SELECT * FROM hadoop_prod.default.b) u ON t.id = u.id
WHEN MATCHED AND u.flag='b' THEN UPDATE SET t.count = t.count + u.count
WHEN MATCHED AND u.flag='a' THEN DELETE
WHEN NOT MATCHED THEN INSERT (id,count) values (u.id,u.count)

Query data

(1) Ordinary query

SELECT count(1) as count, data
FROM local.db.table
GROUP BY data

(2) Query metadata

// 查询表快照
SELECT * FROM hadoop_prod.default.a.snapshots

// 查询数据文件信息
SELECT * FROM hadoop_prod.default.a.files

// 查询表历史
SELECT * FROM hadoop_prod.default.a.history

// 查询 manifest
ELECT * FROM hadoop_prod.default.a.manifests

stored procedure

Procedures can be used from any configured Iceberg Catalog via CALL. All Procedures are in namespace.

(1) Grammar

Pass parameters according to parameter name

CALL catalog_name.system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1)

When passing parameters positionally, only the end parameter can be omitted if the end parameter is optional.

CALL catalog_name.system.procedure_name(arg_1, arg_2, ... arg_n)

(2) Snapshot management

  • Roll back to the specified snapshot id

    CALL hadoop_prod.system.rollback_to_snapshot('default.a', 7601163594701794741)
    
  • Roll back to a snapshot at a specified time

    CALL hadoop_prod.system.rollback_to_timestamp('db.sample', TIMESTAMP '2021-06-30 00:00:00.000')
    
  • Set the current snapshot ID of the table

    CALL hadoop_prod.system.set_current_snapshot('db.sample', 1)
    
  • Change from snapshot to current table state

    CALL hadoop_prod.system.cherrypick_snapshot('default.a', 7629160535368763452)
    CALL hadoop_prod.system.cherrypick_snapshot(snapshot_id => 7629160535368763452, table => 'default.a' )
    

(3) Metadata management

  • Delete snapshots older than the specified date and time, but keep the most recent 100 snapshots:

    CALL hive_prod.system.expire_snapshots('db.sample', TIMESTAMP '2021-06-30 00:00:00.000', 100)
    
  • Delete files in the Iceberg table that are not referenced in any metadata file

    #列出所有需要删除的候选文件
    CALL catalog_name.system.remove_orphan_files(table => 'db.sample', dry_run => true)
    #删除指定目录中db.sample表不知道的任何文件
    CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location => 'tablelocation/data')
    
  • Merge data files (merge small files)

    CALL catalog_name.system.rewrite_data_files('db.sample')
    CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 'sort', sort_order => 'id DESC NULLS LAST,name ASC NULLS FIRST')
    CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 'sort', sort_order => 'zorder(c1,c2)')
    CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => map('min-input-files','2'))
    CALL catalog_name.system.rewrite_data_files(table => 'db.sample', where => 'id = 3 and name = "foo"')
    
  • Rewrite table list to optimize execution plan

    CALL catalog_name.system.rewrite_manifests('db.sample')
    
    #重写表db中的清单。并禁用Spark缓存的使用。这样做可以避免执行程序上的内存问题。
    CALL catalog_name.system.rewrite_manifests('db.sample', false)
    

(4) Migration table

  • Snapshot

    CALL catalog_name.system.snapshot('db.sample', 'db.snap')
    CALL catalog_name.system.snapshot('db.sample', 'db.snap', '/tmp/temptable/')
    
  • migrate

    CALL catalog_name.system.migrate('spark_catalog.db.sample', map('foo', 'bar'))
    CALL catalog_name.system.migrate('db.sample')
    
  • Add data file

    CALL spark_catalog.system.add_files(
        table => 'db.tbl',
        source_table => 'db.src_tbl',
        partition_filter => map('part_col_1', 'A')
    )
    
    CALL spark_catalog.system.add_files(
        table => 'db.tbl',
        source_table => '`parquet`.`path/to/table`'
    )
    

(5) Metadata information

  • Get the parent snapshot id of the specified snapshot

    CALL spark_catalog.system.ancestors_of('db.tbl')
    
  • Get all ancestor snapshots of the specified snapshot

    CALL spark_catalog.system.ancestors_of('db.tbl', 1)
    CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
    

DataFrame operations

Environmental preparation

(1) Create a maven project and configure the pom file

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.atguigu.iceberg</groupId>
    <artifactId>spark-iceberg-demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.3.1</spark.version>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <!-- Spark的依赖引入 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <scope>provided</scope>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <scope>provided</scope>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.binary.version}</artifactId>
            <scope>provided</scope>
            <version>${spark.version}</version>
        </dependency>

        <!--fastjson <= 1.2.80 存在安全漏洞,-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.83</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-spark-runtime-3.3 -->
        <dependency>
            <groupId>org.apache.iceberg</groupId>
            <artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
            <version>1.1.0</version>
        </dependency>


    </dependencies>

    <build>
        <plugins>
            <!-- assembly打包插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <archive>
                        <manifest>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>

            <!--Maven编译scala所需依赖-->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(2) Configure Catalog

val spark: SparkSession = SparkSession.builder().master("local").appName(this.getClass.getSimpleName)
  //指定hive catalog, catalog名称为iceberg_hive
  .config("spark.sql.catalog.iceberg_hive", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.iceberg_hive.type", "hive")
  .config("spark.sql.catalog.iceberg_hive.uri", "thrift://hadoop1:9083")
  //    .config("iceberg.engine.hive.enabled", "true")
  //指定hadoop catalog,catalog名称为iceberg_hadoop 
  .config("spark.sql.catalog.iceberg_hadoop", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.iceberg_hadoop.type", "hadoop")
  .config("spark.sql.catalog.iceberg_hadoop.warehouse", "hdfs://hadoop1:8020/warehouse/spark-iceberg")
  .getOrCreate()

Read table

(1) Load table

spark.read
.format("iceberg")
.load("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a")
.show()

or

// 仅支持Spark3.0以上
spark.table("iceberg_hadoop.default.a")
.show()

(2) Time travel: query at specified time

spark.read
    .option("as-of-timestamp", "499162860000")
    .format("iceberg")
.load("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a")
.show()

(3) Time travel: query by specifying snapshot id

spark.read
    .option("snapshot-id", 7601163594701794741L)
    .format("iceberg")
.load("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a")
.show()

(4) Incremental query

spark.read
.format("iceberg")
.option("start-snapshot-id", "10963874102873")
.option("end-snapshot-id", "63874143573109")
.load("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a")
.show()

The queried table can only write data in the append method, and replacement, overwrite, and delete operations are not supported.

checklist

(1) Query metadata

spark.read.format("iceberg").load("iceberg_hadoop.default.a.files")
spark.read.format("iceberg").load("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a#files")

(2) Metadata table time travel query

spark.read
.format("iceberg")
.option("snapshot-id", 7601163594701794741L)
.load("iceberg_hadoop.default.a.files")

write table

(1) Create a sample class and prepare DF

case class Sample(id:Int,data:String,category:String)

val df: DataFrame = spark.createDataFrame(Seq(Sample(1,'A', 'a'), Sample(2,'B', 'b'), Sample(3,'C', 'c')))

(2) Insert data and create tables

df.writeTo("iceberg_hadoop.default.table1").create()

import spark.implicits._
df.writeTo("iceberg_hadoop.default.table1")
  .tableProperty("write.format.default", "orc")
  .partitionedBy($"category")
  .createOrReplace()

(3) append addition

df.writeTo("iceberg_hadoop.default.table1").append()

(4) Dynamic partition coverage

df.writeTo("iceberg_hadoop.default.table1").overwritePartitions()

(5) Static partition coverage

import spark.implicits._
df.writeTo("iceberg_hadoop.default.table1").overwrite($"category" === "c")

(6) Insert into the partition table and sort within the partition

df.sortWithinPartitions("category")
    .writeTo("iceberg_hadoop.default.table1")
    .append()

Maintenance table

(1) Get Table object

1)HadoopCatalog

import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.hadoop.HadoopCatalog;
import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;

val conf = new Configuration()
val catalog = new HadoopCatalog(conf,"hdfs://hadoop1:8020/warehouse/spark-iceberg")
val table: Table = catalog.loadTable(TableIdentifier.of("db","table1"))

2)HiveCatalog

import org.apache.iceberg.hive.HiveCatalog;
import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;

val catalog = new HiveCatalog()
catalog.setConf(spark.sparkContext.hadoopConfiguration)

val properties = new util.HashMap[String,String]()
properties.put("warehouse", "hdfs://hadoop1:8020/warehouse/spark-iceberg")
properties.put("uri", "thrift://hadoop1:9083")

catalog.initialize("hive", properties)
val table: Table = catalog.loadTable(TableIdentifier.of("db", "table1"))

(2) Snapshot expiration cleanup

Each write to an Iceberg table creates a new snapshot or version of the table. Snapshots can be used for time travel queries, or the table can be rolled back to any valid snapshot. It is recommended to set a snapshot expiration time, and old expired snapshots will be deleted from the metadata (no longer available for time travel queries).

// 1天过期时间
val tsToExpire: Long = System.currentTimeMillis() - (1000 * 60 * 60 * 24)

table.expireSnapshots()
  .expireOlderThan(tsToExpire)
  .commit()

Or use SparkActions to set expiration:

//SparkActions可以并行运行大型表的表过期设置
SparkActions.get()
  .expireSnapshots(table)
  .expireOlderThan(tsToExpire)
  .execute()

(3) Delete invalid files

In Spark and other distributed processing engines, task or job failures may leave files that are not referenced by table metadata, and in some cases normal snapshot expiration may not determine that the file is no longer needed and delete the file.

SparkActions
    .get()
    .deleteOrphanFiles(table)
    .execute()

(4) Merge small files

Excessive data files result in more metadata being stored in the manifest file, while smaller data files result in unnecessary amounts of metadata and less efficient file opening costs.

SparkActions
    .get()
    .rewriteDataFiles(table)
    .filter(Expressions.equal("category", "a"))
    .option("target-file-size-bytes", 1024L.toString) //1KB
    .execute()

Integrate with Flink SQL

Apache Iceberg supports both Apache Flink's DataStream API and Table API.

Environmental preparation

(1) Install Flink

1) The version correspondence between Flink and Iceberg is as follows

Flink version Iceberg version
1.11 0.9.0 – 0.12.1
1.12 0.12.0 – 0.13.1
1.13 0.13.0 – 1.0.0
1.14 0.13.0 – 1.1.0
1.15 0.14.0 – 1.1.0
1.16 1.1.0 – 1.1.0

2) Upload and decompress the Flink installation package

tar -zxvf flink-1.16.0-bin-scala_2.12.tgz -C /opt/module/

3) Configure environment variables

sudo vim /etc/profile.d/my_env.sh
export HADOOP_CLASSPATH=`hadoop classpath`
source /etc/profile.d/my_env.sh

4) Copy iceberg’s jar package to Flink’s lib directory

cp /opt/software/iceberg/iceberg-flink-runtime-1.16-1.1.0.jar /opt/module/flink-1.16.0/lib

(2) Start Hadoop

(3) Start sql-client

1) Modify flink-conf.yaml configuration

vim /opt/module/flink-1.16.0/conf/flink-conf.yaml

classloader.check-leaked-classloader: false
taskmanager.numberOfTaskSlots: 4

state.backend: rocksdb
execution.checkpointing.interval: 30000
state.checkpoints.dir: hdfs://hadoop1:8020/ckps
state.backend.incremental: true

2) local mode

(1) Modify workers

vim /opt/module/flink-1.16.0/conf/workers
#表示:会在本地启动3个TaskManager的 local集群
localhost
localhost
localhost

(2) Start Flink

/opt/module/flink-1.16.0/bin/start-cluster.sh

View webui: http://hadoop1:8081

(3) Start Flink’s sql-client

/opt/module/flink-1.16.0/bin/sql-client.sh embedded

Create and use Catalog

Grammar Description

CREATE CATALOG <catalog_name> WITH (
  'type'='iceberg',
  `<config_key>`=`<config_value>`
); 
  • type: must be iceberg. (must)

  • catalog-type: Two kinds of catalogs, hive and hadoop, are built in, and catalog-impl can also be used to customize the catalog. (optional)

  • catalog-impl: The fully qualified class name of the custom catalog implementation. Must be set if catalog-type is not set. (optional)

  • property-version: A version number describing the property version. This attribute is available for backward compatibility in case the attribute format changes. The current property version is 1. (optional)

  • cache-enabled: Whether to enable directory caching, the default value is true. (optional)

  • cache.expiration-interval-ms: The time (in milliseconds) for local cache catalog entries; a negative value, such as -1, means there is no time limit and is not allowed to be set to 0. The default value is -1. (optional)

Hive Catalog

(1) Upload hive connector to flink’s lib

cp flink-sql-connector-hive-3.1.2_2.12-1.16.0.jar /opt/module/flink-1.16.0/lib/

(2) Start hive metastore service

hive --service metastore

(3) Create hive catalog

Restart the flink cluster and re-enter sql-client

CREATE CATALOG hive_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'uri'='thrift://hadoop1:9083',
  'clients'='5',
  'property-version'='1',
  'warehouse'='hdfs://hadoop1:8020/warehouse/iceberg-hive'
);

use catalog hive_catalog;

use catalog hive_catalog;

  • uri: thrift uri of Hive metastore. (required)

  • clients:Hive metastore client pool size, default is 2. (optional)

  • warehouse: data warehouse directory.

  • hive-conf-dir: The directory path containing the hive-site.xml configuration file. The value of hive.metastore.warehouse.dir in hive-site.xml will be overwritten by warehouse.

  • hadoop-conf-dir: Directory path containing core-site.xml and hdfs-site.xml configuration files.

Hadoop Catalog

Iceberg also supports directory-based catalogs in HDFS, which can be configured using 'catalog-type'='hadoop'.

CREATE CATALOG hadoop_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hadoop',
  'warehouse'='hdfs://hadoop1:8020/warehouse/iceberg-hadoop',
  'property-version'='1'
);

use catalog hadoop_catalog;
  • warehouse: HDFS directory where metadata files and data files are stored. (required)

Configure sql-client initialization file

vim /opt/module/flink-1.16.0/conf/sql-client-init.sql

CREATE CATALOG hive_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'uri'='thrift://hadoop1:9083',
  'warehouse'='hdfs://hadoop1:8020/warehouse/iceberg-hive'
);

USE CATALOG hive_catalog;

When starting sql-client subsequently, add -i sql file path to complete the catalog initialization.

/opt/module/flink-1.16.0/bin/sql-client.sh embedded -i conf/sql-client-init.sql

DDL statement

create database

CREATE DATABASE iceberg_db;
USE iceberg_db;

Create table

CREATE TABLE `hive_catalog`.`default`.`sample` (
    id BIGINT COMMENT 'unique id',
    data STRING
);

The table creation command now supports the most commonly used flink table creation syntax, including:

  • PARTITION BY (column1, column2, …): Configure partitions. Apache Flink does not yet support hidden partitions.

  • COMMENT 'table document': Comments on the specified table

  • WITH ('key'='value', …): Set table attributes

Currently, calculated columns and watermarks (primary keys are supported) are not supported.

(1) Create partition table

CREATE TABLE `hive_catalog`.`default`.`sample` (
    id BIGINT COMMENT 'unique id',
    data STRING
) PARTITIONED BY (data);

Apache Iceberg supports hidden partitions, but Apache flink does not support partitioning by functions on columns, and now cannot support hidden partitions in flink DDL.

(2) Use LIKE syntax to create tables

The LIKE syntax is used to create a table with the same schema, partitions, and attributes as another table.

CREATE TABLE `hive_catalog`.`default`.`sample` (
    id BIGINT COMMENT 'unique id',
    data STRING
);

CREATE TABLE  `hive_catalog`.`default`.`sample_like` LIKE `hive_catalog`.`default`.`sample`;

Modify table

(1) Modify table attributes

ALTER TABLE `hive_catalog`.`default`.`sample` SET ('write.format.default'='avro');

(2) Modify the table name

ALTER TABLE `hive_catalog`.`default`.`sample` RENAME TO `hive_catalog`.`default`.`new_sample`;

Delete table

DROP TABLE `hive_catalog`.`default`.`sample`;

insert statement

INSERT INTO

INSERT INTO `hive_catalog`.`default`.`sample` VALUES (1, 'a');
INSERT INTO `hive_catalog`.`default`.`sample` SELECT id, data from sample2;

INSERT OVERWRITE

Only supports Flink’s Batch mode:

SET execution.runtime-mode = batch;
INSERT OVERWRITE sample VALUES (1, 'a');
INSERT OVERWRITE `hive_catalog`.`default`.`sample` PARTITION(data='a') SELECT 6;

UPSERT

Iceberg supports UPSERT based on primary key when writing data to v2 table format. There are two ways to enable upsert.

(1) Specify when creating the table

CREATE TABLE `hive_catalog`.`test1`.`sample5` (
    `id`  INT UNIQUE COMMENT 'unique id',
    `data` STRING NOT NULL,
    PRIMARY KEY(`id`) NOT ENFORCED
) with (
    'format-version'='2', 
    'write.upsert.enabled'='true'
);

(2) Specify when inserting

INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */
...

For the inserted table, format-version needs to be 2.

OVERWRITE and UPSERT cannot be set at the same time. In UPSERT mode, if the table is partitioned, the partition field must also be the primary key.

(3) Read the Kafka stream, upsert and insert it into the iceberg table

create table default_catalog.default_database.kafka(
    id int,
    data string
) with (
    'connector' = 'kafka'
    ,'topic' = 'test111'
    ,'properties.zookeeper.connect' = 'hadoop1:2181'
    ,'properties.bootstrap.servers' = 'hadoop1:9092'
    ,'format' = 'json'
    ,'properties.group.id'='iceberg'
    ,'scan.startup.mode'='earliest-offset'
);


INSERT INTO hive_catalog.test1.sample5 SELECT * FROM default_catalog.default_database.kafka;

Check for phrases

Iceberg supports Flink’s streaming and batch reading.

Batch mode

SET execution.runtime-mode = batch;
select * from sample;

Streaming mode

SET execution.runtime-mode = streaming;
SET table.dynamic-table-options.enabled=true;
SET sql-client.execution.result-mode=tableau;

(1) Read all records from the current snapshot, and then read incremental data from the snapshot

SELECT * FROM sample5 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ;

(2) Read the incremental data after the specified snapshot id (not included)

SELECT * FROM sample /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987')*/ ;
  • monitor-interval: The time interval for continuously monitoring newly submitted data files (default is 10s).

  • start-snapshot-id: The snapshot id at which the streaming job starts.

**Note:** If it is unbounded data streaming upsert into the iceberg table (read kafka, upsert into the iceberg table), then there will be a problem of not being able to read the data when reading the iceberg table in a stream. If unbounded data is appended to the iceberg table in a streaming manner (read kafka, append to the iceberg table), then the results can be seen normally when reading the iceberg table.

Insufficient integration with Flink

Supported features Considerable Remark
SQL create catalog
SQL create database
SQL create table
SQL create table like
SQL alter table Only supports modifying table properties, does not support changing columns and partitions
SQL drop_table
SQL select Supports streaming and batch processing modes
SQL insert into Supports streaming and batch processing modes
SQL insert overwrite
DataStream read
DataStream append
DataStream overwrite
Metadata tables Supports Java API, does not support Flink SQL
Rewrite files action
  • 不支持创建隐藏分区的Iceberg表。

  • 不支持创建带有计算列的Iceberg表。

  • 不支持创建带watermark的Iceberg表。

  • 不支持添加列,删除列,重命名列,更改列。

  • Iceberg目前不支持Flink SQL 查询表的元数据信息,需要使用Java API 实现。

与 Flink DataStream 集成

环境准备

(1)配置pom文件

新建Maven工程,pom文件配置如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.atguigu.iceberg</groupId>
    <artifactId>flink-iceberg-demo</artifactId>
    <version>1.0-SNAPSHOT</version>


    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <flink.version>1.16.0</flink.version>
        <java.version>1.8</java.version>
        <scala.binary.version>2.12</scala.binary.version>
        <slf4j.version>1.7.30</slf4j.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>   <!--不会打包到依赖中,只参与编译,不参与运行 -->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>

        <!--idea运行时也有webui-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>${slf4j.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-to-slf4j</artifactId>
            <version>2.14.0</version>
            <scope>provided</scope>
        </dependency>


        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-statebackend-rocksdb</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.3</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-flink-runtime-1.16 -->
        <dependency>
            <groupId>org.apache.iceberg</groupId>
            <artifactId>iceberg-flink-runtime-1.16</artifactId>
            <version>1.1.0</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.4</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>com.google.code.findbugs:jsr305</exclude>
                                    <exclude>org.slf4j:*</exclude>
                                    <exclude>log4j:*</exclude>
                                    <exclude>org.apache.hadoop:*</exclude>
                                </excludes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers combine.children="append">
                                <transformer
                                             implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

(2)配置log4j

resources目录下新建log4j.properties。

log4j.rootLogger=error,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n

读取数据

常规Source写法

(1)Batch方式

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a");
DataStream<RowData> batch = FlinkSource.forRowData()
     .env(env)
     .tableLoader(tableLoader)
     .streaming(false)
     .build();

batch.map(r -> Tuple2.of(r.getLong(0),r.getLong(1) ))
      .returns(Types.TUPLE(Types.LONG,Types.LONG))
      .print();

env.execute("Test Iceberg Read");

(2)Streaming方式

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a"); 
DataStream<RowData> stream = FlinkSource.forRowData()
    .env(env)
    .tableLoader(tableLoader)
    .streaming(true)
    .startSnapshotId(3821550127947089987L)
    .build();

stream.map(r -> Tuple2.of(r.getLong(0),r.getLong(1) ))
    .returns(Types.TUPLE(Types.LONG,Types.LONG))
    .print();

env.execute("Test Iceberg Read");

FLIP-27 Source写法

(1)Batch方式

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a");

IcebergSource<RowData> source1 = IcebergSource.forRowData()
    .tableLoader(tableLoader)
    .assignerFactory(new SimpleSplitAssignerFactory())
    .build();

DataStream<RowData> batch = env.fromSource(
    Source1,
    WatermarkStrategy.noWatermarks(),
    "My Iceberg Source",
    TypeInformation.of(RowData.class));

batch.map(r -> Tuple2.of(r.getLong(0), r.getLong(1)))
    .returns(Types.TUPLE(Types.LONG, Types.LONG))
    .print();

env.execute("Test Iceberg Read");

(2)Streaming方式

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a");

IcebergSource source2 = IcebergSource.forRowData()
    .tableLoader(tableLoader)
    .assignerFactory(new SimpleSplitAssignerFactory())
    .streaming(true)
    .streamingStartingStrategy(StreamingStartingStrategy.INCREMENTAL_FROM_LATEST_SNAPSHOT)
    .monitorInterval(Duration.ofSeconds(60))
    .build();

DataStream<RowData> stream = env.fromSource(
    Source2,
    WatermarkStrategy.noWatermarks(),
    "My Iceberg Source",
    TypeInformation.of(RowData.class));

stream.map(r -> Tuple2.of(r.getLong(0), r.getLong(1)))
    .returns(Types.TUPLE(Types.LONG, Types.LONG))
    .print();

env.execute("Test Iceberg Read");

写入数据

目前支持DataStream<RowData>和DataStream<Row>格式的数据流写入Iceberg表。

(1)写入方式支持 append、overwrite、upsert

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);


SingleOutputStreamOperator<RowData> input = env.fromElements("")
    .map(new MapFunction<String, RowData>() {
    
    
        @Override
        public RowData map(String s) throws Exception {
    
    
            GenericRowData genericRowData = new GenericRowData(2);
            genericRowData.setField(0, 99L);
            genericRowData.setField(1, 99L);

            return genericRowData;
        }
    });

TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://hadoop1:8020/warehouse/spark-iceberg/default/a");


FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .append()            // append方式
    //.overwrite(true)   // overwrite方式
    //.upsert(true)       // upsert方式
    ;

env.execute("Test Iceberg DataStream");

(2)写入选项

FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .set("write-format", "orc")
    .set(FlinkWriteOptions.OVERWRITE_MODE, "true");

可配置选项如下:

选项 默认值 说明
write-format Parquet同write.format.default 写入操作使用的文件格式:Parquet, avro或orc
target-file-size-bytes 536870912(512MB)同write.target-file-size-bytes 控制生成的文件的大小,目标大约为这么多字节
upsert-enabled 同write.upsert.enabled,
overwrite-enabled false 覆盖表的数据,不能和UPSERT模式同时开启
distribution-mode None同 write.distribution-mode 定义写数据的分布方式: none:不打乱行; hash:按分区键散列分布;range:如果表有SortOrder,则通过分区键或排序键分配
compression-codec 同 write.(fileformat).compression-codec
compression-level 同 write.(fileformat).compression-level
compression-strategy 同write.orc.compression-strategy

Merge small files

Iceberg currently does not support checking tables in flink sql. You need to use the Java API provided by Iceberg to read metadata to obtain table information. Small files can be rewritten into large files by submitting a Flink batch job:

import org.apache.iceberg.flink.actions.Actions;

// 1.获取 Table对象
// 1.1 创建 catalog对象
Configuration conf = new Configuration();
HadoopCatalog hadoopCatalog = new HadoopCatalog(conf, "hdfs://hadoop1:8020/warehouse/spark-iceberg");

// 1.2 通过 catalog加载 Table对象
Table table = hadoopCatalog.loadTable(TableIdentifier.of("default", "a"));

// 有Table对象,就可以获取元数据、进行维护表的操作
//        System.out.println(table.history());
//        System.out.println(table.expireSnapshots().expireOlderThan());

// 2.通过 Actions 来操作 合并
Actions.forTable(table)
    .rewriteDataFiles()
    .targetSizeInBytes(1024L)
    .execute();

After getting the Table object, you can obtain metadata and perform table maintenance operations. For more API operations provided by Iceberg, check: https://iceberg.apache.org/docs/latest/api/

Guess you like

Origin blog.csdn.net/qq_44766883/article/details/131488124