Iceberg from entry to proficiency series eighteen: an article to learn more about Flink's support for Iceberg

Apache Iceberg supports Apache Flink's DataStream API and Table API.

1. The iceberg function supported by Flink

function support Considerable Precautions
SQL create catalog ✔️
SQL create database ✔️
SQL create table ✔️
SQL create table like ✔️
SQL alter table ✔️ Only table attributes are supported, column and partition changes are not supported
SQL drop_table ✔️
SQL select ✔️ Supports streaming and batch modes
SQL insert into ✔️ Supports streaming and batch modes
SQL insert overwrite ✔️
DataStream read ✔️
DataStream append ✔️
DataStream overwrite ✔️
Metadata tables ✔️
Rewrite files action ✔️

2. Preparations for using Flink SQL Client

To create an Iceberg table in Flink, it is recommended to use Flink SQL Client, so that users can understand the concept more easily.

Download Flink from the Apache download page. Iceberg uses Scala 2.12 when compiling the Apache Iceberg-flink-runtime jar, so it is recommended to use Flink 1.16 bundled with Scala 2.12.

FLINK_VERSION=1.16.1
SCALA_VERSION=2.12
APACHE_FLINK_URL=https://archive.apache.org/dist/flink/
wget ${APACHE_FLINK_URL}/flink-${FLINK_VERSION}/flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgz
tar xzvf flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgz

Start a standalone Flink cluster in a Hadoop environment:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
APACHE_HADOOP_URL=https://archive.apache.org/dist/hadoop/
HADOOP_VERSION=2.8.5
wget ${APACHE_HADOOP_URL}/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
tar xzvf hadoop-${HADOOP_VERSION}.tar.gz
HADOOP_HOME=`pwd`/hadoop-${HADOOP_VERSION}

export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

# Start the flink standalone cluster
./bin/start-cluster.sh

Start the Flink SQL client. There is a separate flink-runtime module in the Iceberg project to generate bundled jars that can be loaded directly by the Flink SQL client. To manually build the jars bundled with flink-runtime, build the Iceberg project, which will generate jars under <iceberg-root-dir>/flink-runtime/build/libs. Or download flink-runtime jar from Apache repository.

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`   

./bin/sql-client.sh embedded -j <flink-runtime-directory>/iceberg-flink-runtime-1.16-1.3.0.jar shell

By default, Iceberg ships with Hadoop jars for the Hadoop directory. To use the Hive catalog, load the Hive jar when opening the Flink SQL client. Fortunately, Flink provides a bundled Hive jar for the SQL client. An example of how to download dependencies and get started:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

ICEBERG_VERSION=1.3.0
MAVEN_URL=https://repo1.maven.org/maven2
ICEBERG_MAVEN_URL=${MAVEN_URL}/org/apache/iceberg
ICEBERG_PACKAGE=iceberg-flink-runtime
wget ${ICEBERG_MAVEN_URL}/${ICEBERG_PACKAGE}-${FLINK_VERSION_MAJOR}/${ICEBERG_VERSION}/${ICEBERG_PACKAGE}-${FLINK_VERSION_MAJOR}-${ICEBERG_VERSION}.jar -P lib/

HIVE_VERSION=2.3.9
SCALA_VERSION=2.12
FLINK_VERSION=1.16.1
FLINK_CONNECTOR_URL=${MAVEN_URL}/org/apache/flink
FLINK_CONNECTOR_PACKAGE=flink-sql-connector-hive
wget ${FLINK_CONNECTOR_URL}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}/${FLINK_VERSION}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}-${FLINK_VERSION}.jar

./bin/sql-client.sh embedded shell

Download Flink's Python API

Install Apache Flink dependencies using pip:

pip install apache-flink==1.16.1

Provides the file:// path to the iceberg-flink-runtime jar, which can be obtained by building the project and looking at /flink-runtime/build/libs, or downloading the repository from the official Apache repository. Third-party jars can be added to pyflink in the following ways:

  • env.add_jars("file:///my/jar/path/connector.jar")
  • table_env.get_config().get_configuration().set_string(“pipeline.jars”, “file:///my/jar/path/connector.jar”)

This is also mentioned in the official documentation. The following example uses env.add_jars(…) :

import os

from pyflink.datastream import StreamExecutionEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
iceberg_flink_runtime_jar = os.path.join(os.getcwd(), "iceberg-flink-runtime-1.16-1.3.0.jar")

env.add_jars("file://{}".format(iceberg_flink_runtime_jar))

Next, create a StreamTableEnvironment and execute Flink SQL statements. The following example shows how to create a custom table of contents through the Python Table API:

from pyflink.table import StreamTableEnvironment
table_env = StreamTableEnvironment.create(env)
table_env.execute_sql("""
CREATE CATALOG my_catalog WITH (
    'type'='iceberg', 
    'catalog-impl'='com.my.custom.CatalogImpl',
    'my-additional-catalog-config'='my-value'
)
""")

Run the query:

(table_env
    .sql_query("SELECT PULocationID, DOLocationID, passenger_count FROM my_catalog.nyc.taxis LIMIT 5")
    .execute()
    .print()) 
+----+----------------------+----------------------+--------------------------------+
| op |         PULocationID |         DOLocationID |                passenger_count |
+----+----------------------+----------------------+--------------------------------+
| +I |                  249 |                   48 |                            1.0 |
| +I |                  132 |                  233 |                            1.0 |
| +I |                  164 |                  107 |                            1.0 |
| +I |                   90 |                  229 |                            1.0 |
| +I |                  137 |                  249 |                            1.0 |
+----+----------------------+----------------------+--------------------------------+
5 rows in set

4. Add a directory.

Flink supports creating directories using Flink SQL.

directory configuration

Create and name a catalog by executing the following query (replace <catalog_name> with your catalog name and <config_key>=<config_value> with the catalog implementation configuration):

CREATE CATALOG <catalog_name> WITH (
  'type'='iceberg',
  `<config_key>`=`<config_value>`
); 

The following properties can be set globally and are not restricted to a specific directory implementation:

  • type: must be iceberg. (required)
  • catalog-type: hive, hadoop, or rest for built-in catalogs, or unset to use catalog-impl for custom catalog implementations. (optional)
  • Catalog-impl: The fully qualified class name of the custom catalog implementation. Must be set if directory type is not set. (optional)
  • property-version: A version number describing the property version. This property is used for backward compatibility if the property format changes. The current attribute version is 1. (optional)
  • cache-enabled: Whether to enable directory caching, the default value is true. (optional)
  • cache.expiration-interval-ms: how long directory entries are cached locally, in milliseconds; a negative value such as -1 will disable expiration, and a value of 0 will not allow setting. The default value is -1. (optional)

5. Hive catalog

This will create an Iceberg catalog named hive_catalog, configurable with 'catalog-type'='hive', which loads tables from the Hive metastore:

CREATE CATALOG hive_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'uri'='thrift://localhost:9083',
  'clients'='5',
  'property-version'='1',
  'warehouse'='hdfs://nn:8020/warehouse/path'
);

If using the Hive catalog, the following properties can be set:

  • uri: The Thrift URI of the Hive metastore. (required)
  • client: Hive Metastore client pool size, the default value is 2. (optional)
  • warehouse: Hive warehouse location, if you neither set hive-conf-dir to specify the location containing the hive-site.xml configuration file, nor add the correct hive-site.xml in the classpath, you should specify this path.
  • hive-conf-dir: Path to the directory containing the hive-site.xml configuration file, which will be used to provide custom Hive configuration values. If both hive-conf-dir and warehouse are set, the hive.metastore.warehouse.dir value in /hive-site.xml (or a hive configuration file in the classpath) will be overridden by the warehouse value to create the iceberg directory.
  • hadoop-conf-dir: Path to the directory containing the core-site.xml and hdfs-site.xml configuration files that will be used to provide custom Hadoop configuration values.

create table

CREATE TABLE `hive_catalog`.`default`.`sample` (
    id BIGINT COMMENT 'unique id',
    data STRING
);

write data

To append new data to a table with a Flink streaming job, use INSERT INTO:

INSERT INTO `hive_catalog`.`default`.`sample` VALUES (1, 'a');
INSERT INTO `hive_catalog`.`default`.`sample` SELECT id, data from other_kafka_table;

To replace data in a table with query results, use INSERT OVERWRITE in a batch job (flink streaming jobs do not support INSERT OVERWRITE). Overwriting is an atomic operation on Iceberg tables.

Partitions with rows generated by SELECT queries are replaced, for example:

INSERT OVERWRITE `hive_catalog`.`default`.`sample` VALUES (1, 'a');

Iceberg also supports overriding a given partition by selecting a value:

INSERT OVERWRITE `hive_catalog`.`default`.`sample` PARTITION(data='a') SELECT 6;

Flink natively supports writing DataStream and DataStream to iceberg tables.

StreamExecutionEnvironment env = ...;

DataStream<RowData> input = ... ;
Configuration hadoopConf = new Configuration();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf);

FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .append();

env.execute("Test Iceberg DataStream");

branch write

The toBranch API in FlinkSink also supports writing to branches in Iceberg tables.

FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .toBranch("audit-branch")
    .append();

Read
Submit a Flink batch job using the following statement:

-- Execute the flink job in batch mode for current session context
SET execution.runtime-mode = batch;
SELECT * FROM `hive_catalog`.`default`.`sample`;

Iceberg supports processing incremental data in Flink streaming jobs starting from a historical snapshot ID:

-- Submit the flink job in streaming mode for current session.
SET execution.runtime-mode = streaming;

-- Enable this switch because streaming read SQL will provide few job options in flink SQL hint options.
SET table.dynamic-table-options.enabled=true;

-- Read all the records from the iceberg current snapshot, and then read incremental data starting from that snapshot.
SELECT * FROM `hive_catalog`.`default`.`sample` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ;

-- Read all incremental data starting from the snapshot-id '3821550127947089987' (records from this snapshot will be excluded).
SELECT * FROM `hive_catalog`.`default`.`sample` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987')*/ ;

SQL is also the recommended way to check tables. To see all snapshots in a table, use the snapshot metadata table:

SELECT * FROM `hive_catalog`.`default`.`sample`.`snapshots`

Iceberg supports streaming or batch reading in the Java API:

DataStream<RowData> batch = FlinkSource.forRowData()
     .env(env)
     .tableLoader(tableLoader)
     .streaming(false)
     .build();

Six, type conversion

Iceberg's integration for Flink automatically converts between Flink and Iceberg types. When writing to tables of types not supported by Flink (such as UUID), Iceberg will accept and convert values ​​of Flink types.

Flink to Iceberg

Flink types are converted to Iceberg types according to the following table:

Considerable Iceberg Notes
boolean boolean
tinyint integer
smallint integer
integer integer
bigint long
float float
double double
char string
varchar string
string string
binary binary
varbinary fixed
decimal decimal
date date
time time
timestamp timestamp without timezone
timestamp_ltz timestamp with timezone
array list
map map
multiset map
row struct
raw Not supported
interval Not supported
structured Not supported
timestamp with zone Not supported
distinct Not supported
null Not supported
symbol Not supported
logical Not supported

Iceberg to Flink

The Iceberg type is converted to the Flink type according to the following table:

Iceberg Considerable
boolean boolean
struct row
list array
map map
integer integer
long bigint
float float
double double
date date
time time
timestamp without timezone timestamp(6)
timestamp with timezone timestamp_ltz(6)
string varchar(2147483647)
uuid binary(16)
fixed(N) binary(N)
binary varbinary(2147483647)
decimal(P, S) decimal(P, S)

7. Functions to be supported

Some features are not yet supported by the current Flink Iceberg integration work:

  • Iceberg tables with hidden partitions are not supported
  • Creating Iceberg tables with computed columns is not supported
  • Creation of watermarked Iceberg tables is not supported
  • Adding columns, deleting columns, renaming columns, and changing columns is not supported, and will be supported in flink 1.18.0

Guess you like

Origin blog.csdn.net/zhengzaifeidelushang/article/details/131745396