Iceberg from entry to master series seventeen: Apache InLong synchronizes data to Iceberg

1. Overview

  • Apache Iceberg is a high-performance format for large analytical tables.

2. Version support

extract node Version
Iceberg Iceberg:0.12.x,0.13.x

3. Dependencies

<dependency>
    <groupId>org.apache.inlong</groupId>
    <artifactId>sort-connector-iceberg</artifactId>
    <version>1.7.0</version>
</dependency>

4. SQL API usage

To create an Iceberg table in flink, we recommend using Flink SQL Client because it is easier for users to understand the concept.

Step.1 Start an independent flink cluster in the hadoop environment.

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

# Start the flink standalone cluster
./bin/start-cluster.sh

Step.2 Start the flink SQL client.

flink-runtime creates a separate module in the iceberg project to generate a bundled jar that can be loaded directly by the flink SQL client.

If you want flink-runtime to manually build the bundled jar, just build the inlong project, it will be in <inlong-root-dir>/inlong-sort/sort-connectors/iceberg/target.

By default, iceberg includes hadoop jars for the hadoop directory. If we want to use hive directory, we need to load hive jars when opening flink sql client. Fortunately, apache inlong packages a bundled hive jar into Iceberg. So we can open the sql client as follows:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

./bin/sql-client.sh embedded -j <flink-runtime-directory>/iceberg-flink-runtime-xxx.jar shell

Step.3 Create a table in the current Flink directory

By default, we don't need to create a directory, just use an in-memory directory. If catalog-database.catalog-table does not exist in the catalog, it will be created automatically. Here we just load the data.

Tables managed in the Hive catalog

The following SQL will create a Flink table in the current Flink directory, which maps to the iceberg table managed by default_database.iceberg_table in the iceberg directory. Since the catalog type is hive by default, there is no need to put catalog-type here.

CREATE TABLE flink_table (
    id   BIGINT,
    data STRING
) WITH (
    'connector'='iceberg',
    'catalog-name'='hive_prod',
    'uri'='thrift://localhost:9083',
    'warehouse'='hdfs://nn:8020/path/to/warehouse'
);

If you want to create a Flink table that maps to a different Iceberg table managed in the Hive catalog (eg hive_db.hive_iceberg_table in Hive), you can create the Flink table as follows:

CREATE TABLE flink_table (
    id   BIGINT,
    data STRING
) WITH (
    'connector'='iceberg',
    'catalog-name'='hive_prod',
    'catalog-database'='hive_db',
    'catalog-table'='hive_iceberg_table',
    'uri'='thrift://localhost:9083',
    'warehouse'='hdfs://nn:8020/path/to/warehouse'
);

When writing records to a Flink table, if the underlying catalog database (hive_db in the example above) does not exist, it will be created automatically.

Tables managed in hadoop catalog

The following SQL will create a Flink table in the current Flink directory, which maps to the management Iceberg table in the default_database.flink_tablehadoop directory.

CREATE TABLE flink_table (
    id   BIGINT,
    data STRING
) WITH (
    'connector'='iceberg',
    'catalog-name'='hadoop_prod',
    'catalog-type'='hadoop',
    'warehouse'='hdfs://nn:8020/path/to/warehouse'
);

Step.6 Insert data into the Iceberg table

INSERT INTO `flink_table` 
    SELECT 
    `id` AS `id`,
    `d` AS `name`
    FROM `source_table`

Tables managed in a custom Catalog

The following SQL will create a Flink table in the current Flink directory that maps to the Iceberg table managed in the default_database.flink_table custom directory.

CREATE TABLE flink_table (
    id   BIGINT,
    data STRING
) WITH (
    'connector'='iceberg',
    'catalog-name'='custom_prod',
    'catalog-type'='custom',
    'catalog-impl'='com.my.custom.CatalogImpl',
     -- More table properties for the customized catalog
    'my-additional-catalog-config'='my-value',
     ...
);

5. Multi-table writing

At present, Iceberg supports multiple tables to be written at the same time. You need to add 'sink.multiple.enable' = 'true' to the FLINK SQL table creation parameters and the schema of the target table can only be defined as BYTES or STRING. The following is an example of a table creation statement:

CREATE TABLE `table_2`(
    `data` STRING)
WITH (
    'connector'='iceberg-inlong',
    'catalog-name'='hive_prod',
    'uri'='thrift://localhost:9083',
    'warehouse'='hdfs://localhost:8020/hive/warehouse',
    'sink.multiple.enable' = 'true',
    'sink.multiple.format' = 'canal-json',
    'sink.multiple.add-column.policy' = 'TRY_IT_BEST',
    'sink.multiple.database-pattern' = '${database}',
    'sink.multiple.table-pattern' = 'test_${table}'
);

To support multi-table writing, you need to set the serialization format of the upstream data (set through the option 'sink.multiple.format', currently only supports [canal-json|debezium-json]).

6. Dynamic table name mapping

Iceberg can customize the rules of the mapped database name and table name when writing multiple tables. You can fill in placeholders and add prefixes and suffixes to modify the mapped target table name. Iceberg Load Node will parse 'sink.multiple.database-pattern' as the database name of the destination, and parse 'sink.multiple.table-pattern' as the name of the table of the destination. The placeholder is parsed from the data, and the variable is strictly represented by '${VARIABLE_NAME}'. The value of the variable comes from the data itself, that is, it can be a format specified by 'sink.multiple.format' The metadata fields of the data can also be physical fields in the data. Examples of 'topic-parttern' are as follows:

  • ‘sink.multiple.format’ 为 ‘canal-json’:
  • 'topic-pattern' is '{database}_${table}', the extracted Topic is 'inventory_products' ('database', 'table' is the metadata field, 'id' is the physical field)
  • ‘topic-pattern’ 为 ‘{database} t a b l e {table} t ab l e {id}', the extracted Topic is 'inventory_products_111' ('database', 'table' is the metadata field, 'id' is the physical field)

7. Dynamic database building and table building

Iceberg will automatically create databases and data tables when encountering non-existing tables and libraries when writing multiple tables, and supports adding and capturing additional tables into the library during operation. The default Iceberg table parameters are: 'format-version' = '2', 'write.upsert.enabled' = 'true'', 'engine.hive.enabled' = 'true'

8. Dynamic schema changes

Iceberg supports synchronizing source table structure changes to target tables (DDL synchronization) when writing multiple tables. The supported schema changes are as follows:

insert image description here

9. Iceberg Load node parameters

options Is it necessary Defaults type describe
connector required (none) String Specifies the connector to use, should be 'iceberg'
catalog-type required hive String hive or hadoop for built-in catalogs, or unset for custom catalog implementations using catalog-impl
catalog-name required (none) String directory name
catalog-database required (none) String directory name
catalog-table required (none) String The name of the database managed in the Iceberg directory
catalog-impl custom custom optional (none) String If not set, you must set the fully qualified class name Custom Catalog implements catalog-type
cache-enabled optional true Boolean Whether to enable directory caching, the default value is true
uri hive catalog optional (none) String thrift URI for Hive metastore
clients hive catalog optional 2 Integer Hive Metastore client pool size, the default value is 2
warehouse hive catalog or hadoop catalog is optional (none) String For Hive directory, is the Hive warehouse location, user should specify this path if neither set hive-conf-dir to specify the location containing hive-site.xml configuration file nor add correct hive-site.xml classpath. For the hadoop directory, the HDFS directory stores metadata files and data files
hive-conf-dir hive catalog optional (none) String hive-site.xml contains the path to a directory of configuration files that will be used to provide custom Hive configuration values. The value of hive.metastore.warehouse.dirfrom <hive-conf-dir>/hive-site.xml (or the hive configuration file from the classpath) will be overridden by this value if the Iceberg directory is set and created at the same time. warehouse hive-conf-dirwarehouse
inlong.metric.labels optional (none) String The label value of the inlong metric, the value is composed of groupId={groupId}&streamId={streamId}&nodeId={nodeId}.
sink.multiple.enable optional false Boolean Whether to enable multi-channel writing
sink.multiple.schema-update.policy optional TRY_IT_BEST Enum Processing strategy when the schema in the data is inconsistent with the target table TRY_IT_BEST: Do your best, handle it as much as possible, and ignore if you can’t handle it IGNORE_WITH_LOG: Ignore and record the log, and the table data will not be processed later THROW_IT_STOP: Throw an exception and stop the task until the user manually handles the schema inconsistency
sink.multiple.pk-auto-generated optional false Boolean Whether to automatically generate the primary key, whether to use all fields as the primary key when the source table has no primary key when automatically creating a table for multi-way writing
sink.multiple.typemap-compatible-with-spark optional false Boolean Whether to adapt to the type system of spark, and whether to adapt to the type system of spark when automatically creating tables for multi-channel writing

10. Data type mapping

Iceberg data type details. Here's how loading data converts Iceberg types to Flink types.

Flink SQL type Iceberg type
CHAR STRING
VARCHAR STRING
STRING STRING
BOOLEAN BOOLEAN
BINARY FIXED(L)
VARBINARY BINARY
DECIMAL DECIMAL(P,S)
TINYINT INT
SMALLINT INT
INTEGER INT
BIGINT LONG
FLOAT FLOAT
DOUBLE DOUBLE
DATE DATE
TIME TIME
TIMESTAMP TIMESTAMP
TIMESTAMP_LTZ TIMESTAMPTZ
INTERVAL -
ARRAY LIST
MULTISET MAP
MAP MAP
ROW STRUCT
RAW -

Guess you like

Origin blog.csdn.net/zhengzaifeidelushang/article/details/131747514