Practice data lake iceberg The thirty-seventh lesson kakfa writes the enfource and not enfource test of the icberg table of iceberg

Series Article Directory

Practice Data Lake iceberg Lesson 1 Getting Started
Practice Data Lake iceberg Lesson 2 Iceberg is based on hadoop’s underlying data format
Practice data lake
iceberg In sqlclient, use SQL to read data from Kafka to iceberg (upgrade the version to flink1.12.7)
practice data lake iceberg Lesson 5 hive catalog features
practice data lake iceberg Lesson 6 write from kafka to iceberg failure problem solving
practice data lake iceberg Lesson 7 Write to iceberg
practice data lake iceberg in real time Lesson 8 hive and iceberg integrate
practice data lake iceberg Lesson 9 merge small files
practice data lake iceberg Lesson 10 snapshot delete
practice data lake iceberg Lesson 11 test partition table integrity Process (creating numbers, building tables, merging, and deleting snapshots)
Practice data lake iceberg Lesson 12 What is a catalog
Practice data lake iceberg Lesson 13 Metadata is many times larger than data files
Practice data lake iceberg Lesson 14 Data merging (to solve the problem of metadata expansion over time)
practice data lake iceberg Lesson 15 spark installation and integration iceberg (jersey package conflict)
practice data lake iceberg Lesson 16 open the cognition of iceberg through spark3 Door
Practice data lake iceberg Lesson 17 Hadoop2.7, spark3 on yarn run iceberg configuration
Practice data lake iceberg Lesson 18 Multiple clients interact with iceberg Start commands (commonly used commands)
Practice data lake iceberg Lesson 19 flink count iceberg , No result problem
practice data lake iceberg Lesson 20 flink + iceberg CDC scenario (version problem, test failed)
practice data lake iceberg Lesson 21 flink1.13.5 + iceberg0.131 CDC (test successful INSERT, change operation failed)
Practice data lake iceberg Lesson 22 flink1.13.5 + iceberg0.131 CDC (CRUD test successful)
practice data lake iceberg Lesson 23 flink-sql restart
practice data lake iceberg from checkpoint Lesson 24 iceberg metadata details Analyzing
the practice data lake iceberg Lesson 25 Running flink sql in the background The effect of addition, deletion and modification
Practice data lake iceberg Lesson 26 checkpoint setting method
Practice data lake iceberg Lesson 27 Flink cdc test program failure restart: can restart from the last time checkpoint to continue working
practice data lake iceberg Lesson 28 Deploy packages that do not exist in the public warehouse to local warehouse
practice data lake iceberg Lesson 29 how to obtain flink jobId elegantly and efficiently
practice data lake iceberg lesson 30 mysql -> iceberg, different clients sometimes have zone issues
Practice data lake iceberg Lesson 31 use github's flink-streaming-platform-web tool to manage flink task flow, test cdc restart scenario practice data lake iceberg lesson 32 DDL statement practice data lake
through hive catalog persistence method
iceberg Lesson 33 Upgrade flink to 1.14, with built-in functioin to support json function
Practice data lake iceberg Lesson 34 Based on data lake icerberg's stream-batch integration architecture-stream architecture test practice
data lake iceberg Lesson 35 is based on data Lake icerberg’s stream-batch integrated architecture – test whether incremental reading is full or only incremental
practice data lake iceberg Lesson 36 Based on data lake icerberg’s stream-batch integrated architecture – update mysql select from icberg syntax is an incremental update test
practice data lake iceberg Lesson 37 kakfa writes the enfource of iceberg's icberg table, not enfource test
practice data lake iceberg more content directory



foreword

Test whether iceberg reads data from kafka and whether it can automatically update iceberg data when entering the lake according to the id on kafka. Test
results for this scenario: 1. iceberg writes additional data by default for data flowing in from kafka 2 .The upsert mode can be realized by setting the 'write.upsert.enabled' = 'true parameter to the iceberg table


1. Test ideas

Write iceberg from kafka manufacturing data, and when iceberg sets pk, observe whether it is additional writing or updating.

2. Test not enforced code

2.1 Test code

Test ideas: 1. select from kafka
2. insert to iceberg
The code is as follows:

CREATE TABLE IF NOT EXISTS KafkaTableTest2_XXZH (
    `id` bigint,
    `data` STRING
) WITH (
    'connector' = 'kafka',
    'topic' = 'test2_xxzh',
    'properties.bootstrap.servers' = 'hadoop101:9092,hadoop102:9092,hadoop103:9092',
    'properties.group.id' = 'testGroup',
    'scan.startup.mode' = 'latest-offset',
    'csv.ignore-parse-errors'='true',
    'format' = 'csv'
);


CREATE CATALOG hive_iceberg_catalog WITH (
    'type'='iceberg',
    'catalog-type'='hive',
    'uri'='thrift://hadoop101:9083',
    'clients'='5',
    'property-version'='1',
    'warehouse'='hdfs:///user/hive/warehouse/hive_iceberg_catalog'
);
use catalog hive_iceberg_catalog;
CREATE TABLE IF NOT EXISTS ods_base.IcebergTest2_XXZH (
    `id` bigint,
    `data` STRING,
    primary key (id) not enforced
)with(
    'write.metadata.delete-after-commit.enabled'='true',
    'write.metadata.previous-versions-max'='5',
    'format-version'='2'
 );
 

 
 insert into  hive_iceberg_catalog.ods_base.IcebergTest2_XXZH select * from default_catalog.default_database.KafkaTableTest2_XXZH;
 

2.2 Manufacturing data

[root@hadoop101 conf]#  kafka-console-producer.sh --broker-list  hadoop101:9092,hadoop102:9092,hadoop103:9092  --topic test2_xxzh
>1,abc
[2022-07-22 14:55:51,643] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 3 : {
    
    test2_xxzh=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
>2,bb
>3,cc
>4,dd
>5,ee
>3,cccc
>6,666
>4,ddddd
>

2.3 Running Results

spark-sql (default)> select *  from ods_base.IcebergTest2_XXZH;
22/07/22 15:12:28 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
id      data
3       cc
4       ddddd
5       ee
3       cccc
6       666
4       dd
Time taken: 0.405 seconds, Fetched 6 row(s)

The running result of flink-sql:
insert image description here

2.4 Running conclusion

Iceberg cannot be updated according to the pk declared by kafka. iceberg is written in append mode.


3. Change to enforce and report an error

3.1 Test code

Change the pk of the iceberg table to enforced and run again


Flink SQL> CREATE TABLE IF NOT EXISTS KafkaTableTest3_XXZH (
>     `id` bigint,
>     `data` STRING
> ) WITH (
>     'connector' = 'kafka',
>     'topic' = 'test2_xxzh',
>     'properties.bootstrap.servers' = 'hadoop101:9092,hadoop102:9092,hadoop103:9092',
>     'properties.group.id' = 'testGroup',
>     'scan.startup.mode' = 'latest-offset',
>     'csv.ignore-parse-errors'='true',
>     'format' = 'csv'
> );
> 
[INFO] Execute statement succeed.

Flink SQL> CREATE CATALOG hive_iceberg_catalog WITH (
>     'type'='iceberg',
>     'catalog-type'='hive',
>     'uri'='thrift://hadoop101:9083',
>     'clients'='5',
>     'property-version'='1',
>     'warehouse'='hdfs:///user/hive/warehouse/hive_iceberg_catalog'
> );
[INFO] Execute statement succeed.

Flink SQL> use catalog hive_iceberg_catalog;
[INFO] Execute statement succeed.

Flink SQL> CREATE TABLE IF NOT EXISTS ods_base.IcebergTest3_XXZH (
>     `id` bigint,
>     `data` STRING,
>     primary key (id) enforced
> )with(
>     'write.metadata.delete-after-commit.enabled'='true',
>     'write.metadata.previous-versions-max'='5',
>     'format-version'='2'
>  );
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: Flink doesn't support ENFORCED mode for PRIMARY KEY constraint. ENFORCED/NOT ENFORCED  controls if the constraint checks are performed on the incoming/outgoing data. Flink does not own the data therefore the only supported mode is the NOT ENFORCED mode

报错信息:
org.apache.flink.table.api.ValidationException: Flink doesn’t support ENFORCED mode for PRIMARY KEY constraint. ENFORCED/NOT ENFORCED controls if the constraint checks are performed on the incoming/outgoing data. Flink does not own the data therefore the only supported mode is the NOT ENFORCED mode

Flink does not own these data itself, so the only supported modes are non-strong values.

Conclusion: iceberg did not update the data according to pk

4. 'write.upsert.enabled' = 'true', set this parameter to realize the upsert function

CREATE TABLE IF NOT EXISTS ods_base.IcebergTest4_XXZH (
    `id` bigint,
    `data` STRING,
    primary key (id) not enforced
)with(
  'format-version' = '2',
  'write.upsert.enabled' = 'true',
  'write.distribution-mode'='hash',
  'write.metadata.delete-after-commit.enabled'='true',
  'write.metadata.previous-versions-max'='3'
 );
[root@hadoop102 module]#  kafka-console-producer.sh --topic test4_xxzh --broker-list hadoop101:9092,hadoop102:9092,hadoop103:9092
>2,222
>3,333  (这里暂停,去spark观察)
>2,bbbb
>3,cccc
>4,444
>5,555

Initialization data

spark-sql (default)> select * from  ods_base.IcebergTest4_XXZH ;
id      data
2       222
3       333

Update data, the contents of id=2 and 3 are all updated

spark-sql (default)> select * from  ods_base.IcebergTest4_XXZH ;
22/07/26 19:24:58 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
id      data
2       bbbb
4       444
5       555
3       cccc

Summarize

1. Iceberg writes additional data by default to the data flowing in from Kafka.
2. By setting the 'write.upsert.enabled' = 'true parameter to the iceberg table, the upsert mode can be realized

Guess you like

Origin blog.csdn.net/spark_dev/article/details/125932957