Practice data lake iceberg Lesson 35 is based on the stream-batch integration architecture of data lake icerberg--test whether the incremental read is full or only incremental

Series Article Directory

Practice Data Lake iceberg Lesson 1 Getting Started
Practice Data Lake iceberg Lesson 2 Iceberg is based on hadoop’s underlying data format
Practice data lake
iceberg In sqlclient, use SQL to read data from Kafka to iceberg (upgrade the version to flink1.12.7)
practice data lake iceberg Lesson 5 hive catalog features
practice data lake iceberg Lesson 6 write from kafka to iceberg failure problem solving
practice data lake iceberg Lesson 7 Write to iceberg
practice data lake iceberg in real time Lesson 8 hive and iceberg integrate
practice data lake iceberg Lesson 9 merge small files
practice data lake iceberg Lesson 10 snapshot delete
practice data lake iceberg Lesson 11 test partition table integrity Process (creating numbers, building tables, merging, and deleting snapshots)
Practice data lake iceberg Lesson 12 What is a catalog
Practice data lake iceberg Lesson 13 Metadata is many times larger than data files
Practice data lake iceberg Lesson 14 Data merging (to solve the problem of metadata expansion over time)
practice data lake iceberg Lesson 15 spark installation and integration iceberg (jersey package conflict)
practice data lake iceberg Lesson 16 open the cognition of iceberg through spark3 Door
Practice data lake iceberg Lesson 17 Hadoop2.7, spark3 on yarn run iceberg configuration
Practice data lake iceberg Lesson 18 Multiple clients interact with iceberg Start commands (commonly used commands)
Practice data lake iceberg Lesson 19 flink count iceberg , No result problem
practice data lake iceberg Lesson 20 flink + iceberg CDC scenario (version problem, test failed)
practice data lake iceberg Lesson 21 flink1.13.5 + iceberg0.131 CDC (test successful INSERT, change operation failed)
Practice data lake iceberg Lesson 22 flink1.13.5 + iceberg0.131 CDC (CRUD test successful)
practice data lake iceberg Lesson 23 flink-sql restart
practice data lake iceberg from checkpoint Lesson 24 iceberg metadata details Analyzing
the practice data lake iceberg Lesson 25 Running flink sql in the background The effect of addition, deletion and modification
Practice data lake iceberg Lesson 26 checkpoint setting method
Practice data lake iceberg Lesson 27 Flink cdc test program failure restart: can restart from the last time checkpoint to continue working
practice data lake iceberg Lesson 28 Deploy packages that do not exist in the public warehouse to local warehouse
practice data lake iceberg Lesson 29 how to obtain flink jobId elegantly and efficiently
practice data lake iceberg lesson 30 mysql -> iceberg, different clients sometimes have zone issues
Practice data lake iceberg Lesson 31 use github's flink-streaming-platform-web tool to manage flink task flow, test cdc restart scenario practice data lake iceberg lesson 32 DDL statement practice data lake
through hive catalog persistence method
iceberg Lesson 33 upgrade flink to 1.14, built-in functioin supports json function
practice data lake iceberg Lesson 34 is based on data lake Icerberg's stream-batch integrated architecture-stream architecture test practice
data lake iceberg Lesson 35 is based on data Lake icerberg's stream-batch integration architecture – test whether incremental reads are full or only incremental
practice data lake iceberg more content directory


提示:写完文章后,目录可以自动生成,如何生成可参考右边的帮助文档


foreword

insert image description here

In the previous lesson, when we talked about incremental update, the younger brother’s boss asked, for incremental implementation, should we read the incremental data or re-read the historical data? Crit, according to my understanding, is to read increments. . . , the boss does not agree with the understanding! Well, let’s test it, so with this article, readers feel my bleeding heart. May I ask if you have such an experience. . .

1. Test ideas

Continuing from the kafka case above, the producer sends data to kafka, and flink-sql sends the kafka data to iceberg. This section continues to consume from iceberg. Test:
select * from hive_iceberg_catalog.ods_base.IcebergSink_XXZH / + OPTIONS('streaming'= 'true', 'monitor-interval'='1s') /
When the incremental operator of this sql is triggered, is it full or incremental

2. Case test

1. Code

 
     public static void main(String[] args) throws Exception {
    
    
        //TODO 1.准备环境
        //1.1流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //1.2 表执行环境
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);


        String sql2 = "CREATE CATALOG hive_iceberg_catalog WITH (\n" +
                "    'type'='iceberg',\n" +
                "    'catalog-type'='hive',\n" +
                "    'uri'='thrift://hadoop101:9083',\n" +
                "    'clients'='5',\n" +
                "    'property-version'='1',\n" +
                "    'warehouse'='hdfs:///user/hive/warehouse/hive_iceberg_catalog'\n" +
                ")";
        String sql3 = "use catalog hive_iceberg_catalog";
        String sql4 = "CREATE TABLE IF NOT EXISTS ods_base.IcebergSink_XXZH (\n" +
                "    `log` STRING,\n" +
                "\t`dt` INT\n" +
                ")with(\n" +
                "    'write.metadata.delete-after-commit.enabled'='true',\n" +
                "    'write.metadata.previous-versions-max'='5',\n" +
                "    'format-version'='2'\n" +
                " )";
      String sql6 = "select * from  hive_iceberg_catalog.ods_base.IcebergSink_XXZH /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ";

        tableEnv.executeSql(sql2);
        tableEnv.executeSql(sql3);
        tableEnv.executeSql(sql4);
        tableEnv.executeSql(sql6).print();



        //TODO 6.执行任务
        env.execute();

    }


2. Start the program

The console outputs historical data

| +I |                              e |    20230101 |
| +I |                              e |    20230101 |
| +I |                              e |    20230101 |
| +I |                             >e |    20230101 |
| +I |                              e |     2023010 |
| +I |                            abc |    20240101 |
| +I |                           abcd |    20240101 |
| +I |                           abcd |    20240101 |
| +I |                              ; |      (NULL) |
| +I ||      (NULL) |
| +I |                              ; |      (NULL) |
| +I ||      (NULL) |
22/06/16 21:09:21 INFO compress.CodecPool: Got brand-new decompressor [.gz]
| +I |                              1 |    20220601 |
22/06/16 21:09:21 INFO compress.CodecPool: Got brand-new decompressor [.gz]
| +I |                              2 |    20220601 |

3. Start the producer and observe the result

[root@hadoop101 lib]# kafka-console-producer.sh --broker-list  hadoop101:9092,hadoop102:9092,hadoop103:9092  --topic test_xxzh

Enter one:

2,20220606

Observe the console and find that there is one more increment:

22/06/16 21:15:48 INFO compress.CodecPool: Got brand-new decompressor [.gz]
| +I |                              2 |    20220606 |

Summarize

select * from hive_iceberg_catalog.ods_base.IcebergSink_XXZH /+ OPTIONS(‘streaming’=‘true’, ‘monitor-interval’=‘1s’)/

This grammatical confirmation is incremental processing of data, rather than re-reading all data once

Guess you like

Origin blog.csdn.net/spark_dev/article/details/125323191