Integration of Flink in Hudi data lake technology


1 Install Flink

insert image description here

Use Flink 1.12 version, deploy Flink Standalone cluster mode, start the service, the steps are as follows:
step1, download the installation package

https://archive.apache.org/dist/flink/flink-1.12.2/

step2, upload software package

flink-1.12.2-bin-scala_2.12.tgz to the specified directory of node1

step3, unzip

tar -zxvf flink-1.12.2-bin-scala_2.12.tgz -C /export/server/ chown -R
root:root /export/server/flink-1.12.2/

step4, create a soft connection

ln -s brisk-1.12.2 brisk

step5, add hadoop dependent jar package

cd /export/server/flink/lib
Use rz to upload the jar package: flink-shaded-hadoop-2-uber-2.7.5-10.0.jar

insert image description here

step6, start HDFS cluster

hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode

step7, start Flink local cluster

/export/server/flink/bin/start-cluster.sh
can view the following two processes using jps

insert image description here

Stop Flink

/export/server/flink/bin/stop-cluster.sh

Step8, access Flink's Web UI
URL: node1:8081/#/overview
insert image description here

step9. Execute the official example
Read the text file data, perform word frequency statistics WordCount, and print the result to the console.

/export/server/flink/bin/flink run
/export/server/flink/examples/batch/WordCount.jar

2 Quick Start

Operate Hudi table data based on Flink and perform query analysis. The software version description is as follows:
insert image description here

2.1 Overview of integrating Flink

When Flink integrates Hudi, it will essentially integrate the jar package: hudi-flink-bundle_2.12-0.9.0.jar, which can be placed under the Flink application CLASSPATH. When Flink SQLConnector supports Hudi as Source and Sink, there are two ways to put the jar package into the CLASSPATH path:
● Method 1: When running the Flink SQL Client command line, specify the jar package through the parameter [-j xx.jar]
insert image description here

● Method 2: Put the jar package directly into the lib directory of the Flink software installation package [$FLINK_HOME/lib]

insert image description here

Next, use the Flink SQL Client to provide the SQL command line to integrate with Hudi. You need to start the Flink Standalone cluster, and you need to modify the configuration file [$FLINK_HOME/conf/flink-conf.yaml], and the number of Slots allocated by the TaskManager is 4.
insert image description here

2.2 Environment preparation

First start each framework service, then write DDL statements to create tables, and finally DML statements to insert data and query and analyze. Follow the steps below to start the environment, which is divided into three steps:
● The first step is to start the HDFS cluster

[root@node1 ~]# hadoop-daemon.sh start namenode
[root@node1 ~]# hadoop-daemon.sh start datanode

● Step 2: Start the Flink cluster
Since Flink needs to connect to the HDFS file system, set the HADOOP_CLASSPATH variable first, and then start the Standalone cluster service.

[root@node1 ~]# export HADOOP_CLASSPATH=$HADOOP_HOME/bin/hadoop classpath

[root@node1 ~]# /export/server/flink/bin/start-cluster.sh

● Step 3: Start the Flink SQL Cli command line

[root@node1 ~]# /export/server/flink/bin/sql-client.sh embedded shell

Use the specified parameter [-j xx.jar] to load the hudi-flink integration package. The command is as follows.

[root@node1 ~]# /export/server/flink/bin/sql-client.sh embedded -j
/root/hudi-flink-bundle_2.11-0.9.0.jar shell

Set the analysis result display mode in SQL Cli to: set execution.result-mode=tableau;.
insert image description here

2.3 Create table

Create table: t1, store data in Hudi table, underlying HDFS storage, table type: MOR, the statement is as follows:

CREATE TABLE t1(
  uuid VARCHAR(20), 
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://node1.oldlu.cn:8020/ehualu/hudi-warehouse/hudi-t1',
  'write.tasks' = '1',
  'compaction.tasks' = '1', 
  'table.type' = 'MERGE_ON_READ'
);

Execute the DDL statement on the Flink SQL CLI command line, the screenshot is as follows:
insert image description here

To view the table and structure, the command is as follows:
insert image description here
Next, write the INSERT statement to insert data into the Hudi table.

2.4 Insert data

Insert data into the above created table: t1, where the t1 table is a partition table, field name: partition, field values ​​when inserting data: [part1, part2, part3 and part4], the statement is as follows:

INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1');

INSERT INTO t1 VALUES
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');

The screenshot of execution in Flink SQL CLI is as follows:
insert image description here

The log information shows that the SQL statement is submitted to the Flink Standalone cluster for execution, and the insert statement is executed successfully.
insert image description here
Query the data storage directory on HDFS:
insert image description here

2.5 Query data

After the data is inserted into the Hudi table through Flink SQL CLI, write an SQL statement to query the data. The statement is as follows:

select * from t1;

Like inserting data, submit SQL to the Standalone cluster to generate job query data.
insert image description here

Partitions are trimmed by adding the partition path in the WHERE clause, as follows:

select * from t1 where partition = ‘par1’ ;
insert image description here

2.6 Update data

Change the data age of id1 from 23 to 27, and execute the SQL statement as follows:

insert into t1 values (‘id1’,‘Danny’,27,TIMESTAMP ‘1970-01-01
00:00:01’,‘par1’);

Query the data of the table again, the result is as follows:
insert image description here

Insert the Flink Standalone monitoring page 8081, and you can see that 3 jobs are executed.
insert image description here

3 Streaming query

When Flink inserts Hudi table data, it supports loading data in a streaming manner and incremental query analysis.

3.1 Create table

First create a table: t2, set related attributes, query and read in a streaming manner, and map to the previous table: t1, the statement is as follows.

CREATE TABLE t2(
  uuid VARCHAR(20), 
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/hudi-t1',
  'table.type' = 'MERGE_ON_READ',
  'read.tasks' = '1', 
  'read.streaming.enabled' = 'true',
  'read.streaming.start-commit' = '20210316134557',
  'read.streaming.check-interval' = '4' 
);

Description of core parameter options:
● read.streaming.enabled is set to true, indicating that the table data is read through streaming;
● read.streaming.check-interval specifies the interval for the source to monitor new commits as 4s;
● table.type setting The table type is MERGE_ON_READ;
insert image description here

Next, write SQL to insert data, and insert table in streaming mode: t2 data.

3.2 Query data

Create table: After t2, the data in the table at this time is the data written in the previous batch mode.

select * from t2 ;

insert image description here

Insert all the data in the display table, the cursor keeps flashing, every 4 seconds, and then query incrementally according to the commit timestamp.

3.3 Insert data

Re-open the Terminal to start the Flink SQL CLI, recreate the table: t1, and insert 1 piece of data in batch mode.

CREATE TABLE t1(
  uuid VARCHAR(20), 
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/hudi-t1',
  'write.tasks' = '1',
  'compaction.tasks' = '1', 
  'table.type' = 'MERGE_ON_READ'
);

insert into t1 values ('id9','test',27,TIMESTAMP '1970-01-01 00:00:01','par5');

A new piece of data (a piece of data inserted earlier) can be read in the flow table after a few seconds.
insert image description here

Through some simple demonstrations, it is found that the integration of HUDI Flink has been relatively complete, and the read and write data have been covered.

4 Flink SQL Writer

The Flink SQL Connector connector is provided in the hudi-flink module to support reading and writing data from Hudi tables.
insert image description here

Documentation: https://hudi.apache.org/docs/writing_data#flink-sql-writer

4.1 Flink SQL integrated with Kafka

First, configure Flink SQL to integrate Kafka and consume Kafka Topic data in real time. The specific steps are as follows:
insert image description here

● The first step is to create a topic
and start Zookeeper and Kafka service components. The case demonstrates the integration of FlinkSQL and Kafka to load data in real time. Use the KafkaTool tool to connect and start the Kafka service, and create a topic: flink-topic.
insert image description here

You can use the command line to create a topic. The specific command is as follows:
– Create a topic: flink-topic

kafka-topics.sh --create --bootstrap-server node1.oldlu.cn:9092
–replication-factor 1 --partitions 1 --topic flink-topic

Start the Flink Standalone cluster service, run the flink-sql command line, and create a table mapped to Kafka.

■The second step, start the HDFS cluster

[root@node1 ~]# hadoop-daemon.sh start namenode [root@node1 ~]#
hadoop-daemon.sh start datanode

■Step 3: Start the Flink cluster
Since Flink needs to connect to the HDFS file system, set the HADOOP_CLASSPATH variable first, and then start the Standalone cluster service.

[root@node1 ~]# export HADOOP_CLASSPATH=$HADOOP_HOME/bin/hadoop classpath [root@node1 ~]# /export/server/flink/bin/start-cluster.sh

■Step 4: Start the Flink SQL Cli command line.
Use the specified parameter [-j xx.jar] to load the hudi-flink integration package. The command is as follows.

[root@node1 ~]# cd /export/server/flink [root@node1 ~]#
bin/sql-client.sh embedded -j
/root/flink-sql-connector-kafka_2.11-1.12.0.jar shell

Set the analysis result display mode in SQL Cli to: tableau.
insert image description here

The fifth step is to create a table and map it to Kafka Topic.
The data in Kafka Topic is in CSV file format and has three fields: user_id, item_id, and behavior. When consuming data from Kafka, set the start from the latest offset. The table creation statement is as follows :set execution.result-mode=tableau;

CREATE TABLE tbl_kafka (
  `user_id` BIGINT,
  `item_id` BIGINT,
  `behavior` STRING
) WITH (
  'connector' = 'kafka',
  'topic' = 'flink-topic',
  'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',
  'properties.group.id' = 'test-group-10001',
  'scan.startup.mode' = 'latest-offset',
  'format' = 'csv'
);

After executing the command, view the table, the screenshot is as follows:
insert image description here

■Step 6: Send data to Topic in real time, and query in FlinkSQL
First, execute the SELECT query statement on the FlinkSQL page, the screenshot is as follows:
insert image description here

Second, send data to Topic through Kafka Console Producer, the command and data are as follows:
– Producer sends data
kafka-console-producer.sh --broker-list node1.oldlu.cn:9092 --topic flink-topic

/*
1001,90001,click
1001,90001,browser
1001,90001,click
1002,90002,click
1002,90003,click
1003,90001,order
1004,90001,order
*/

Insert data, observe the FlinkSQL interface, you can find the real-time query processing of data, the screenshot is as follows:
insert image description here

At this point, FlinkSQL integrates Kafka, uses tables to associate topic data, and then writes Flink SQL programs to synchronize Kafka data to Hudi tables in real time.

4.2 Flink SQL written to Hudi

Change the above-mentioned Structured Streaming stream program into a Flink SQL program: consume Topic data from Kafka in real time, analyze and convert it, and store it in the Hudi table. The schematic diagram is shown below.
insert image description here

4.2.1 Create Maven Module

insert image description here

Create a Maven Module module and add dependencies, here Flink: 1.12.2 and Hudi: 0.9.0 version.

<repositories>
    <repository>
        <id>nexus-aliyun</id>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
    </repository>
    <repository>
        <id>central_maven</id>
        <name>central maven</name>
        <url>https://repo1.maven.org/maven2</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>apache.snapshots</id>
        <name>Apache Development Snapshot Repository</name>
        <url>https://repository.apache.org/content/repositories/snapshots/</url>
        <releases>
            <enabled>false</enabled>
        </releases>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
    </repository>
</repositories>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>${java.version}</maven.compiler.source>
    <maven.compiler.target>${java.version}</maven.compiler.target>
    <java.version>1.8</java.version>
    <scala.binary.version>2.12</scala.binary.version>
    <flink.version>1.12.2</flink.version>
    <hadoop.version>2.7.3</hadoop.version>
    <mysql.version>8.0.16</mysql.version>
</properties>

<dependencies>
    <!-- Flink Client -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <!-- Flink Table API & SQL -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-common</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-planner-blink_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-api-java-bridge_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-json</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hudi</groupId>
        <artifactId>hudi-flink-bundle_${scala.binary.version}</artifactId>
        <version>0.9.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-shaded-hadoop-2-uber</artifactId>
        <version>2.7.5-10.0</version>
    </dependency>

    <!-- MySQL/FastJson/lombok -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>${mysql.version}</version>
    </dependency>
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.68</version>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.12</version>
    </dependency>

    <!-- slf4j及log4j -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.7</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.17</version>
        <scope>runtime</scope>
    </dependency>

</dependencies>

<build>
    <sourceDirectory>src/main/java</sourceDirectory>
    <testSourceDirectory>src/test/java</testSourceDirectory>
    <plugins>
        <!-- 编译插件 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.5.1</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <!--<encoding>${project.build.sourceEncoding}</encoding>-->
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>2.18.1</version>
            <configuration>
                <useFile>false</useFile>
                <disableXmlReport>true</disableXmlReport>
                <includes>
                    <include>**/*Test.*</include>
                    <include>**/*Suite.*</include>
                </includes>
            </configuration>
        </plugin>
        <!-- 打jar包插件(会包含所有依赖) -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!-- <mainClass>com.oldlu.flink.batch.FlinkBatchWordCount</mainClass> -->
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

4.2.2 Consuming Kafka data

Create a class: FlinkSQLKafakDemo, based on the Flink Table API, consume data from Kafka and extract field values ​​(for subsequent storage in the Hudi table).

package cn.oldlu.hudi;

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;

import static org.apache.flink.table.api.Expressions.$;

/**
 * 基于Flink SQL Connector实现:实时消费Topic中数据,转换处理后,实时存储Hudi表中
 */
public class FlinkSQLKafakDemo {
    
    

   public static void main(String[] args) {
    
    

      // 1-获取表执行环境
      EnvironmentSettings settings = EnvironmentSettings
         .newInstance()
         .inStreamingMode()
         .build();
      TableEnvironment tableEnv = TableEnvironment.create(settings) ;

      // 2-创建输入表, TODO: 从Kafka消费数据
      tableEnv.executeSql(
         "CREATE TABLE order_kafka_source (\n" +
            "  orderId STRING,\n" +
            "  userId STRING,\n" +
            "  orderTime STRING,\n" +
            "  ip STRING,\n" +
            "  orderMoney DOUBLE,\n" +
            "  orderStatus INT\n" +
            ") WITH (\n" +
            "  'connector' = 'kafka',\n" +
            "  'topic' = 'order-topic',\n" +
            "  'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',\n" +
            "  'properties.group.id' = 'gid-1001',\n" +
            "  'scan.startup.mode' = 'latest-offset',\n" +
            "  'format' = 'json',\n" +
            "  'json.fail-on-missing-field' = 'false',\n" +
            "  'json.ignore-parse-errors' = 'true'\n" +
            ")"
      );

      // 3-数据转换:提取订单时间中订单日期,作为Hudi表分区字段值
      Table etlTable = tableEnv
         .from("order_kafka_source")
         .addColumns(
            $("orderTime").substring(0, 10).as("partition_day")
         )
         .addColumns(
            $("orderId").substring(0, 17).as("ts")
         );
      tableEnv.createTemporaryView("view_order", etlTable);

      // 4-查询数据
      tableEnv.executeSql("SELECT * FROM view_order").print();
   }

}

Run the streaming application and the simulated data program to view the console.
insert image description here

4.2.3 Save data to Hudi

Write the table creation DDL statement, map it to the Hudi table, and specify related attributes: primary key field, table type, etc.

CREATE TABLE order_hudi_sink (
  orderId STRING PRIMARY KEY NOT ENFORCED,
  userId STRING,
  orderTime STRING,
  ip STRING,
  orderMoney DOUBLE,
  orderStatus INT,
  ts STRING,
  partition_day STRING
)
PARTITIONED BY (partition_day)
WITH (
    'connector' = 'hudi',
    'path' = 'file:///D:/flink_hudi_order',
    'table.type' = 'MERGE_ON_READ',
    'write.operation' = 'upsert',
    'hoodie.datasource.write.recordkey.field'= 'orderId',
    'write.precombine.field' = 'ts',
    'write.tasks'= '1'
);

Save the Hudi table data in the LocalFS directory of the local file system. In addition, when writing data to the Hudi table, use the INSERT INTO insertion method to write data. The specific DDL statement is as follows: – Subquery insert INSERT ... SELECT
...

INSERT INTO order_hudi_sink
SELECT
    orderId, userId, orderTime, ip, orderMoney, orderStatus,
    substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day
FROM order_kafka_source ;
12345

Create a class: FlinkSQLHudiDemo, write code: consume data from Kafka, convert it, and save it to the Hudi table.

package cn.oldlu.hudi;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

import static org.apache.flink.table.api.Expressions.$;

/**
 * 基于Flink SQL Connector实现:实时消费Topic中数据,转换处理后,实时存储Hudi表中
 */
public class FlinkSQLHudiDemo {
    
    

   public static void main(String[] args) {
    
    

      System.setProperty("HADOOP_USER_NAME","root");

      // 1-获取表执行环境
      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
      env.setParallelism(1);
      env.enableCheckpointing(5000);
      EnvironmentSettings settings = EnvironmentSettings
         .newInstance()
         .inStreamingMode()
         .build();
      StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings) ;

      // 2-创建输入表, TODO: 从Kafka消费数据
      tableEnv.executeSql(
         "CREATE TABLE order_kafka_source (\n" +
            "  orderId STRING,\n" +
            "  userId STRING,\n" +
            "  orderTime STRING,\n" +
            "  ip STRING,\n" +
            "  orderMoney DOUBLE,\n" +
            "  orderStatus INT\n" +
            ") WITH (\n" +
            "  'connector' = 'kafka',\n" +
            "  'topic' = 'order-topic',\n" +
            "  'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',\n" +
            "  'properties.group.id' = 'gid-1001',\n" +
            "  'scan.startup.mode' = 'latest-offset',\n" +
            "  'format' = 'json',\n" +
            "  'json.fail-on-missing-field' = 'false',\n" +
            "  'json.ignore-parse-errors' = 'true'\n" +
            ")"
      );

      // 3-数据转换:提取订单时间中订单日期,作为Hudi表分区字段值
      Table etlTable = tableEnv
         .from("order_kafka_source")
         .addColumns(
            $("orderId").substring(0, 17).as("ts")
         )
         .addColumns(
            $("orderTime").substring(0, 10).as("partition_day")
         );
      tableEnv.createTemporaryView("view_order", etlTable);

      // 4-定义输出表,TODO:数据保存到Hudi表中
      tableEnv.executeSql(
         "CREATE TABLE order_hudi_sink (\n" +
            "  orderId STRING PRIMARY KEY NOT ENFORCED,\n" +
            "  userId STRING,\n" +
            "  orderTime STRING,\n" +
            "  ip STRING,\n" +
            "  orderMoney DOUBLE,\n" +
            "  orderStatus INT,\n" +
            "  ts STRING,\n" +
            "  partition_day STRING\n" +
            ")\n" +
            "PARTITIONED BY (partition_day) \n" +
            "WITH (\n" +
            "  'connector' = 'hudi',\n" +
            "  'path' = 'file:///D:/flink_hudi_order',\n" +
            "  'table.type' = 'MERGE_ON_READ',\n" +
            "  'write.operation' = 'upsert',\n" +
            "  'hoodie.datasource.write.recordkey.field' = 'orderId'," +
            "  'write.precombine.field' = 'ts'" +
            "  'write.tasks'= '1'" +
            ")"
      );

      // 5-通过子查询方式,将数据写入输出表
      tableEnv.executeSql(
         "INSERT INTO order_hudi_sink\n" +
            "SELECT\n" +
            "  orderId, userId, orderTime, ip, orderMoney, orderStatus, ts, partition_day\n" +
            "FROM view_order"
      );

   }

}

Run the stream program written above, view the directory of the local file system, and save the data structure information of the Hudi table:
insert image description here

4.2.4 Loading Hudi table data

Create a class: FlinkSQLReadDemo, load the data in the Hudi table, read it in streaming mode, create the same table, and map it to the data storage directory of the Hudi table. The DDL statement for creating the table is as follows:

CREATE TABLE order_hudi(
  orderId STRING PRIMARY KEY NOT ENFORCED,
  userId STRING,
  orderTime STRING,
  ip STRING,
  orderMoney DOUBLE,
  orderStatus INT,
  ts STRING,
  partition_day STRING
)
PARTITIONED BY (partition_day)
WITH (
    'connector' = 'hudi',
    'path' = 'file:///D:/flink_hudi_order',
    'table.type' = 'MERGE_ON_READ',
    'read.streaming.enabled' = 'true',
    'read.streaming.check-interval' = '4'
);

The complete Flink SQL streaming program code is as follows:

package cn.oldlu.hudi;

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

/**
 * 基于Flink SQL Connector实现:从Hudi表中加载数据,编写SQL查询
 */
public class FlinkSQLReadDemo {
    
    

   public static void main(String[] args) {
    
    
      System.setProperty("HADOOP_USER_NAME","root");

      // 1-获取表执行环境
      EnvironmentSettings settings = EnvironmentSettings
         .newInstance()
         .inStreamingMode()
         .build();
      TableEnvironment tableEnv = TableEnvironment.create(settings) ;

      // 2-创建输入表, TODO: 加载Hudi表查询数据
      tableEnv.executeSql(
         "CREATE TABLE order_hudi(\n" +
            "  orderId STRING PRIMARY KEY NOT ENFORCED,\n" +
            "  userId STRING,\n" +
            "  orderTime STRING,\n" +
            "  ip STRING,\n" +
            "  orderMoney DOUBLE,\n" +
            "  orderStatus INT,\n" +
            "  ts STRING,\n" +
            "  partition_day STRING\n" +
            ")\n" +
            "PARTITIONED BY (partition_day)\n" +
            "WITH (\n" +
            "  'connector' = 'hudi',\n" +
            "  'path' = 'file:///D:/flink_hudi_order',\n" +
            "  'table.type' = 'MERGE_ON_READ',\n" +
            "  'read.streaming.enabled' = 'true',\n" +
            "  'read.streaming.check-interval' = '4'\n" +
            ")"
      );

      // 3-通过子查询方式,将数据写入输出表
      tableEnv.executeSql(
         "SELECT \n" +
            "  orderId, userId, orderTime, ip, orderMoney, orderStatus, ts ,partition_day \n" +
            "FROM order_hudi"
      ).print();

   }

}

Run the streaming program and load the Hudi table data, the results are as follows:
insert image description here

4.3 Flink SQL Client writes to Hudi

Start the Flink Standalone cluster, run the SQL Client command line client, execute DDL and DML statements, and manipulate data.

4.3.1 Integrated environment

■Configure Flink cluster
Modify $FLINK_HOME/conf/flink-conf.yaml file

jobmanager.rpc.address: node1.oldlu.cn
jobmanager.memory.process.size: 1024m
taskmanager.memory.process.size: 2048m
taskmanager.numberOfTaskSlots: 4

classloader.check-leaked-classloader: false
classloader.resolve-order: parent-first

execution.checkpointing.interval: 3000
state.backend: rocksdb
state.checkpoints.dir: hdfs://node1.oldlu.cn:8020/flink/flink-checkpoints
state.savepoints.dir: hdfs://node1.oldlu.cn:8020/flink/flink-savepoints
state.backend.incremental: true

● Put the integrated jar package of Hudi and Flink and other related jar packages into the $FLINK_HOME/lib directory
insert image description here

● Start the Standalone cluster

export HADOOP_CLASSPATH=/export/server/hadoop/bin/hadoop classpath
/export/server/flink/bin/start-cluster.sh

● Start the SQL Client, it is best to specify the Hudi integration jar package again

/export/server/flink/bin/sql-client.sh embedded -j
/export/server/flink/lib/hudi-flink-bundle_2.12-0.9.0.jar shell

● Set properties

set execution.result-mode=tableau; set
execution.checkpointing.interval=3sec;

insert image description here

4.3.2 Execute SQL

First create an input table: consume data from Kafka, then write SQL to extract field values, then create an output table: save the data in the Hudi table, and finally write SQL to query the data in the Hudi table.
● Step 1. Create an input table and associate Kafka Topic
– Input table: Kafka Source

CREATE TABLE order_kafka_source (
  orderId STRING,
  userId STRING,
  orderTime STRING,
  ip STRING,
  orderMoney DOUBLE,
  orderStatus INT
) WITH (
  'connector' = 'kafka',
  'topic' = 'order-topic',
  'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',
  'properties.group.id' = 'gid-1001',
  'scan.startup.mode' = 'latest-offset',
  'format' = 'json',
  'json.fail-on-missing-field' = 'false',
  'json.ignore-parse-errors' = 'true'
);

SELECT orderId, userId, orderTime, ip, orderMoney, orderStatus FROM order_kafka_source ;

● Step 2, process and obtain Kafka message data, and extract field values

SELECT 
  orderId, userId, orderTime, ip, orderMoney, orderStatus, 
  substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day 
FROM order_kafka_source ;

● Step 3. Create an output table, save data to the Hudi table, and set related properties
– output table: Hudi Sink

CREATE TABLE order_hudi_sink (
  orderId STRING PRIMARY KEY NOT ENFORCED,
  userId STRING,
  orderTime STRING,
  ip STRING,
  orderMoney DOUBLE,
  orderStatus INT,
  ts STRING,
  partition_day STRING
)
PARTITIONED BY (partition_day) 
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/order_hudi_sink',
  'table.type' = 'MERGE_ON_READ',
  'write.operation' = 'upsert',
  'hoodie.datasource.write.recordkey.field'= 'orderId',
  'write.precombine.field' = 'ts',
  'write.tasks'= '1',
  'compaction.tasks' = '1', 
  'compaction.async.enabled' = 'true', 
  'compaction.trigger.strategy' = 'num_commits', 
  'compaction.delta_commits' = '1'
);

● Step 4. Use the INSERT INTO statement to save the data in the Hudi table
– insert the subquery INSERT ... SELECT ...

INSERT INTO order_hudi_sink 
SELECT
  orderId, userId, orderTime, ip, orderMoney, orderStatus,
  substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day 
FROM order_kafka_source ;
12345

At this point, submit the Flink Job to run on the FlinkStandalone cluster, as shown below:
insert image description here

As long as the simulated transaction order data program is run, the data will be sent to Kafka, and finally converted and saved to the Hudi table. The screenshot is as follows:
insert image description here

■Step 5, write SELECT statement, query Hudi table transaction order data
– query Hudi table data

SELECT * FROM order_hudi_sink ;

insert image description here

5 Hudi CDC

The full name of CDC is Change data Capture, that is, change data capture. It is mainly oriented to database changes. It is a very common technology in the database field. It is mainly used to capture some changes in the database, and then the changed data can be sent downstream.
insert image description here

For CDC, there are two main types in the industry: one is query-based, the client will query the change data of the source database table through SQL, and then send it to the outside. The second is based on the log, which is also a method widely used in the industry. Generally, through the binlog method, the changed records will be written into the binlog. After the binlog is parsed, it will be written into the message system, or processed directly based on Flink CDC.
insert image description here

■Query-based: This CDC technology is intrusive and needs to execute SQL statements at the data source. Implementing CDC using this technique can impact the performance of the data source. Often an entire table containing a large number of records needs to be scanned.
■ Log-based: This CDC technology is non-invasive and does not require SQL statements to be executed at the data source. By reading the log files of the source database to identify the creation, modification or deletion of data on the source database tables.

5.1 CDC data into the lake

Based on CDC data entering the lake, the architecture is very simple: various upstream data sources, such as DB change data, event streams, and various external data sources, can be written into the table through the change stream, and then Perform external query analysis.

insert image description here

Typical CDC link into the lake: the above link is the link adopted by most companies. The previous CDC data is first imported into Kafka or Pulsar through the CDC tool, and then written to Hudi through Flink or Spark streaming consumption. The second architecture is to directly connect to the MySQL upstream data source through Flink CDC, and directly write to the downstream Hudi table.
insert image description here

5.2 Flink CDC Hudi

Based on the Flink CDC technology, the MySQL database table data is collected in real time, after ETL conversion processing, and finally the Hudi table is stored.
insert image description here

5.2.1 Business requirements

MySQL database creates tables, adds data in real time, writes data into Hudi tables through Flink CDC, and integrates Hudi with Hive to automatically create tables and add partition information in Hive. Finally, Hive terminal Beeline queries and analyzes data.
insert image description here

Hudi tables and Hive tables are automatically associated and integrated. It is necessary to recompile the Hudi source code, specify the Hive version and include the Hive dependent jar package during compilation. The specific steps are as follows.
● Modify Hudi’s integrated flink and Hive compilation dependency version configuration
Reason: The current version of Hudi, when compiling, itself has integrated the flink-SQL-connector-hive package by default, and it will be integrated with flink-SQL-connector-hive under the Flink lib package conflict. Therefore, only the hive compiled version is modified during the compilation process.
File: hudi-0.9.0/packaging/hudi-flink-bundle/pom.xml
insert image description here

● Compile Hudi source code

mvn clean install -DskipTests -Drat.skip=true -Dscala-2.12 -Dspark3
-Pflink-bundle-shade-hive2

After the compilation is complete, there are 2 jar packages, which are very important:
 hudi-flink-bundle_2.12-0.9.0.jar, located in
hudi-0.9.0/packaging/hudi-flink-bundle/target, flink is used to write Input and read data, copy it to
KaTeX parse error: Unexpected character: '' at position 39: ... jar package with the same name, first delete and then copy. ̲ hudi-hadoop-mr… HIVE_HOME/lib directory.
■ Put the jar package corresponding to Flink CDC MySQL into the $FLINK_HOME/lib directory
flink-sql-connector-mysql-cdc-1.3.0.jar

So far, in the $FLINK_HOME/lib directory, there are the following required jar packages, all of which are indispensable, pay attention to the version number.
insert image description here

5.2.2 Create MySQL table

First enable the MySQL database binlog log, then restart the MySQL database service, and finally create a table.
■The first step, open the MySQL binlog log

[root@node1 ~]# vim /etc/my.cnf

Add content under [mysqld]:

server-id=2
log-bin=mysql-bin
binlog_format=row
expire_logs_days=15
binlog_row_image=full

insert image description here

■The second step, restart MySQL Server

service mysqld restart

Log in to the MySQL Client command line to check whether it takes effect.
insert image description here

■The third step, in the MySQL database, create a table
– MySQL database to create a table

create database test ;
create table test.tbl_users(
   id bigint auto_increment primary key,
   name varchar(20) null,
   birthday timestamp default CURRENT_TIMESTAMP not null,
   ts timestamp default CURRENT_TIMESTAMP not null
);

insert image description here

5.2.3 Create CDC table

First start the HDFS service, Hive MetaStore and HiveServer2 services, and the Flink Standalone cluster, then run the SQL Client, and finally create a table associated with a MySQL table, using the MySQL CDC method.
● Start the HDFS service, start the NameNode and DataNode respectively
– start the HDFS service

hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode

● Start Hive service: metadata MetaStore and HiveServer2
– Hive service

/export/server/hive/bin/start-metastore.sh
/export/server/hive/bin/start-hiveserver2.sh

■Start the Flink Standalone cluster
– start the Flink Standalone cluster

export HADOOP_CLASSPATH=/export/server/hadoop/bin/hadoop classpath
/export/server/flink/bin/start-cluster.sh

■Start the SQL Client client

/export/server/flink/bin/sql-client.sh embedded -j
/export/server/flink/lib/hudi-flink-bundle_2.12-0.9.0.jar shell

Set properties:

set execution.result-mode=tableau; set
execution.checkpointing.interval=3sec;

● Create input tables, associate MySQL tables, and use MySQL CDC association
– Flink SQL Client to create tables

CREATE TABLE users_source_mysql (
  id BIGINT PRIMARY KEY NOT ENFORCED,
  name STRING,
  birthday TIMESTAMP(3),
  ts TIMESTAMP(3)
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'node1.oldlu.cn',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'server-time-zone' = 'Asia/Shanghai',
'debezium.snapshot.mode' = 'initial',
'database-name' = 'test',
'table-name' = 'tbl_users'
);

Query the structure of the table, where id is the primary key, and ts is the data merge field.
insert image description here

● Query CDC table data
– query data

select * from users_source_mysql;

insert image description here

● Open the MySQL Client, execute DML statements, and insert data

insert into test.tbl_users (name) values (‘zhangsan’) insert into
test.tbl_users (name) values (‘lisi’); insert into test.tbl_users
(name) values (‘wangwu’); insert into test.tbl_users (name) values
(‘laoda’); insert into test.tbl_users (name) values (‘laoer’);

5.2.4 Creating Views

Create a temporary view and add the partition column part to facilitate subsequent synchronization of the hive partition table.
– Create a temporary view and add partition columns to facilitate subsequent synchronization of the hive partition table

create view view_users_cdc AS SELECT *, DATE_FORMAT(birthday,
‘yyyyMMdd’) as part FROM users_source_mysql;

View the data in the view view

select * from view_users_cdc;

insert image description here

5.2.5 Create Hudi table

Create CDC Hudi Sink table, and automatically synchronize hive partition table, specific DDL statement.

CREATE TABLE users_sink_hudi_hive(
id bigint ,
name string,
birthday TIMESTAMP(3),
ts TIMESTAMP(3),
part VARCHAR(20),
primary key(id) not enforced
)
PARTITIONED BY (part)
with(
'connector'='hudi',
'path'= 'hdfs://node1.oldlu.cn:8020/ehualu/hudi-warehouse/users_sink_hudi_hive', 
'table.type'= 'MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'= 'id', 
'write.precombine.field'= 'ts',
'write.tasks'= '1',
'write.rate.limit'= '2000', 
'compaction.tasks'= '1', 
'compaction.async.enabled'= 'true',
'compaction.trigger.strategy'= 'num_commits',
'compaction.delta_commits'= '1',
'changelog.enabled'= 'true',
'read.streaming.enabled'= 'true',
'read.streaming.check-interval'= '3',
'hive_sync.enable'= 'true',
'hive_sync.mode'= 'hms',
'hive_sync.metastore.uris'= 'thrift://node1.oldlu.cn:9083',
'hive_sync.jdbc_url'= 'jdbc:hive2://node1.oldlu.cn:10000',
'hive_sync.table'= 'users_sink_hudi_hive',
'hive_sync.db'= 'default',
'hive_sync.username'= 'root',
'hive_sync.password'= '123456',
'hive_sync.support_timestamp'= 'true'
);

Here Hudi table type: MOR, Merge on Read (merge on read), snapshot query + incremental query + read optimization query (near real-time). Use columnar storage (parquet) + row file (arvo) to store data. Updates are recorded to delta files, which are then compacted either synchronously or asynchronously to generate new versions of columnar files.
insert image description here

5.2.6 Write data to Hudi table

Write the INSERT statement, query the data from the view, and then write it into the Hudi table. The statement is as follows:

insert into users_sink_hudi_hive select id, name, birthday, ts, part
from view_users_cdc;

Flink web UI DAG diagram:
insert image description here

Hudi file directory situation on HDFS:
insert image description here

To query Hudi table data, the SELECT statement is as follows:

select * from users_sink_hudi_hive;

insert image description here

5.2.7 Hive table query

The hudi-hadoop-mr-bundle-0.9.0.jar package needs to be imported and placed under $HIVE_HOME/lib.
insert image description here

Start the beeline client in Hive and connect to the HiveServer2 service:

/export/server/hive/bin/beeline -u jdbc:hive2://node1.oldlu.cn:10000
-n root -p 123456

insert image description here

Two tables in hudi MOR mode have been automatically generated:
■users_sink_hudi_hive_ro, the full name of the ro table is read oprimized table, and for the xxx_ro table synchronized with the MOR table, only the compressed parquet is exposed. Its query method is similar to COW table. After setting the hiveInputFormat, it can be queried like a normal Hive table;
users_sink_hudi_hive_rt, rt means incremental view, mainly for the rt table of incremental query; ro table can only query parquet file data, rt table parquet file data and log file data All can be checked;
check the automatically generated table users_sink_hudi_hive_ro structure:

CREATE EXTERNAL TABLE `users_sink_hudi_hive_ro`(
  `_hoodie_commit_time` string COMMENT '', 
  `_hoodie_commit_seqno` string COMMENT '', 
  `_hoodie_record_key` string COMMENT '', 
  `_hoodie_partition_path` string COMMENT '', 
  `_hoodie_file_name` string COMMENT '', 
  `_hoodie_operation` string COMMENT '', 
  `id` bigint COMMENT '', 
  `name` string COMMENT '', 
  `birthday` bigint COMMENT '', 
  `ts` bigint COMMENT '')
PARTITIONED BY ( 
  `part` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='true', 
  'path'='hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive'
TBLPROPERTIES (
  'last_commit_time_sync'='20211125095818', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1', 
  'spark.sql.sources.schema.numParts'='1', 
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_operation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"birthday\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', 
  'spark.sql.sources.schema.partCol.0'='partition', 
  'transient_lastDdlTime'='1637743860')

View the partition information of the automatically generated table:

show partitions users_sink_hudi_hive_ro ; show partitions
users_sink_hudi_hive_rt ;

insert image description here

Query Hive partition table data

set hive.exec.mode.local.auto=true; set hive.input.format =
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set
hive.mapred.mode=nonstrict ;

select id, name, birthday, ts, part from users_sink_hudi_hive_ro;

insert image description here

Specify the partition field to filter and query data

select name, ts from users_sink_hudi_hive_ro where part =‘20211125’;
select name, ts from users_sink_hudi_hive_rt where part =‘20211125’;

insert image description here

5.3 Hudi Client operates Hudi table

Enter the Hudi client command line: hudi-0.9.0/hudi-cli/hudi-cli.sh
insert image description here

Connect to Hudi table and view table information

connect --path hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive

insert image description here

View Hudi commit information

commits show --sortBy “CommitTime”

insert image description here
Check out the Hudi compactions plan

compactions show all
insert image description here

Guess you like

Origin blog.csdn.net/ZGL_cyy/article/details/130370533