Table of contents
1 Install Flink
Use Flink 1.12 version, deploy Flink Standalone cluster mode, start the service, the steps are as follows:
step1, download the installation package
https://archive.apache.org/dist/flink/flink-1.12.2/
step2, upload software package
flink-1.12.2-bin-scala_2.12.tgz to the specified directory of node1
step3, unzip
tar -zxvf flink-1.12.2-bin-scala_2.12.tgz -C /export/server/ chown -R
root:root /export/server/flink-1.12.2/
step4, create a soft connection
ln -s brisk-1.12.2 brisk
step5, add hadoop dependent jar package
cd /export/server/flink/lib
Use rz to upload the jar package: flink-shaded-hadoop-2-uber-2.7.5-10.0.jar
step6, start HDFS cluster
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
step7, start Flink local cluster
/export/server/flink/bin/start-cluster.sh
can view the following two processes using jps
Stop Flink
/export/server/flink/bin/stop-cluster.sh
Step8, access Flink's Web UI
URL: node1:8081/#/overview
step9. Execute the official example
Read the text file data, perform word frequency statistics WordCount, and print the result to the console.
/export/server/flink/bin/flink run
/export/server/flink/examples/batch/WordCount.jar
2 Quick Start
Operate Hudi table data based on Flink and perform query analysis. The software version description is as follows:
2.1 Overview of integrating Flink
When Flink integrates Hudi, it will essentially integrate the jar package: hudi-flink-bundle_2.12-0.9.0.jar, which can be placed under the Flink application CLASSPATH. When Flink SQLConnector supports Hudi as Source and Sink, there are two ways to put the jar package into the CLASSPATH path:
● Method 1: When running the Flink SQL Client command line, specify the jar package through the parameter [-j xx.jar]
● Method 2: Put the jar package directly into the lib directory of the Flink software installation package [$FLINK_HOME/lib]
Next, use the Flink SQL Client to provide the SQL command line to integrate with Hudi. You need to start the Flink Standalone cluster, and you need to modify the configuration file [$FLINK_HOME/conf/flink-conf.yaml], and the number of Slots allocated by the TaskManager is 4.
2.2 Environment preparation
First start each framework service, then write DDL statements to create tables, and finally DML statements to insert data and query and analyze. Follow the steps below to start the environment, which is divided into three steps:
● The first step is to start the HDFS cluster
[root@node1 ~]# hadoop-daemon.sh start namenode
[root@node1 ~]# hadoop-daemon.sh start datanode
● Step 2: Start the Flink cluster
Since Flink needs to connect to the HDFS file system, set the HADOOP_CLASSPATH variable first, and then start the Standalone cluster service.
[root@node1 ~]# export HADOOP_CLASSPATH=
$HADOOP_HOME/bin/hadoop classpath
[root@node1 ~]# /export/server/flink/bin/start-cluster.sh
● Step 3: Start the Flink SQL Cli command line
[root@node1 ~]# /export/server/flink/bin/sql-client.sh embedded shell
Use the specified parameter [-j xx.jar] to load the hudi-flink integration package. The command is as follows.
[root@node1 ~]# /export/server/flink/bin/sql-client.sh embedded -j
/root/hudi-flink-bundle_2.11-0.9.0.jar shell
Set the analysis result display mode in SQL Cli to: set execution.result-mode=tableau;.
2.3 Create table
Create table: t1, store data in Hudi table, underlying HDFS storage, table type: MOR, the statement is as follows:
CREATE TABLE t1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://node1.oldlu.cn:8020/ehualu/hudi-warehouse/hudi-t1',
'write.tasks' = '1',
'compaction.tasks' = '1',
'table.type' = 'MERGE_ON_READ'
);
Execute the DDL statement on the Flink SQL CLI command line, the screenshot is as follows:
To view the table and structure, the command is as follows:
Next, write the INSERT statement to insert data into the Hudi table.
2.4 Insert data
Insert data into the above created table: t1, where the t1 table is a partition table, field name: partition, field values when inserting data: [part1, part2, part3 and part4], the statement is as follows:
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1');
INSERT INTO t1 VALUES
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
The screenshot of execution in Flink SQL CLI is as follows:
The log information shows that the SQL statement is submitted to the Flink Standalone cluster for execution, and the insert statement is executed successfully.
Query the data storage directory on HDFS:
2.5 Query data
After the data is inserted into the Hudi table through Flink SQL CLI, write an SQL statement to query the data. The statement is as follows:
select * from t1;
Like inserting data, submit SQL to the Standalone cluster to generate job query data.
Partitions are trimmed by adding the partition path in the WHERE clause, as follows:
select * from t1 where
partition
= ‘par1’ ;
2.6 Update data
Change the data age of id1 from 23 to 27, and execute the SQL statement as follows:
insert into t1 values (‘id1’,‘Danny’,27,TIMESTAMP ‘1970-01-01
00:00:01’,‘par1’);
Query the data of the table again, the result is as follows:
Insert the Flink Standalone monitoring page 8081, and you can see that 3 jobs are executed.
3 Streaming query
When Flink inserts Hudi table data, it supports loading data in a streaming manner and incremental query analysis.
3.1 Create table
First create a table: t2, set related attributes, query and read in a streaming manner, and map to the previous table: t1, the statement is as follows.
CREATE TABLE t2(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/hudi-t1',
'table.type' = 'MERGE_ON_READ',
'read.tasks' = '1',
'read.streaming.enabled' = 'true',
'read.streaming.start-commit' = '20210316134557',
'read.streaming.check-interval' = '4'
);
Description of core parameter options:
● read.streaming.enabled is set to true, indicating that the table data is read through streaming;
● read.streaming.check-interval specifies the interval for the source to monitor new commits as 4s;
● table.type setting The table type is MERGE_ON_READ;
Next, write SQL to insert data, and insert table in streaming mode: t2 data.
3.2 Query data
Create table: After t2, the data in the table at this time is the data written in the previous batch mode.
select * from t2 ;
Insert all the data in the display table, the cursor keeps flashing, every 4 seconds, and then query incrementally according to the commit timestamp.
3.3 Insert data
Re-open the Terminal to start the Flink SQL CLI, recreate the table: t1, and insert 1 piece of data in batch mode.
CREATE TABLE t1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/hudi-t1',
'write.tasks' = '1',
'compaction.tasks' = '1',
'table.type' = 'MERGE_ON_READ'
);
insert into t1 values ('id9','test',27,TIMESTAMP '1970-01-01 00:00:01','par5');
A new piece of data (a piece of data inserted earlier) can be read in the flow table after a few seconds.
Through some simple demonstrations, it is found that the integration of HUDI Flink has been relatively complete, and the read and write data have been covered.
4 Flink SQL Writer
The Flink SQL Connector connector is provided in the hudi-flink module to support reading and writing data from Hudi tables.
Documentation: https://hudi.apache.org/docs/writing_data#flink-sql-writer
4.1 Flink SQL integrated with Kafka
First, configure Flink SQL to integrate Kafka and consume Kafka Topic data in real time. The specific steps are as follows:
● The first step is to create a topic
and start Zookeeper and Kafka service components. The case demonstrates the integration of FlinkSQL and Kafka to load data in real time. Use the KafkaTool tool to connect and start the Kafka service, and create a topic: flink-topic.
You can use the command line to create a topic. The specific command is as follows:
– Create a topic: flink-topic
kafka-topics.sh --create --bootstrap-server node1.oldlu.cn:9092
–replication-factor 1 --partitions 1 --topic flink-topic
Start the Flink Standalone cluster service, run the flink-sql command line, and create a table mapped to Kafka.
■The second step, start the HDFS cluster
[root@node1 ~]# hadoop-daemon.sh start namenode [root@node1 ~]#
hadoop-daemon.sh start datanode
■Step 3: Start the Flink cluster
Since Flink needs to connect to the HDFS file system, set the HADOOP_CLASSPATH variable first, and then start the Standalone cluster service.
[root@node1 ~]# export HADOOP_CLASSPATH=
$HADOOP_HOME/bin/hadoop classpath
[root@node1 ~]# /export/server/flink/bin/start-cluster.sh
■Step 4: Start the Flink SQL Cli command line.
Use the specified parameter [-j xx.jar] to load the hudi-flink integration package. The command is as follows.
[root@node1 ~]# cd /export/server/flink [root@node1 ~]#
bin/sql-client.sh embedded -j
/root/flink-sql-connector-kafka_2.11-1.12.0.jar shell
Set the analysis result display mode in SQL Cli to: tableau.
The fifth step is to create a table and map it to Kafka Topic.
The data in Kafka Topic is in CSV file format and has three fields: user_id, item_id, and behavior. When consuming data from Kafka, set the start from the latest offset. The table creation statement is as follows :set execution.result-mode=tableau;
CREATE TABLE tbl_kafka (
`user_id` BIGINT,
`item_id` BIGINT,
`behavior` STRING
) WITH (
'connector' = 'kafka',
'topic' = 'flink-topic',
'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',
'properties.group.id' = 'test-group-10001',
'scan.startup.mode' = 'latest-offset',
'format' = 'csv'
);
After executing the command, view the table, the screenshot is as follows:
■Step 6: Send data to Topic in real time, and query in FlinkSQL
First, execute the SELECT query statement on the FlinkSQL page, the screenshot is as follows:
Second, send data to Topic through Kafka Console Producer, the command and data are as follows:
– Producer sends data
kafka-console-producer.sh --broker-list node1.oldlu.cn:9092 --topic flink-topic
/*
1001,90001,click
1001,90001,browser
1001,90001,click
1002,90002,click
1002,90003,click
1003,90001,order
1004,90001,order
*/
Insert data, observe the FlinkSQL interface, you can find the real-time query processing of data, the screenshot is as follows:
At this point, FlinkSQL integrates Kafka, uses tables to associate topic data, and then writes Flink SQL programs to synchronize Kafka data to Hudi tables in real time.
4.2 Flink SQL written to Hudi
Change the above-mentioned Structured Streaming stream program into a Flink SQL program: consume Topic data from Kafka in real time, analyze and convert it, and store it in the Hudi table. The schematic diagram is shown below.
4.2.1 Create Maven Module
Create a Maven Module module and add dependencies, here Flink: 1.12.2 and Hudi: 0.9.0 version.
<repositories>
<repository>
<id>nexus-aliyun</id>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</repository>
<repository>
<id>central_maven</id>
<name>central maven</name>
<url>https://repo1.maven.org/maven2</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>apache.snapshots</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
<java.version>1.8</java.version>
<scala.binary.version>2.12</scala.binary.version>
<flink.version>1.12.2</flink.version>
<hadoop.version>2.7.3</hadoop.version>
<mysql.version>8.0.16</mysql.version>
</properties>
<dependencies>
<!-- Flink Client -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- Flink Table API & SQL -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-json</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-flink-bundle_${scala.binary.version}</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-shaded-hadoop-2-uber</artifactId>
<version>2.7.5-10.0</version>
</dependency>
<!-- MySQL/FastJson/lombok -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.68</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.12</version>
</dependency>
<!-- slf4j及log4j -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.7</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/java</sourceDirectory>
<testSourceDirectory>src/test/java</testSourceDirectory>
<plugins>
<!-- 编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<!--<encoding>${project.build.sourceEncoding}</encoding>-->
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<!-- 打jar包插件(会包含所有依赖) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!-- <mainClass>com.oldlu.flink.batch.FlinkBatchWordCount</mainClass> -->
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
4.2.2 Consuming Kafka data
Create a class: FlinkSQLKafakDemo, based on the Flink Table API, consume data from Kafka and extract field values (for subsequent storage in the Hudi table).
package cn.oldlu.hudi;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;
import static org.apache.flink.table.api.Expressions.$;
/**
* 基于Flink SQL Connector实现:实时消费Topic中数据,转换处理后,实时存储Hudi表中
*/
public class FlinkSQLKafakDemo {
public static void main(String[] args) {
// 1-获取表执行环境
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
TableEnvironment tableEnv = TableEnvironment.create(settings) ;
// 2-创建输入表, TODO: 从Kafka消费数据
tableEnv.executeSql(
"CREATE TABLE order_kafka_source (\n" +
" orderId STRING,\n" +
" userId STRING,\n" +
" orderTime STRING,\n" +
" ip STRING,\n" +
" orderMoney DOUBLE,\n" +
" orderStatus INT\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'order-topic',\n" +
" 'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',\n" +
" 'properties.group.id' = 'gid-1001',\n" +
" 'scan.startup.mode' = 'latest-offset',\n" +
" 'format' = 'json',\n" +
" 'json.fail-on-missing-field' = 'false',\n" +
" 'json.ignore-parse-errors' = 'true'\n" +
")"
);
// 3-数据转换:提取订单时间中订单日期,作为Hudi表分区字段值
Table etlTable = tableEnv
.from("order_kafka_source")
.addColumns(
$("orderTime").substring(0, 10).as("partition_day")
)
.addColumns(
$("orderId").substring(0, 17).as("ts")
);
tableEnv.createTemporaryView("view_order", etlTable);
// 4-查询数据
tableEnv.executeSql("SELECT * FROM view_order").print();
}
}
Run the streaming application and the simulated data program to view the console.
4.2.3 Save data to Hudi
Write the table creation DDL statement, map it to the Hudi table, and specify related attributes: primary key field, table type, etc.
CREATE TABLE order_hudi_sink (
orderId STRING PRIMARY KEY NOT ENFORCED,
userId STRING,
orderTime STRING,
ip STRING,
orderMoney DOUBLE,
orderStatus INT,
ts STRING,
partition_day STRING
)
PARTITIONED BY (partition_day)
WITH (
'connector' = 'hudi',
'path' = 'file:///D:/flink_hudi_order',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field'= 'orderId',
'write.precombine.field' = 'ts',
'write.tasks'= '1'
);
Save the Hudi table data in the LocalFS directory of the local file system. In addition, when writing data to the Hudi table, use the INSERT INTO insertion method to write data. The specific DDL statement is as follows: – Subquery insert INSERT ... SELECT
...
INSERT INTO order_hudi_sink
SELECT
orderId, userId, orderTime, ip, orderMoney, orderStatus,
substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day
FROM order_kafka_source ;
12345
Create a class: FlinkSQLHudiDemo, write code: consume data from Kafka, convert it, and save it to the Hudi table.
package cn.oldlu.hudi;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import static org.apache.flink.table.api.Expressions.$;
/**
* 基于Flink SQL Connector实现:实时消费Topic中数据,转换处理后,实时存储Hudi表中
*/
public class FlinkSQLHudiDemo {
public static void main(String[] args) {
System.setProperty("HADOOP_USER_NAME","root");
// 1-获取表执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(5000);
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings) ;
// 2-创建输入表, TODO: 从Kafka消费数据
tableEnv.executeSql(
"CREATE TABLE order_kafka_source (\n" +
" orderId STRING,\n" +
" userId STRING,\n" +
" orderTime STRING,\n" +
" ip STRING,\n" +
" orderMoney DOUBLE,\n" +
" orderStatus INT\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'order-topic',\n" +
" 'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',\n" +
" 'properties.group.id' = 'gid-1001',\n" +
" 'scan.startup.mode' = 'latest-offset',\n" +
" 'format' = 'json',\n" +
" 'json.fail-on-missing-field' = 'false',\n" +
" 'json.ignore-parse-errors' = 'true'\n" +
")"
);
// 3-数据转换:提取订单时间中订单日期,作为Hudi表分区字段值
Table etlTable = tableEnv
.from("order_kafka_source")
.addColumns(
$("orderId").substring(0, 17).as("ts")
)
.addColumns(
$("orderTime").substring(0, 10).as("partition_day")
);
tableEnv.createTemporaryView("view_order", etlTable);
// 4-定义输出表,TODO:数据保存到Hudi表中
tableEnv.executeSql(
"CREATE TABLE order_hudi_sink (\n" +
" orderId STRING PRIMARY KEY NOT ENFORCED,\n" +
" userId STRING,\n" +
" orderTime STRING,\n" +
" ip STRING,\n" +
" orderMoney DOUBLE,\n" +
" orderStatus INT,\n" +
" ts STRING,\n" +
" partition_day STRING\n" +
")\n" +
"PARTITIONED BY (partition_day) \n" +
"WITH (\n" +
" 'connector' = 'hudi',\n" +
" 'path' = 'file:///D:/flink_hudi_order',\n" +
" 'table.type' = 'MERGE_ON_READ',\n" +
" 'write.operation' = 'upsert',\n" +
" 'hoodie.datasource.write.recordkey.field' = 'orderId'," +
" 'write.precombine.field' = 'ts'" +
" 'write.tasks'= '1'" +
")"
);
// 5-通过子查询方式,将数据写入输出表
tableEnv.executeSql(
"INSERT INTO order_hudi_sink\n" +
"SELECT\n" +
" orderId, userId, orderTime, ip, orderMoney, orderStatus, ts, partition_day\n" +
"FROM view_order"
);
}
}
Run the stream program written above, view the directory of the local file system, and save the data structure information of the Hudi table:
4.2.4 Loading Hudi table data
Create a class: FlinkSQLReadDemo, load the data in the Hudi table, read it in streaming mode, create the same table, and map it to the data storage directory of the Hudi table. The DDL statement for creating the table is as follows:
CREATE TABLE order_hudi(
orderId STRING PRIMARY KEY NOT ENFORCED,
userId STRING,
orderTime STRING,
ip STRING,
orderMoney DOUBLE,
orderStatus INT,
ts STRING,
partition_day STRING
)
PARTITIONED BY (partition_day)
WITH (
'connector' = 'hudi',
'path' = 'file:///D:/flink_hudi_order',
'table.type' = 'MERGE_ON_READ',
'read.streaming.enabled' = 'true',
'read.streaming.check-interval' = '4'
);
The complete Flink SQL streaming program code is as follows:
package cn.oldlu.hudi;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;
/**
* 基于Flink SQL Connector实现:从Hudi表中加载数据,编写SQL查询
*/
public class FlinkSQLReadDemo {
public static void main(String[] args) {
System.setProperty("HADOOP_USER_NAME","root");
// 1-获取表执行环境
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
TableEnvironment tableEnv = TableEnvironment.create(settings) ;
// 2-创建输入表, TODO: 加载Hudi表查询数据
tableEnv.executeSql(
"CREATE TABLE order_hudi(\n" +
" orderId STRING PRIMARY KEY NOT ENFORCED,\n" +
" userId STRING,\n" +
" orderTime STRING,\n" +
" ip STRING,\n" +
" orderMoney DOUBLE,\n" +
" orderStatus INT,\n" +
" ts STRING,\n" +
" partition_day STRING\n" +
")\n" +
"PARTITIONED BY (partition_day)\n" +
"WITH (\n" +
" 'connector' = 'hudi',\n" +
" 'path' = 'file:///D:/flink_hudi_order',\n" +
" 'table.type' = 'MERGE_ON_READ',\n" +
" 'read.streaming.enabled' = 'true',\n" +
" 'read.streaming.check-interval' = '4'\n" +
")"
);
// 3-通过子查询方式,将数据写入输出表
tableEnv.executeSql(
"SELECT \n" +
" orderId, userId, orderTime, ip, orderMoney, orderStatus, ts ,partition_day \n" +
"FROM order_hudi"
).print();
}
}
Run the streaming program and load the Hudi table data, the results are as follows:
4.3 Flink SQL Client writes to Hudi
Start the Flink Standalone cluster, run the SQL Client command line client, execute DDL and DML statements, and manipulate data.
4.3.1 Integrated environment
■Configure Flink cluster
Modify $FLINK_HOME/conf/flink-conf.yaml file
jobmanager.rpc.address: node1.oldlu.cn
jobmanager.memory.process.size: 1024m
taskmanager.memory.process.size: 2048m
taskmanager.numberOfTaskSlots: 4
classloader.check-leaked-classloader: false
classloader.resolve-order: parent-first
execution.checkpointing.interval: 3000
state.backend: rocksdb
state.checkpoints.dir: hdfs://node1.oldlu.cn:8020/flink/flink-checkpoints
state.savepoints.dir: hdfs://node1.oldlu.cn:8020/flink/flink-savepoints
state.backend.incremental: true
● Put the integrated jar package of Hudi and Flink and other related jar packages into the $FLINK_HOME/lib directory
● Start the Standalone cluster
export HADOOP_CLASSPATH=
/export/server/hadoop/bin/hadoop classpath
/export/server/flink/bin/start-cluster.sh
● Start the SQL Client, it is best to specify the Hudi integration jar package again
/export/server/flink/bin/sql-client.sh embedded -j
/export/server/flink/lib/hudi-flink-bundle_2.12-0.9.0.jar shell
● Set properties
set execution.result-mode=tableau; set
execution.checkpointing.interval=3sec;
4.3.2 Execute SQL
First create an input table: consume data from Kafka, then write SQL to extract field values, then create an output table: save the data in the Hudi table, and finally write SQL to query the data in the Hudi table.
● Step 1. Create an input table and associate Kafka Topic
– Input table: Kafka Source
CREATE TABLE order_kafka_source (
orderId STRING,
userId STRING,
orderTime STRING,
ip STRING,
orderMoney DOUBLE,
orderStatus INT
) WITH (
'connector' = 'kafka',
'topic' = 'order-topic',
'properties.bootstrap.servers' = 'node1.oldlu.cn:9092',
'properties.group.id' = 'gid-1001',
'scan.startup.mode' = 'latest-offset',
'format' = 'json',
'json.fail-on-missing-field' = 'false',
'json.ignore-parse-errors' = 'true'
);
SELECT orderId, userId, orderTime, ip, orderMoney, orderStatus FROM order_kafka_source ;
● Step 2, process and obtain Kafka message data, and extract field values
SELECT
orderId, userId, orderTime, ip, orderMoney, orderStatus,
substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day
FROM order_kafka_source ;
● Step 3. Create an output table, save data to the Hudi table, and set related properties
– output table: Hudi Sink
CREATE TABLE order_hudi_sink (
orderId STRING PRIMARY KEY NOT ENFORCED,
userId STRING,
orderTime STRING,
ip STRING,
orderMoney DOUBLE,
orderStatus INT,
ts STRING,
partition_day STRING
)
PARTITIONED BY (partition_day)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://node1.oldlu.cn:8020/hudi-warehouse/order_hudi_sink',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field'= 'orderId',
'write.precombine.field' = 'ts',
'write.tasks'= '1',
'compaction.tasks' = '1',
'compaction.async.enabled' = 'true',
'compaction.trigger.strategy' = 'num_commits',
'compaction.delta_commits' = '1'
);
● Step 4. Use the INSERT INTO statement to save the data in the Hudi table
– insert the subquery INSERT ... SELECT ...
INSERT INTO order_hudi_sink
SELECT
orderId, userId, orderTime, ip, orderMoney, orderStatus,
substring(orderId, 0, 17) AS ts, substring(orderTime, 0, 10) AS partition_day
FROM order_kafka_source ;
12345
At this point, submit the Flink Job to run on the FlinkStandalone cluster, as shown below:
As long as the simulated transaction order data program is run, the data will be sent to Kafka, and finally converted and saved to the Hudi table. The screenshot is as follows:
■Step 5, write SELECT statement, query Hudi table transaction order data
– query Hudi table data
SELECT * FROM order_hudi_sink ;
5 Hudi CDC
The full name of CDC is Change data Capture, that is, change data capture. It is mainly oriented to database changes. It is a very common technology in the database field. It is mainly used to capture some changes in the database, and then the changed data can be sent downstream.
For CDC, there are two main types in the industry: one is query-based, the client will query the change data of the source database table through SQL, and then send it to the outside. The second is based on the log, which is also a method widely used in the industry. Generally, through the binlog method, the changed records will be written into the binlog. After the binlog is parsed, it will be written into the message system, or processed directly based on Flink CDC.
■Query-based: This CDC technology is intrusive and needs to execute SQL statements at the data source. Implementing CDC using this technique can impact the performance of the data source. Often an entire table containing a large number of records needs to be scanned.
■ Log-based: This CDC technology is non-invasive and does not require SQL statements to be executed at the data source. By reading the log files of the source database to identify the creation, modification or deletion of data on the source database tables.
5.1 CDC data into the lake
Based on CDC data entering the lake, the architecture is very simple: various upstream data sources, such as DB change data, event streams, and various external data sources, can be written into the table through the change stream, and then Perform external query analysis.
Typical CDC link into the lake: the above link is the link adopted by most companies. The previous CDC data is first imported into Kafka or Pulsar through the CDC tool, and then written to Hudi through Flink or Spark streaming consumption. The second architecture is to directly connect to the MySQL upstream data source through Flink CDC, and directly write to the downstream Hudi table.
5.2 Flink CDC Hudi
Based on the Flink CDC technology, the MySQL database table data is collected in real time, after ETL conversion processing, and finally the Hudi table is stored.
5.2.1 Business requirements
MySQL database creates tables, adds data in real time, writes data into Hudi tables through Flink CDC, and integrates Hudi with Hive to automatically create tables and add partition information in Hive. Finally, Hive terminal Beeline queries and analyzes data.
Hudi tables and Hive tables are automatically associated and integrated. It is necessary to recompile the Hudi source code, specify the Hive version and include the Hive dependent jar package during compilation. The specific steps are as follows.
● Modify Hudi’s integrated flink and Hive compilation dependency version configuration
Reason: The current version of Hudi, when compiling, itself has integrated the flink-SQL-connector-hive package by default, and it will be integrated with flink-SQL-connector-hive under the Flink lib package conflict. Therefore, only the hive compiled version is modified during the compilation process.
File: hudi-0.9.0/packaging/hudi-flink-bundle/pom.xml
● Compile Hudi source code
mvn clean install -DskipTests -Drat.skip=true -Dscala-2.12 -Dspark3
-Pflink-bundle-shade-hive2
After the compilation is complete, there are 2 jar packages, which are very important:
hudi-flink-bundle_2.12-0.9.0.jar, located in
hudi-0.9.0/packaging/hudi-flink-bundle/target, flink is used to write Input and read data, copy it to
KaTeX parse error: Unexpected character: '' at position 39: ... jar package with the same name, first delete and then copy. ̲ hudi-hadoop-mr… HIVE_HOME/lib directory.
■ Put the jar package corresponding to Flink CDC MySQL into the $FLINK_HOME/lib directory
flink-sql-connector-mysql-cdc-1.3.0.jar
So far, in the $FLINK_HOME/lib directory, there are the following required jar packages, all of which are indispensable, pay attention to the version number.
5.2.2 Create MySQL table
First enable the MySQL database binlog log, then restart the MySQL database service, and finally create a table.
■The first step, open the MySQL binlog log
[root@node1 ~]# vim /etc/my.cnf
Add content under [mysqld]:
server-id=2
log-bin=mysql-bin
binlog_format=row
expire_logs_days=15
binlog_row_image=full
■The second step, restart MySQL Server
service mysqld restart
Log in to the MySQL Client command line to check whether it takes effect.
■The third step, in the MySQL database, create a table
– MySQL database to create a table
create database test ;
create table test.tbl_users(
id bigint auto_increment primary key,
name varchar(20) null,
birthday timestamp default CURRENT_TIMESTAMP not null,
ts timestamp default CURRENT_TIMESTAMP not null
);
5.2.3 Create CDC table
First start the HDFS service, Hive MetaStore and HiveServer2 services, and the Flink Standalone cluster, then run the SQL Client, and finally create a table associated with a MySQL table, using the MySQL CDC method.
● Start the HDFS service, start the NameNode and DataNode respectively
– start the HDFS service
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
● Start Hive service: metadata MetaStore and HiveServer2
– Hive service
/export/server/hive/bin/start-metastore.sh
/export/server/hive/bin/start-hiveserver2.sh
■Start the Flink Standalone cluster
– start the Flink Standalone cluster
export HADOOP_CLASSPATH=
/export/server/hadoop/bin/hadoop classpath
/export/server/flink/bin/start-cluster.sh
■Start the SQL Client client
/export/server/flink/bin/sql-client.sh embedded -j
/export/server/flink/lib/hudi-flink-bundle_2.12-0.9.0.jar shell
Set properties:
set execution.result-mode=tableau; set
execution.checkpointing.interval=3sec;
● Create input tables, associate MySQL tables, and use MySQL CDC association
– Flink SQL Client to create tables
CREATE TABLE users_source_mysql (
id BIGINT PRIMARY KEY NOT ENFORCED,
name STRING,
birthday TIMESTAMP(3),
ts TIMESTAMP(3)
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'node1.oldlu.cn',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'server-time-zone' = 'Asia/Shanghai',
'debezium.snapshot.mode' = 'initial',
'database-name' = 'test',
'table-name' = 'tbl_users'
);
Query the structure of the table, where id is the primary key, and ts is the data merge field.
● Query CDC table data
– query data
select * from users_source_mysql;
● Open the MySQL Client, execute DML statements, and insert data
insert into test.tbl_users (name) values (‘zhangsan’) insert into
test.tbl_users (name) values (‘lisi’); insert into test.tbl_users
(name) values (‘wangwu’); insert into test.tbl_users (name) values
(‘laoda’); insert into test.tbl_users (name) values (‘laoer’);
5.2.4 Creating Views
Create a temporary view and add the partition column part to facilitate subsequent synchronization of the hive partition table.
– Create a temporary view and add partition columns to facilitate subsequent synchronization of the hive partition table
create view view_users_cdc AS SELECT *, DATE_FORMAT(birthday,
‘yyyyMMdd’) as part FROM users_source_mysql;
View the data in the view view
select * from view_users_cdc;
5.2.5 Create Hudi table
Create CDC Hudi Sink table, and automatically synchronize hive partition table, specific DDL statement.
CREATE TABLE users_sink_hudi_hive(
id bigint ,
name string,
birthday TIMESTAMP(3),
ts TIMESTAMP(3),
part VARCHAR(20),
primary key(id) not enforced
)
PARTITIONED BY (part)
with(
'connector'='hudi',
'path'= 'hdfs://node1.oldlu.cn:8020/ehualu/hudi-warehouse/users_sink_hudi_hive',
'table.type'= 'MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'= 'id',
'write.precombine.field'= 'ts',
'write.tasks'= '1',
'write.rate.limit'= '2000',
'compaction.tasks'= '1',
'compaction.async.enabled'= 'true',
'compaction.trigger.strategy'= 'num_commits',
'compaction.delta_commits'= '1',
'changelog.enabled'= 'true',
'read.streaming.enabled'= 'true',
'read.streaming.check-interval'= '3',
'hive_sync.enable'= 'true',
'hive_sync.mode'= 'hms',
'hive_sync.metastore.uris'= 'thrift://node1.oldlu.cn:9083',
'hive_sync.jdbc_url'= 'jdbc:hive2://node1.oldlu.cn:10000',
'hive_sync.table'= 'users_sink_hudi_hive',
'hive_sync.db'= 'default',
'hive_sync.username'= 'root',
'hive_sync.password'= '123456',
'hive_sync.support_timestamp'= 'true'
);
Here Hudi table type: MOR, Merge on Read (merge on read), snapshot query + incremental query + read optimization query (near real-time). Use columnar storage (parquet) + row file (arvo) to store data. Updates are recorded to delta files, which are then compacted either synchronously or asynchronously to generate new versions of columnar files.
5.2.6 Write data to Hudi table
Write the INSERT statement, query the data from the view, and then write it into the Hudi table. The statement is as follows:
insert into users_sink_hudi_hive select id, name, birthday, ts, part
from view_users_cdc;
Flink web UI DAG diagram:
Hudi file directory situation on HDFS:
To query Hudi table data, the SELECT statement is as follows:
select * from users_sink_hudi_hive;
5.2.7 Hive table query
The hudi-hadoop-mr-bundle-0.9.0.jar package needs to be imported and placed under $HIVE_HOME/lib.
Start the beeline client in Hive and connect to the HiveServer2 service:
/export/server/hive/bin/beeline -u jdbc:hive2://node1.oldlu.cn:10000
-n root -p 123456
Two tables in hudi MOR mode have been automatically generated:
■users_sink_hudi_hive_ro, the full name of the ro table is read oprimized table, and for the xxx_ro table synchronized with the MOR table, only the compressed parquet is exposed. Its query method is similar to COW table. After setting the hiveInputFormat, it can be queried like a normal Hive table;
users_sink_hudi_hive_rt, rt means incremental view, mainly for the rt table of incremental query; ro table can only query parquet file data, rt table parquet file data and log file data All can be checked;
check the automatically generated table users_sink_hudi_hive_ro structure:
CREATE EXTERNAL TABLE `users_sink_hudi_hive_ro`(
`_hoodie_commit_time` string COMMENT '',
`_hoodie_commit_seqno` string COMMENT '',
`_hoodie_record_key` string COMMENT '',
`_hoodie_partition_path` string COMMENT '',
`_hoodie_file_name` string COMMENT '',
`_hoodie_operation` string COMMENT '',
`id` bigint COMMENT '',
`name` string COMMENT '',
`birthday` bigint COMMENT '',
`ts` bigint COMMENT '')
PARTITIONED BY (
`part` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'hoodie.query.as.ro.table'='true',
'path'='hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive'
TBLPROPERTIES (
'last_commit_time_sync'='20211125095818',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_operation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"birthday\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
'spark.sql.sources.schema.partCol.0'='partition',
'transient_lastDdlTime'='1637743860')
View the partition information of the automatically generated table:
show partitions users_sink_hudi_hive_ro ; show partitions
users_sink_hudi_hive_rt ;
Query Hive partition table data
set hive.exec.mode.local.auto=true; set hive.input.format =
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set
hive.mapred.mode=nonstrict ;select id, name, birthday, ts,
part
from users_sink_hudi_hive_ro;
Specify the partition field to filter and query data
select name, ts from users_sink_hudi_hive_ro where part =‘20211125’;
select name, ts from users_sink_hudi_hive_rt where part =‘20211125’;
5.3 Hudi Client operates Hudi table
Enter the Hudi client command line: hudi-0.9.0/hudi-cli/hudi-cli.sh
Connect to Hudi table and view table information
connect --path hdfs://node1.oldlu.cn:8020/users_sink_hudi_hive
View Hudi commit information
commits show --sortBy “CommitTime”
Check out the Hudi compactions plan
compactions show all