02_ Quickly experience Hudi, compile Hudi, install HDFS, install Spark 3.x, simulate data, insert data, query data, .hoodie files, data files, Hudi data storage overview, Metadata metadata, etc.

This article is from the "Dark Horse Programmer" hudi course

2. Chapter 2 Quick experience of Hudi
2.1 Compile Hudi
2.1.1 Step 1, Maven installation
2.1.2 Step 2, Download source package
2.1.3 Step 3, Add Maven image
2.1.4 Step 4, Execute compilation command
2.1.5 Step 5, Hudi CLI test
2.2 Environment preparation
2.2.1 Install HDFS
2.2.2 Install Spark 3.x
2.3 spark-shell Use
2.3.1 to start spark-shell
2.3.2 Simulate data
2.3.3 Insert data
2.3. 4 Query data
2.3.5 Table data structure
2.3.5.1 .hoodie file
2.3.5.2 Data file
2.3.6 Hudi data storage overview
2.3.6.1 Metadata metadata
2.3.6.2 Index index
2.3.6.3 Data data
2.4 IDEA programming development
2.4.1 Prepare environment
2.4.2 Code structure
2.4.3 Insert data Insert
2.4.4 Query data Query
2.4.5 Update data Update
2.4.6 Incremental query Incremental query
2.4.7 Delete data Delete

2. The second chapter quickly experience Hudi

According to the officially provided Spark DataSource data source, perform CRUD operations on Hudi table data, quickly get started to experience the Hudi data lake framework, and use it on the spark-shell command line and the API in IDEA respectively.

2.1 Compile Hudi

When developing the Apache Hudi data lake framework, add MAVEN dependencies. To use commands to manage Hudi table data, you need to download and compile the Hudi source code package. The operation steps are as follows.

2.1.1 The first step, Maven installation

Download and install Maven on the CentOS 7.7 version 64-bit operation, directly decompress the Maven package, and then configure the system environment variables. Maven version: 3.5.4, warehouse directory name: m2, as shown in the following figure:
insert image description here

After configuring the Maven environment variable, execute: mvn -version
insert image description here

2.1.2 The second step, download the source package

Go to the Apache software archive directory to download the Hudi 0.8 source package: http://archive.apache.org/dist/hudi/0.9.0/

wget https://archive.apache.org/dist/hudi/0.9.0/hudi-0.9.0.src.tgz

In addition, you can also download the Hudi source code from Github:

https://github.com/apache/hudi

It explains how to compile the Hudi source code:
insert image description here

2.1.3 The third step, add Maven image

Since Hudi needs to download related dependencies when compiling, you need to add the Maven mirror warehouse path to download the JAR package.
Edit the $MAVEN_HOME/conf/settings.xml file and add the following image:

<mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>aliyunmaven</id>
    <mirrorOf>*</mirrorOf>
    <name>阿里云spring插件仓库</name>
    <url>https://maven.aliyun.com/repository/spring-plugin</url>
</mirror>
<mirror>
    <id>repo2</id>
    <name>Mirror from Maven Repo2</name>
    <url>https://repo.spring.io/plugins-release/</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>UK</id>
    <name>UK Central</name>
    <url>http://uk.maven.org/maven2</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>jboss-public-repository-group</id>
    <name>JBoss Public Repository Group</name>
    <url>http://repository.jboss.org/nexus/content/groups/public</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>CN</id>
    <name>OSChina Central</name>
    <url>http://maven.oschina.net/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>google-maven-central</id>
    <name>GCS Maven Central mirror Asia Pacific</name>
    <url>https://maven-central-asia.storage-download.googleapis.com/maven2/</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>confluent</id>
    <name>confluent maven</name>
    <url>http://packages.confluent.io/maven/</url>
    <mirrorOf>confluent</mirrorOf>
</mirror>

2.1.4 The fourth step, execute the compilation command

Upload and download the Hudi source code to the CentOS system directory: /root, decompress the tar package, enter the software package, and execute the compilation command:

[root@node1 hudi-0.9.0]# mvn clean install -DskipTests -DskipITs -Dscala-2.12 -Dspark3

insert image description here

After the compilation is successful, the screenshot is as follows:
insert image description here

2.1.5 The fifth step, Hudi CLI test

After the compilation is complete, enter the **$HUDI_HOME/hudi-cli directory and run the hudi-cli** script. If it can run, it means that the compilation is successful. The screenshot is as follows:
insert image description here

2.2 Environment preparation

The Apache Hudi data lake framework provides data management functions. The bottom layer stores data on the HDFS distributed and reliable file system. By default, it supports Spark to operate data (save data and read data), and supports Flink to operate data, as well as frameworks such as Hive For integration, first build a pseudo-distributed big data environment to facilitate the subsequent use of Hudi.

2.2.1 Install HDFS

First install and deploy the HDFS distributed file system pseudo-distributed cluster to facilitate subsequent data storage.
step1, decompress the software package, decompress and configure HDFS on the
node1.itcast.cn machine

[root@node1 ~]# cd /export/software/
[root@node1 software]# rz
[root@node1 software]# tar -zxf hadoop-2.7.3.tar.gz -C /export/server/

After the decompression is complete, create a hadoop soft connection to facilitate subsequent software version upgrades and management.

[root@node1 ~]# cd /export/server/
[root@node1 server]# ln -s hadoop-2.7.3 hadoop
[root@node1 server]# ll
lrwxrwxrwx  1 root root  12 Feb 23 21:35 hadoop -> hadoop-2.7.3
drwxr-xr-x  9 root root 149 Nov  4 17:57 hadoop-2.7.3

step2. Configure environment variables
In Hadoop, scripts in the bin and sbin directories, configuration files in etc/hadoop, and many configuration items use the environment variables HADOOP_*. If only HADOOP_HOME is configured, these scripts will determine the class library paths of COMMON, HDFS and YARN by appending the corresponding directory structure from HADOOP_HOME.

HADOOP_HOME:Hadoop软件的安装路径;
HADOOP_CONF_DIR:Hadoop的配置文件路径;
HADOOP_COMMON_HOME:Hadoop公共类库的路径;
HADOOP_HDFS_HOME:Hadoop HDFS的类库路径;
HADOOP_YARN_HOME:Hadoop YARN的类库路径;
HADOOP_MAPRED_HOME:Hadoop MapReduce的类库路径;

Edit [ /etc/profile ] file, the command is as follows:

vim /etc/profile

Add the following:

export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Execute the following command to take effect:

source /etc/profile

Remarks: The environment variables must be configured on the three machines to make them take effect, which is convenient for subsequent direct use of commands.

step3, hadoop-env.sh
configures the JDK and HADOOP installation directory in the Hadoop environment variable script, the command and content are as follows.
Excuting an order:

[root@node1 ~]# vim /export/server/hadoop/etc/hadoop/hadoop-env.sh

The modification is as follows:

export JAVA_HOME=/export/server/jdk
export HADOOP_HOME=/export/server/hadoop

Step4, core-site.xml
Configure the public properties of the Hadoop Common module, edit the core-site.xml file, the command and content are as follows.
Excuting an order:

[root@node1 ~]# vim /export/server/hadoop/etc/hadoop/core-site.xml

Add configuration content:

<property>
	<name>fs.defaultFS</name>
	<value>hdfs://node1.itcast.cn:8020</value>
</property>
<property>
	<name>hadoop.tmp.dir</name>
	<value>/export/server/hadoop/datas/tmp</value>
</property>
<property>
	<name>hadoop.http.staticuser.user</name>
	<value>root</value>
</property>

Create a temporary data directory, the command is as follows:

[root@node1 ~]# mkdir -p  /export/server/hadoop/datas/tmp

Step5, hdfs-site.xml
configures HDFS distributed file system related attributes, the specific commands and contents are as follows:
Execute the command:

[root@node1 ~]# vim /export/server/hadoop/etc/hadoop/hdfs-site.xml 

Add configuration content:

<property>
	<name>dfs.namenode.name.dir</name>
	<value>/export/server/hadoop/datas/dfs/nn</value>
</property>
<property>
	<name>dfs.datanode.data.dir</name>
	<value>/export/server/hadoop/datas/dfs/dn</value>
</property>

<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
<property>
	<name>dfs.permissions.enabled</name>
	<value>false</value>
</property>
<property>
	<name>dfs.datanode.data.dir.perm</name>
	<value>750</value>
</property>

Create a data directory, the command is as follows:

[root@node1 ~]# mkdir -p  /export/server/hadoop/datas/dfs/nn
[root@node1 ~]# mkdir -p  /export/server/hadoop/datas/dfs/dn

Step6. Workers
configure the machine running on the slave node DataNode in the HDFS cluster. The specific command and content are as follows:
Execute the command:

[root@node1 ~]# vim /export/server/hadoop/etc/hadoop/workers 

Add configuration content:

node1.itcast.cn

Step7. Format HDFS
Before starting the HDFS file for the first time, format the HDFS file system first. The command is as follows:

[root@node1 ~]# hdfs namenode -format

step8. Start the HDFS cluster
Start the HDFS cluster service on node1.itcast.cn: NameNode and DataNodes, the command is as follows:

[root@node1 ~]# hadoop-daemon.sh start namenode
[root@node1 ~]# hadoop-daemon.sh start datanode

View HDFS WEB UI, the address is: http://node1.itcast.cn:50070/
insert image description here

2.2.2 Install Spark 3.x

Unzip the compiled spark installation package [spark-3.0.0-bin-hadoop2.7.tgz] to the [/export/server] directory:

## 解压软件包
tar -zxf /export/software/spark-3.0.0-bin-hadoop2.7.tgz -C /export/server/
## 创建软连接,方便后期升级
ln -s /export/server/spark-3.0.0-bin-hadoop2.7 /export/server/spark

The meanings of each directory are as follows:
insert image description here

  • The first step, install Scala-2.12.10
## 解压Scala
tar -zxf /export/softwares/scala-2.12.10.tgz -C /export/server/
## 创建软连接
ln -s /export/server/scala-2.12.10 /export/server/scala
## 设置环境变量
vim /etc/profile
### 内容如下:
# SCALA_HOME
export SCALA_HOME=/export/server/scala
export PATH=$PATH:$SCALA_HOME/bin
  • The second step, modify the configuration name
## 进入配置目录
cd /export/server/spark/conf
## 修改配置文件名称
mv spark-env.sh.template spark-env.sh
  • The third step is to modify the configuration file, $SPARK_HOME/conf/spark-env.sh, and add the following content:
## 设置JAVA和SCALA安装目录
JAVA_HOME=/export/server/jdk
SCALA_HOME=/export/server/scala
## HADOOP软件配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop

Screenshot below:
insert image description here

Start spark-shell in local mode:

## 进入Spark安装目录
cd /export/server/spark
## 启动spark-shell
bin/spark-shell --master local[2]

After running successfully, there will be the following prompt information:
insert image description here

Upload the $【SPARK_HOME/README.md】file to the HDFS directory【/datas】, and use SparkContext to read the file. The command is as follows:

## 上传HDFS文件
hdfs dfs -mkdir -p /datas/
hdfs dfs -put /export/server/spark/README.md /datas
## 读取文件
val datasRDD = sc.textFile("/datas/README.md")
## 条目数
datasRDD.count
## 获取第一条数据
datasRDD.first

Relevant screenshots are as follows:
insert image description here

Use the SparkSession object spark to load and read text data and encapsulate it into a DataFrame. The screenshot is as follows:
insert image description here

2.3 spark-shell use

First use the spark-shell command line to run in local mode (LocalMode: –master local[2]), simulate and generate Trip ride transaction data, save it to the Hudi table, and load data query analysis from the Hudi table, where Hudi Table data is finally stored on the HDFS distributed file system.
The command to start the pseudo-distributed HDFS file system is as follows:

[root@node1 ~]# hadoop-daemon.sh start namenode
[root@node1 ~]# hadoop-daemon.sh start datanode

2.3.1 Start spark-shell

On the spark-shell command line, to operate the Hudi table data, when you need to run the spark-shell command, add the relevant dependency package. The official command (for Spark3 and Hudi 0.9) is as follows:

spark-shell \
--master local[2] \
--packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.0.1 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"

The above command needs to be connected to the Internet , download the relevant jar package based on ivy to the local, and then load it into the CLASSPATH. It contains 3 jar packages:
insert image description here

In addition, you can download the above three jar packages and upload them to the virtual machine. The command is as follows:

[root@node1 ~]# cd /root
[root@node1 ~]# mkdir -p hudi-jars
# 上传JAR包
[root@node1 ~]# rz       

insert image description here

When spark-shell is started, it is specified by –jars, and the specific operation commands are as follows:

spark-shell \
--master local[2] \
--jars /root/hudi-jars/hudi-spark3-bundle_2.12-0.9.0.jar,\
/root/hudi-jars/spark-avro_2.12-3.0.1.jar,/root/hudi-jars/spark_unused-1.0.0.jar \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"  

A screenshot is shown below:
insert image description here

Next, execute the relevant code, save data to the Hudi table and load data from the Hudi table.
Official documentation: https://hudi.apache.org/docs/spark_quick-start-guide.html

2.3.2 Simulation data

First import Spark and Hudi related packages and define variables (table name and data storage path), the code is as follows:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "hdfs://node1.itcast.cn:8020/datas/hudi-warehouse/hudi_trips_cow"
val dataGen = new DataGenerator

Among them, the DataGenerator object is constructed to simulate and generate Trip ride data. The code is as follows:

val inserts = convertToStringList(dataGen.generateInserts(10))

The above code simulates the generation of 10 trip ride data in JSON format, as follows:
insert image description here

Next, convert the simulated data List to a DataFrame dataset, the code is as follows:

val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

View the Schema information of the converted DataFrame dataset, as follows:

df.printSchema()

insert image description here

Select the relevant fields to view the simulated sample data, as follows:

df.select("rider", "begin_lat", "begin_lon", "driver", "fare", "uuid", "ts").show(10, truncate=false)

insert image description here

2.3.3 Insert data

Save the Trip data generated by the above simulation into the Hudi table. Since Hudi was born based on the Spark framework, SparkSQL supports the Hudi data source. You can directly specify the data source Source through format and set the relevant attributes to save the data. The command is as follows:

df.write
  .mode(Overwrite)
  .format("hudi")
  .options(getQuickstartWriteConfigs)
  .option(PRECOMBINE_FIELD_OPT_KEY, "ts")
  .option(RECORDKEY_FIELD_OPT_KEY, "uuid")
  .option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath")
  .option(TABLE_NAME, tableName)
  .save(basePath)

Use the paste mode in the Scala interactive command line to paste the code, the screenshot is as follows:
insert image description here

The relevant parameters are described as follows:

  • Parameters: getQuickstartWriteConfigs, set the number of partitions during Shuffle when writing/updating data to Hudi
    insert image description here

  • Parameter: PRECOMBINE_FIELD_OPT_KEY, when data is merged, according to the primary key field
    insert image description here

  • Parameters: RECORDKEY_FIELD_OPT_KEY, the unique id of each record, supports multiple fields
    insert image description here

  • Parameter: PARTITIONPATH_FIELD_OPT_KEY, the partition field used to store data
    insert image description here

After the data is saved successfully, check the HDFS file system directory: /datas/hudi-warehouse/hudi_trips_cow, the structure is as follows:
insert image description here

It can be found that the Hudi table data is stored on HDFS in the form of PARQUET columns.

2.3.4 Query data

To read data from the Hudi table, the SparkSQL external data source is also used to load data, and the format data source and related parameter options are specified. The command is as follows:

val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")

It is enough to specify the data storage path of the Hudi table, and the regular Regex matching method is adopted. Since the saved Hudi table belongs to the partition table and is a three-level partition (equivalent to specifying three partition fields in the table in Hive), use the expression: // / / Load all data.
insert image description here

Print the Schema information for obtaining Hudi table data, as follows:

tripsSnapshotDF.printSchema()

insert image description here

There are 5 more fields than the data originally saved in the Hudi table, and these fields belong to the related fields used by Hudi to manage data.
Register the obtained Hudi table data DataFrame as a temporary view, and use SQL to analyze the data based on business queries.

tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

Query service 1 : the travel cost is greater than 20 information data

spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()

Execution query analysis results are as follows:
insert image description here

**Query business 2: **Select fields to query data

spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

Execution query analysis results are as follows:
insert image description here

So far, the data is saved to the Hudi table, and the data analysis operation is loaded from Hudi.

2.3.5 Table Data Structure

The data files of the Hudi table can be stored using the file system of the operating system, or can be stored using a distributed file system such as HDFS. For subsequent analysis performance and data reliability, HDFS is generally used for storage. From the perspective of HDFS storage, the storage files of a Hudi table are divided into two categories.
insert image description here

  • .hoodie file: Due to the fragmented nature of CRUD, each operation will generate a file. The increasing number of these small files will seriously affect the performance of HDFS. Hudi has designed a file merging mechanism. The .hoodie folder stores the log files related to the corresponding file merge operation.
  • The path related to amricas and asia is the actual data file, which is stored by partition, and the path key of the partition can be specified.

2.3.5.1 .hoodie files

Hudi calls a series of CRUD operations on tables as time goes by Timeline. A certain operation in Timeline is called Instant . Instant contains the following information:

  • Instant Action, recording whether this operation is a data submission (COMMITS), file merging (COMPACTION), or file cleaning (CLEANS);
  • Instant Time, the time when this operation occurred;
  • State, the state of the operation, initiated (REQUESTED), in progress (INFLIGHT), or completed (COMPLETED); the status record of the corresponding operation is stored in the
    .hoodie folder:
    insert image description here

2.3.5.2 Data files

Hudi's real data files are stored in the Parquet file format. The screenshot is as follows:
insert image description here

It contains a metadata metadata file and data file parquet column storage.
In order to implement CRUD of data, Hudi needs to be able to uniquely identify a record. Hudi will combine the unique field (record key) + data partition (partitionPath) in the data set as the unique key of the data .

2.3.6 Hudi data storage overview

The organizational directory structure of the Hudi dataset is very similar to that of Hive, and a dataset corresponds to this root directory . The data set is broken into multiple partitions, and the partition fields exist in the form of folders, which contain all the files of the partition.
insert image description here

  • Under the root directory, each partition has a unique partition path, and each partition data is stored in multiple files.
  • Each file is identified by a unique fileId and the commit that generated the file. If an update operation occurs, multiple files share the same fileId, but have different commits.
  • Each record is identified by its key value and mapped to a fileId.

The mapping between a record's key and fileId is permanently determined once the first version is written to the file. In other words, a fileId identifies a group of files, each file contains a specific set of records, and the same records in different files are distinguished by version numbers.

2.3.6.1 Metadata Metadata

The metadata of each operation on the dataset is maintained in the form of a timeline to support the transient view of the dataset. This part of metadata is stored in the metadata directory under the root directory. There are three types of metadata:

  • Commits: A single commit contains information about an atomic write operation to the previous batch of data in the dataset. We identify commits with monotonically increasing timestamps, marking the beginning of a write operation.
  • Cleans: Background activity used to clean up old versions of files in the dataset that are no longer used by queries.
  • Compactions: Background activities for coordinating data structure differences within Hudi. For example, the update operation is transferred from row-based log files to column-based data.

2.3.6.2 Index

Hudi maintains an index to support fast mapping of the key of a new record to the corresponding fileId when the record key exists. Index implementation is pluggable.

  • Bloom filter: stored in the footer of the data file. The default option does not depend on external system implementation. Data and indexes are always consistent.
  • Apache HBase: It can efficiently find a small batch of keys. This option may be several seconds faster during index marking.

2.3.6.3 Data

Hudi stores all ingested data in two different storage formats. The design of this block is also plug-in, and users can choose any data format that meets the following conditions:

  • Read-optimized column storage format (ROFormat), the default value is Apache Parquet;
  • Write-optimized row format (WOFormat), the default value is Apache Avro;

2.4 IDEA programming development

Apache Hudi was originally developed by Uber to enable low-latency database access with high efficiency. Hudi provides the concept of Hudi tables, which support CRUD operations. Next, use the Hudi API to perform read and write operations based on the Spark framework.
insert image description here

2.4.1 Prepare the environment

Create a Maven Project project, add Hudi and Spark related dependent jar packages, and the content of the POM file is as follows:

<repositories>
    <repository>
        <id>aliyun</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </repository>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    <repository>
        <id>jboss</id>
        <url>http://repository.jboss.com/nexus/content/groups/public</url>
    </repository>
</repositories>

<properties>
    <scala.version>2.12.10</scala.version>
    <scala.binary.version>2.12</scala.binary.version>
    <spark.version>3.0.0</spark.version>
    <hadoop.version>2.7.3</hadoop.version>
    <hudi.version>0.9.0</hudi.version>
</properties>

<dependencies>
    <!-- 依赖Scala语言 -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <!-- Spark Core 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Spark SQL 依赖 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>

    <!-- Hadoop Client 依赖 -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>

    <!-- hudi-spark3 -->
    <dependency>
        <groupId>org.apache.hudi</groupId>
        <artifactId>hudi-spark3-bundle_2.12</artifactId>
        <version>${hudi.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-avro_2.12</artifactId>
        <version>${spark.version}</version>
    </dependency>

</dependencies>

<build>
    <outputDirectory>target/classes</outputDirectory>
    <testOutputDirectory>target/test-classes</testOutputDirectory>
    <resources>
        <resource>
            <directory>${project.basedir}/src/main/resources</directory>
        </resource>
    </resources>
    <!-- Maven 编译的插件 -->
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.2.0</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Create the relevant Maven Project project directory structure, as shown in the following figure:
insert image description here

Among them, put the HDFS Client configuration file into the resources directory of the project, so that the Hudi table data can be stored on HDFS.

2.4.2 Code structure

Based on the Spark DataSource data source, simulate and generate Trip ride transaction data, save it to the Hudi table (COW type: Copy on Write), and then load the data analysis query from the Hudi table. The specific task requirements are as follows: Task 1: Simulate data and insert it into the Hudi
table , using COW mode
Task 2: Snapshot query (Snapshot Query) data, using DSL mode
Task 3: Update (Update) data
Task 4: Incremental query (Incremental Query) data, using SQL mode
Task 5: Delete (Delete) data Create a package [
cn.itcast.hudi.spark
] in the project , and create an object: HudiSparkDemo , write the MAIN method, define task requirements and function code structure:

def main(args: Array[String]): Unit = {
    
    
   // 创建SparkSession实例对象,设置属性
   val spark: SparkSession = {
    
    
      SparkSession.builder()
         .appName(this.getClass.getSimpleName.stripSuffix("$"))
         .master("local[2]")
         // 设置序列化方式:Kryo
         .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
         .getOrCreate()
   }
   
   val tableName: String = "tbl_trips_cow"
   val tablePath: String = "/hudi-warehouse/tbl_trips_cow"
   
   // 构建数据生成器,为例模拟产生插入和更新数据
   import org.apache.hudi.QuickstartUtils._
   
   // 任务一:模拟数据,插入Hudi表,采用COW模式
   //insertData(spark, tableName, tablePath)
   
   // 任务二:快照方式查询(Snapshot Query)数据,采用DSL方式
   //queryData(spark, tablePath)
   //queryDataByTime(spark, tablePath)
   
   //Thread.sleep(10000)
   // 任务三:更新(Update)数据
   //val dataGen: DataGenerator = new DataGenerator()
   //insertData(spark, tableName, tablePath, dataGen)
   //updateData(spark, tableName, tablePath, dataGen)
   
   // 任务四:增量查询(Incremental Query)数据,采用SQL方式
   //incrementalQueryData(spark, tablePath)
   
   //任务五:删除(Delete)数据
   deleteData(spark, tableName, tablePath)
   
   // 应用结束,关闭资源
   spark.stop()
}

Next, complete the task code writing one by one according to the task description.

2.4.3 Insert data Insert

Use the official QuickstartUtils to provide simulation generation of Trip data classes, simulate 100 transaction Trip ride data, convert it into a DataFrame dataset, and save it to the Hudi table. The code is basically the same as the spark-shell command line, as shown below:

/**
 * 官方案例:模拟产生数据,插入Hudi表,表的类型COW
 */
def insertData(spark: SparkSession, table: String, path: String): Unit = {
    
    
   import spark.implicits._
   
   // TODO: a. 模拟乘车数据
   import org.apache.hudi.QuickstartUtils._
   
   val dataGen: DataGenerator = new DataGenerator()
   val inserts = convertToStringList(dataGen.generateInserts(100))
   
   import scala.collection.JavaConverters._
   val insertDF: DataFrame = spark.read
      .json(spark.sparkContext.parallelize(inserts.asScala, 2).toDS())
   //insertDF.printSchema()
   //insertDF.show(10, truncate = false)
   
   // TODO: b. 插入数据至Hudi表
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   insertDF.write
      .mode(SaveMode.Append)
      .format("hudi") // 指定数据源为Hudi
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      // Hudi 表的属性设置
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
}

Execute the code to check whether relevant data is saved in the HDFS file system path.

2.4.4 Query data Query

Use the Snapshot snapshot method to query data from the Hudi table, write DSL code, and analyze data according to the business. The code is as follows:

/**
 * 官方案例:采用Snapshot Query快照方式查询表的数据
 */
def queryData(spark: SparkSession, path: String): Unit = {
    
    
   import spark.implicits._
   
   val tripsDF: DataFrame = spark.read.format("hudi").load(path)
   //tripsDF.printSchema()
   //tripsDF.show(10, truncate = false)
   
   // 查询费用大于20,小于50的乘车数据
   tripsDF
      .filter($"fare" >= 20 && $"fare" <= 50)
      .select($"driver", $"rider", $"fare", $"begin_lat", $"begin_lon", $"partitionpath", $"_hoodie_commit_time")
      .orderBy($"fare".desc, $"_hoodie_commit_time".desc)
      .show(20, truncate = false)
}

Execute the above code, the display results are as follows;
insert image description here

To query Hudi table data, you can filter and query based on time, set the property: "as.of.instant" , the value format: "20210728141108" or "2021-07-28 14: 11: 08" , the code demonstration is as follows:

/**
 * 官方案例:采用时间过滤查询数据
 */
def queryDataByTime(spark: SparkSession, path: String): Unit = {
    
    
   import org.apache.spark.sql.functions._
   
   // 方式一:指定字符串,格式 yyyyMMddHHmmss
   val df1 = spark.read
      .format("hudi")
      .option("as.of.instant", "20211119095057")
      .load(path)
      .sort(col("_hoodie_commit_time").desc)
   df1.show(numRows = 5, truncate = false)
   
   // 方式二:指定字符串,格式yyyy-MM-dd HH:mm:ss
   val df2 = spark.read
      .format("hudi")
      .option("as.of.instant", "20211119095057")
      .load(path)
      .sort(col("_hoodie_commit_time").desc)
   df2.show(numRows = 5, truncate = false)
}

2.4.5 Update data Update

The biggest advantage of the Hudi data lake framework is that it supports Upser operations (insert or update) on data, and then updates the Update data. Since the officially provided tool class DataGenerator simulates and generates update data, it must use the same DataGenerator object as the simulated generation of insert data, so rewrite the insertData insert data method.

/**
 * 官方案例:模拟产生数据,插入Hudi表,表的类型COW
 */
def insertData(spark: SparkSession, table: String, path: String, dataGen: DataGenerator): Unit = {
    
    
   import spark.implicits._
   
   // TODO: a. 模拟乘车数据
   import org.apache.hudi.QuickstartUtils._
   val inserts = convertToStringList(dataGen.generateInserts(100))
   
   import scala.collection.JavaConverters._
   val insertDF: DataFrame = spark.read
      .json(spark.sparkContext.parallelize(inserts.asScala, 2).toDS())
   //insertDF.printSchema()
   //insertDF.show(10, truncate = false)
   
   // TODO: b. 插入数据至Hudi表
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   insertDF.write
      .mode(SaveMode.Overwrite)
      .format("hudi") // 指定数据源为Hudi
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      // Hudi 表的属性设置
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
}

Update data method: updateData, first generate update data, and then save it to the Hudi table, the code is as follows:

/**
 * 官方案例:更新Hudi数据,运行程序时,必须要求与插入数据使用同一个DataGenerator对象,更新数据Key是存在的
 */
def updateData(spark: SparkSession, table: String, path: String, dataGen: DataGenerator): Unit = {
    
    
   import spark.implicits._
   
   // TODO: a、模拟产生更新数据
   import org.apache.hudi.QuickstartUtils._

   import scala.collection.JavaConverters._
   val updates = convertToStringList(dataGen.generateUpdates(100))
   val updateDF = spark.read.json(spark.sparkContext.parallelize(updates.asScala, 2).toDS())
   // TODO: b、更新数据至Hudi表
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   updateDF.write
      .mode(SaveMode.Append)
      .format("hudi")
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
}

2.4.6 Incremental query Incremental query

When the type of the table in Hudi is: COW, it supports 2 query methods: Snapshot Queries and Incremental Queries. By default, the query belongs to: Snapshot Queries snapshot query, which can be set through the parameter: hoodie.datasource.query.type.
insert image description here

If it is an incremental query, you need to specify a timestamp. When the data in the Hudi table satisfies: instant_time > beginTime, the data will be loaded and read. In addition, you can also set a certain time range: endTime > instant_time > begionTime to obtain the corresponding data. The official source code description is as follows:
insert image description here

Next, first load all the data from the Hudi table, get the field value: _hoodie_commit_time, select a value from it as the incremental query: beginTime start time; set the attribute parameters again, and query the data incrementally from the Hudi table. The specific code is as follows :

/**
 * 官方案例:采用Incremental Query增量方式查询表的数据
 */
def incrementalQueryData(spark: SparkSession, path: String): Unit = {
    
    
   import spark.implicits._
   
   // TODO: a. 加载Hudi表数据,获取commitTime时间,作为增量查询时间阈值
   import org.apache.hudi.DataSourceReadOptions._
   spark.read
      .format("hudi")
      .load(path)
      .createOrReplaceTempView("view_temp_hudi_trips")
   val commits: Array[String] = spark
      .sql(
         """
           |select
           |  distinct(_hoodie_commit_time) as commitTime
           |from
           |  view_temp_hudi_trips
           |order by
           |  commitTime DESC
           |""".stripMargin
      )
      .map(row => row.getString(0))
        .take(50)
   val beginTime = commits(commits.length - 1) // commit time we are interested in
   println(s"beginTime = ${beginTime}")
   
   // TODO: b. 设置Hudi数据CommitTime时间阈值,进行增量查询数据
   val tripsIncrementalDF = spark.read
      .format("hudi")
      // 设置查询数据模式为:incremental,增量读取
      .option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL)
      // 设置增量读取数据时开始时间
      .option(BEGIN_INSTANTTIME.key(), beginTime)
      .load(path)
   
   // TODO: c. 将增量查询数据注册为临时视图,查询费用fare大于20的数据信息
   tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
   spark
      .sql(
         """
           |select
           |  `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
           |from
           |  hudi_trips_incremental
           |where
           |  fare > 20.0
           |""".stripMargin
      )
      .show(10, truncate = false)
}

The above code registers DataFrame as a temporary view, writes SQL statements, and queries data incrementally. The running results are as follows:
insert image description here

2.4.7 Delete data Delete

Use the DataGenerator data generator to build the data to be deleted based on the existing data, and finally save it in the Hudi table. At that time, you need to set the attribute parameter: hoodie.datasource.write.operation value: delete.
insert image description here

Writing method: deleteData, first obtain 2 pieces of data from the Hudi table, then construct the data format, and finally save it to the Hudi table, the specific code is as follows:

/**
 * 官方案例:删除Hudi表数据,依据主键UUID进行删除,如果是分区表,指定分区路径
 */
def deleteData(spark: SparkSession, table: String, path: String): Unit = {
    
    
   import spark.implicits._
   
   // TODO: a. 加载Hudi表数据,获取条目数
   val tripsDF: DataFrame = spark.read.format("hudi").load(path)
   println(s"Count = ${tripsDF.count()}")
   
   // TODO: b. 模拟要删除的数据
   val dataframe: DataFrame = tripsDF.select($"uuid", $"partitionpath").limit(2)
   import org.apache.hudi.QuickstartUtils._

   val dataGen: DataGenerator = new DataGenerator()
   val deletes = dataGen.generateDeletes(dataframe.collectAsList())
   
   import scala.collection.JavaConverters._
   val deleteDF = spark.read.json(spark.sparkContext.parallelize(deletes.asScala, 2))
   
   // TODO: c. 保存数据至Hudi表,设置操作类型为:DELETE
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   deleteDF.write
      .mode(SaveMode.Append)
      .format("hudi")
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      // 设置数据操作类型为delete,默认值为upsert
      .option(OPERATION.key(), "delete")
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
   
   // TODO: d. 再次加载Hudi表数据,统计条目数,查看是否减少2条
   val hudiDF: DataFrame = spark.read.format("hudi").load(path)
   println(s"Delete After Count = ${hudiDF.count()}")
}

Guess you like

Origin blog.csdn.net/toto1297488504/article/details/132240039