4. Detailed explanation of Hudi integrated Spark case
In the hudi quick start program, I briefly experienced the integration of spark with hudi, and now I will explain it in detail.
Data lake architecture Hudi (2) Hudi version 0.12 source code compilation, Hudi integrated spark, using IDEA and spark to add, delete, modify and check hudi tables
4.1 Using the spark-shell method
# 启动命令行
spark-shell \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
4.1.1 Insert data
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "hdfs://192.168.42.104:9000/datas/hudi_warehouse/hudi_trips_cow"
val dataGen = new DataGenerator
# 不需要单独的建表。如果表不存在,第一批写表将创建该表。(默认是COW表)
# 新增数据,使用官方提供的工具类生成一些Trips乘车数据,将其加载到DataFrame中,然后将DataFrame写入Hudi表。
# Mode(overwrite)将覆盖重新创建表(如果已存在)。
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
4.1.2 Query data
# 注意:该表有三级分区(区域/国家/城市),在0.9.0版本以前的hudi,在load中的路径需要按照分区目录拼接"*",如:load(basePath + "/*/*/*/*"),当前版本不需要。
# 1、转换为df
val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
# 2、进行查询
scala> spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
+------------------+-------------------+-------------------+-------------+
| fare| begin_lon| begin_lat| ts|
+------------------+-------------------+-------------------+-------------+
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1677600005195|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1677240470730|
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1677696170708|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1677272458691|
| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1677360474147|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1677583109653|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1677602421735|
| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1677721939334|
+------------------+-------------------+-------------------+-------------+
# 3、查询hudi多出来的几个字段
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20230302191855836|16df8361-18cd-461...| americas/united_s...|rider-213|driver-213| 64.27696295884016|
| 20230302191855836|d2bb2448-1e1f-45f...| americas/united_s...|rider-213|driver-213| 27.79478688582596|
| 20230302191855836|8d1b3b83-e88c-45e...| americas/united_s...|rider-213|driver-213| 93.56018115236618|
| 20230302191855836|ce2b0518-1875-48b...| americas/united_s...|rider-213|driver-213| 33.92216483948643|
| 20230302191855836|a5b03e52-31c7-4f9...| americas/united_s...|rider-213|driver-213|19.179139106643607|
| 20230302191855836|30263e49-3c95-489...| americas/brazil/s...|rider-213|driver-213| 43.4923811219014|
| 20230302191855836|dd70365d-5345-4d3...| americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
| 20230302191855836|ff01ba9d-92f0-410...| americas/brazil/s...|rider-213|driver-213|34.158284716382845|
| 20230302191855836|4d4e2563-bc21-4e6...| asia/india/chennai|rider-213|driver-213|17.851135255091155|
| 20230302191855836|3c495316-233e-418...| asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+
# 4、时间旅行查询
Hudi从0.9.0开始就支持时间旅行查询。目前支持三种查询时间格式,如下所示。
spark.read.
format("hudi").
option("as.of.instant", "20230302191855836").
load(basePath).show(10)
spark.read.
format("hudi").
option("as.of.instant", "2023-03-02 19:18:55.836").
load(basePath).show(10)
# 表示 "as.of.instant = 2023-03-02 00:00:00"
spark.read.
format("hudi").
option("as.of.instant", "2023-03-02").
load(basePath).show(10)
4.1.3 Update data
# 更新前数据
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20230302191855836|16df8361-18cd-461...| americas/united_s...|rider-213|driver-213| 64.27696295884016|
| 20230302191855836|d2bb2448-1e1f-45f...| americas/united_s...|rider-213|driver-213| 27.79478688582596|
| 20230302191855836|8d1b3b83-e88c-45e...| americas/united_s...|rider-213|driver-213| 93.56018115236618|
| 20230302191855836|ce2b0518-1875-48b...| americas/united_s...|rider-213|driver-213| 33.92216483948643|
| 20230302191855836|a5b03e52-31c7-4f9...| americas/united_s...|rider-213|driver-213|19.179139106643607|
| 20230302191855836|30263e49-3c95-489...| americas/brazil/s...|rider-213|driver-213| 43.4923811219014|
| 20230302191855836|dd70365d-5345-4d3...| americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
| 20230302191855836|ff01ba9d-92f0-410...| americas/brazil/s...|rider-213|driver-213|34.158284716382845|
| 20230302191855836|4d4e2563-bc21-4e6...| asia/india/chennai|rider-213|driver-213|17.851135255091155|
| 20230302191855836|3c495316-233e-418...| asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+
# 更新数据
# 类似于插入新数据,使用数据生成器生成(注意是同一个数据生成器对象)新数据对历史数据进行更新。将数据加载到DataFrame中并将DataFrame写入Hudi表中。
val updates = convertToStringList(dataGen.generateUpdates(5))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
# 注意:保存模式现在是Append。通常,除非是第一次创建表,否则请始终使用追加模式。
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
# 再次查询
# 1、转换为df
val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
# 更新后数据
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20230302194751288|16df8361-18cd-461...| americas/united_s...|rider-243|driver-243|14.503019204958845|
| 20230302194751288|d2bb2448-1e1f-45f...| americas/united_s...|rider-243|driver-243| 51.42305232303094|
| 20230302194751288|8d1b3b83-e88c-45e...| americas/united_s...|rider-243|driver-243|26.636532270940915|
| 20230302194716880|ce2b0518-1875-48b...| americas/united_s...|rider-284|driver-284| 90.9053809533154|
| 20230302191855836|a5b03e52-31c7-4f9...| americas/united_s...|rider-213|driver-213|19.179139106643607|
| 20230302194751288|30263e49-3c95-489...| americas/brazil/s...|rider-243|driver-243| 89.45841313717807|
| 20230302194751288|dd70365d-5345-4d3...| americas/brazil/s...|rider-243|driver-243|2.4995362119815567|
| 20230302194716880|ff01ba9d-92f0-410...| americas/brazil/s...|rider-284|driver-284| 29.47661370147079|
| 20230302194751288|4d4e2563-bc21-4e6...| asia/india/chennai|rider-243|driver-243| 71.08018349571618|
| 20230302194716880|3c495316-233e-418...| asia/india/chennai|rider-284|driver-284| 9.384124531808036|
+-------------------+--------------------+----------------------+---------+----------+------------------+
4.1.4 Incremental query
Hudi also provides a way to query incrementally, which can get the data stream changed since a given commit timestamp. You need to specify the beginTime of the incremental query, and optionally specify the endTime. If we want all changes after a given commit, we don't need to specify endTime (which is the common case).
# 1、加载数据
spark.
read.
format("hudi").
load(basePath).
createOrReplaceTempView("hudi_trips_snapshot")
# 2、获取指定beginTime
scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
commits: Array[String] = Array(20230302210112648, 20230302210408496)
scala> val beginTime = commits(commits.length - 2)
beginTime: String = 20230302210112648
# 3、创建增量查询的表
val tripsIncrementalDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
# 4、查询增量表
scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare < 20.0").show()
+-------------------+-----------------+-------------------+------------------+-------------+
|_hoodie_commit_time| fare| begin_lon| begin_lat| ts|
+-------------------+-----------------+-------------------+------------------+-------------+
| 20230302210408496|60.34474295461695|0.03363698727131392|0.9886806054385373|1677343847695|
| 20230302210408496| 57.4289850003576| 0.9692506010574379|0.9566270007622102|1677699656426|
+-------------------+-----------------+-------------------+------------------+-------------+
4.1.5 Query at specified time point
# 查询特定时间点的数据,可以将endTime指向特定时间,beginTime指向000(表示最早提交时间)
# 1)指定beginTime和endTime
val beginTime = "000"
val endTime = commits(commits.length - 2)
# 2)根据指定时间创建表
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
# 3)查询
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
+-------------------+-----------------+------------------+-------------------+-------------+
|_hoodie_commit_time| fare| begin_lon| begin_lat| ts|
+-------------------+-----------------+------------------+-------------------+-------------+
| 20230302210112648|75.67233311397607|0.7433519787065044|0.23986563259065297|1677257554148|
| 20230302210112648|72.88363497900701|0.6482943149906912| 0.682825302671212|1677446496876|
| 20230302210112648|41.57780462795554|0.5609292266131617| 0.6718059599888331|1677230346940|
| 20230302210112648|69.36363684236434| 0.621688297381891|0.13625652434397972|1677277488735|
| 20230302210112648|43.51073292791451|0.3953934768927382|0.39178349695388426|1677567017799|
| 20230302210112648|62.79408654844148|0.8414360533180016| 0.9115819084017496|1677314954780|
| 20230302210112648|66.06966684558341|0.7598920002419857| 0.1591418101835923|1677428809403|
| 20230302210112648|63.30100459693087|0.4878809010360382| 0.6331319396951335|1677336164167|
+-------------------+-----------------+------------------+-------------------+-------------+
4.1.6 Delete data
Delete according to the incoming HoodieKeys (uuid + partitionpath). Only append mode supports the delete function.
# 1)获取总行数
scala> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
res50: Long = 10
# 2)取其中2条用来删除
val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
# 3)将待删除的2条数据构建DF
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2))
# 4)执行删除
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION_OPT_KEY,"delete").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
# 5)统计删除数据后的行数,验证删除是否成功
val roAfterDeleteViewDF = spark.
read.
format("hudi").
load(basePath)
roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
// 返回的总行数应该比原来少2行
scala> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
res53: Long = 8
4.1.7 Coverage data
For a table or partition, if most of the records change every cycle, it is inefficient to do upsert or merge. We want something like hive's "insert overwrite" operation, to ignore existing data and just create a commit with new data provided.
It can also be used for certain operational tasks, such as repairing specified problem partitions. We can 'insert overwrite' the partition with the records in the source file. For some data sources, this is much faster than restore and replay.
Insert overwrite operations may be faster than upserts for batch ETL jobs that recompute the entire target partition (including indexing, precombining, and other repartitioning steps) in each batch.
# 1)查看当前表的key
scala> spark.
| read.format("hudi").
| load(basePath).
| select("uuid","partitionpath").
| sort("partitionpath","uuid").
| show(100, false)
+------------------------------------+------------------------------------+
|uuid |partitionpath |
+------------------------------------+------------------------------------+
|0a47c845-fb42-4187-af27-a85e6229a3c3|americas/brazil/sao_paulo |
|6f82914d-f7a0-4972-8691-d1404ed7cae3|americas/brazil/sao_paulo |
|e2d4fa5b-da34-4603-85c3-d2ad884ac090|americas/brazil/sao_paulo |
|26e8db50-755c-44e7-9200-988a78c1e5de|americas/united_states/san_francisco|
|5afb905d-7ed2-46f5-bba8-5e2fb8ac88da|americas/united_states/san_francisco|
|2947db75-fa72-43d5-993c-4530b9890c73|asia/india/chennai |
|74f3ec44-62fa-435f-b06c-4cb9e0f4defa|asia/india/chennai |
|f22b8c1c-7b57-4c5f-8bce-7ce6783047b0|asia/india/chennai |
+------------------------------------+------------------------------------+
# 2)生成一些新的行程数据
val inserts = convertToStringList(dataGen.generateInserts(2))
val df = spark.
read.json(spark.sparkContext.parallelize(inserts, 2)).
filter("partitionpath = 'americas/united_states/san_francisco'")
# 3)覆盖指定分区
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(OPERATION.key(),"insert_overwrite").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "uuid").
option(PARTITIONPATH_FIELD.key(), "partitionpath").
option(TBL_NAME.key(), tableName).
mode(Append).
save(basePath)
# 4)查询覆盖后的key,发生了变化
spark.
read.format("hudi").
load(basePath).
select("uuid","partitionpath").
sort("partitionpath","uuid").
show(100, false)
+------------------------------------+------------------------------------+
|uuid |partitionpath |
+------------------------------------+------------------------------------+
|0a47c845-fb42-4187-af27-a85e6229a3c3|americas/brazil/sao_paulo |
|6f82914d-f7a0-4972-8691-d1404ed7cae3|americas/brazil/sao_paulo |
|e2d4fa5b-da34-4603-85c3-d2ad884ac090|americas/brazil/sao_paulo |
|ea2fe685-ad87-4bba-b688-4436f729e005|americas/united_states/san_francisco|
|2947db75-fa72-43d5-993c-4530b9890c73|asia/india/chennai |
|74f3ec44-62fa-435f-b06c-4cb9e0f4defa|asia/india/chennai |
|f22b8c1c-7b57-4c5f-8bce-7ce6783047b0|asia/india/chennai |
+------------------------------------+------------------------------------+
4.2 Using spark-sql method
4.2.1 Installation of Hive3.1.2
The connection address of hive3.1.2 http://archive.apache.org/dist/hive/hive-3.1.2/
1. After downloading, upload it to /opt/apps
2. Unzip
tar -zxvf apache-hive-3.1.2-bin.tar.gz
3, double naming
mv apache-hive-3.1.2-bin hive-3.1.2
4. Execute the following command to modify hive-site.xml
cd /opt/apps/hive-3.1.2/conf
mv hive-default.xml.template hive-default.xml
5. Execute the following command to create a new hive-site.xml configuration file
vim hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- jdbc连接的URL -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://centos04:3306/hive?useSSL=false</value>
</property>
<!-- jdbc连接的Driver-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- jdbc连接的username-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- jdbc连接的password -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<!-- Hive默认在HDFS的工作目录 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<!-- Hive元数据存储的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<!-- 元数据存储授权 -->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<!-- 指定存储元数据要连接的地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://centos04:9083</value>
</property>
<!-- 指定hiveserver2连接的host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>centos04</value>
</property>
<!-- 指定hiveserver2连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
</configuration>
6. Configure hadoop
Add the following content to hadoop's core-site.xml, then restart
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>root</value>
<description>Allow the superuser oozie to impersonate any members of the group group1 and group2</description>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
<description>The superuser can connect only from host1 and host2 to impersonate a user</description>
</property>
7. The guava.jar that depends on hive is inconsistent with the version in hadoop
# hadoop3.1.3的guava版本是27,而hive3.1.2版本是19
# 两者不一致,则删除低版本的,把高版本的复制过去。
rm -rf /opt/apps/hive-3.1.2/lib/guava-19.0.jar
cp /opt/apps/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar /opt/apps/hive-3.1.2/lib
8. Configure the hive metabase in mysql
1. First download the mysql jdbc package
2. Copy it to the hive/lib directory.
3. Start and log in to mysql
4.将hive数据库下的所有表的所有权限赋给root用户,并配置123456为hive-site.xml中的连接密码,然后``刷新系统权限关系表
mysql> create database hive;
mysql> CREATE USER 'root'@'%' IDENTIFIED BY '123456';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' WITH GRANT OPTION;
mysql> flush privileges;
-- 初始化Hive元数据库
[root@centos04 conf]# schematool -initSchema -dbType mysql -verbose
9. Start Hive's Metastore
# 配置环境变量
export HIVE_HOME=/opt/apps/hive-3.1.2
# 启动Hive
[root@centos04 conf]# nohup hive --service metastore &
[root@centos04 conf]# netstat -nltp | grep 9083
tcp6 0 0 :::9083 :::* LISTEN 10282/java
10. Start Hive
# 先启动hadoop集群
start-dfs.sh
# 启动hadoop集群后,要等hdfs退出安全模式之后再启动hive。
[root@centos04 conf]# hive
# 启动远程连接
[root@centos04 ~]# hiveserver2 &
[root@centos04 ~]# netstat -nltp | grep 10000
tcp6 0 0 :::10000 :::* LISTEN 10589/java
[root@centos04 ~]# netstat -nltp | grep 10002
tcp6 0 0 :::10002 :::* LISTEN 10589/java
beeline
!connect jdbc:hive2://centos04:10000
输入用户名 root
输入密码 回车
4.2.2 Create hudi table using spark-sql
# 启动命令行窗口
spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
注意:如果没有配置hive环境变量,手动拷贝hive-site.xml到spark的conf下
parameter name | Defaults | illustrate |
---|---|---|
primaryKey | uuid | The primary key name of the table, multiple fields are separated by commas. samehoodie.datasource.write.recordkey.field |
preCombineField | The pre-merge field for the table. samehoodie.datasource.write.precombine.field |
|
type | cow | Created table type: type = 'cow' type = 'mor' samehoodie.datasource.write.table.type |
4.2.2.1 Create a non-partitioned table
use hudi_spark;
-- 创建一个cow表,默认primaryKey 'uuid',不提供preCombineField
create table hudi_cow_nonpcf_tbl (
uuid int,
name string,
price double
) using hudi;
-- 默认创建的路径为本地,/root/spark-warehouse/hudi_spark.db/hudi_cow_nonpcf_tbl
-- 创建一个mor非分区表
create table hudi_mor_tbl (
id int,
name string,
price double,
ts bigint
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts'
);
4.2.2.2 Create partition table
-- 创建一个cow分区外部表,指定primaryKey和preCombineField
create table hudi_cow_pt_tbl (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh)
location 'hdfs://192.168.42.104:9000/datas/hudi_warehouse/spark_sql/hudi_cow_pt_tbl';
4.2.2.3 Create a new table on the existing hudi table
-- 不需要指定模式和非分区列(如果存在)之外的任何属性,Hudi可以自动识别模式和配置。
-- 非分区表(依据本地存在的路径进行创建)
create table hudi_existing_tbl0 using hudi
location 'file:///root/spark-warehouse/hudi_spark.db/hudi_cow_nonpcf_tbl';
-- 分区表(依据hdfs上存在的路径进行创建,如果没有数据会报错)
-- It is not allowed to specify partition columns when the table schema is not defined
create table hudi_existing_tbl1 using hudi
partitioned by (dt, hh)
location 'hdfs://192.168.42.104:9000/datas/hudi_warehouse/spark_sql/hudi_cow_pt_tbl';
4.2.2.4 Create a table through CTAS (Create Table As Select)
-- 为了提高向hudi表加载数据的性能,CTAS使用批量插入作为写操作。
--(1)通过CTAS创建cow非分区表,不指定preCombineField
create table hudi_ctas_cow_nonpcf_tbl
using hudi
tblproperties (primaryKey = 'id')
as
select 1 as id, 'a1' as name, 10 as price;
-- (2)通过CTAS创建cow分区表,指定preCombineField
create table hudi_ctas_cow_pt_tbl
using hudi
tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts')
partitioned by (dt)
as
select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt;
-- (3)通过CTAS从其他表加载数据
# 创建内部表
create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet';
# 通过CTAS加载数据
create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (datestr) as select * from parquet_mngd;
4.2.3 Insert data
By default, the write operation type of insert into is upsert if preCombineKey is provided, otherwise insert is used.
-- 1)向非分区表插入数据
insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;
insert into hudi_mor_tbl select 1, 'a1', 20, 1000;
-- 2)向分区表动态分区插入数据
insert into hudi_cow_pt_tbl partition (dt, hh)
select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt, '10' as hh;
-- 3)向分区表静态分区插入数据
insert into hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='11') select 2, 'a2', 1000;
-- 4)使用bulk_insert插入数据
-- hudi支持使用bulk_insert作为写操作的类型,只需要设置两个配置:
-- hoodie.sql.bulk.insert.enable和hoodie.sql.insert.mode。
-- 向指定preCombineKey的表插入数据,则写操作为upsert
insert into hudi_mor_tbl select 1, 'a1_1', 20, 1001;
select id, name, price, ts from hudi_mor_tbl;
1 a1_1 20.0 1001
-- 向指定preCombineKey的表插入数据,指定写操作为bulk_insert(此时不会更新数据)
set hoodie.sql.bulk.insert.enable=true;
set hoodie.sql.insert.mode=non-strict;
insert into hudi_mor_tbl select 1, 'a1_2', 20, 1002;
select id, name, price, ts from hudi_mor_tbl;
1 a1_1 20.0 1001
1 a1_2 20.0 1002
4.2.4 Query data
-- 1)查询
select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0
-- 2)时间旅行查询
Hudi从0.9.0开始就支持时间旅行查询。Spark SQL方式要求Spark版本 3.2及以上。
create table hudi_cow_pt_tbl1 (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl1';
-- 插入一条id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a0', 1000, '2021-12-09', '10';
select * from hudi_cow_pt_tbl1;
-- 修改id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a1', 1001, '2021-12-09', '10';
select * from hudi_cow_pt_tbl1;
-- 基于第一次提交时间进行时间旅行
select * from hudi_cow_pt_tbl1 timestamp as of '20230303013452312' where id = 1;
-- 其他时间格式的时间旅行写法
select * from hudi_cow_pt_tbl1 timestamp as of '2023-03-03 01:34:52.312' where id = 1;
select * from hudi_cow_pt_tbl1 timestamp as of '2023-03-03' where id = 1;
4.2.5 Update data
-- 1)update
更新操作需要指定preCombineField。
(1)语法
UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]
(2)执行更新
update hudi_mor_tbl set price = price * 2, ts = 1111 where id = 1;
update hudi_cow_pt_tbl1 set name = 'a1_1', ts = 1001 where id = 1;
-- update using non-PK field
update hudi_cow_pt_tbl1 set ts = 1111 where name = 'a1_1';
-- 2)MergeInto
(1)语法
MERGE INTO tableIdentifier AS target_alias
USING (sub_query | tableIdentifier) AS source_alias
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
<merge_condition> =A equal bool condition
<matched_action> =
DELETE |
UPDATE SET * |
UPDATE SET column1 = expression1 [, column2 = expression2 ...]
<not_matched_action> =
INSERT * |
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
(2)执行案例
-- 1、准备source表:非分区的hudi表,插入数据
create table merge_source (id int, name string, price double, ts bigint) using hudi
tblproperties (primaryKey = 'id', preCombineField = 'ts');
insert into merge_source values (1, "old_a1", 22.22, 2900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000);
merge into hudi_mor_tbl as target
using merge_source as source
on target.id = source.id
when matched then update set *
when not matched then insert *
;
-- 2、准备source表:分区的parquet表,插入数据
create table merge_source2 (id int, name string, flag string, dt string, hh string) using parquet;
insert into merge_source2 values (1, "new_a1", 'update', '2021-12-09', '10'), (2, "new_a2", 'delete', '2021-12-09', '11'), (3, "new_a3", 'insert', '2021-12-09', '12');
merge into hudi_cow_pt_tbl1 as target
using (
select id, name, '2000' as ts, flag, dt, hh from merge_source2
) source
on target.id = source.id
when matched and flag != 'delete' then
update set id = source.id, name = source.name, ts = source.ts, dt = source.dt, hh = source.hh
when matched and flag = 'delete' then delete
when not matched then
insert (id, name, ts, dt, hh) values(source.id, source.name, source.ts, source.dt, source.hh)
;
4.2.6 Delete data
-- 删除数据
1)语法
DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
2)案例
delete from hudi_cow_nonpcf_tbl where uuid = 1;
delete from hudi_mor_tbl where id % 2 = 0;
-- 使用非主键字段删除
delete from hudi_cow_pt_tbl1 where name = 'a1_1';
4.2.7 Coverage data
使用INSERT_OVERWRITE类型的写操作覆盖分区表
使用INSERT_OVERWRITE_TABLE类型的写操作插入覆盖非分区表或分区表(动态分区)
-- 1)insert overwrite 非分区表
insert overwrite hudi_mor_tbl select 99, 'a99', 20.0, 900;
insert overwrite hudi_cow_nonpcf_tbl select 99, 'a99', 20.0;
-- 2)通过动态分区insert overwrite table到分区表
insert overwrite table hudi_cow_pt_tbl1 select 10, 'a10', 1100, '2021-12-09', '11';
-- 3)通过静态分区insert overwrite 分区表
insert overwrite hudi_cow_pt_tbl1 partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100;
4.2.8 Modify table structure and modify partition
-- 修改表结构(Alter Table)
1)语法
-- Alter table name
ALTER TABLE oldTableName RENAME TO newTableName
-- Alter table add columns
ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
-- Alter table properties
ALTER TABLE tableIdentifier SET TBLPROPERTIES (key = 'value')
2)案例
--rename to:
ALTER TABLE hudi_cow_nonpcf_tbl RENAME TO hudi_cow_nonpcf_tbl2;
--add column:
ALTER TABLE hudi_cow_nonpcf_tbl2 add columns(remark string);
--change column:
ALTER TABLE hudi_cow_nonpcf_tbl2 change column uuid uuid int;
--set properties;
alter table hudi_cow_nonpcf_tbl2 set tblproperties (hoodie.keep.max.commits = '10');
-- 修改分区
1)语法
-- Drop Partition
ALTER TABLE tableIdentifier DROP PARTITION ( partition_col_name = partition_col_val [ , ... ] )
-- Show Partitions
SHOW PARTITIONS tableIdentifier
2)案例
--show partition:
show partitions hudi_cow_pt_tbl1;
--drop partition:
alter table hudi_cow_pt_tbl1 drop partition (dt='2021-12-09', hh='10');
注意:show partition结果是基于文件系统表路径的。删除整个分区数据或直接删除某个分区目录并不精确。
4.3 Using the IDEA method
You can refer to: https://blog.csdn.net/qq_44665283/article/details/129271737?spm=1001.2014.3001.5501
4.4 Use the DeltaStreamer import tool (from Apache kafka to hudi table case)
The HoodieDeltaStreamer tool (part of hudi-utilities-bundle) provides a way to ingest from different sources such as DFS or Kafka, and has the following functions:
Ø Accurately collect new data from Kafka once, and import incrementally from the output of Sqoop, HiveIncrementalPuller or files under the DFS folder.
Ø The imported data supports json, avro or custom data types.
Ø Manage checkpoints, rollback and recovery.
Ø Avro Schema using DFS or Confluent schema registry.
Ø Support custom conversion operation.
The official website is as follows: https://hudi.apache.org/cn/docs/0.12.2/hoodie_deltastreamer/
The case given on the official website is based on Confluent Kafka, and this case is based on Apache Kafka.
1. Start zk and kafka
2. Create a test topic
/opt/apps/kafka_2.12-2.6.2/bin/kafka-topics.sh --bootstrap-server centos01:9092 --create --topic hudi_test
3. Prepare kafka producer program
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>hudi-start</artifactId>
<groupId>com.yyds</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>hudi-kafka</artifactId>
<dependencies>
<!--kafka的客户端-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.4.1</version>
</dependency>
<!--fastjson <= 1.2.80 存在安全漏洞,-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.83</version>
</dependency>
</dependencies>
</project>
package com.yyds;
import com.alibaba.fastjson.JSONObject;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
import java.util.Random;
public class HudiKafkaProducer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "centos01:9092,centos02:9092,centos03:9092");
props.put("acks", "-1");
props.put("batch.size", "1048576");
props.put("linger.ms", "5");
props.put("compression.type", "snappy");
props.put("buffer.memory", "33554432");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);
Random random = new Random();
for (int i = 0; i < 1000; i++) {
JSONObject model = new JSONObject();
model.put("userid", i);
model.put("username", "name" + i);
model.put("age", 18);
model.put("partition", random.nextInt(100));
producer.send(new ProducerRecord<String, String>("hudi_test", model.toJSONString()));
}
producer.flush();
producer.close();
}
}
4. Prepare the configuration file of the DeltaStreamer tool
(1) Define the schema files required by arvo (including source and target)
mkdir /opt/apps/hudi-props/
vim /opt/apps/hudi-props/source-schema-json.avsc
# kafka字段配置如下
{
"type": "record",
"name": "Profiles",
"fields": [
{
"name": "userid",
"type": [ "null", "string" ],
"default": null
},
{
"name": "username",
"type": [ "null", "string" ],
"default": null
},
{
"name": "age",
"type": [ "null", "string" ],
"default": null
},
{
"name": "partition",
"type": [ "null", "string" ],
"default": null
}
]
}
# hudi表的配置
cp source-schema-json.avsc target-schema-json.avsc
(2) hudi configuration base.properties
cp /opt/apps/hudi-0.12.0/hudi-utilities/src/test/resources/delta-streamer-config/base.properties /opt/apps/hudi-props/
(3) Write the configuration file of kafka source
cp /opt/apps/hudi-0.12.0/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties /opt/apps/hudi-props/
vim /opt/apps/hudi-props/kafka-source.properties
include=hdfs://centos04:9000/hudi-props/base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=userid
hoodie.datasource.write.partitionpath.field=partition
# schema provider configs
hoodie.deltastreamer.schemaprovider.source.schema.file=hdfs://centos04:9000/hudi-props/source-schema-json.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=hdfs://centos04:9000/hudi-props/target-schema-json.avsc
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=hudi_test
#Kafka props
bootstrap.servers=centos01:9092,centos02:9092,centos03:9092
auto.offset.reset=earliest
group.id=test-group
# 将配置文件上传到Hdfs
hadoop fs -put /opt/apps/hudi-props/ /
5. Copy the required jar package to Spark
cp /opt/apps/hudi-0.12.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.12.0.jar /opt/apps/spark-3.2.2/jars/
You need to put hudi-utilities-bundle_2.12-0.12.0.jar into the jars path of spark, otherwise an error will be reported and some classes and methods cannot be found.
6. Run the import command
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
/opt/apps/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.0.jar \
--props hdfs://centos04:9000/hudi-props/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field userid \
--target-base-path hdfs://centos04:9000/tmp/hudi/hudi_test \
--target-table hudi_test \
--op BULK_INSERT \
--table-type MERGE_ON_READ
7. View the import results
(1) Start spark-sql (remember to start Hive)
spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
(2) Specify location to create hudi table
use spark_hudi;
create table hudi_test using hudi
location 'hdfs://centos04:9000/tmp/hudi/hudi_test';
(3) Query the hudi table
spark-sql> select * from hudi_test limit 10;
20230306182511817 20230306182511817_0_0 222 45 b7b4efa6-af0a-49b9-a9ac-fdff4139dcf3-85_0-15-13_20230306182511817.parquet 222 name222 18 45
20230306182511817 20230306182511817_0_1 767 45 b7b4efa6-af0a-49b9-a9ac-fdff4139dcf3-85_0-15-13_20230306182511817.parquet 767 name767 18 45
20230306182511817 20230306182511817_1_0 128 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 128 name128 18 45
20230306182511817 20230306182511817_1_1 150 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 150 name150 18 45
20230306182511817 20230306182511817_1_2 154 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 154 name154 18 45
20230306182511817 20230306182511817_1_3 163 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 163 name163 18 45
20230306182511817 20230306182511817_1_4 598 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 598 name598 18 45
20230306182511817 20230306182511817_1_5 853 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 853 name853 18 45
20230306182511817 20230306182511817_1_6 982 45 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet 982 name982 18 45
20230306182511817 20230306182511817_1_0 140 98 19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-78_1-15-14_20230306182511817.parquet 140 name140 18 98
Time taken: 5.119 seconds, Fetched 10 row(s)
4.5 Concurrency Control
4.5.1 Concurrency control supported by Hudi
1)MVCC
Hudi's table operations, such as compression, cleanup, and submission, hudi will use multi-version concurrency control to provide snapshot isolation between multiple table operation writes and queries. Using the MVCC model, Hudi supports any number of concurrent operations and guarantees that no conflicts will occur. Hudi defaults to this model. All table services in the MVCC mode are used 同一个writer
to ensure no conflicts and avoid race conditions.
2)OPTIMISTIC CONCURRENCY
Use optimistic concurrency control for write operations (upsert, insert, 多个writer
etc. Hudi支持文件级的乐观一致性
) Overlapping files that are being changed allow both writes to succeed. This feature is in the experimental stage and requires Zookeeper or HiveMetastore to acquire locks.
4.5.2 Using concurrent write mode
(1) If it needs to be turned on 乐观并发写入
, the following properties need to be set
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes=LAZY
hoodie.write.lock.provider=<lock-provider-classname>
Hudi's lock acquisition service provides two modes using zookeeper, HiveMetaStore or Amazon DynamoDB (choose one)
(2) Related zookeeper parameters
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url
hoodie.write.lock.zookeeper.port
hoodie.write.lock.zookeeper.lock_key
hoodie.write.lock.zookeeper.base_path
(3) Related HiveMetastore parameters, HiveMetastore URI is extracted from the hadoop configuration file loaded at runtime
hoodie.write.lock.provider=org.apache.hudi.hive.HiveMetastoreBasedLockProvider
hoodie.write.lock.hivemetastore.database
hoodie.write.lock.hivemetastore.table
4.5.3 Concurrent writing with Spark DataFrame
(1) start spark-shell
spark-shell \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
(2) Write code
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option("hoodie.write.concurrency.mode", "optimistic_concurrency_control").
option("hoodie.cleaner.policy.failed.writes", "LAZY").
option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
option("hoodie.write.lock.zookeeper.url", "centos01,centos02,centos03").
option("hoodie.write.lock.zookeeper.port", "2181").
option("hoodie.write.lock.zookeeper.lock_key", "test_table").
option("hoodie.write.lock.zookeeper.base_path", "/multiwriter_test").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath)
(3) Use the zk client to verify whether zk is used.
/opt/apps/apache-zookeeper-3.5.7/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
(4) The corresponding directory is generated under zk, the directory under /multiwriter_test is the lock_key specified in the code
[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test
4.5.4 Concurrent writing using Delta Streamer
Based on the previous DeltaStreamer example, use Delta Streamer to consume Kafka data and write it to hudi, this time 加上并发写的参数
.
1) Enter the configuration file directory, modify the configuration file to add corresponding parameters, and submit it to Hdfs
cd /opt/apps/hudi-props/
cp kafka-source.properties kafka-multiwriter-source.propertis
vim kafka-multiwriter-source.propertis
# 添加并发控制的参数
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes=LAZY
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url=centos01,centos02,centos03
hoodie.write.lock.zookeeper.port=2181
hoodie.write.lock.zookeeper.lock_key=test_table2
hoodie.write.lock.zookeeper.base_path=/multiwriter_test2
hadoop fs -put /opt/apps/hudi-props/kafka-multiwriter-source.propertis /hudi-props
2) Run Delta Streamer
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
/opt/apps/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.0.jar \
--props hdfs://centos04:9000/hudi-props/kafka-multiwriter-source.propertis \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field userid \
--target-base-path hdfs://centos04:9000/tmp/hudi/hudi_test_multi \
--target-table hudi_test_multi \
--op INSERT \
--table-type MERGE_ON_READ
3) Check whether zk generates a new directory
/opt/apps/apache-zookeeper-3.5.7-bin/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test2
4.6 hudi tuning
4.6.1 General tuning
# 并行度
Hudi对输入进行分区默认并发度为1500,以确保每个Spark分区都在2GB的限制内(在Spark2.4.0版本之后去除了该限制),如果有更大的输入,则相应地进行调整。建议设置shuffle的并发度,配置项为 hoodie.[insert|upsert|bulkinsert].shuffle.parallelism,以使其至少达到inputdatasize/500MB。
# Off-heap(堆外)内存
Hudi写入parquet文件,需要使用一定的堆外内存,如果遇到此类故障,请考虑设置类似 spark.yarn.executor.memoryOverhead或 spark.yarn.driver.memoryOverhead的值。
# Spark 内存
通常Hudi需要能够将单个文件读入内存以执行合并或压缩操作,因此执行程序的内存应足以容纳此文件。另外,Hudi会缓存输入数据以便能够智能地放置数据,因此预留一些 spark.memory.storageFraction通常有助于提高性能。
# 调整文件大小
设置 limitFileSize以平衡接收/写入延迟与文件数量,并平衡与文件数据相关的元数据开销。
# 时间序列/日志数据
对于单条记录较大的数据库/ nosql变更日志,可调整默认配置。另一类非常流行的数据是时间序列/事件/日志数据,它往往更加庞大,每个分区的记录更多。在这种情况下,请考虑通过 .bloomFilterFPP()/bloomFilterNumEntries()来调整Bloom过滤器的精度,以加速目标索引查找时间,另外可考虑一个以事件时间为前缀的键,这将使用范围修剪并显着加快索引查找的速度。
# GC调优
请确保遵循Spark调优指南中的垃圾收集调优技巧,以避免OutOfMemory错误。[必须]使用G1 / CMS收集器,其中添加到spark.executor.extraJavaOptions的示例如下:
-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
# OutOfMemory错误
如果出现OOM错误,则可尝试通过如下配置处理:spark.memory.fraction=0.2,spark.memory.storageFraction=0.2允许其溢出而不是OOM(速度变慢与间歇性崩溃相比)。
4.6.2 Configuration example
spark.driver.extraClassPath /etc/hive/conf
spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
spark.driver.maxResultSize 2g
spark.driver.memory 4g
spark.executor.cores 1
spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
spark.executor.id driver
spark.executor.instances 300
spark.executor.memory 6g
spark.rdd.compress true
spark.kryoserializer.buffer.max 512m
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled true
spark.sql.hive.convertMetastoreParquet false
spark.submit.deployMode cluster
spark.task.cpus 1
spark.task.maxFailures 4
spark.yarn.driver.memoryOverhead 1024
spark.yarn.executor.memoryOverhead 3072
spark.yarn.max.executor.failures 100