Data Lake Architecture Hudi (4) Detailed Case Study of Hudi Integration Spark

4. Detailed explanation of Hudi integrated Spark case

In the hudi quick start program, I briefly experienced the integration of spark with hudi, and now I will explain it in detail.
Data lake architecture Hudi (2) Hudi version 0.12 source code compilation, Hudi integrated spark, using IDEA and spark to add, delete, modify and check hudi tables

4.1 Using the spark-shell method

# 启动命令行

spark-shell \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

4.1.1 Insert data

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "hdfs://192.168.42.104:9000/datas/hudi_warehouse/hudi_trips_cow"
val dataGen = new DataGenerator




# 不需要单独的建表。如果表不存在,第一批写表将创建该表。(默认是COW表)
# 新增数据,使用官方提供的工具类生成一些Trips乘车数据,将其加载到DataFrame中,然后将DataFrame写入Hudi表。
# Mode(overwrite)将覆盖重新创建表(如果已存在)。
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

4.1.2 Query data

# 注意:该表有三级分区(区域/国家/城市),在0.9.0版本以前的hudi,在load中的路径需要按照分区目录拼接"*",如:load(basePath + "/*/*/*/*"),当前版本不需要。
# 1、转换为df
val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath)
  
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")


# 2、进行查询
scala> spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+------------------+-------------------+-------------------+-------------+
|              fare|          begin_lon|          begin_lat|           ts|
+------------------+-------------------+-------------------+-------------+
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1677600005195|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1677240470730|
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1677696170708|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1677272458691|
|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1677360474147|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1677583109653|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1677602421735|
| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1677721939334|
+------------------+-------------------+-------------------+-------------+

# 3、查询hudi多出来的几个字段
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|  20230302191855836|16df8361-18cd-461...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|  20230302191855836|d2bb2448-1e1f-45f...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|  20230302191855836|8d1b3b83-e88c-45e...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|  20230302191855836|ce2b0518-1875-48b...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|  20230302191855836|a5b03e52-31c7-4f9...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|  20230302191855836|30263e49-3c95-489...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|  20230302191855836|dd70365d-5345-4d3...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|  20230302191855836|ff01ba9d-92f0-410...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|  20230302191855836|4d4e2563-bc21-4e6...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|  20230302191855836|3c495316-233e-418...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+




# 4、时间旅行查询
Hudi从0.9.0开始就支持时间旅行查询。目前支持三种查询时间格式,如下所示。
spark.read.
  format("hudi").
  option("as.of.instant", "20230302191855836").
  load(basePath).show(10)

spark.read.
  format("hudi").
  option("as.of.instant", "2023-03-02 19:18:55.836").
  load(basePath).show(10)

# 表示 "as.of.instant = 2023-03-02 00:00:00"
spark.read.
  format("hudi").
  option("as.of.instant", "2023-03-02").
  load(basePath).show(10)

4.1.3 Update data

# 更新前数据
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|  20230302191855836|16df8361-18cd-461...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|  20230302191855836|d2bb2448-1e1f-45f...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|  20230302191855836|8d1b3b83-e88c-45e...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|  20230302191855836|ce2b0518-1875-48b...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|  20230302191855836|a5b03e52-31c7-4f9...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|  20230302191855836|30263e49-3c95-489...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|  20230302191855836|dd70365d-5345-4d3...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|  20230302191855836|ff01ba9d-92f0-410...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|  20230302191855836|4d4e2563-bc21-4e6...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|  20230302191855836|3c495316-233e-418...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+



# 更新数据
# 类似于插入新数据,使用数据生成器生成(注意是同一个数据生成器对象)新数据对历史数据进行更新。将数据加载到DataFrame中并将DataFrame写入Hudi表中。
val updates = convertToStringList(dataGen.generateUpdates(5))

val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))

# 注意:保存模式现在是Append。通常,除非是第一次创建表,否则请始终使用追加模式。
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)
  
  
# 再次查询
# 1、转换为df
val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath)
  
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

# 更新后数据
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|  20230302194751288|16df8361-18cd-461...|  americas/united_s...|rider-243|driver-243|14.503019204958845|
|  20230302194751288|d2bb2448-1e1f-45f...|  americas/united_s...|rider-243|driver-243| 51.42305232303094|
|  20230302194751288|8d1b3b83-e88c-45e...|  americas/united_s...|rider-243|driver-243|26.636532270940915|
|  20230302194716880|ce2b0518-1875-48b...|  americas/united_s...|rider-284|driver-284|  90.9053809533154|
|  20230302191855836|a5b03e52-31c7-4f9...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|  20230302194751288|30263e49-3c95-489...|  americas/brazil/s...|rider-243|driver-243| 89.45841313717807|
|  20230302194751288|dd70365d-5345-4d3...|  americas/brazil/s...|rider-243|driver-243|2.4995362119815567|
|  20230302194716880|ff01ba9d-92f0-410...|  americas/brazil/s...|rider-284|driver-284| 29.47661370147079|
|  20230302194751288|4d4e2563-bc21-4e6...|    asia/india/chennai|rider-243|driver-243| 71.08018349571618|
|  20230302194716880|3c495316-233e-418...|    asia/india/chennai|rider-284|driver-284| 9.384124531808036|
+-------------------+--------------------+----------------------+---------+----------+------------------+

4.1.4 Incremental query

Hudi also provides a way to query incrementally, which can get the data stream changed since a given commit timestamp. You need to specify the beginTime of the incremental query, and optionally specify the endTime. If we want all changes after a given commit, we don't need to specify endTime (which is the common case).

# 1、加载数据
spark.
  read.
  format("hudi").
  load(basePath).
  createOrReplaceTempView("hudi_trips_snapshot")
  

# 2、获取指定beginTime
scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)

commits: Array[String] = Array(20230302210112648, 20230302210408496)  

scala> val beginTime = commits(commits.length - 2) 
beginTime: String = 20230302210112648

# 3、创建增量查询的表
val tripsIncrementalDF = spark.read.format("hudi").
  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
  load(basePath)
  
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")


# 4、查询增量表
scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare < 20.0").show()
+-------------------+-----------------+-------------------+------------------+-------------+
|_hoodie_commit_time|             fare|          begin_lon|         begin_lat|           ts|
+-------------------+-----------------+-------------------+------------------+-------------+
|  20230302210408496|60.34474295461695|0.03363698727131392|0.9886806054385373|1677343847695|
|  20230302210408496| 57.4289850003576| 0.9692506010574379|0.9566270007622102|1677699656426|
+-------------------+-----------------+-------------------+------------------+-------------+

4.1.5 Query at specified time point

# 查询特定时间点的数据,可以将endTime指向特定时间,beginTime指向000(表示最早提交时间)
# 1)指定beginTime和endTime
val beginTime = "000" 
val endTime = commits(commits.length - 2) 

# 2)根据指定时间创建表
val tripsPointInTimeDF = spark.read.format("hudi").
  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
  option(END_INSTANTTIME_OPT_KEY, endTime).
  load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")

# 3)查询
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
+-------------------+-----------------+------------------+-------------------+-------------+
|_hoodie_commit_time|             fare|         begin_lon|          begin_lat|           ts|
+-------------------+-----------------+------------------+-------------------+-------------+
|  20230302210112648|75.67233311397607|0.7433519787065044|0.23986563259065297|1677257554148|
|  20230302210112648|72.88363497900701|0.6482943149906912|  0.682825302671212|1677446496876|
|  20230302210112648|41.57780462795554|0.5609292266131617| 0.6718059599888331|1677230346940|
|  20230302210112648|69.36363684236434| 0.621688297381891|0.13625652434397972|1677277488735|
|  20230302210112648|43.51073292791451|0.3953934768927382|0.39178349695388426|1677567017799|
|  20230302210112648|62.79408654844148|0.8414360533180016| 0.9115819084017496|1677314954780|
|  20230302210112648|66.06966684558341|0.7598920002419857| 0.1591418101835923|1677428809403|
|  20230302210112648|63.30100459693087|0.4878809010360382| 0.6331319396951335|1677336164167|
+-------------------+-----------------+------------------+-------------------+-------------+

4.1.6 Delete data

Delete according to the incoming HoodieKeys (uuid + partitionpath). Only append mode supports the delete function.

# 1)获取总行数
scala> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
res50: Long = 10

# 2)取其中2条用来删除
val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)

# 3)将待删除的2条数据构建DF
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2))

# 4)执行删除
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(OPERATION_OPT_KEY,"delete").
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)
  
  
# 5)统计删除数据后的行数,验证删除是否成功
val roAfterDeleteViewDF = spark.
  read.
  format("hudi").
  load(basePath)

roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")

// 返回的总行数应该比原来少2行
scala> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
res53: Long = 8

4.1.7 Coverage data

For a table or partition, if most of the records change every cycle, it is inefficient to do upsert or merge. We want something like hive's "insert overwrite" operation, to ignore existing data and just create a commit with new data provided.

It can also be used for certain operational tasks, such as repairing specified problem partitions. We can 'insert overwrite' the partition with the records in the source file. For some data sources, this is much faster than restore and replay.

Insert overwrite operations may be faster than upserts for batch ETL jobs that recompute the entire target partition (including indexing, precombining, and other repartitioning steps) in each batch.

# 1)查看当前表的key
scala> spark.
     |   read.format("hudi").
     |   load(basePath).
     |   select("uuid","partitionpath").
     |   sort("partitionpath","uuid").
     |   show(100, false)
+------------------------------------+------------------------------------+
|uuid                                |partitionpath                       |
+------------------------------------+------------------------------------+
|0a47c845-fb42-4187-af27-a85e6229a3c3|americas/brazil/sao_paulo           |
|6f82914d-f7a0-4972-8691-d1404ed7cae3|americas/brazil/sao_paulo           |
|e2d4fa5b-da34-4603-85c3-d2ad884ac090|americas/brazil/sao_paulo           |
|26e8db50-755c-44e7-9200-988a78c1e5de|americas/united_states/san_francisco|
|5afb905d-7ed2-46f5-bba8-5e2fb8ac88da|americas/united_states/san_francisco|
|2947db75-fa72-43d5-993c-4530b9890c73|asia/india/chennai                  |
|74f3ec44-62fa-435f-b06c-4cb9e0f4defa|asia/india/chennai                  |
|f22b8c1c-7b57-4c5f-8bce-7ce6783047b0|asia/india/chennai                  |
+------------------------------------+------------------------------------+
  
  
# 2)生成一些新的行程数据
val inserts = convertToStringList(dataGen.generateInserts(2))
val df = spark.
  read.json(spark.sparkContext.parallelize(inserts, 2)).
  filter("partitionpath = 'americas/united_states/san_francisco'")
  
# 3)覆盖指定分区
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(OPERATION.key(),"insert_overwrite").
  option(PRECOMBINE_FIELD.key(), "ts").
  option(RECORDKEY_FIELD.key(), "uuid").
  option(PARTITIONPATH_FIELD.key(), "partitionpath").
  option(TBL_NAME.key(), tableName).
  mode(Append).
  save(basePath)
  
# 4)查询覆盖后的key,发生了变化
spark.
  read.format("hudi").
  load(basePath).
  select("uuid","partitionpath").
  sort("partitionpath","uuid").
  show(100, false)
  
+------------------------------------+------------------------------------+
|uuid                                |partitionpath                       |
+------------------------------------+------------------------------------+
|0a47c845-fb42-4187-af27-a85e6229a3c3|americas/brazil/sao_paulo           |
|6f82914d-f7a0-4972-8691-d1404ed7cae3|americas/brazil/sao_paulo           |
|e2d4fa5b-da34-4603-85c3-d2ad884ac090|americas/brazil/sao_paulo           |
|ea2fe685-ad87-4bba-b688-4436f729e005|americas/united_states/san_francisco|
|2947db75-fa72-43d5-993c-4530b9890c73|asia/india/chennai                  |
|74f3ec44-62fa-435f-b06c-4cb9e0f4defa|asia/india/chennai                  |
|f22b8c1c-7b57-4c5f-8bce-7ce6783047b0|asia/india/chennai                  |
+------------------------------------+------------------------------------+  

4.2 Using spark-sql method

4.2.1 Installation of Hive3.1.2

The connection address of hive3.1.2 http://archive.apache.org/dist/hive/hive-3.1.2/

1. After downloading, upload it to /opt/apps

2. Unzip

tar -zxvf apache-hive-3.1.2-bin.tar.gz 

3, double naming

mv apache-hive-3.1.2-bin hive-3.1.2 

4. Execute the following command to modify hive-site.xml

 cd /opt/apps/hive-3.1.2/conf 
 mv hive-default.xml.template hive-default.xml

5. Execute the following command to create a new hive-site.xml configuration file

vim hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- jdbc连接的URL -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://centos04:3306/hive?useSSL=false</value>
</property>
 
    <!-- jdbc连接的Driver-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>
 
	<!-- jdbc连接的username-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
 
    <!-- jdbc连接的password -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
</property>
 
    <!-- Hive默认在HDFS的工作目录 -->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
    
   <!-- Hive元数据存储的验证 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
   
    <!-- 元数据存储授权  -->
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    <!-- 指定存储元数据要连接的地址 -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://centos04:9083</value>
    </property>
    <!-- 指定hiveserver2连接的host -->
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>centos04</value>
    </property>
 
    <!-- 指定hiveserver2连接的端口号 -->
    <property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
    </property>
 
</configuration>

6. Configure hadoop

Add the following content to hadoop's core-site.xml, then restart

<property> 
    <name>hadoop.proxyuser.root.groups</name> 
    <value>root</value>
    <description>Allow the superuser oozie to impersonate any members of the group group1 and group2</description> 
</property>

<property> 
    <name>hadoop.proxyuser.root.hosts</name> 
    <value>*</value> 
    <description>The superuser can connect only from host1 and host2 to impersonate a user</description> 
</property>

7. The guava.jar that depends on hive is inconsistent with the version in hadoop

 # hadoop3.1.3的guava版本是27,而hive3.1.2版本是19
 
 # 两者不一致,则删除低版本的,把高版本的复制过去。
 
 rm -rf /opt/apps/hive-3.1.2/lib/guava-19.0.jar
 

 cp /opt/apps/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar  /opt/apps/hive-3.1.2/lib

8. Configure the hive metabase in mysql

1. First download the mysql jdbc package

2. Copy it to the hive/lib directory.

3. Start and log in to mysql

4.将hive数据库下的所有表的所有权限赋给root用户,并配置123456为hive-site.xml中的连接密码,然后``刷新系统权限关系表

mysql> create database hive; 
mysql> CREATE USER  'root'@'%'  IDENTIFIED BY '123456';

mysql> GRANT ALL PRIVILEGES ON  *.* TO 'root'@'%' WITH GRANT OPTION;

mysql> flush privileges;


-- 初始化Hive元数据库
[root@centos04 conf]# schematool -initSchema -dbType mysql -verbose

9. Start Hive's Metastore

# 配置环境变量
export HIVE_HOME=/opt/apps/hive-3.1.2

# 启动Hive

[root@centos04 conf]# nohup hive --service metastore & 

[root@centos04 conf]# netstat -nltp | grep 9083
tcp6       0      0 :::9083                 :::*                    LISTEN      10282/java   

10. Start Hive

# 先启动hadoop集群
start-dfs.sh

# 启动hadoop集群后,要等hdfs退出安全模式之后再启动hive。
[root@centos04 conf]# hive



# 启动远程连接
[root@centos04 ~]# hiveserver2  &
[root@centos04 ~]# netstat -nltp | grep  10000
tcp6       0      0 :::10000                :::*                    LISTEN      10589/java          
[root@centos04 ~]# netstat -nltp | grep  10002
tcp6       0      0 :::10002                :::*                    LISTEN      10589/java  

beeline
!connect jdbc:hive2://centos04:10000
输入用户名  root 
输入密码  回车

4.2.2 Create hudi table using spark-sql

# 启动命令行窗口
spark-sql \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

注意:如果没有配置hive环境变量,手动拷贝hive-site.xml到spark的conf下

parameter name Defaults illustrate
primaryKey uuid The primary key name of the table, multiple fields are separated by commas. samehoodie.datasource.write.recordkey.field
preCombineField The pre-merge field for the table. samehoodie.datasource.write.precombine.field
type cow Created table type: type = 'cow' type = 'mor' samehoodie.datasource.write.table.type

4.2.2.1 Create a non-partitioned table

use hudi_spark;

-- 创建一个cow表,默认primaryKey 'uuid',不提供preCombineField
create table hudi_cow_nonpcf_tbl (
  uuid int,
  name string,
  price double
) using hudi;

-- 默认创建的路径为本地,/root/spark-warehouse/hudi_spark.db/hudi_cow_nonpcf_tbl


-- 创建一个mor非分区表
create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
);

4.2.2.2 Create partition table

-- 创建一个cow分区外部表,指定primaryKey和preCombineField
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location 'hdfs://192.168.42.104:9000/datas/hudi_warehouse/spark_sql/hudi_cow_pt_tbl';

4.2.2.3 Create a new table on the existing hudi table

-- 不需要指定模式和非分区列(如果存在)之外的任何属性,Hudi可以自动识别模式和配置。

-- 非分区表(依据本地存在的路径进行创建)
create table hudi_existing_tbl0 using hudi
location 'file:///root/spark-warehouse/hudi_spark.db/hudi_cow_nonpcf_tbl';


-- 分区表(依据hdfs上存在的路径进行创建,如果没有数据会报错)
-- It is not allowed to specify partition columns when the table schema is not defined
create table hudi_existing_tbl1 using hudi
partitioned by (dt, hh)
location 'hdfs://192.168.42.104:9000/datas/hudi_warehouse/spark_sql/hudi_cow_pt_tbl';

4.2.2.4 Create a table through CTAS (Create Table As Select)

-- 为了提高向hudi表加载数据的性能,CTAS使用批量插入作为写操作。
--(1)通过CTAS创建cow非分区表,不指定preCombineField 
create table hudi_ctas_cow_nonpcf_tbl
using hudi
tblproperties (primaryKey = 'id')
as
select 1 as id, 'a1' as name, 10 as price;


-- (2)通过CTAS创建cow分区表,指定preCombineField
create table hudi_ctas_cow_pt_tbl
using hudi
tblproperties (type = 'cow', primaryKey = 'id', preCombineField = 'ts')
partitioned by (dt)
as
select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-12-01' as dt;

-- (3)通过CTAS从其他表加载数据
# 创建内部表
create table parquet_mngd using parquet location 'file:///tmp/parquet_dataset/*.parquet';

# 通过CTAS加载数据
create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/' options (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (datestr) as select * from parquet_mngd;

4.2.3 Insert data

By default, the write operation type of insert into is upsert if preCombineKey is provided, otherwise insert is used.

-- 1)向非分区表插入数据

insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;
insert into hudi_mor_tbl select 1, 'a1', 20, 1000;

-- 2)向分区表动态分区插入数据
insert into hudi_cow_pt_tbl partition (dt, hh)
select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt, '10' as hh;

-- 3)向分区表静态分区插入数据
insert into hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='11') select 2, 'a2', 1000;

-- 4)使用bulk_insert插入数据
-- hudi支持使用bulk_insert作为写操作的类型,只需要设置两个配置:
-- hoodie.sql.bulk.insert.enable和hoodie.sql.insert.mode。

-- 向指定preCombineKey的表插入数据,则写操作为upsert

insert into hudi_mor_tbl select 1, 'a1_1', 20, 1001;
select id, name, price, ts from hudi_mor_tbl;
1  a1_1   20.0   1001

 

-- 向指定preCombineKey的表插入数据,指定写操作为bulk_insert(此时不会更新数据)
set hoodie.sql.bulk.insert.enable=true;
set hoodie.sql.insert.mode=non-strict;

insert into hudi_mor_tbl select 1, 'a1_2', 20, 1002;
select id, name, price, ts from hudi_mor_tbl;

1  a1_1   20.0   1001
1  a1_2   20.0   1002

4.2.4 Query data

-- 1)查询
select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0

-- 2)时间旅行查询
Hudi从0.9.0开始就支持时间旅行查询。Spark SQL方式要求Spark版本 3.2及以上。

create table hudi_cow_pt_tbl1 (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl1';


-- 插入一条id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a0', 1000, '2021-12-09', '10';
select * from hudi_cow_pt_tbl1;

-- 修改id为1的数据
insert into hudi_cow_pt_tbl1 select 1, 'a1', 1001, '2021-12-09', '10';
select * from hudi_cow_pt_tbl1;

-- 基于第一次提交时间进行时间旅行
select * from hudi_cow_pt_tbl1 timestamp as of '20230303013452312' where id = 1;

-- 其他时间格式的时间旅行写法
select * from hudi_cow_pt_tbl1 timestamp as of '2023-03-03 01:34:52.312' where id = 1;

select * from hudi_cow_pt_tbl1 timestamp as of '2023-03-03' where id = 1;

4.2.5 Update data

 -- 1)update
更新操作需要指定preCombineField。1)语法
UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]2)执行更新
update hudi_mor_tbl set price = price * 2, ts = 1111 where id = 1;

update hudi_cow_pt_tbl1 set name = 'a1_1', ts = 1001 where id = 1;

-- update using non-PK field
update hudi_cow_pt_tbl1 set ts = 1111 where name = 'a1_1';




-- 2)MergeInto1)语法
MERGE INTO tableIdentifier AS target_alias
USING (sub_query | tableIdentifier) AS source_alias
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]

<merge_condition> =A equal bool condition 
<matched_action>  =
  DELETE  |
  UPDATE SET *  |
  UPDATE SET column1 = expression1 [, column2 = expression2 ...]
<not_matched_action>  =
  INSERT *  |
  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])2)执行案例
-- 1、准备source表:非分区的hudi表,插入数据
create table merge_source (id int, name string, price double, ts bigint) using hudi
tblproperties (primaryKey = 'id', preCombineField = 'ts');
insert into merge_source values (1, "old_a1", 22.22, 2900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000);

merge into hudi_mor_tbl as target
using merge_source as source
on target.id = source.id
when matched then update set *
when not matched then insert *
;


-- 2、准备source表:分区的parquet表,插入数据
create table merge_source2 (id int, name string, flag string, dt string, hh string) using parquet;
insert into merge_source2 values (1, "new_a1", 'update', '2021-12-09', '10'), (2, "new_a2", 'delete', '2021-12-09', '11'), (3, "new_a3", 'insert', '2021-12-09', '12');

merge into hudi_cow_pt_tbl1 as target
using (
  select id, name, '2000' as ts, flag, dt, hh from merge_source2
) source
on target.id = source.id
when matched and flag != 'delete' then
 update set id = source.id, name = source.name, ts = source.ts, dt = source.dt, hh = source.hh
when matched and flag = 'delete' then delete
when not matched then
 insert (id, name, ts, dt, hh) values(source.id, source.name, source.ts, source.dt, source.hh)
;

4.2.6 Delete data

-- 删除数据
1)语法
DELETE FROM tableIdentifier [ WHERE BOOL_EXPRESSION]
2)案例
delete from hudi_cow_nonpcf_tbl where uuid = 1;

delete from hudi_mor_tbl where id % 2 = 0;

-- 使用非主键字段删除
delete from hudi_cow_pt_tbl1 where name = 'a1_1';

4.2.7 Coverage data

使用INSERT_OVERWRITE类型的写操作覆盖分区表
使用INSERT_OVERWRITE_TABLE类型的写操作插入覆盖非分区表或分区表(动态分区)

-- 1)insert overwrite 非分区表
insert overwrite hudi_mor_tbl select 99, 'a99', 20.0, 900;
insert overwrite hudi_cow_nonpcf_tbl select 99, 'a99', 20.0;

-- 2)通过动态分区insert overwrite table到分区表
insert overwrite table hudi_cow_pt_tbl1 select 10, 'a10', 1100, '2021-12-09', '11';


-- 3)通过静态分区insert overwrite 分区表
insert overwrite hudi_cow_pt_tbl1 partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100;

4.2.8 Modify table structure and modify partition

-- 修改表结构(Alter Table)
1)语法
-- Alter table name
ALTER TABLE oldTableName RENAME TO newTableName
-- Alter table add columns
ALTER TABLE tableIdentifier ADD COLUMNS(colAndType (,colAndType)*)
-- Alter table column type
ALTER TABLE tableIdentifier CHANGE COLUMN colName colName colType
-- Alter table properties
ALTER TABLE tableIdentifier SET TBLPROPERTIES (key = 'value')
2)案例
--rename to:
ALTER TABLE hudi_cow_nonpcf_tbl RENAME TO hudi_cow_nonpcf_tbl2;
--add column:
ALTER TABLE hudi_cow_nonpcf_tbl2 add columns(remark string);
--change column:
ALTER TABLE hudi_cow_nonpcf_tbl2 change column uuid uuid int;
--set properties;
alter table hudi_cow_nonpcf_tbl2 set tblproperties (hoodie.keep.max.commits = '10');





-- 修改分区
1)语法
-- Drop Partition
ALTER TABLE tableIdentifier DROP PARTITION ( partition_col_name = partition_col_val [ , ... ] )
-- Show Partitions
SHOW PARTITIONS tableIdentifier
2)案例
--show partition:
show partitions hudi_cow_pt_tbl1;

--drop partition:
alter table hudi_cow_pt_tbl1 drop partition (dt='2021-12-09', hh='10');
注意:show partition结果是基于文件系统表路径的。删除整个分区数据或直接删除某个分区目录并不精确。

4.3 Using the IDEA method

You can refer to: https://blog.csdn.net/qq_44665283/article/details/129271737?spm=1001.2014.3001.5501

4.4 Use the DeltaStreamer import tool (from Apache kafka to hudi table case)

The HoodieDeltaStreamer tool (part of hudi-utilities-bundle) provides a way to ingest from different sources such as DFS or Kafka, and has the following functions:

Ø Accurately collect new data from Kafka once, and import incrementally from the output of Sqoop, HiveIncrementalPuller or files under the DFS folder.

Ø The imported data supports json, avro or custom data types.

Ø Manage checkpoints, rollback and recovery.

Ø Avro Schema using DFS or Confluent schema registry.

Ø Support custom conversion operation.

The official website is as follows: https://hudi.apache.org/cn/docs/0.12.2/hoodie_deltastreamer/

The case given on the official website is based on Confluent Kafka, and this case is based on Apache Kafka.

1. Start zk and kafka

2. Create a test topic

/opt/apps/kafka_2.12-2.6.2/bin/kafka-topics.sh --bootstrap-server centos01:9092 --create --topic hudi_test

3. Prepare kafka producer program

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>hudi-start</artifactId>
        <groupId>com.yyds</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>hudi-kafka</artifactId>

    <dependencies>
        <!--kafka的客户端-->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.4.1</version>
        </dependency>

        <!--fastjson <= 1.2.80 存在安全漏洞,-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.83</version>
        </dependency>
    </dependencies>

</project>
package com.yyds;

import com.alibaba.fastjson.JSONObject;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;
import java.util.Random;

public class HudiKafkaProducer {
    
    
    public static void main(String[] args) {
    
    

        Properties props = new Properties();
        props.put("bootstrap.servers", "centos01:9092,centos02:9092,centos03:9092");
        props.put("acks", "-1");
        props.put("batch.size", "1048576");
        props.put("linger.ms", "5");
        props.put("compression.type", "snappy");
        props.put("buffer.memory", "33554432");
        props.put("key.serializer",
                "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer",
                "org.apache.kafka.common.serialization.StringSerializer");
        KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);
        Random random = new Random();
        for (int i = 0; i < 1000; i++) {
    
    
            JSONObject model = new JSONObject();
            model.put("userid", i);
            model.put("username", "name" + i);
            model.put("age", 18);
            model.put("partition", random.nextInt(100));
            producer.send(new ProducerRecord<String, String>("hudi_test", model.toJSONString()));
        }
        producer.flush();
        producer.close();
    }
}

4. Prepare the configuration file of the DeltaStreamer tool

(1) Define the schema files required by arvo (including source and target)

mkdir /opt/apps/hudi-props/
vim /opt/apps/hudi-props/source-schema-json.avsc
# kafka字段配置如下
{
    
            
  "type": "record",
  "name": "Profiles",   
  "fields": [
    {
    
    
      "name": "userid",
      "type": [ "null", "string" ],
      "default": null
    },
    {
    
    
      "name": "username",
      "type": [ "null", "string" ],
      "default": null
    },
    {
    
    
      "name": "age",
      "type": [ "null", "string" ],
      "default": null
    },
    {
    
    
      "name": "partition",
      "type": [ "null", "string" ],
      "default": null
    }
  ]
}

# hudi表的配置
cp source-schema-json.avsc target-schema-json.avsc

(2) hudi configuration base.properties

cp /opt/apps/hudi-0.12.0/hudi-utilities/src/test/resources/delta-streamer-config/base.properties /opt/apps/hudi-props/ 

(3) Write the configuration file of kafka source

cp /opt/apps/hudi-0.12.0/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties /opt/apps/hudi-props/
vim /opt/apps/hudi-props/kafka-source.properties 

include=hdfs://centos04:9000/hudi-props/base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=userid  
hoodie.datasource.write.partitionpath.field=partition
# schema provider configs
hoodie.deltastreamer.schemaprovider.source.schema.file=hdfs://centos04:9000/hudi-props/source-schema-json.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=hdfs://centos04:9000/hudi-props/target-schema-json.avsc
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=hudi_test
#Kafka props
bootstrap.servers=centos01:9092,centos02:9092,centos03:9092
auto.offset.reset=earliest
group.id=test-group



# 将配置文件上传到Hdfs
hadoop fs -put /opt/apps/hudi-props/ /

5. Copy the required jar package to Spark

cp /opt/apps/hudi-0.12.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.12.0.jar /opt/apps/spark-3.2.2/jars/

You need to put hudi-utilities-bundle_2.12-0.12.0.jar into the jars path of spark, otherwise an error will be reported and some classes and methods cannot be found.

6. Run the import command

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
/opt/apps/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.0.jar \
--props hdfs://centos04:9000/hudi-props/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider  \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource  \
--source-ordering-field userid \
--target-base-path hdfs://centos04:9000/tmp/hudi/hudi_test  \
--target-table hudi_test \
--op BULK_INSERT \
--table-type MERGE_ON_READ

7. View the import results

(1) Start spark-sql (remember to start Hive)

spark-sql \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
 --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

(2) Specify location to create hudi table

use spark_hudi;

 

create table hudi_test using hudi

location 'hdfs://centos04:9000/tmp/hudi/hudi_test';

(3) Query the hudi table

spark-sql> select * from hudi_test limit 10;
20230306182511817       20230306182511817_0_0   222     45      b7b4efa6-af0a-49b9-a9ac-fdff4139dcf3-85_0-15-13_20230306182511817.parquet       222   name222  18      45
20230306182511817       20230306182511817_0_1   767     45      b7b4efa6-af0a-49b9-a9ac-fdff4139dcf3-85_0-15-13_20230306182511817.parquet       767   name767  18      45
20230306182511817       20230306182511817_1_0   128     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       128   name128  18      45
20230306182511817       20230306182511817_1_1   150     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       150   name150  18      45
20230306182511817       20230306182511817_1_2   154     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       154   name154  18      45
20230306182511817       20230306182511817_1_3   163     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       163   name163  18      45
20230306182511817       20230306182511817_1_4   598     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       598   name598  18      45
20230306182511817       20230306182511817_1_5   853     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       853   name853  18      45
20230306182511817       20230306182511817_1_6   982     45      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-76_1-15-14_20230306182511817.parquet       982   name982  18      45
20230306182511817       20230306182511817_1_0   140     98      19eb5a0a-aa85-492d-bfb7-c3ccd620d0ca-78_1-15-14_20230306182511817.parquet       140   name140  18      98
Time taken: 5.119 seconds, Fetched 10 row(s)

4.5 Concurrency Control

4.5.1 Concurrency control supported by Hudi

1)MVCC

​ Hudi's table operations, such as compression, cleanup, and submission, hudi will use multi-version concurrency control to provide snapshot isolation between multiple table operation writes and queries. Using the MVCC model, Hudi supports any number of concurrent operations and guarantees that no conflicts will occur. Hudi defaults to this model. All table services in the MVCC mode are used 同一个writerto ensure no conflicts and avoid race conditions.

insert image description here

2)OPTIMISTIC CONCURRENCY

​ Use optimistic concurrency control for write operations (upsert, insert, 多个writeretc. Hudi支持文件级的乐观一致性) Overlapping files that are being changed allow both writes to succeed. This feature is in the experimental stage and requires Zookeeper or HiveMetastore to acquire locks.

insert image description here

4.5.2 Using concurrent write mode

(1) If it needs to be turned on 乐观并发写入, the following properties need to be set

hoodie.write.concurrency.mode=optimistic_concurrency_control

hoodie.cleaner.policy.failed.writes=LAZY

hoodie.write.lock.provider=<lock-provider-classname>

Hudi's lock acquisition service provides two modes using zookeeper, HiveMetaStore or Amazon DynamoDB (choose one)

(2) Related zookeeper parameters

hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider

hoodie.write.lock.zookeeper.url
hoodie.write.lock.zookeeper.port
hoodie.write.lock.zookeeper.lock_key
hoodie.write.lock.zookeeper.base_path

(3) Related HiveMetastore parameters, HiveMetastore URI is extracted from the hadoop configuration file loaded at runtime

hoodie.write.lock.provider=org.apache.hudi.hive.HiveMetastoreBasedLockProvider
hoodie.write.lock.hivemetastore.database
hoodie.write.lock.hivemetastore.table

4.5.3 Concurrent writing with Spark DataFrame

(1) start spark-shell

spark-shell \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

(2) Write code

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._


val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator

 
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

df.write.format("hudi").
 options(getQuickstartWriteConfigs).
 option(PRECOMBINE_FIELD_OPT_KEY, "ts").
 option(RECORDKEY_FIELD_OPT_KEY, "uuid").
 option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
 option("hoodie.write.concurrency.mode", "optimistic_concurrency_control").
 option("hoodie.cleaner.policy.failed.writes", "LAZY").
 option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
 option("hoodie.write.lock.zookeeper.url", "centos01,centos02,centos03").
 option("hoodie.write.lock.zookeeper.port", "2181").
 option("hoodie.write.lock.zookeeper.lock_key", "test_table").
 option("hoodie.write.lock.zookeeper.base_path", "/multiwriter_test").
 option(TABLE_NAME, tableName).
 mode(Append).
 save(basePath)

(3) Use the zk client to verify whether zk is used.

/opt/apps/apache-zookeeper-3.5.7/bin/zkCli.sh 
[zk: localhost:2181(CONNECTED) 0] ls /

(4) The corresponding directory is generated under zk, the directory under /multiwriter_test is the lock_key specified in the code

[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test

4.5.4 Concurrent writing using Delta Streamer

Based on the previous DeltaStreamer example, use Delta Streamer to consume Kafka data and write it to hudi, this time 加上并发写的参数.

1) Enter the configuration file directory, modify the configuration file to add corresponding parameters, and submit it to Hdfs

cd /opt/apps/hudi-props/

cp kafka-source.properties kafka-multiwriter-source.propertis
vim kafka-multiwriter-source.propertis 

 
# 添加并发控制的参数
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes=LAZY
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url=centos01,centos02,centos03
hoodie.write.lock.zookeeper.port=2181
hoodie.write.lock.zookeeper.lock_key=test_table2
hoodie.write.lock.zookeeper.base_path=/multiwriter_test2


hadoop fs -put /opt/apps/hudi-props/kafka-multiwriter-source.propertis /hudi-props

2) Run Delta Streamer

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
/opt/apps/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.0.jar \
--props hdfs://centos04:9000/hudi-props/kafka-multiwriter-source.propertis \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider  \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource  \
--source-ordering-field userid \
--target-base-path hdfs://centos04:9000/tmp/hudi/hudi_test_multi  \
--target-table hudi_test_multi \
--op INSERT \
--table-type MERGE_ON_READ

3) Check whether zk generates a new directory

/opt/apps/apache-zookeeper-3.5.7-bin/bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 0] ls /
[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test2

4.6 hudi tuning

4.6.1 General tuning

# 并行度
Hudi对输入进行分区默认并发度为1500,以确保每个Spark分区都在2GB的限制内(在Spark2.4.0版本之后去除了该限制),如果有更大的输入,则相应地进行调整。建议设置shuffle的并发度,配置项为 hoodie.[insert|upsert|bulkinsert].shuffle.parallelism,以使其至少达到inputdatasize/500MB。


# Off-heap(堆外)内存
Hudi写入parquet文件,需要使用一定的堆外内存,如果遇到此类故障,请考虑设置类似 spark.yarn.executor.memoryOverhead或 spark.yarn.driver.memoryOverhead的值。


# Spark 内存
通常Hudi需要能够将单个文件读入内存以执行合并或压缩操作,因此执行程序的内存应足以容纳此文件。另外,Hudi会缓存输入数据以便能够智能地放置数据,因此预留一些 spark.memory.storageFraction通常有助于提高性能。


# 调整文件大小
设置 limitFileSize以平衡接收/写入延迟与文件数量,并平衡与文件数据相关的元数据开销。


# 时间序列/日志数据
对于单条记录较大的数据库/ nosql变更日志,可调整默认配置。另一类非常流行的数据是时间序列/事件/日志数据,它往往更加庞大,每个分区的记录更多。在这种情况下,请考虑通过 .bloomFilterFPP()/bloomFilterNumEntries()来调整Bloom过滤器的精度,以加速目标索引查找时间,另外可考虑一个以事件时间为前缀的键,这将使用范围修剪并显着加快索引查找的速度。


# GC调优
请确保遵循Spark调优指南中的垃圾收集调优技巧,以避免OutOfMemory错误。[必须]使用G1 / CMS收集器,其中添加到spark.executor.extraJavaOptions的示例如下:
-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof


# OutOfMemory错误
如果出现OOM错误,则可尝试通过如下配置处理:spark.memory.fraction=0.2,spark.memory.storageFraction=0.2允许其溢出而不是OOM(速度变慢与间歇性崩溃相比)。

4.6.2 Configuration example

spark.driver.extraClassPath /etc/hive/conf
spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
spark.driver.maxResultSize 2g
spark.driver.memory 4g
spark.executor.cores 1
spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
spark.executor.id driver
spark.executor.instances 300
spark.executor.memory 6g
spark.rdd.compress true

spark.kryoserializer.buffer.max 512m
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled true
spark.sql.hive.convertMetastoreParquet false
spark.submit.deployMode cluster
spark.task.cpus 1
spark.task.maxFailures 4

spark.yarn.driver.memoryOverhead 1024
spark.yarn.executor.memoryOverhead 3072
spark.yarn.max.executor.failures 100

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/129305888