Apache hudi core function point analysis

Bad

Part of the code in this article corresponds to version 0.14.0

Development Background

The initial requirement is that Uber will have many record-level update scenarios. One of Hudi’s main scenarios within Uber is the matching of passengers placing taxi orders and drivers receiving orders. Passengers and drivers are two data streams respectively. Through Hudi’s Upsert The ability and incremental reading function can splice these two data streams at the minute level to obtain passenger-driver matching data.
In order to improve the timeliness of updates, a new framework is proposed as a near-real-time incremental solution.
 

image.png


It can also be seen from the name Hadoop Upsert and Incremental that the main function of Hudi is the upsert and incremental capabilities, which are built on top of Hadoop.

Overall structure

image.png

image.png


Core function points

Support update and delete

Indexing | Apache Hudi
mainly uses indexing technology to achieve efficient upsert and delete. The Hoodie key (record key) of a record can be mapped to a file ID through the index, and then the insertion method of updating and deleting input is determined based on the type of table and the type of data written.

Index type

  1. BloomFilter is implemented by default. By default, every time a file is committed, the bloomfilter built by the key contained in the file and the range of the key are written out to the footer of the parquet file.
  2. HBase global index, dependent on external cluster
  3. Simple Index (query whether the corresponding file exists based on the key field)
  4. Bucket Index is divided into buckets first and then hashed, in order to solve the problem of low bloomfilter index efficiency in large-scale scenarios.

Bucket index


 

Index implementation class


Index types are also divided into global and non-global. BloomFilter Index and Simple Index have global options. HBase is naturally the global option. Global index will ensure the uniqueness of keys in the global partition, and the cost will be higher.

image.png

image.png

Odps/MaxCompute also supports update and delete.
How to update or delete data_Cloud native big data computing service MaxCompute-Alibaba Cloud Help Center
is also implemented using the idea of ​​base file + delta log

Hive3.0 also supports update, delete and ACID semantics
Running Apache Hive 3, new features and tips and tricks | Adaltas
Hive Transactions - Apache Hive - Apache Software Foundation

The difference is that Hudi supports upsert of data tables, which means it can ensure the uniqueness of the data primary key when writing, while odps and hive should only support updating data through update and delete dml statements, and the coverage scenarios are different. The latter should only be used in data correction scenarios. As a choice for entering the lake, it still needs to naturally support upsert.

ACID transaction support

I think transaction support is the core part of Hudi, because data updates and deletions rely heavily on transaction capabilities. Traditional data warehouses only provide insert semantics and files can only be appended. The demand for transaction guarantees will be much weaker, and the most is read. Incomplete data arrived (append occurred after writing the partition data).
However, when update and delete semantics need to be supported, the demand for transaction guarantees will be much stronger. Therefore, it can be seen that in order to enable the update and delete capabilities of tables in hive and odps, you first need to enable the transaction attributes of the table.
The implementation of transactions in hudi
**MVCC ** achieves snapshot isolation between multiple writers and readers through the mvcc mechanism

image.png

OCC Optimistic Concurrency Control
By default, Hudi considers writing by a single writer. In this case, the throughput is the maximum. If there are multiple writers, you need to enable concurrency control for multiple writers.

 
 
hoodie.write.concurrency.mode=optimistic_concurrency_control
# 指定锁的实现 默认是基于filesystem 的锁机制(要求filesystem能提供原子性的创建和删除保障)
hoodie.write.lock.provider=<lock-provider-classname>

Supports optimistic concurrency control at file granularity. When writing is completed and commit is completed, if occ is turned on, the lock will be acquired first and then committed. It seems that this lock is a lock with global granularity. Take filesystem lock as an example of
the commit process.

 
 
protected void autoCommit(Option<Map<String, String>> extraMetadata, HoodieWriteMetadata<O> result) {
final Option<HoodieInstant> inflightInstant = Option.of(new HoodieInstant(State.INFLIGHT,
getCommitActionType(), instantTime));
// 开始事务,如果是occ并发模型,会获取锁
this.txnManager.beginTransaction(inflightInstant,
lastCompletedTxn.isPresent() ? Option.of(lastCompletedTxn.get().getLeft()) : Option.empty());
try {
setCommitMetadata(result);
// reload active timeline so as to get all updates after current transaction have started. hence setting last arg to true.
// 尝试解冲突,冲突判定的策略是可插拔的,默认是变更的文件粒度查看是否有交集. 目前冲突的文件更改是无法处理的,会终止commit请求
TransactionUtils.resolveWriteConflictIfAny(table, this.txnManager.getCurrentTransactionOwner(),
result.getCommitMetadata(), config, this.txnManager.getLastCompletedTransactionOwner(), true, pendingInflightAndRequestedInstants);
commit(extraMetadata, result);
} finally {
this.txnManager.endTransaction(inflightInstant);
}
}

Lock acquisition process

 
 
@Override
public boolean tryLock(long time, TimeUnit unit) {
try {
synchronized (LOCK_FILE_NAME) {
// Check whether lock is already expired, if so try to delete lock file
// 先检查lock file 是否存在,默认路径是 base/.hoodie/lock 也就是所有的commit操作都会操作这个文件
if (fs.exists(this.lockFile)) {
if (checkIfExpired()) {
fs.delete(this.lockFile, true);
LOG.warn("Delete expired lock file: " + this.lockFile);
} else {
reloadCurrentOwnerLockInfo();
return false;
}
}
// 如果文件不存在,则获取锁,创建文件
acquireLock();
return fs.exists(this.lockFile);
}
} catch (IOException | HoodieIOException e) {
// 创建时可能会发生失败,则返回false获取锁失败
LOG.info(generateLogStatement(LockState.FAILED_TO_ACQUIRE), e);
return false;
}
}

If the files modified by the two write requests do not overlap, they will be passed directly in the resolveConflict phase. If there is overlap, the later-committed write will fail and be rolled back.

FileLayouts

COW

MOR

  • A table corresponds to the base dir of a distributed file
  • Files in each partition are organized according to file groups, and each file group corresponds to a file ID.
  • Each file group contains multiple file slices
  • Each slice has a base file (parquet file) and a set of .log file delta files

Base File is the main file that stores Hudi data sets and is stored in columnar format such as Parquet. The format is

 
 
<fileId>_<writeToken>_<instantTime>.parquet

Log File is a file used to store changed data in the MOR table. It is also often called Delta Log. Log File will not exist independently. It must be subordinate to a Base File in Parquet format. A Base File and several subordinate files to it. Log File consists of a File Slice.

 
 
.<fileId>_<baseCommitTime>.log.<fileVersion>_<writeToken>

File Slice, in the MOR table, a file collection consisting of a Base File and several Log Files subordinate to it is called a File Slice. File Slice is a specific concept for MOR tables. For COW tables, since it does not generate Log File, File Silce only contains Base File, or each Base File is an independent File Silce.

FileId相同的文件属于同一个File Group。同一File Group下往往有多个不同版本(instantTime)的Base File(针对COW表)或Base File + Log File的组合(针对MOR表),当File Group内最新的Base File迭代到足够大( >100MB)时,Hudi就不会在当前File Group上继续追加数据了,而是去创建新的File Group。

这里面可以看到根据大小上下限来决定是否创建新的File Group在hudi中叫自适应的file sizing。这里其实就是在partition的粒度下创建了更小粒度的group. 类似于Snowflake中的micro partition技术。这个对于行级别的更新是很友好的,不管是cow还是mor表都减少了更新带来的重写数据的范围。

多种查询类型

  • Snapshot Queries可以查询最新COMMIT的快照数据。针对Merge On Read类型的表,查询时需要在线合并列存中的Base数据和日志中的实时数据;针对Copy On Write表,可以查询最新版本的Parquet数据。Copy On Write和Merge On Read表支持该类型的查询。 批式处理
  • Incremental Queries支持增量查询的能力,可以查询给定COMMIT之后的最新数据。Copy On Write和Merge On Read表支持该类型的查询。 流式/增量处理。 增量读取的最开始的意义应该是能加速数仓计算的pipeline,因为在传统离线数仓里面只能按照partition粒度commit,因为无法将paritition做到特别细粒度,最多可能到小时,30min,那么下游调度就只能按这个粒度来调度计算。而hudi里面基于事务就可以非常快速的commit,并提供commit 之后的增量语义,那么就可以加速离线数据处理pipeline。衍生的价值应该是可以让他提供类似消息队列的功能,这样就可以也当做一个实时数仓来用(如果时效性够的话)
  • Read Optimized Queries只能查询到给定COMMIT之前所限定范围的最新数据。Read Optimized Queries是对Merge On Read表类型快照查询的优化,通过牺牲查询数据的时效性,来减少在线合并日志数据产生的查询延迟。因为这种查询只查存量数据,不查增量数据,因为使用的都是列式文件格式,所以效率较高。

Metadata管理

Hudi默认支持了写入表的元数据管理,metadata 也是一张MOR的hoodie表. 初始的需求是为了避免频繁的list file(分布式文件系统中这一操作通常很重)。Metadata是以HFile的格式存储(Hbase存储格式),提供高效的kv点查效率
Metadata 相关功能的配置org.apache.hudi.common.config.HoodieMetadataConfig
提供了哪些元数据?

  • hoodie.metadata.index.bloom.filter.enable保存数据文件的bloom filter index
  • hoodie.metadata.index.column.stats.enable保存数据文件的column 的range 用于裁剪优化

flink data skipping支持: [HUDI-4353] Column stats data skipping for flink by danny0405 · Pull Request #6026 · apache/hudi · GitHub

Catalog 支持 基于dfs 或者 hive metastore 来构建catalog 来管理所有在hudi上的表的元数据

 
 
CREATE CATALOG hoodie_catalog
WITH (
'type'='hudi',
'catalog.path' = '${catalog default root path}',
'hive.conf.dir' = '${directory where hive-site.xml is located}',
'mode'='hms' -- supports 'dfs' mode that uses the DFS backend for table DDLs persistence
);

其他表服务能力

schema evolution, clustering,clean, file sizing..

插件实现

写入类型

Write Operations | Apache Hudi

  • Upsert 默认,会先按索引查找来决定数据写入更新的位置或者仅执行插入。如果是构建一张数据库的镜像表可以使用这种方式。
  • Insert 没有去重的逻辑(不会按照record key去查找),对于没有去重需求,或者能容忍重复,仅仅需要事务保障,增量读取功能可以使用这种模式
  • bulk_insert 用于首次批量导入,通常通过Flink batch任务来运行,默认会按照分区键来排序,尽可能的避免小文件问题
  • delete 数据删除 软删除和硬删除

插件支持多种写入模式, 参见org.apache.hudi.table.HoodieTableSink#getSinkRuntimeProvider。常见的有
Streaming Ingestion | Apache Hudi
BULK_INSERT, bulk insert 模式通常是用来批量导入数据,
每次写入数据RowData时,会同时更新bloom filter索引(将record key 添加到bloom filter 中). 在一个parquet文件写完成之后,会将构建的bloom filter信息序列化成字符串, 以及此文件的key range,序列化后保存到file footer中(在没开启bloom filter索引时也会做这一步).

 
 
public Map<String, String> finalizeMetadata() {
HashMap<String, String> extraMetadata = new HashMap<>();
extraMetadata.put(HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY, bloomFilter.serializeToString());
if (bloomFilter.getBloomFilterTypeCode().name().contains(HoodieDynamicBoundedBloomFilter.TYPE_CODE_PREFIX)) {
extraMetadata.put(HOODIE_BLOOM_FILTER_TYPE_CODE, bloomFilter.getBloomFilterTypeCode().name());
}
if (minRecordKey != null && maxRecordKey != null) {
extraMetadata.put(HOODIE_MIN_RECORD_KEY_FOOTER, minRecordKey.toString());
extraMetadata.put(HOODIE_MAX_RECORD_KEY_FOOTER, maxRecordKey.toString());
}
return extraMetadata;
}

Append Mode: 仅只有Insert的数据
Upsert:

  • bootstrap index 生成BootstrapOperator用于基于已经存在的hoodie表的历史数据集,构建初始的index索引(可选)通过参数index.bootstrap.enabled开启,默认为false。加载过程会可能会比较慢,开启的情况下需要等到所有task都加载完成才能处理数据。这个加载需要获取所有分区的 索引,加载到state中. 这个理论上是需要读取metadata列 _hoodie_record_key和 _hoodie_partition_path 然后构建出IndexRecord,所以会很慢。
  • stream writer 写入时会先通过BucketAssignFunction计算数据应该落到哪个bucket(file group)去, 感觉bucket这个词和bucket index有点冲突,这里是两个概念,这里主要还是划分数据所属哪个file,这一步就会用到前面构建的索引,所以默认情况下flink的索引是基于state的
 
 
// Only changing records need looking up the index for the location,
// append only records are always recognized as INSERT.
HoodieRecordGlobalLocation oldLoc = indexState.value();
// change records 表示会更改数据的写入类型如update,delete
if (isChangingRecords && oldLoc != null) {
// Set up the instant time as "U" to mark the bucket as an update bucket.
// 打标之后如果partition 发生变化了,例如partition 字段发生了变化 ? 状态中存储的就是这个数据应该存放的location
if (!Objects.equals(oldLoc.getPartitionPath(), partitionPath)) {
if (globalIndex) {
// if partition path changes, emit a delete record for old partition path,
// then update the index state using location with new partition path.
// 对于全局索引,需要先删除老的分区的数据,非全局索引不做跨分区的改动
HoodieRecord<?> deleteRecord = new HoodieAvroRecord<>(new HoodieKey(recordKey, oldLoc.getPartitionPath()),
payloadCreation.createDeletePayload((BaseAvroPayload) record.getData()));
deleteRecord.unseal();
deleteRecord.setCurrentLocation(oldLoc.toLocal("U"));
deleteRecord.seal();
out.collect((O) deleteRecord);
}
location = getNewRecordLocation(partitionPath);
} else {
location = oldLoc.toLocal("U");
this.bucketAssigner.addUpdate(partitionPath, location.getFileId());
}
} else {
location = getNewRecordLocation(partitionPath);
}

可以看到在BucketAssigner这一步就已经确定了record 已经落到哪个fileid中(也就是打标的过程),所以默认就走的是基于state的索引。 在这里org.apache.hudi.table.action.commit.FlinkWriteHelper#write区别于org.apache.hudi.table.action.commit.BaseWriteHelper#write。好处就是不用像BloomFilter 索引去读取文件key 以及并且没有假阳的问题,坏处就是需要在写入端通过state来维护索引。除了默认基于State索引的方式, Flink 也支持BucketIndex。

总体感觉,索引的实现比较割裂,交由各个引擎的实现端来完成。而且流式写入依赖内部状态索引可能稳定性的问题。

小结

  1. 相比传统数仓支持update, delete(更轻量)
  2. ACID 事务特性 (地基功能) + 索引机制。
  3. 支持增量读取和批式读取
  4. 提供健全的文件和表的metadata,加速查询端数据裁剪能力
  5. 目前看不支持dim join
  6. The positioning is storage integrating flow and batch + upgrade of traditional data warehouse. There is no replacement for olap and kv storage systems.

Overall, the core value of Hudi is
the reduction of end-to-end data latency.
In the traditional Hive-based T + 1 update solution, only day-level data freshness can be achieved, depending on the granularity of the partition. Because in the traditional offline data warehouse, commits can only be made according to the partition granularity. Since the partition cannot be made particularly fine-grained, the pressure of file management will be great. It may be up to 30 minutes per hour. Then the downstream scheduling can only be scheduled according to this granularity. calculate. Hudi can commit very quickly based on transactions and provides incremental semantics after commit, which can accelerate the offline data processing pipeline.

Efficient Upsert
does not need to overwrite the entire table or partition every time, but can perform local updates at file granularity to improve storage and computing efficiency.

Both of these are guaranteed by ACID transactions. Therefore, Hudi's name was chosen very well, and basically spells out all its core functions.

Guess you like

Origin blog.csdn.net/Gefangenes/article/details/132483906