Performance increased by 30%! Practical Analysis of Performance Optimization of Kangaroo Cloud Data Stack Based on Apache Hudi

Apache Hudi is an open source data lake solution , which can help enterprises better manage and analyze massive data, and supports efficient data update and query . It also provides a variety of data compression and storage formats as well as indexing functions, thus providing more flexible and efficient data processing methods for enterprise data warehouse practices .

In the financial field, enterprises can use Hudi to process a large amount of financial transaction data that needs to be queried and updated in real time. In the e-commerce business, enterprises can use Hudi to track order data, and update and query orders in real time. In logistics and supply chain management, Hudi can help enterprises process and update a large amount of logistics data in real time to ensure data consistency and reliability.

As a one-stop big data basic software , Kangaroo Cloud Data Stack provides customers with complete support capabilities such as stock data migration , data into the lake , and file management based on Apache Hudi. In this process, I have accumulated some experience in Hudi performance optimization, and I hope to share it with you through this article.

Brief analysis of Hudi principle

Apache Hudi is an open source data lake solution. It is built based on the technology stack of Hadoop and Spark , and has been extended to various computing engines such as Flink and Trino . The main purpose of Apache Hudi is to provide an efficient, scalable and reliable data lake solution for managing and processing large-scale data sets.

The core implementation of Hudi is to support incremental data update and query operations by dividing the data collection into multiple data files and maintaining a data version and index information for each data file . As shown in the figure below, when the user needs to update the data, Hudi will write the updated data into a new data file, and copy the data in the original data file through the copy-on-write operation. The records are copied to a new data file, and the corresponding data records are updated in the new data file.

At the same time, Hudi will update the data version and index information so that users can access the latest data records based on the data version and unique identifier. When users need to query data, Hudi will use index information to locate data records and return the latest data records.

file

In Hudi's merge on read mode, the update operation is implemented by merging the original data and the updated data at query time. Specifically, when there is new data to be written, Hudi will append the new data to a new log file and record the information of the new file in the metadata file . When querying data, Hudi will merge all data files to generate a view, and then query the view.

Since Hudi only needs to merge the data that needs to be updated when querying, and does not need to merge when writing, it can avoid the performance overhead of writing and achieve fast update operations.

Apache Hudi creates a new version when writing data, and generates a view by merging all versions of data when reading data. In the view, each data record appears only once and is the latest version, which ensures that the read operation only involves the data in the view without modifying the original data, thus achieving read-write separation .

By implementing concurrency control through multiple versions , Hudi can improve the performance of read operations while ensuring data consistency, while also ensuring data reliability and scalability.

Hudi optimization practice

The following describes the performance optimization of Hudi based on the practical experience of Kangaroo Cloud Data Stack .

Support for multiple indexes

Hudi divides the data collection into multiple data files, and maintains a data version and index information for each data file to support incremental data updates and query operations. By building an index, the generated metadata can be used to quickly locate the location of the data required for query, as shown in the following figure. This can reduce or even avoid scanning or reading unnecessary data from the file system, reduce IO overhead, and greatly improve query efficiency. Hudi already supports several different indexing technologies , and is constantly improving and adding more indexing implementations.

Kangaroo Cloud Data Stack supports users to set the index type they want to use when creating a Hudi table, including SIMPLE , BLOOM FILTER, BUCKET and other types. During the writing process, Hudi will write the index information to the parquet file or external storage. When reading, the application will compare and judge based on this information and skip unnecessary data files.

file

Hudi introduced MetadataTable as a multi-mode index in version 0.11.0 . Using MetadataTable to summarize metadata information, applications can avoid file system calls for file Listing operations (which is very time-consuming in object storage), and avoid direct reading Fetching the footer information in the parquet file can greatly improve query performance.

Kangaroo cloud data stack supports users to enable multi-mode index when creating a table , and write the index information of the file to MetadataTable when writing data. DataStack also supports the construction of MetadataTable in an asynchronous manner to ensure that the write is still in a low-latency state, and then the MetadataTable is generated offline by the background application to improve read performance.

Since MetadataTable relies on information such as column stats/bloomfilter recorded in the base file, there is no way to save the information of the log file to MetadataTable in the merge on read mode, and the open source framework does not use it to implement file filtering.

However, considering that the base file and the log file share the same fileId, the Kangaroo Cloud technology team has made a transformation inside the data stack: after obtaining the base file through the MetadataTable, filter the log file according to the fileId to avoid unnecessary reading. It has been verified that this change can make the merge on read mode have the same filtering effect as the copy on write mode .

Optimize file layout

In big data storage, file layout optimization is an important performance optimization technique. Its main purpose is to arrange data into the storage medium according to certain rules when data is written , so as to improve the efficiency of data reading and processing. File layout optimization can be done in a variety of ways, such as timestamp sorting , partition sorting, and merging files.

Hudi provides a file layout optimization method called Clustering , which can be used to merge small files into larger files to reduce the total number of files that the query engine needs to scan, or use concepts such as space filling curves to adapt to the data lake layout And reduce the amount of data read by the query. Using Clustering, data with the same query characteristics can be placed in several adjacent files, and then filtered according to the index information during query, which can effectively reduce the number of files to be read and reduce computing costs.

Kangaroo Cloud Data Stack provides a visual page to facilitate users to adjust the file layout. Users can freely set sorting strategies, sorting fields, filter conditions, etc. optimize. Because Hudi uses multiple versions to organize files , users don't need to worry about the optimization task affecting the running reading task. After the optimization is completed, the new reading task can enjoy the efficiency improvement brought by the new layout.

file

Explore new features

In the process of implementing Hudi, Kangaroo Cloud Data Stack is also actively tracking new functions and features in the practice community.

In Hudi 0.13.0, Hudi implemented the " optimized record load handling " feature. By setting the two parameters hoodie.datasource.write.record.merger.impls=org.apache.hudi.HoodieSparkRecordMerger and hoodie.logfile.data.block.format=parquet to avoid additional copy and deserialization, in the write operation Process records in a consistent manner throughout their lifecycle.

Kangaroo Cloud Stack has tested and introduced this feature. After verification, the update performance has been improved by about 20% compared with the previous version, which is in line with the description of the community. In addition, the number stack also refers to the new feature of disruptor lock-free message queue writing data introduced by Hudi 0.13.0, by setting hoodie.write.executor.type = DISRUPTOR and hoodie.write.executor.disruptor.wait.strategy = BUSY_SPIN_WAIT Parameters, combined with the aforementioned optimized configuration, the overall update performance has been improved by more than 30%.

Summarize

The advantage of Apache Hudi is that it supports incremental data processing , has good data consistency and reliability, and provides a variety of performance optimization technologies, which can improve the efficiency of data processing and query, and has good performance and scalability.

In the process of implementing Hudi, the Kangaroo Cloud Data Stack team verified various indexes of Hudi, applied the file organization optimization function, and summarized common tuning parameters to provide reliable, efficient, and scalable data lake construction for enterprises . The data lake solution has accumulated a lot of experience, which can help enterprises better manage and analyze data, and improve the accuracy and efficiency of business decisions.

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10084152