Exploration practice of extreme query optimization based on Apache Hudi

Abstract: This article mainly introduces how Presto makes better use of Hudi's data layout and index information to speed up query performance.

This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD's Exploration and Practice of Apache Hudi's Ultimate Query Optimization! ", author: FI_mengtao.

background

LakeHouse is a new open architecture that combines the best elements of data lakes and data warehouses, and is an important development direction in the current big data field.

HUAWEI CLOUD started pre-research on related technologies as early as 2020 and implemented it in HUAWEI CLOUD FusionInsight MRS intelligent data lake solution.

At present, the three mainstream data lake components Apache Hudi, Iceberg, and Delta have their own advantages, and the industry is constantly exploring and choosing solutions that suit them.

The core base of Huawei's integrated lake and warehouse architecture is Apache Hudi. All data entering the lake is carried through Apache Hudi , and externally through HetuEngine (Presto enhanced version) engine to assume the role of one-stop SQL analysis, so how to better combine Presto and Hudi to make it It is of great significance that the query efficiency is close to that of a professional distributed data warehouse. Query performance optimization is a big topic, including indexing, data layout, pre-aggregation, statistics, engine runtime optimization, etc. This article mainly introduces how Presto makes better use of Hudi's data layout and index information to speed up query performance. Pre-aggregation and statistics we will share later. 

Data layout optimization

The query scene of big data analysis generally has filter conditions. For this type of query, if the target result set is small, in theory, we can skip a large number of irrelevant data when reading table data by certain means, and only read Very small datasets, thus significantly improving query efficiency. We can call the above technique DataSkipping.

A good data layout can make the related data more compact (of course, the small file problem is also dealt with) is a key step to realize DataSkipping. Reasonable setting of partition fields and data sorting in daily work belong to data layout optimization. The current mainstream query engines Presto/Spark can perform Rowgroup-level filtering on Parquet files, and the latest version even supports Page-level filtering; selecting an appropriate data layout allows the engine to use the column statistics to easily filter out the above files when reading the above files. A large number of Rowgroup/Page, thereby reducing IO.

So does DataSkipping just rely on data layout? actually not. The above filtering still needs to open each file in the table to complete the filtering, so the filtering effect is limited, and the data layout optimization and FileSkipping can play a better effect.

When we complete the data layout, collect statistical information on the relevant columns of each file. The following figure is a simple example. The data is sorted and written into the table to generate three files. Specify where a < 10. The following figure can be clear It can be seen that the result set with a < 10 only exists in the parquet1 file, and the minimum value of a in parquet2/parquet3 is larger than 10. Obviously, there is no result set, so just cut out parquet2 and parquet3 directly.

This is a simple FileSkipping. The purpose of FileSkipping is to cut unnecessary files as much as possible and reduce scanning IO. There are many ways to implement FileSkipping, such as

Min-max statistics filtering, BloomFilter, Bitmap, secondary index, etc., each method has its own advantages and disadvantages, among which min-max statistics filtering is the most common, and it is also the default implementation provided by Hudi/Iceberg/DeltaLake.

Apache Hudi core capabilities

Clustering

Hudi has provided Clustering to optimize data layout as early as version 0.7.0. With the addition of Z-Order/Hilbert high-order clustering algorithm in version 0.10.0, Hudi's data layout optimization is becoming more and more powerful. Hudi currently provides the following three different clustering For different check scenarios, you can choose different strategies according to specific filter conditions

For the specific principles of Z-Order and Hilbert, please refer to the relevant Wiki, https://en.wikipedia.org/wiki/Z-order This article will not go into details.

Metadata Table(MDT)

Metadata Table (MDT): Hudi's metadata information table is a self-managed Hudi MoR table, located in the .hoodie directory of the Hudi table, and the user does not perceive it after opening. The same Hudi has long supported MDT. After continuous iteration, the 0.12 version of MDT has matured. The current MDT table has the following capabilities

(1)Column_stats/Bloomfilter

We introduced the data layout optimization above, and then we will talk about the FileSkipping capability provided by Hudi. Currently, Hudi supports the collection of statistics including min-max value, null count, and total count for specified columns, and Hudi guarantees that the collection of these information is atomic. Using these statistics in conjunction with the query engine can greatly reduce FileSkipping. IO. BloomFilter is another capability provided by Hudi, which currently only supports the construction of BloomFilter for primary keys. If BloomFilter judges that it does not exist, it must not exist. It is very convenient for FileSkipping. We can directly apply the query conditions to the BloomFilter of each file, and then filter the files with invalid points. Note that BloomFilter is only suitable for equivalent filter conditions such as where a = 10, nothing can be done for a > 10.

(2) High-performance FileList

When querying ultra-large-scale datasets, FileList is an unavoidable operation. On HDFS, the operation time-consuming is acceptable. Once object storage is involved, the efficiency of large-scale FileList is extremely low. Hudi introduced MDT to save file information directly. Thus avoiding large-scale FileList.

Presto and Hudi integration

As a data lake export engine, HetuEngine (Presto) is very important to query Hudi. For docking, we mainly optimize the query and complex query in different ways. The following section focuses on the query scene. Before integrating with Hudi, first solve the following problems

  1. How to integrate Hudi, modify directly in Hive Connector, or use independent Hudi Connector?
  2. Which indexes are supported for DataSkipping?
  3. Is DataSkipping done on the Coordinator side or on the Worker side?

Question 1:  After discussion we decided to use Hudi Connector to host this optimization. The connector of the current community is still slightly better and insufficient. The lack of some optimizations, including statistical information, Runtime Filter, and Filter cannot be pushed down, etc., leads to the unsatisfactory performance of TPC-DS. We have focused on this optimization in this optimization, and subsequent related optimizations will be pushed to the community.

Question 2:  The internal HetuEngine already supports Bitmap and secondary indexes. This time, it focuses on integrating the Column statistics and BloomFilter capabilities of MDT, and uses the Filter pushed down by Presto to directly cut files.

Question 3:  We have tested this issue. For the column statistics, the overall data volume is not large. The statistics of 1w files are about a few M. There is no problem in loading them into the Coordinator memory. Therefore, we choose to filter directly on the Coordinator side. .

For BloomFilter and Bitmap, it is completely different. The test results show that 1.4T data generates more than 1G BloomFilter indexes. It is obviously unrealistic to load these indexes into Coordinator. We know that the BloomFilter of Hudi MDT actually exists in the HFile, and the HFile check is very efficient, so we push the DataSkipping to the Worker side, and each Task checks the HFile to find its own BloomFilter information for filtering.

Check Scenario Test

Test Data

We use the same SSB dataset as ClickHouse for testing, with a data size of 1.5T and 12 billion pieces of data.

$ ./dbgen -s 2000 -T c
$ ./dbgen -s 2000 -T l
$ ./dbgen -s 2000 -T p
$ ./dbgen -s 2000 -T s

test environment

1CN+3WN Container 170GB,136GB JVM heap, 95GB Max Query Memory,40vcore

data processing

Use the Hilbert algorithm that comes with Hudi to directly preprocess the data and write it to the target table. Here, the Hilbert algorithm specifies S_CITY, C_CITY, P_BRAND, and LO_DISCOUNT as the sorting column.

SpaceCurveSortingHelper
.orderDataFrameBySamplingValues(df.withColumn("year", expr("year((LO_ORDERDATE))")), LayoutOptimizationStrategy.HILBERT, Seq("S_CITY", "C_CITY", "P_BRAND", "LO_DISCOUNT"), 9000)
.registerTempTable("hilbert")
spark.sql("insert into lineorder_flat_parquet_hilbert select * from hilbert")

Test Results

Use cold start mode to reduce the impact of Presto cache on performance.

SSB Query

file read volume

  1. For all SQL we can see a 2x - 11x performance improvement, the FileSkipping effect is more pronounced and the filtered files have a 2x - 200x improvement.    
  2. Even without MDT, Presto's powerful Rowgroup-level filtering, combined with Hilbert's data layout optimization, can greatly improve query performance.
  3. The column data scanned by the SSB model is relatively small. In actual scenarios, the performance of scanning multiple columns Presto + MDT + Hilbert can reach more than 30x .  
  4. In the test, the deficiencies of MDT were also found. The MDT table generated by 12 billion data is close to 50M, and it takes a certain amount of time to load it into the memory. Later, consider configuring the cache disk for MDT to speed up the reading efficiency.

Regarding the test of BloomFilter, since Hudi only supports the construction of BloomFilter for the primary key, we constructed a 1000w data set for testing

spark.sql(
 """
 |create table prestoc(
 |c1 int,
 |c11 int,
 |c12 int,
 |c2 string,
 |c3 decimal(38, 10),
 |c4 timestamp,
 |c5 int,
 |c6 date,
 |c7 binary,
 |c8 int
 |) using hudi
 |tblproperties (
 |primaryKey = 'c1',
 |preCombineField = 'c11',
 |hoodie.upsert.shuffle.parallelism = 8,
 |hoodie.table.keygenerator.class = 'org.apache.hudi.keygen.SimpleKeyGenerator',
 |hoodie.metadata.enable = "true",
 |hoodie.metadata.index.column.stats.enable = "true",
 |hoodie.metadata.index.column.stats.file.group.count = "2",
 |hoodie.metadata.index.column.stats.column.list = 'c1,c2',
 |hoodie.metadata.index.bloom.filter.enable = "true",
 |hoodie.metadata.index.bloom.filter.column.list = 'c1',
 |hoodie.enable.data.skipping = "true",
 |hoodie.cleaner.policy.failed.writes = "LAZY",
 |hoodie.clean.automatic = "false",
 |hoodie.metadata.compact.max.delta.commits = "1"
 |)
 |
 |""".stripMargin)

In the end, a total of 8 files were generated, and 7 were dropped with BloomFilter Skipping  , and the effect was very obvious.

Follow-up

Subsequent work on checking will focus on Bitmap and secondary indexes. Finally, summarize the selection methods of various optimization techniques in DataSkipping.

  1. Various sorting methods in Clustering need to be combined with Column statistics to achieve better results.
  2. BloomFilter is suitable for checking the equivalent conditions, and does not require data sorting, but to select high-base fields, the low-base field BloomFIlter is not very useful; in addition, do not select BloomFilter for ultra-high bases, and the resulting BloomFilter results are too large.

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/5580479