In-depth comparison of the three open source data lake solutions of delta, iceberg and hudi

The three popular open source data lake solutions on the market are delta, Apache Iceberg and Apache Hudi. Among them, due to the great success of Apache Spark in commercialization, delta launched by the commercial company Databricks behind it is also particularly eye-catching. Apache Hudi is a data lake project designed by Uber engineers to meet the needs of its internal data analysis. The functions such as fast upsert/delete and compaction provided by it can be said to accurately hit the pain points of the broad masses of people, and the members of the project are active Local community construction, including technical details sharing, domestic community promotion, etc., are gradually attracting the attention of potential users. Apache Iceberg seems to be relatively mediocre at present. Simply put, the community attention is not as good as delta, and the functions are not as rich as Hudi, but it is an ambitious project because it has a highly abstract and very elegant design. A general data lake solution has laid a good foundation.

Many users will think, looking at the three major projects, under what circumstances should they choose the right data lake solution? Today we will deconstruct the core requirements of the data lake, compare the three major products in depth, and help users better choose the data lake solution for their own scenarios.

First, let's analyze one by one why various technology companies want to launch their open source data lake solutions, what problems they encounter, and how the proposed solutions solve the problems. We hope to objectively analyze business scenarios to rationally judge which functions are the pain points and rigid needs of customers.

Databricks sum Delta

Take delta launched by Databricks as an example. The core problems it wants to solve are basically concentrated in the following figure

Before the delta data lake, Databricks customers generally used the classic lambda architecture to build their streaming batch processing scenarios. Taking user click behavior analysis as an example, the click event is consumed by downstream Spark Streaming jobs through Kafka, and a real-time analysis result is obtained after analysis and processing (business-level aggregation, etc.). This real-time result is only a state seen at the current time. React to all click events on the timeline. So in order to save the full click behavior, Kafka will be analyzed and processed by another Spark Batch job and imported to the file system (usually it is written in HDFS or S3 in parquet format, which can be considered as a simplified version of the data lake). Downstream Batch jobs do full data analysis and AI processing.

 

There are actually many problems with this scheme: 

First, the data imported into the file system in batches generally lacks strict global schema specifications. When analyzing the downstream Spark jobs, it will be very troublesome to encounter data in disorderly format. Each analysis job must filter and process the disordered and missing data, and the cost Larger

Second, there is no ACID guarantee in the process of writing data to the file system, and users may read the data in the intermediate state of the import. Therefore, in order to avoid this pit, the upper-level batch processing job can only be scheduled to avoid the data import time period. You can imagine how unfriendly this is to the business side; at the same time, it cannot guarantee the snapshot version imported multiple times, for example, the business side wants to read the latest The data version imported five times is actually impossible.

Third, users cannot efficiently upsert/delete historical data. Once parquet files are written into HDFS files, if they want to change the data, they can only rewrite the data in full, which is costly. In fact, this kind of demand is widespread. For example, due to program problems, some data is incorrectly written to the file system. Now the business side wants to correct these data; the online MySQL binlog continuously imports update/delete to increase The amount is updated to the downstream data lake; some data review regulations require mandatory data deletion, such as the GDPR privacy protection introduced in Europe.

Fourth, frequent data import will generate a large number of small files on the file system, causing the file system to be overwhelmed, especially the HDFS file system that has a limit on the number of files.

Therefore, in Databricks' view, the following four points are necessary for the data lake.

 

In fact, when Databricks is designing delta, it hopes to achieve further unification of stream batch jobs at the data level (as shown below). Business data is imported into a unified data lake through Kafka (regardless of batch processing or stream processing). The upper-level business can use various analysis engines for further business report analysis, stream computing, AI analysis, and so on.

So, to sum up, I think databricks mainly considers the following core features when designing delta:

Uber 和 Apache Hudi

Uber's business scenario is mainly: to synchronize the itinerary order data generated online to a unified data center, and then use it for analysis and processing by upper-level city operations colleagues. In 2014, Uber's data lake architecture was relatively simple, business logs were synchronized to S3 via Kafka, and the upper layer used EMR for data analysis; online relational databases and NoSQL would be through ETL (ETL tasks will also be pulled Some Kakfa (synchronized to S3 data) tasks are synchronized to the closed-source Vertica analytical database. Urban operation students mainly realize data aggregation through Vertica SQL. At that time, we also encountered problems such as chaotic data format, high system expansion cost (relying on Vertica commercial charging software), and troublesome data backfilling. Subsequent migration to the open source Hadoop ecosystem solved the scalability issues and other issues, but still encountered some of the above-mentioned problems of Databricks. The core problem was the inability to quickly upsert the stock data.

As shown in the figure above, the ETL task periodically synchronizes the incremental update data to the analysis table every 30 minutes, and overwrites all existing old data files, resulting in high data delay and high resource consumption. In addition, downstream of the data lake, there are also streaming jobs that incrementally consume newly written data, and streaming consumption of the data lake is also a necessary function for them. Therefore, they hope to design a suitable data lake solution that can achieve fast upsert and stream incremental consumption on the premise of meeting the needs of a general data lake.

The Uber team implemented both Copy On Write and Merge On Read data formats on Hudi, and Merge On Read was designed to solve their fast upsert. To put it simply, each time the incrementally updated data is written to a batch of independent delta file sets, the delta files and the existing data files are periodically combined through compaction. At the same time, three different reading perspectives are provided to the upper analysis engine: only delta delta files, only data files, and combined delta and data files. Meet the needs of various business parties for streaming batch data analysis of the data lake.

 

In the end, we can extract Uber's data lake requirements as the following figure, which is also the core feature that Hudi focuses on.

Netflix和Apache Iceberg

 

Netflix's data lake was originally built with the help of Hive, but after discovering many flaws in the design of Hive, it turned to self-developed Iceberg, and eventually evolved into Apache's next highly abstract and general open source data lake solution. Netflix uses an internal time-series data business case to illustrate these problems with Hive. When using Hive to partition according to the time field, they found that only 2688 partitions and 2.7 million data files would be generated in one month. They performed a simple select query and found that it took tens of minutes only in the partition pruning phase.

They found that Hive's metadata relies on an external MySQL and HDFS file system. After finding the relevant partitions through MySQL, they need to go to the HDFS file system for each partition to do a list of directories according to the partition. In the case of a large amount of files, this is a very time-consuming operation. At the same time, because metadata is managed by MySQL and HDFS, the atomicity of the write operation itself is difficult to guarantee. Even when Hive ACID is turned on, there are still many small scenarios where atomicity cannot be guaranteed. In addition, Hive Metastore does not have file-level statistical information, which makes the filter can only be pushed down to the partition level, but cannot be pushed down to the file level, and the performance loss of upper-level analysis is inevitable. Finally, Hive's complex semantic reliance on the underlying file system makes it difficult to build a data lake on the lower cost S3.

 

Therefore, Netflix designed its own lightweight data lake Iceberg in order to solve these pain points. At the beginning of the design, the authors positioned it as a general data lake project, so they made a high degree of abstraction in the implementation. Although it is not as rich as the previous two in terms of functions, it will become a very promising open source data lake solution once the functions are complemented due to its solid underlying design.

 

In general, the core demands of Netflix in designing Iceberg can be summarized as follows:

Summary of pain points

 

We can put the pain points targeted by the above three projects on a picture. It can be found that the function points marked in red are basically the function points that a good data lake solution should achieve.

Comparison of 7 dimensions

 

After understanding the original intentions and pain points of the above three major projects, we then compare and evaluate the differences between the three major projects from seven dimensions. Usually when people consider the selection of data lake solutions, Hive ACID is also a strong candidate, because it provides a relatively complete set of functions that people need, so here we include Hive ACID in the ranks of comparison.

 

First, ACID and isolation level support

The main explanation here is the meaning of the three types of isolation for the data lake.

  1. Serialization means that all readers and writers must be executed serially;

  2. Write Serialization: It means that multiple writers must be strictly serialized, and the reader and writer can run at the same time;

  3. Snapshot Isolation: It means that if the data written by multiple writers has no intersection, it can be executed concurrently; otherwise, it can only be serialized. Reader and writer can run at the same time.

Taken together, the concurrency of the Snapshot Isolation isolation level is relatively good.

Second, Schema change support and design

There are two comparison items here. One is the support of schema changes. My understanding is that hudi only supports backward-compatible DDL operations such as adding optional columns and deleting columns, while other solutions do not have this restriction. The other is whether the data lake has a custom schema interface to decouple it from the schema of the computing engine. Here iceberg is doing a better job, abstracting its own schema and not binding any schema at the computing engine level.

Third, streaming batch interface support

Currently Iceberg and Hive do not support streaming consumption, but the Iceberg community is developing support on issue 179.

Fourth, the degree of interface abstraction and plug-inization

The comparison here is mainly from four aspects: the write and read paths of the computing engine, the pluggability of the underlying storage, and the file format. Here Iceberg is the data lake solution with the best degree of abstraction, and the four aspects are very cleanly decoupled. Delta is the main push behind databricks and must be naturally bound to spark; the code of hudi is similar to delta and is also strongly bound to spark. Storage pluggable means whether it is convenient to migrate to other distributed file systems (such as S3), which requires the data lake to have the least semantic dependence on the file system API interface, for example, if the ACID of the data lake strongly depends on the file system rename If the interface is atomic, it is difficult to migrate to cheap storage like S3. At present, only Hive has not considered this design; the file format refers to whether the file can be read and analyzed without relying on data lake tools Data, which requires the data lake not to design its own file format, and use open source parquet and avro formats. Here, there is an advantage that the cost of migration is very low and will not be tied to a certain data lake solution.

Fifth, query performance optimization

Sixth, other functions

One line demo here refers to whether the sample demo is simple enough to reflect the ease of use of the solution. Iceberg is a little more complicated (I think it is mainly because Iceberg abstracts the schema by itself, so the schema of the table needs to be defined before operation). The best one is actually delta, because it deeply follows the pace of spark ease of use.

 

Python support is actually a problem that many developers who do machine learning based on data lakes will consider. You can see that Iceberg and Delta are two very good solutions.

 

For data security considerations, Iceberg also provides file-level encryption and decryption functions, which is a more important point that other solutions have not considered.

Seventh, the status quo of the community (as of 2020-01-08)

What needs to be explained here is that the Delta and Hudi projects are doing relatively well in the construction and promotion of open source communities. The open source and commercial versions of Delta provide detailed internal design documents. It is easy for users to understand the internal design and core functions of this solution. At the same time, Databricks also provides a large number of technical videos and speeches shared with the outside world, and even invited their corporate users Come share Delta's online experience. Uber’s engineers also shared a lot of Hudi’s technical details and internal plans. Researching the nearly 10 PPTs on the official website has made it easier to understand the internal details. In addition, domestic partners are also actively promoting community construction and providing official technology. Weekly public account and mailing list. Iceberg will be relatively calm. Most of the community discussions are on Github issues and pull requests. There will be fewer discussions on the mailing list. Many valuable technical documents can only be seen by carefully tracking issues and PR. This may be related to the core of the community. The style of the developer is related.

to sum up

 

We summarize the three products (where delta is divided into open source and commercial versions of databricks) as follows:

If we use an analogy to illustrate the differences between delta, iceberg, hudi, and hive-acid, the four projects can be compared to building houses. Since the open source delta is a simplified version of the databricks closed source delta, it mainly provides users with a table format technical standard. The closed source version of delta has achieved many optimizations based on this standard. Here we mainly use the closed source delta for comparison .

The base of the Delta house is relatively strong, and the functional floors are relatively high, but this house can actually be said to be databricks. In essence, it is to better grow the Spark ecology. It is difficult for other computing engines to replace Spark on the delta. Especially at the write path level; Iceberg's architectural foundation is very solid, and it is very convenient to extend to new computing engines or file systems, but now the functional floor is relatively low. The most lacking functions are upsert and compaction. Iceberg community The implementation of these two functions is being promoted with the highest priority; Hudi’s situation is relatively different. Its architectural foundation design is not as strong as iceberg. For example, if you want to connect Flink as a sink, you need to move the entire house from the bottom up. Go over it again, abstract the interface, and also consider not affecting other functions. Of course, Hudi's functional floor is relatively complete, and the upsert and compaction functions provided directly hit the pain points of the masses. Hive’s house looks like a mansion with most of the functions. Using it as a data lake is a bit like building a house against a wall of a mansion, which is relatively heavyweight. In addition, as the above analysis by Netflix, Looking at the wall of this mansion, there are actually some problems.

Guess you like

Origin blog.csdn.net/yyoc97/article/details/107474023