Thinking and Implementation of ByteDance Data Lake Technology Selection

This article is a speech given by the ByteDance data platform development kit team at the Flink Forward Asia 2021: Flink Forward Summit, focusing on sharing the selection thinking and exploration practice of ByteDance data lake technology.

Text | Gary Li Senior R&D Engineer of ByteDance Data Platform Development Kit Team, Apache Hudi PMC Member of Data Lake Open Source Project

With the continuous development of the Flink community, more and more companies use Flink as their preferred big data computing engine. ByteDance is also continuing to explore Flink. As one of many Flink users, its investment in Flink is also increasing year by year.

The Status Quo of ByteDance Data Integration

In 2018, we constructed a batch synchronization channel between heterogeneous data sources based on Flink, which is mainly used for importing online databases into offline data warehouses and batch transmission between different data sources.

In 2020, we constructed a real-time data integration channel of MQ-Hive based on Flink, which is mainly used to write the data in the message queue to Hive and HDFS in real time, and achieve stream-batch unification on the computing engine.

By 2021, we will construct a real-time data lake integration channel based on Flink, thus completing the construction of a lake and warehouse integrated data integration system.

ByteDance data integration system currently supports dozens of different data transmission pipelines, covering online databases, such as Mysql Oracle and MangoDB; message queues, such as Kafka RocketMQ; various components of the big data ecosystem, such as HDFS, HIVE and ClickHouse.

Inside ByteDance, the data integration system serves almost all business lines, including well-known applications such as Douyin and Toutiao.

The whole system is mainly divided into 3 modes - batch integration, streaming integration and incremental integration.

  • The batch integration mode is based on the Flink Batch mode, which transmits data in batches in different systems, and currently supports more than 20 different data source types.
  • The streaming integration mode mainly imports data from MQ to Hive and HDFS, and the stability and real-time performance of tasks are widely recognized by users.
  • Incremental mode is CDC mode, which is used to support the synchronization of data changes to the database of external components through the database change log Binlog.

This mode currently supports 5 data sources. Although there are not many data sources, the number of tasks is very large, including many core links, such as billing and settlement of each business line, which requires very high data accuracy.

The overall link in the CDC link is relatively long. First of all, the first import is a batch import. We directly connect the Mysql library through Flink Batch mode to pull the full amount of data and write it to Hive, and the incremental Binlog data is imported to HDFS through streaming tasks.

Since Hive does not support update operations, we still use a Spark-based batch link to merge the previous day's Hive table and the newly added Binlog by means of T-1 incremental merge to produce the current day's Hive table .

With the rapid development of business, more and more problems are exposed by this link.

First of all, this Spark-based offline link consumes serious resources. Every time new data is generated, it involves a full data shuffle and a full data disk. The storage and computing resources consumed in the middle are relatively serious.

At the same time, with the rapid development of ByteDance business, the demand for near real-time analysis is also increasing.

Finally, the entire link process is too long, involving two computing engines, Spark and Flink, and three different task types. The cost of use and learning for users is relatively high, and it brings a lot of operation and maintenance costs.

In order to solve these problems, we hope to make a complete architecture upgrade to the incremental mode, and merge the incremental mode into the streaming integration, so as to get rid of the dependence on Spark and achieve unity at the computing engine level.

After the transformation is completed, the Flink-based data integration engine can support batch, streaming and incremental modes at the same time, covering almost all data integration scenarios.

At the same time, in the incremental mode, it provides data latency equivalent to the streaming channel, giving users near real-time analysis capabilities. While achieving these goals, it can further reduce computational costs and improve efficiency.


After some exploration, we noticed the emerging data lake technology.

Thoughts on the selection of data lake technology

Our focus is on Iceberg and Hudi, two open source data lake frameworks under the Apache Software Foundation.

Both Iceberg and Hudi data lake frameworks are excellent. But the two projects were created to solve different problems, and so have different functional focuses:

  • Iceberg: The core abstraction is relatively inexpensive to interface with new computing engines, and provides advanced query optimization capabilities and complete schema changes.
  • Hudi: Focuses more on efficient Upserts and near real-time updates, providing the Merge On Read file format, and incremental query functions that facilitate building incremental ETL pipelines.

After some comparison, the two frameworks have their own advantages and disadvantages, and they are far from the final form of the data lake we imagine, so our core problems are concentrated on the following two problems:

Which framework can better support the core demands of our CDC data processing?

Which framework can quickly complement the functions of another framework and grow into a general and mature data lake framework?

After many internal discussions, we believe that Hudi is more mature in processing CDC data, and the community iteration speed is very fast, especially in the past year, many important functions have been completed, and the integration with Flink has become more mature. We chose Hudi as our data lake base.

01 - Indexing System

We choose Hudi, and the most important thing is Hudi's indexing system.

This graph is a comparison with and without indexing.

In the process of writing CDC data, in order for the newly added Update data to act on the bottom table, we need to know clearly whether this data has appeared and where it has appeared, so as to write the data to the correct place. When merging, we can only merge a single file without having to deal with global data.

If there is no index, the merge operation can only be performed by merging global data, which brings about a global shuffle.

In the example in the figure, the merge cost without an index is twice that with an index, and if the amount of data in the base table increases, this performance gap increases exponentially.

Therefore, at the level of byte-beating business data, the performance benefits brought by indexing are huge.

Hudi provides a variety of indexes to adapt to different scenarios. Each index has different advantages and disadvantages. The selection of the index needs to be based on the specific data distribution, so as to achieve the optimal solution for writing and querying.

Below are examples of two different scenarios.

1. Log data deduplication scenario

In the scenario of deduplication of log data, the data usually has a timestamp of create_time, and the distribution of the bottom table is also partitioned according to this timestamp. The data in recent hours or days will be updated more frequently, but the older The data won't change much.

Hot and cold partition scenarios are more suitable for Bloom indexes, State indexes with TTL, and hash indexes.

2. CDC scene

The second example is an example of a database export, which is the CDC scenario. In this scenario, the updated data will be randomly distributed, and there is no rule at all, and the amount of data in the bottom table will be relatively large, and the amount of newly added data will usually be smaller than that in the bottom table.

In this scenario, we can choose hash index, state index and Hbase index to achieve efficient global indexing.

These two examples illustrate that in different scenarios, the choice of index also determines the read and write performance of the entire table. Hudi provides a variety of out-of-the-box indexes, which have covered most scenarios, and the user cost is very low.

02 - Merge On Read Table Format

In addition to the indexing system, Hudi's Merge On Read table format is also one of the core features we value. This tabular format enables real-time writes and near real-time queries.

In the construction of the big data system, there is a natural conflict between the writing engine and the query engine:

The writing engine is more inclined to write small files and write them in a row-stored data format to avoid too much computational burden during the writing process as much as possible. It is best to write one by one.

The query engine is more inclined to read large files and store data in columnar file formats, such as parquet and orc, and the data is strictly distributed according to certain rules, such as sorting according to a common field, so that it can be used when querying. , skip scanning useless data to reduce computational overhead.

To find the best trade-off in this natural conflict, Hudi supports the Merge On Read file format.

The MOR format contains two kinds of files: one is the log file based on the row-based Avro format, and the other is the base file based on the column-based format, including Parquet or ORC.

The log file is usually small and contains newly updated data. The base file is large and contains all the historical data.

The write engine can write updated data to the log file with low latency.

The query engine merges the log file with the base file when reading, so that the latest view can be read; the compaction task periodically triggers the merge of the base file and the log file to avoid continuous expansion of the log file. Under this mechanism, the Merge On Read file format achieves real-time writing and near real-time querying.

03 - Incremental Calculation

The indexing system and Merge On Read format lay a very solid foundation for the real-time data lake, and incremental computing is another dazzling feature of Hudi based on this foundation:

Incremental computing gives Hudi a message-queue-like capability. Users can pull new data within a period of time on Hudi's timeline through a timestamp similar to offset.

In some scenarios where the data delay tolerance is at the minute level, Hudi can unify the Lambda architecture, serve both real-time and offline scenarios, and achieve stream-batch integration in storage.

Practical thinking on the internal scene of ByteDance

After choosing the data lake framework based on Hudi, we created a customized implementation plan based on the internal scene of ByteDance. Our goal is to support all datalinks with Update via Hudi:

Need Efficient and Low Cost Upserts

Support high throughput

End-to-end data visibility within 5-10 minutes

With the goal clear, we started testing the Hudi Flink Writer. This diagram is the architecture of Hudi on Flink Writer: after a new piece of data comes in, it will first go through an index layer to find where it needs to go.

  • The state index saves the one-to-one mapping relationship between all primary keys and file IDs. For Update data, the existing file ID will be found. For Insert data, the index layer will assign it a new file ID, or a historical file ID. small file, and let it fill into the small file, thus avoiding the small file problem.
  • After the index layer, each piece of data will have a file ID, and Flink will perform a shuffle based on the file ID, and import the data with the same file ID into the same subtask, which can avoid the problem of multiple tasks writing to the same file. .
  • There is a memory buffer in the write subtask, which is used to store all the data of the current batch. When the Checkpoint is triggered, the data in the subtask buffer will be transferred to the Hudi Client, and the Client will perform some calculations in the micro-batch mode. Operations, such as Insert/Upsert/Insert overwrite, etc., the calculation logic of each operation is different. For example, the Insert operation will generate a new file, and the Upsert operation may be merged with the historical file once.
  • After the calculation is completed, the processed data is written to HDFS, and metadata is collected at the same time.
  • The Compaction task is a part of the streaming task. It will periodically train Hudi's timeline to check whether there is a Compaction plan. If there is a Compaction plan, it will be executed through an additional Compaction operator.

During testing, we encountered the following problems:

  • In a scenario with a large amount of data, all the mapping relationships between primary keys and file IDs will exist in the State. The volume of the State expands very quickly, which brings additional storage overhead and sometimes causes the problem of Checkpoint timeout.
  • The second problem is that during Checkpoint, Hudi Client has heavy operations, such as merging with the underlying base file. This operation involves reading, deduplicating, and writing new files from historical files. The jitter of HDFS is prone to the problem of Checkpoint timeout
  • The third problem is that the Compaction task is a part of the streaming task, and the resources cannot be adjusted after the task is started. If adjustment is required, the entire task can only be restarted, and the overhead is relatively large. Sub-empty run leads to waste of resources, or insufficient resources lead to task failure

To address these issues, we started customizing optimizations for our scenarios

ByteDance's customized optimization technical solution

01 - Index Layer

The purpose of the index is to find the location of the file where the current piece of data is located. If it exists in the State, each piece of data involves a read and write of the State. In the scenario of a large amount of data, the computational and storage costs are relatively high. big.

ByteDance has internally developed a hash-based index, which can find the location of the file by directly hashing the primary key. This method can achieve global indexing under non-partitioned tables, bypassing the need for State. Dependency, after the transformation, the index layer becomes a simple hash operation.

02 - Write Layer

Early Hudi writing was strongly bound to Spark. At the end of 2020, the Hudi community split the underlying Hudi Client and supported the Flink engine. This transformation method turns the Spark RDD operation into a List operation , so the bottom layer is still a batch operation. For Flink, the calculation logic that needs to be done during each checkpoint is similar to Spark RDD, which is equivalent to executing a batch operation, and the computational burden is relatively large.

The specific process of the writing layer is: after a piece of data passes through the indexing layer, it arrives at the writing layer. The data will first be accumulated in Flink’s memory buffer, and at the same time, memory monitoring will be used to avoid the task failure caused by the memory exceeding the limit. When the checkpoint is reached, The data will be imported to the Hudi Client, and then the Hudi Client will calculate the final written data through operations such as Insert, Append, Merge, etc. After the calculation is completed, the new file will be written to HDFS and the metadata will be recovered at the same time.

Our core goal is how to make this micro-batch writing mode more streaming, thereby reducing the computational burden during Checkpoint.

In terms of table structure, we have chosen the Merge on Read format which is more compatible with streaming writing. The write operator is only responsible for additional writing to the log file, and does not do any other additional operations, such as writing with the base file. merge.

In terms of memory, we removed the memory buffer of the first layer of Flink, and directly built the memory buffer in the hudi client, and performed memory monitoring while writing data to avoid the memory exceeding the limit. We will write to hdfs. The operation and Checkpoint are decoupled. During the operation of the task, each small batch of data will be written to HDFS once. Since HDFS supports additional write operations, this form will not bring about the problem of small files, thus making Checkpoint as much as possible. The lightweight, avoids the problem of Checkpoint timeout caused by HDFS jitter and excessive calculation.

03- Compaction layer

The Compaction task is essentially a batch task, so it needs to be split with streaming writing. Currently, Hudi on Flink supports the asynchronous execution of Compaction, and all our online tasks use this mode.

In this mode, streaming tasks can focus on writing, improving throughput and writing stability, and batch-type Compaction tasks can decouple streaming tasks, elastically scale to efficiently utilize computing resources, and focus on resource utilization rate and cost savings.

After this series of optimizations, we tested on a 2 million rps Kafka data source, using 200 concurrent imports to Hudi. Compared with the previous time, the checkpoint time has been reduced from 3-5 minutes to less than 1 minute, and the task failure rate caused by HDFS jitter has also been greatly reduced. Due to the reduced checkpoint time, the actual time used for data processing has become more. Data throughput is doubled, while State storage overhead is minimized.

This is the final CDC data import flowchart

First, different databases will send Binlog to the message queue, and the Flink task will convert all data into HoodieRecord format, and then find the corresponding file ID through the hash index. After a layer of shuffle for the file ID, the data arrives for writing Into the layer, the write operator frequently writes data to HDFS in the form of additional writes. After Checkpoint is triggered, Flink will collect all metadata and write them into the metadata system of hudi. Marks the completion of a Commit submission, and a new Commit will start accordingly.

Users can query the submitted data in near real time through query engines such as Flink Spark Presto.

The Compaction service hosted on the data lake platform will periodically submit the Compaction task in Flink Batch mode to compress the Hudi table. This process is imperceptible to the user and does not affect the writing task.

Our complete set of solutions will also be contributed to the community, and interested students can follow the latest developments in the Hudi community

Typical landing scenarios of streaming data lake integration framework

After the transformation of the streaming data lake integration framework, we found some typical landing scenarios:

The most common application is the scenario of importing an online database into an offline data warehouse for analysis. Compared with the previous Spark offline link: the end-to-end data delay has been reduced from more than an hour to 5-10 minutes, and users can perform near real-time analysis. data analysis operations.

In terms of resource utilization, we simulated a scenario of importing Mysql into offline data warehouse for analysis, and compared the solution of importing Flink streaming into Hudi and Spark offline merging. In the scenario of hourly user query, the end-to-end computing resources Save about 70%.

In the data warehouse scenario of ByteDance's EB-level data volume, the benefits brought by this improvement in resource utilization are huge.

For users who build real-time data warehouses based on message queues and Flink, they can import real-time data of different data warehouse levels into Hudi. There are many cases of such data update, so compared with Hive, Hudi can provide high efficiency and low cost Cost-effective Upsert operation, so that users can query the full amount of data in near real time, avoiding a deduplication operation.

This is a Flink dual-stream Join scenario. Many Flink users will use dual-stream Join for real-time field splicing. When using this function, users usually open a time window, and then combine the data from different data sources in this time window. Data splicing, this field splicing function can also be implemented at the Hudi level.

We are exploring a function that only unions the data of different topics together in Flink, and then writes the data of the same primary key into the same file through Hudi's indexing mechanism, and then splices the data through the compaction operation .

The advantage of this method is that we can perform global field splicing through Hudi's indexing mechanism without being limited by a window.

The entire splicing logic is implemented by HoodiePayload. Users can simply inherit HoodiePayload and then develop their own custom splicing logic. The timing of splicing can be Compaction task or Merge on Read near real-time query. use of computing resources. However, compared with Flink dual-stream Join, this mode has a disadvantage, that is, the real-time performance and ease of use are worse.


After this series of work, we are full of expectations for the future of the data lake, and we also set clear goals.

First of all, we hope to use Hudi as the underlying storage for all CDC data sources, completely replace the Spark-based offline merging scheme, and stream import through the data integration engine, bringing near real-time offline analysis capabilities to all online databases.

Next, the incremental ETL scenario is also an important landing scenario. For scenarios where the data delay tolerance is at the minute level, Hudi can serve as a unified storage for both real-time links and offline links, thereby upgrading the traditional data warehouse Lambda architecture to A true flow-batch integration.

Finally, we hope to build an intelligent data lake platform, which will host the operation and maintenance management of all data lakes and achieve a state of self-governance, so that users do not need to worry about operation and maintenance.

At the same time, we hope to provide the function of automatic tuning to find the best configuration parameters based on the distribution of data, such as the performance trade-off between different indexes mentioned above, we hope to find the best configuration through algorithms, thereby improving resources utilization, and lower the threshold for users to use.

Excellent user experience is also one of our pursuits. We hope to achieve one-key entry into the warehouse on the platform side, which will greatly reduce the development cost of users.

The data lake integration technology has also been opened to the outside world through DataLeap, the big data R&D governance suite of Volcano Engine .

Welcome to the public account of the same name of the ByteDance data platform


Guess you like