Highly reliable real-time ETL system based on Flink

GIAC (GLOBAL INTERNET ARCHITECTURE CONFERENCE) is an annual technology architecture conference for architects, technical leaders and high-end technical practitioners launched by the high-availability architecture technology community and msup that has long focused on Internet technology and architecture. It is the largest technology in China One of the meetings.

At the sixth GIAC conference this year, on the topic of big data architecture, Shi Xiaogang , head of real-time computing in Tencent's data platform department, delivered a keynote speech on "Flink-based Highly Reliable Real-Time ETL System". The following is a record of guest speeches:

Shi Xiaogang graduated from Peking University with a doctorate degree, and is the Committer of Apache Flink project. Published many papers in top international conferences and journals such as SIGMOD, TODS and IPDPS, and served as a member of the program committee of KDD, DASFAA and other top international conferences.

Real-time computing platform Oceanus

In recent years, real-time computing has been widely used in Tencent. In order to improve the efficiency of continuous integration and continuous release of user flow computing tasks, the Tencent big data team has built Oceanus around Flink in 2017, a one-stop visual real-time computing platform integrating development, testing, deployment, and operation and maintenance.

Oceanus provides three different application development methods, including Canvas, SQL and Jar, to meet the development needs of different users. Through these three methods, users in different application scenarios do not need to understand the technical details of the underlying framework, and can quickly develop real-time computing tasks, reducing the threshold for user development.

After completing the job development, users can test, configure and deploy the job through Oceanus. Oceanus provides a series of tools for user programs to assist in job testing. Users can either use the one-key generation function provided by Oceanus to generate test data, or upload their own test data to Oceanus, and verify the correctness of the application logic by comparing the expected result with the actual result. Oceanus relies on Tencent's internal resource scheduling system Gaia for resource management and job deployment. Users can configure the CPU and memory resources required by the job through Oceanus, and specify the cluster that the job needs to deploy. After the user completes the configuration, Oceanus will apply to Gaia for the corresponding resources and submit the job to Gaia to run.

Oceanus collects multiple operating indicators when Flink jobs are running, including the memory, I/O and GC of Task Manger. Through these rich operating indicators, users can have a good understanding of the running status of the application, and can help users locate problems in a timely manner when abnormalities occur. Operation and maintenance personnel can use these collected indicators to set alarm strategies and implement refined operations.

On top of Oceanus, Tencent Big Data also provides scene-based support for common real-time computing tasks such as ETL, monitoring alarms, and online learning. For example, Oceanus-ML provides end-to-end online machine learning, covering the entire machine learning process of data access, data processing, feature engineering, algorithm training, model evaluation, and model deployment. Through Oceanus-ML, users can easily use complete data processing functions and rich online learning algorithms to build their own online learning tasks, easily complete model training and evaluation, and deploy models with one click.

For ETL scenarios, Oceanus also provides Oceanus-ETL products to help users import data collected in applications and products into the data warehouse in real time. At present, the Tencent big data team provides data access services for Tencent’s internal businesses including WeChat, QQ Music, and Tencent Games. The number of messages processed per day exceeds 40 trillion, and the peak value of access per second exceeds 4 Billion.

Real-time data access platform Oceanus-ETL

Tencent Big Data began data access work as early as 2012, and built the first generation of Tencent Data Bank (TDBank) based on Storm, which became the first line of Tencent Big Data platform, providing documents and news Multiple access methods such as databases and databases unify the data access portal and provide efficient real-time distributed data distribution.

In 2017, Tencent Big Data restructured TDBank's data access through Flink based on Flink's advantages in ease of use, reliability and performance. Compared to Storm, Flink provides more support for state. On the one hand, Flink saves the state of the program in local memory or RocksDB. Users do not need to remotely access the state data through the network, so better job performance can be obtained. On the other hand, Flink provides an efficient and lightweight checkpoint mechanism through the Chandy-Lamport algorithm, which can ensure that the data processing semantics of Exactly Once and At-Least Once can still be achieved in the event of a failure.

With the continuous increase of Tencent’s business scale, higher requirements have been put forward for data access.

  • Ensure the end-to-end semantics of "one and only once" and "strong consistency"

  • Ensure that ACID transactions are separated from read and write to avoid errors such as dirty reads downstream

  • Support for data correction and format changes

In order to meet the above requirements, we introduced Iceberg this year to provide more reliable and powerful data access services through the ACID transaction mechanism and incremental update capabilities provided by Iceberg.

Implement end-to-end Exactly Once transmission based on Flink

Flink uses the Checkpoint mechanism to back up and restore the task status. In the event of a task failure, the task can be restored from the last backup state without having to re-execute it from the beginning. Through the checkpoint mechanism, Flink can guarantee that Exactly Once data transmission can still be achieved when a failure occurs.

However, in the entire data access link, in addition to Flink, it also includes upstream middleware and downstream data warehouses. Relying only on the checkpoint mechanism of Flink can only guarantee the data transmission of Exactly Once within the Flink job, but cannot guarantee the transmission semantics of Exactly Once end-to-end in the entire data access link. If we write the data received by Flink directly to the downstream storage system, when Flink fails and recovers from the failure, the data written to the downstream storage system after the last checkpoint will be repeated, resulting in subsequent data An error occurred in the analysis.

In order to ensure end-to-end Exactly Once data transmission, TDBank uses Flink's checkpoint mechanism to implement a two-phase submission protocol, and aggregates and reconciles the indicators generated by each link of data access to ensure end-to-end data Reliability of transmission.

In order to ensure the Exactly Once of the data link, we first write the data received by Flink into a temporary directory, and save the written file list. When the checkpoint is executed, we will save the list of these files in the checkpoint and record it. When the checkpoint is completed, Flink will notify all nodes. At this time, these nodes will move the files saved in the checkpoint to the official directory.

In this implementation, Flink uses the existing checkpoint mechanism to implement a two-phase commit mechanism. All nodes perform pre-commit operations when performing checkpoint, and write all data to a reliable distributed storage. When the checkpoint is completed on the JobManager, the transaction is considered to be committed. All nodes will complete the final transaction commit operation after receiving the checkpoint success message.

If a node fails during the last file movement, the Flink job will be restored from the last completed checkpoint and a complete file list will be obtained from the last completed checkpoint. The Flink job will check the files in this file list and move all the files that have not been moved to the final directory.

In order to ensure that data is not lost and repeated during the entire access process, we collected and reconciled the number of data sent and received by each component in the entire data link. Since the general indicator system cannot guarantee the timeliness and correctness of the indicators, we have also achieved highly reliable and consistent indicator aggregation based on Flink.

Similar to the data link, we also use Flink's checkpoint mechanism to ensure the consistency of indicator data. We use Flink to aggregate the collected indicators at a granularity of minutes, and save these aggregated indicators to external storage when performing checkpoint. When saving aggregate indicators, in addition to the general tags, we will also bring the checkpoint numbers when these indicators are written. When the checkpoint is completed, each node will also record the completed checkpoint number in the external storage. When we need to query indicators, we only need to connect the completed checkpoint number and aggregate indicators to obtain consistent indicator results.

Through Flink's checkpoint mechanism, we can ensure the consistency of data transmission and index aggregation in the data link and indicator link, and ensure end-to-end Exactly Once data transmission across the entire data access link.

Real-time data access of ACID based on Iceberg

Apache Iceberg is a universal table format (data organization format), which can be adapted to engines such as Presto and Spark to provide high-performance read-write and metadata management functions. Iceberg's positioning is above storage under the computing engine. It is a data storage format, which Iceberg calls "table format". To be precise, it is a data organization format between the calculation engine and the data storage format-data and metadata are organized in a specific way, so it is more reasonable to call it a data organization format.

Iceberg realizes the ability of ACID through the lock mechanism. It will acquire the lock from the metastore and update it every time the metadata is updated. At the same time, Iceberg guarantees linear consistency (Serializable isolation) to ensure that table modification operations are atomic, and read operations will never read partial or uncommitted data. Iceberg provides an optimistic locking mechanism to reduce the impact of locks, and uses conflict rollback and retry mechanisms to resolve conflicts caused by concurrent writes.

Based on ACID capabilities, Iceberg provides read-write separation capabilities similar to MVCC. First of all, each write operation will generate a new snapshot (snapshot), the snapshot is always linearly increased in the future, to ensure linear consistency. The read operation only reads the existing snapshot, and is not visible to the snapshot read operation that is being generated. Each snapshot has all the data and metadata of the table at that moment, so it provides the ability for users to backtrack (time travel) the table data. Using Iceberg's time travel capability, users can read the data at that moment, while also providing user snapshot rollback and data replay capabilities.

Compared with Hudi and Delta Lake, Iceberg provides a more complete table format capability, type definition and operation abstraction, and is decoupled from the upper data processing engine and the lower data storage format. In addition, Iceberg was not bound to a specific storage engine at the beginning of the design, and at the same time, it avoided the mutual call with the upper engine, so that Iceberg can be easily extended to support different engines.

In data access, Iceberg can ensure ACID transactions and strong consistency, and achieve "one and only once" write; read-write separation enables interactive query engines (such as Hive and Presto, etc.) to be read immediately To get the correct data; Row-level update and delete support data correction through the calculation engine; incremental consumption enables the data that has been landed to be further returned to the streaming processing engine, and only the changed part is processed and passed backward; Iceberg is efficient The query ability can also save the steps of importing MySQL or ClickHouse, and directly consumed by the report and BI system.

In order to be able to use Iceberg, Tencent Big Data has implemented a Flink connector that supports Iceberg, allowing Flink to write data to Iceberg. Flink's Iceberg Sink consists of two parts, one is called Writer and the other is Committer. Writer is responsible for writing the received data to external storage to form a series of DataFiles. At present, in order to simplify the adaptation and maximize the use of existing logic, Tencent internally uses Avro as an intermediate format for data. The follow-up community will introduce a Flink built-in type converter, using the built-in data type of Iceberg as input. When the Writer executes the checkpoint, the Writer closes its own file and sends the constructed DataFile to the downstream Committer.

Committer is globally unique in a Flink job. After receiving the DataFiles sent by all upstream Writers, the Committer will write these DataFiles to a ManifestFile, and save the ManifestFile to the checkpoint. When the checkpoint is completed, Committer will submit the ManifestFile to Iceberg via merge append. Iceberg will complete the commit operation through a series of operations, and finally make the newly added data visible to the downstream data warehouse.

Tencent has made a lot of improvements and optimizations to Iceberg. In addition to supporting Flink's reading and writing, Tencent has also completed row-level deletion and update operations, which greatly saves the overhead caused by data correction and deletion. At the same time, Tencent has also adapted the Data Source V2 in Spark 3.0, using SQL and DataFrame in Spark 3.0 to seamlessly connect to Iceberg.

In the future work, Tencent will continue to enhance Iceberg's core capabilities, including:

  • Add the semantics of update and delete to the Flink sink so that the delayed data can be processed correctly to support the CDC scenario;

  • Increase support for Hive;

  • Add row-level update and delete operations in Merge-On-Read mode.

Backstage reply keywords [GIAC] can get guests to share PPT.

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108332340