The practice of building a real-time data lake based on Flink

This article is compiled from the keynote speech of Wang Zheng and Min Zhongyuan, Volcano Engine cloud native computing R&D engineers, at the CommunityOverCode Asia 2023 Data Lake Special Session on "The Practice of Building a Real-Time Data Lake Based on Flink".
 
Real-time data lake is a core component of modern data architecture. With the development of data lake technology, users have higher requirements for it: data needs to be imported from multiple data sources, data lake and data sources must be kept real-time and consistent, It can be synchronized in time when changes occur, and it also requires high-performance query, data return in seconds, etc. So we chose to use Flink for incoming and outgoing lake and OLAP queries. Flink Batch-flow integration architecture, E xactly O nce Guaranteed and a complete community ecology provide Many Connector can meet the previous needs. Flink is also suitable for OLAPquery, which will be introduced in detail in this article .

Overall structure

In the overall architecture of building a real-time data lake based on Flink, K8s is used as the container orchestration and management platform at the bottom layer. The storage layer supports HDFS or S3. Iceberg was chosen as the Table Format due to its good file organization structure and ecology. The computing layer uses Flink to enter and exit the lake. Flink SQL is the most commonly used method of entering and exiting the lake. At the same time, some high-level functions are developed using the Flink Datastream API. Jobs entering and exiting the lake use Flink Application Mode to run on K8s. Then OLAP query is performed through Flink SQL Gateway and Flink Cluster in Session Mode, and the return results of the JDBC and REST API interfaces are provided. Of course, we also need to use Catalog to manage metadata, which not only refers to Iceberg's metadata, but also includes metadata from other third-party data sources, and uses scheduled tasks for subsequent data maintenance.

Data entry into the lake practice

When data enters the lake, Flink obtains data from the data source on the left and writes it to Iceberg in a stream or batch manner. Iceberg itself also provides several actions for data maintenance, so for each table there will be scheduled scheduling tasks such as data expiration, snapshot expiration, orphan file cleaning, small file merging, etc. These actions have greatly improved performance in practice. Big help.
Target Schema is fixed. The destination table also has tables to the destination table . Flink SQL is usually used for data import and export. , can write Temporary table , you can also store metadata in Catalog and use Catalog Table to import data. However, in order to meet the more complex needs of customers, in practice we developed CDC Schema automatic changes based on the Datastream API, which can realize the functions of whole database synchronization + automatic table creation.

Flink SQL

Iceberg community supports basic writing and reading functions. Flink 1.17 introduced row-level update and delete functions (FLIP-282). Based on this, we added batch Upate and Delete operations to implement Iceberg's row-level updates through the RowLevelModificationScanContext interface. In practice, two pieces of information are recorded in the Context - the Snapshot ID at the start of the transaction and the filter conditions of UPDATE/DELETE to ensure the transactionality of batch Update and Delete.

Schema Evolution

Schema evolution is a common problem in stream processing, which ensures the correct writing of data by dynamically changing the Schema of the destination during the stream job process. Iceberg itself has good support for Schema changes. In Iceberg's storage architecture: Catalog does not store Schema, but only stores the latest Metadata file location. The Metadata file stores the mapping of all Schema id to Schema information, as well as the latest Schema id - Current-Schema-id. Each Manifest below records a Schema id, which means that the Parquet files under the Manifest use the corresponding Schema.
If the Schema changes in Iceberg, the Metadata file will record the new Schema and point the Current-Schema-id to the new Schema. Subsequent write jobs will generate new Parquet data files and corresponding Manifest files according to the new Schema. When reading, it will be read according to the latest Schema-id. Even if there is a manifest file with different Schema at the bottom, it will be read using the new Schema information.
Currently, Flinksink provided by Iceberg does not support Schema changes. Iceberg's default Flinksink will create a Streamwrtier for each Parquet file that needs to be written, and the Schema of this Streamwriter is fixed, otherwise an error will be reported when writing Parquet files. In the example above, the original Schema is id, name, and age. When the Schema matches, no error will be reported when writing, so Row 1 can be written; when Row 2 is written, because the length does not match, an error will be reported: Index out of range. ;When writing Row 3, due to data type mismatch, an error will be reported: Class cast excetpion; When writing Row 4, although the type and length match, the Schema meaning is different, and a piece of dirty data will eventually be written in the result file.
There are two main problems to be solved for Schema changes: 1) How to know which Schema each Row corresponds to? 2) How to write multiple Schema data in one job?
For the first question, the Flink CDC Connector can be set to include Schema information for each record. So we need to implement a deserialization method to output a record, including Row and its corresponding Schema information, which is the purple part in the picture, thus solving the first problem.
Regarding the second question, to support mixed writing of multiple Schemas, different Streamwriters need to be created for different Schemas, and each Streamwriter corresponds to a Schema. This adds a new FlinkSchemaEvolvingSink to the Iceberg Sink Connector , It will determine whether the incoming data matches the current Schema. If it does not match, it will commit new Schema information to Iceberg and return the Schema id. Then press the new Schema to write data and Commit data, which is the description of the blue line in the above figure. If the Schema has been generated, the old Schema id will be returned. FlinkSchemaEvolvingSink maintains a Streamwriter Map, where the Key is the Schema ID. When the Schema is passed over, it will be judged whether it contains a Writer of the Schema. If not, a Writer will be created, so that multiple types of writing can be written in the same job. Schema information.

Whole database synchronization and automatic table creation

Before the Flink task Jobgraph is generated, a Catalog module is required Read the information of the source table and synchronize it on the Iceberg side Create or change the destination table corresponding to , and at the same time in Jobgraph Add S ink information of the corresponding table .
During the running of the Flink job, each Binlog record will generate a record through a deserialization parser. This record contains two parts: Tableid and Row, that is, the record in the purple part of the figure. Then Split this record, split the Row according to Table ID, and then write it to the downstream table after Keyby Partition operation.
The entire process mainly consists of the following four parts:
  1. The deserializer parses Event events and data. In order to prevent Class Cast Exception during the transfer process, the data type needs to remain the same as the source Schema. This requires testing each type and comparing each type by using the test cases in Flink CDC.
  2. Catalog Module is mainly responsible for automatically creating tables and updating table contents, and needs to maintain a consistent type conversion method with the deserializer.
  3. Table Spilt can realize the function of Source reuse, create a Sideoutput Tag for each table, and output it to the downstream.
  4. Because Iceberg Sink will affect each Partition Create the corresponding F anout W riter , takes up a lot of memory. Therefore, we need to perform Keyby operation on the Partition field of the table to reduce the number of OOMs. Because Iceberg has the feature of implicit partitioning, it is necessary to transform the fields of the implicit partition and then perform the Keyby operation.

Data query practice

Why choose Flink

  • Architecturally, Flink supports JDBC driver, SQL-Gateway and session mode. Flink session cluster is a typical MPP (massively parallel processing) architecture, and each query does not need to apply for new resources. Users can easily submit SELECT statements through the JDBC driver and get results back in seconds or even sub-seconds.
  • Powerful batch processing capabilities. Flink OLAP can take many batch operations and optimizations. At the same time, there are also a large number of queries in OLAP, and Flink can support them based on Flink's batch processing capabilities without the need to introduce an external batch processing engine like other OLAP engines.
  • Flink supports standard SQL syntax such as QUERY/INSERT/UPDATE to meet the interactive needs of OLAP users.
  • Powerful connector ecosystem. Flink defines comprehensive interfaces for input and output and implements many embedded connectors such as databases and data lake warehouses. Users can also easily implement customized connectors based on these interfaces.

OLAP architecture

The overall architecture of Flink OLAP is divided into two parts: Flink SQL Gateway and Flink Session Cluster. First, the user uses the Client to submit a Query through the Rest interface. It first goes through the SQL parsing and optimization process of the Gateway, generates the execution plan of the job, and then submits it to the JobManager on the Flink Session Cluster through the efficient Socket interface to the corresponding TaskManager. After execution, it will The results are returned to Clientht. The Dispatcher on the JobManager will create a corresponding JobMaster, and then the JobMaster will deploy Tasks according to certain scheduling rules according to the TaskManager in the cluster.

Optimization measures

Query generation optimization

  • Plan Continue
The first optimization point is Plan caching. In OLAP scenarios, Query has two typical characteristics: one is that there are many repeated Queries in business, which is different from streaming, and the second characteristic is that the time-consuming requirements of queries are Sub seconds. Through analysis, we found that the Plan phase takes tens to hundreds of milliseconds, accounting for a relatively high proportion. Therefore, by supporting Plan caching, the Plan result Transformations of Query are cached to avoid the problem of repeated Plans for the same Query.
In addition, Catalog Cache is also supported to accelerate meta-information access, as well as ExecNode's parallel Translate, which reduces the time-consuming of TPC-DS Plan by about 10%.
  • Sanko Download
The second optimization is operator pushdown. Under the storage-computation separation architecture, operator pushdown is a very important type of optimization. Its core idea is to greatly reduce the number of operators by pushing down some operators to the storage layer for calculation as much as possible. Scan data volume reduces external IO and also reduces the amount of data that the Flink engine needs to process, thus significantly improving Query performance.
In the internal practice of Byte, there is a typical business where most of the Query uses TopN data, so we support the push-down of TopN. As can be seen from the figure, the SortLimit operator of Local, that is, the Local The TopN operator is pushed down to the Scan node, and TopN calculation is finally performed at the storage layer, thereby greatly reducing the amount of data read from storage. The optimization effect is very obvious. The amount of data read by the Scan node from the storage is reduced by 99.9%, and the latency of the business query is reduced by about 90.4%.
In addition, we also support more operator pushdowns, including Aggregate pushdown, Filter pushdown, and Limit pushdown, etc.

Query execution optimization

  • ClassLoader 复用
In ClassLoader reuse, let’s first analyze a problem that causes excessive CPU usage due to frequent creation of Classloaders under OLAP. We found that the CPU usage of JM/TM was high on the line. Through flame graph analysis, the Dictionary::find method of the JVM occupies more than 70% of the CPU. When further analyzing the JVM source code, it was found that after the JVM loads the Class, in order to speed up the search from Class Name to Classloader, it will maintain a SystemDictionary. Hash table (Key is Class Name, Value is Classloader instance). When the number of Classloaders is very large, for example, when more than 20,000 Classloaders appear online, there will be a large number of conflicts in the hash table, making the search process very slow, that is, most of the CPU of the entire JM will be consumed in this step. middle.
Through positioning, we found that these Classloaders are all UserCodeClassloaders, which are used to dynamically load the user's Jar package. Each Job will create a new UserCodeClassloader. As can be seen from the figure below, the JobMaster of the new Job and the Task of the Job on the TM New UserCodeClassloaders will be created, resulting in too many Classloaders on JM and TM. In addition, too many Classloaders will lead to insufficient JVM Metaspace space, which will frequently trigger Metaspace Full GC.
Therefore, we optimized Classloader reuse, which was mainly divided into two steps. First, we optimized the way of relying on Jar. Since the third-party Jar package relied on in OLAP scenarios is relatively fixed, it can be directly placed under the Classpath started by JM and TM. , there is no need to submit a separate Jar package for each job. Then for each job, the System Classloader is directly reused during JobMaster and Task initialization. After Classloader reuse, the CPU usage occupied by Dictionary::find in JM dropped from 76% to 1%. At the same time, the frequency of Metaspace Full GC was significantly reduced.
  • CodeGen Existence excellence
The premise of this optimization is that we discovered the problem that Codegen source code compilation under OLAP occupies too much TM CPU. In the current Codegen caching process, a large number of operators in Flink SQL use Codegen to generate calculation logic, such as Generated in Codegen Operator. Class, where Code is the Java source code generated by Codegen. When the Operator is initialized, the Java source code needs to be compiled and loaded into Class. In order to avoid repeated compilation, there is currently a caching mechanism that maps the Class Name to the Classloader used by the Task and then to the compiled Class.
However, there are two problems under the current caching mechanism. First, the current mechanism only implements different concurrent reuse of the same Task within the same job, but there is still repeated compilation for multiple executions of the same Query. This is because in order to avoid naming conflicts when Codegen generates Java source code, the suffixes of the class names and variable names of the code use process-level auto-increment IDs, resulting in multiple executions of the same Query, and the content of the class name and code will change. Therefore Unable to hit cache. Another problem is that every time a class is compiled and loaded, a new ByteArrayClassloader will be created. Frequent creation of Classloader will lead to severe Metaspace fragmentation and trigger Metaspace Full GC, causing service jitter.
In order to avoid repeated compilation of cross-job code and achieve cross-job Class sharing, we need to optimize the caching logic and realize the mapping of the same source code to compiled Class. There are two difficulties here:
  1. The first is how to ensure that operators with the same logic generate the same code;
  2. How to design Cache Key to uniquely identify the same code.
Regarding the first difficulty, when we generate Codegen code, we replace the auto-incrementing ID in the class name and variable name from the global granularity to the Local Context granularity, so that operators with the same logic can generate the same code. For the second difficulty, we designed a four-tuple based on the Classloader's Hash value + Class Name + the length of the code + the md5 value of the code as a Cache Key to uniquely identify the same code.
The effect of Codegen cache optimization is very obvious. The CPU usage of TM side code compilation is optimized from the previous 46% to about 0.3%. Query's E2E Latency is reduced by about 29.2%. At the same time, the Metaspace Full GC time is also reduced by 71.5%. about.

materialized view

  1. First, the user sends a request to create a materialized view to the platform through Flink SQL;
  2. The platform is responsible for creating the Iceberg materialized view, starting the Flink job to refresh the materialized view, and hosting this job to ensure that it continues to run.
  3. The Flink refresh job will continue to stream incremental data from the source table, perform incremental calculations to obtain the incremental results, and then stream them to the materialized view.
  4. The end user can directly obtain the results that originally required full calculations by checking the materialized view.
The above is the main process for implementing materialized views. Currently, our Iceberg materialized view is just an ordinary Iceberg table. In the future, more complete metadata will be recorded at the Iceberg level to support judging the freshness of the data. It will also be based on existing Materialized views automatically rewrite and optimize user queries. Regular data maintenance will include: expired data cleaning, expired snapshot cleaning, orphan file cleaning, data/metadata small file merging, etc.

Summary and outlook

The focus of follow-up work will mainly focus on automated creation of materialized views, query rewriting of materialized views, automatic tuning of parameters for data maintenance tasks (including execution frequency, merged file size, etc.), and related work on hot and cold data tiering/Data cache Expand.
 
Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10322189