[Big Data] Efficiently build a channel into the lake based on Flink CDC

Abstract : This article is compiled from the sharing of Xu Bangjiang (Xue Jin), head of Alibaba Cloud Flink data channel, head of Flink CDC open source community, Apache Flink PMC Member & Committer, at the Streaming Lakehouse Meetup.

1. Analysis of Flink CDC core technology

Change Data CaptureFlink CDC is a CDC ( change data capture ) technology based on database logs , which implements a data integration framework for full incremental integrated reading. With Flink's excellent pipeline capabilities and rich upstream and downstream ecosystem, Flink CDC can efficiently achieve real-time integration of massive data.
Insert image description here
As shown in the figure above, there are historical full data and real-time written incremental data in the database table. The ability of the Flink CDC framework is to synchronize Exactly-oncethe full and incremental data to the downstream system without losing or duplicating the semantics. inside.

Flink CDC can take advantage of Flink's rich upstream and downstream ecosystem. Currently, Flink CDC's own upstream and downstream ecosystem is very complete. For example, Flink CDC has rich data sources, such as MySQL, Oracle, MongoDB, OceanBase, TiDB, SqlServer, and is compatible with the MySQL protocol. MariaDB, PolarDB, etc., have more downstream writing capabilities, supporting writing to Kafka, Pulsar message queues, data lakes such as Hudi, Iceberg, Paimon, and various data warehouses.
Insert image description here
The following introduces the community development of Flink CDC.

Insert image description here

  • In July 2020, the Flink CDC community was officially launched.
  • In May 2021, 1.5version was released to support MySQL and Postgres.
  • In August 2021, 2.0version was released, which can support MySQL CDC to implement the incremental snapshot algorithm.
  • In November 2023, 2.3the version was released, providing an incremental snapshot framework.
  • In June 2023, 2.4the version will be released to achieve continuous expansion of mainstream data sources, and incremental snapshots will cover mainstream connectors.

In June this year, the community released the Flink CDC 2.4version. You can commitslearn about some of the key features and improvements of this version through the code distribution in the figure below.

Insert image description here

For example, MySQL and MongoDB have received much attention and contributions, and PostgreSQL and OceanBase have also made many related contributions. During this period, a total of 32 contributors from various companies participated in the development of the version, 141 issues were solved, and issuenearly 100 issues were merged PR.

The core of Flink CDC 2.4 version featureis as follows:

  • New Vitessdata source support is added. This data source has many overseas users and few domestic users. This feature is contributed by overseas contributors.
  • PostgreSQL and SQL Server support incremental snapshots, which can achieve advanced features such as high concurrency and lock-free reading.
  • MySQL CDC supports non-PK tables, that is, tables without primary keys.
  • OceanBase supports MySQL Mode and Oracle Mode.
  • Upgrade the version of Debezium dependencies to 1.9.7.Finalfix multiple known issues.
  • The Connector version is compatible with multiple versions of Flink 1.13 to 1.17.
  • The incremental snapshot framework supports automatic release of idle readers.

Insert image description here
After Flink CDC is released 2.4, the incremental snapshot support matrix is ​​as shown in the figure below:

Insert image description here
The core advantage of the incremental snapshot algorithm is that it can achieve parallel reading during the full phase when the data is relatively large; it can achieve single concurrent reading during the incremental phase when there are relatively few writes . Task2For example, and in the figure Task3support automatic release of resources after full-increment automatic switching, and the entire switching process is a consistent switching implemented by a lock-free algorithm.

In addition, the above figure also shows that there are some gaps between some classic TP type databases and big data systems. For example, big data systems often have to process massive data including historical data, but due to data ingestion tools, massive data cannot be processed efficiently. Pull over, and Flink CDC can build a data channel well on top of this gap.

To summarize, Flink CDC’s incremental snapshot framework has the following four advantages:

  • Supports parallel reading . The advantage of parallel reading is that it can do horizontal expansion, which allows users to expand resources to improve reading efficiency.
  • Supports lock-free reading , which means there is no need to lock the online database and no intrusion into the business.
  • Supports full incremental integration , that is, full and incremental automatic connection without manual intervention.
  • Exactly-once semantics , exactly once semantics, ensures that data is not lost or duplicated during the synchronization process.

2.Challenges of CDC data entering the lake and warehousing

The challenges of CDC data entering the lake and warehousing are roughly as follows:

  • The scale of historical data is large , the scale of historical data in the database is large, and the scale of historical business data exceeds 100T+.
  • Incremental data has high real-time requirements , and database incremental data has high business value, and its value decreases with time, so it needs to be processed in real time.
  • Data order preservation . The processing results of CDC data usually need to emphasize consistency semantics, and usually require ETL tools to support global order preservation.
  • The schema changes dynamically , incremental data grows with time, and the schema corresponding to the data in the database will continue to evolve.

2.1 CDC data into the lake architecture

The traditional architecture is generally divided into two parts, that is, the offline and real-time parts are separated. The architecture corresponding to these two parts has its own technical characteristics and business characteristics, and may also be related to the company's business organizational structure, such as offline and real-time business belong to two teams, so this Lambda architecture will naturally appear.

Insert image description here
This traditional solution has some disadvantages, such as insufficient real-time data, troublesome maintenance of synchronization link separation, many components, excessive costs caused by the introduction of message queues, etc.

2.2 CDC data ETL architecture

Before CDC data is stored in the lake, there is often a need for ETL, that is, data cleaning, case conversion, or data widening is required.

Insert image description here
In some early architectures, data processing such as collection and calculation will be performed first, and then the data will be written to some downstream storage. As shown in the figure above, the challenges of this ETL architecture are:

  • There are many components, the structure is complex, and the maintenance cost is relatively high.
  • The full amount and the incremental amount are basically separated, and it is difficult to align the data collection and calculation caliber.
  • Single-concurrent reading of the full amount of data cannot be scaled horizontally.

3. Flink CDC-based storage and storage solution

After introducing the traditional data warehousing scheme, let’s take a look at the more concise and efficient CDC warehousing scheme.

3.1 Flink CDC's Lake and Warehouse Architecture

The architecture of Flink CDC's inbound and outbound warehousing is very simple, as shown in the figure below. For example, the inbound link from MySQL to Paimon only requires one component, Flink CDC, and does not require a lengthy Pipeline.

Compared with the traditional lake and warehouse architecture mentioned above, the Flink CDC architecture has the following advantages:

  • Does not affect business stability. This is reflected in the fact that the full amount of data is read only once.
  • It has good real-time performance and supports minute-level output.
  • Full + incremental integration avoids manual operations.
  • Full concurrent reading, high throughput.
  • Short links and few components, low learning and operation and maintenance costs.

Insert image description here

3.2 Flink CDC ETL analysis

Flink CDC is an engine based on the Flink ecosystem. After CDC data is accessed, it can ensure that the data is processed under the semantics of the database CDC data, such as performing some aggregation such as Group By and widening operations such as dual-stream Join on the CDC data.

Insert image description here
In these operations, users only need to write Flink SQL to obtain an experience equivalent to operating on a materialized view of a data table, and implement SQL operations on the full and incremental data in the database. These operations only need to be performed in Flink SQL. It can be fully realized. This greatly lowers the threshold for ETL processing of CDC data, and only requires users to be able to write SQL.

The following figure shows the architecture diagram of using Flink CDC for ETL. In the entire Pipeline, only one Flink component is needed. The advantages of this architecture are:

  • Full incremental integration.
  • ETL can achieve real-time processing.
  • Supports concurrent reading.
  • The number of link short groups is reduced, and the maintenance cost is low.

Insert image description here

3.3 Storage-friendly writing design

In the design of full read, Flink CDC, especially the incremental snapshot framework, takes many aspects into consideration, especially the data slicing part. Full consideration is given to data consistency and downstream storage characteristics. For example, the granularity of Checkpoint is a key factor for downstream storage. If the Checkpoint is the granularity of a table, then all the data in the one-stop table will be very large. When or , there will be a lot of data in the memory, which is flushvery commitimportant bufferfor Downstream sinknodes are very unfriendly to writing.

chunkThen, in the incremental snapshot framework, the checkpoint granularity is optimized to the shard level, and the size of the shard is open to the user to configure. The user can configure how much data a shard ( ) can read. Through this kind of fine-grained control, sinkthe writing of downstream nodes becomes more friendly and will not put too much pressure on the memory.

Insert image description here

3.4 Flink CDC implements heterogeneous data source integration

Based on Flink CDC, the integration of heterogeneous data sources can be easily realized. When there is more than one kind of database, for different business databases, it is often necessary to integrate the data in these libraries. In this case, only a few lines of Flink SQL are needed.

Insert image description here
As shown on the right side of the figure above, some business data is in MySQL and some data is in PostgreSQL. All the user needs to do is to write a few lines of Flink SQL, define different types of CDC tables, and then do some joins there insertand then add That’s it. The left side of the above figure shows the width between the product table, order table and logistics table. This can also be completed in Flink SQL. In the entire example, the user does not need to understand the mechanism of Postgres, nor does it need to understand the mechanism of MySQL Binlog slot. , you only need to understand a few grammars of Flink SQL.

3.5 Flink CDC implements sub-database and sub-table integration

For some business systems that have increased in scale, in order to support high concurrent requests, the architecture of sub-databases and sub-tables is very common. Flink CDC naturally supports data table synchronization of this architecture. Users only need to fill in the regular expression that satisfies the library name and table name in the DDL, and then the historical data and incremental data in the sub-database and sub-table that meet the regular expression can be Data is synchronized downstream.

Insert image description here
In this example, you only need to write a few lines of Flink SQL to efficiently implement data integration of sub-databases and tables.

4.Flink CDC + Paimon best practices

Before introducing the best practices of Flink CDC + Paimon, let's first introduce the overall architecture of Paimon.

Insert image description here
In this picture, you can see Paimon as a lake storage, CDC is a very important part. CDC is equivalent to giving Paimon the first step of accessing data from some database systems or log systems.

As shown in the figure above, in the entire data warehouse link built around Paimon, the real-time performance of data is very high, which can basically meet the needs of near real-time business; the flow of data between layers can be realized by writing Flink SQL ; You can also read the data for analysis by writing Flink or other computing engines. It is a very open architecture. In addition, through this architecture, it can also be found that the entire architecture is relatively simple, and Flink SQL can be used to achieve semantic unity and ensure data consistency.

4.1 Community practice

Compared with other open source communities, the Paimon community's support for CDC is very complete and provides a series of advanced functional support, such as:

  • Support Schema Evolution.
  • Supports automatic table creation and automatic field mapping.
  • A command line automatically generates a synchronization Pipeline.

Insert image description here
Paimon also supports MySQL's entire database synchronization, and provides support for Schema Evolution in the entire database synchronization job. In addition, Paimon also provides support for data synchronization in Kafka message queues, and also supports synchronization of the entire database.

In general, the Paimon community's integration functions for CDC are very complete.

4.2 Internal practices

Within Alibaba Cloud, we develop CTAS/CDAS syntax to achieve full database synchronization and Schema Evolution support. Comparing the practice of the Paimon community just now, it can be understood that the internal practice of Alibaba Cloud is that the previous line of SQL generates a Pipeline, and the core functions provided by the two are similar, such as automatic table creation, automatic mapping, and so on.

Insert image description here
Alibaba Cloud's internal practices also include real-time logging of CDC data into the lake and warehousing, real-time logging of log data into the lake and warehousing, ETL analysis of CDC data, etc. Through the one-line syntax of CTAS and CDAS, the entire database of MySQL or Kafka data can be synchronized to the downstream system.

For CDC ETL analysis, Alibaba Cloud also has some recommended optimization solutions in its internal practice. As shown in the figure below, users can first synchronize data to the message queue. When there are many real-time jobs downstream that consume data from the same table, they only need Read the database once, which will greatly reduce the pressure on the database.

Insert image description here

5.Q & A

If CDC can be moved between different systems, we have a requirement during the development process, that is, the data on the IP side will need to be moved to PP. Will MySQL CDC support this?

At present, if the data does not need to support Schema Evolution, this requirement can be realized; if the data needs to support Schema Evolution, then the mechanism of Binlog needs to be used to assist in the realization.

What is the difference between Flink CDC and Paimon CDC? Is Paimon CDC implemented through Flink CDC?

Paimon's ability to read CDC data from external databases is realized through Flink CDC. In addition, Paimon's own tables can generate CDC data, which means that the data input to Paimon can be data generated by MySQL CDC. On the other hand, Paimon tables themselves CDC data belonging to this table will also be generated.

When we use CDAS or CTAS, are changes to the table structure real-time? Or is it done in Checkpoint?

The current practice is real-time and does not rely on the Checkpoint mechanism. Relying on the Checkpoint mechanism may cause a problem, that is, it takes several minutes to adjust the Checkpoint, which is not acceptable for the CDC data scenario.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132769187