MRS CDL Architecture Design and Implementation

1 Introduction

MRS CDL is a real-time data synchronization service launched by FusionInsight MRS. It aims to capture and push event information from traditional OLTP databases to big data products in real time. This document will introduce the overall architecture and key technologies of CDL in detail.

2 The concept of CDL

MRS CDL (Change Data Loader) is a CDC data synchronization service based on Kafka Connect, which can capture data from various OLTP data sources, such as Oracle, MySQL, PostgreSQL, etc., and then transmit it to the target storage, which can store big data Such as HDFS, OBS, or real-time data lake Hudi, etc.

2.1 What is CDC?

CDC (Change Data Capture) is a design pattern that further processes the changed data by monitoring data changes (addition, modification, deletion, etc.), and is usually used in data warehouses and some applications closely related to databases , such as data synchronization, backup, auditing, ETL, etc.

CDC technology has been around for some years, and it has been used to capture changes in application data more than two decades ago. CDC technology can synchronize messages to the corresponding data warehouses in a timely and effective manner, and has almost no impact on current production applications. Nowadays, the application of big data is becoming more and more common, and the ancient technology of CDC has been revived. It is a new mission of CDC technology to connect with big data scenarios.

At present, there are many mature CDC to big data products in the industry, such as Oracle GoldenGate (for Kafka), Ali/Canal, Linkedin/Databus, Debezium/Debezium and so on.

2.2 Scenarios supported by CDL

MRS CDL absorbs the successful experience of the above mature products, uses Oracle LogMinner and open source Debezium to capture CDC events, and deploys tasks with the help of the high concurrency, high throughput, and high reliability frameworks of Kafka and Kafka Connect.

When existing CDC products are connected to big data scenarios, they basically choose to synchronize data to the message queue Kafka. On this basis, MRS CDL further provides the ability to directly enter the lake, and can directly connect with MRS HDFS and Huawei OBS, as well as MRS Hudi, ClickHouse, etc., to solve the problem of the last mile of data.

Scenes data source target storage
Real-time data lake analytics Oracle Huawei OBS, MRS HDFS, MRS Hudi, MRS ClickHouse, MRS Hive
Real-time data lake analytics MySQL Huawei OBS, MRS HDFS, MRS Hudi, MRS ClickHouse, MRS Hive
Real-time data lake analytics PostgreSQL Huawei OBS, MRS HDFS, MRS Hudi, MRS ClickHouse, MRS Hive

Table 1 Scenarios supported by MRS CDL

3 Architecture of CDL

As a CDC system, the ability to extract data from the source target and transfer it to the target storage is a basic capability. On this basis, flexibility, high performance, high reliability, scalability, reentrancy, and security are the directions that MRS CDL focuses on. , therefore, the core design principles of CDL are as follows:

  • The system architecture must satisfy scalability principles, allowing the addition of new source and target data stores without compromising the functionality of the existing system.
  • The architecture design should meet the separation of business focus between different roles
  • Reduce complexity and dependencies where reasonable, and minimize architectural, security, and resilience risks.
  • It needs to meet the plug-in customer needs and provide general plug-in capabilities, making the system flexible, easy to use, and configurable.
  • Business security to avoid horizontal overreach and information leakage.

3.1 Architecture diagram/role introduction

Figure 1 CDL ​​Architecture

MRS CDL includes two roles: CDL Service and CDL Connector. Their respective functions are as follows:

  • CDL Service: Responsible for task management and scheduling, provide a unified API interface, and monitor the health status of the entire CDL service.
  • CDL Connector: It is essentially the Worker process of Kafka Connect, responsible for the operation of real tasks. Based on the high reliability, high availability, and scalability of Kafka Connect, a heartbeat mechanism is added to assist CDL Service in completing cluster health monitoring.

3.2 Why choose Kafka?

We compared Apache Kafka with various other options such as Flume and Nifi as shown in the table below:

Flume Nifi Kafka
advantage Configuration-based Agent Architecture; Interceptors; Source, Channel, Sink Models There are many out-of-the-box processors; back pressure mechanism; handle messages of arbitrary size; support MiNifi Agent to collect data; support edge layer data flow
shortcoming There are scenarios of data loss; no data backup; data size limit; no back pressure mechanism No data replication; fragile fault tolerance; no message ordering support; poor scalability

Table 1 Framework comparison

For CDC systems, Kafka has enough advantages to support our choice. At the same time, the architecture of Kafka Connect fits perfectly with the CDC system:

  • Parallelism - For a data replication task, it is possible to increase throughput by breaking it up into multiple subtasks and running them in parallel.
  • Order-preserving - Kafka's partition mechanism can ensure that data in a partition is strictly ordered, which helps us achieve data integrity.
  • Scalable - Kafka Connect runs Connectors distributed across the cluster.
  • Ease of use - Kafka's interface is abstracted to improve ease of use.
  • Balancing - Kafka Connect automatically detects failures and rebalances scheduling on the remaining processes based on their respective loads.
  • Life cycle management – ​​Provides complete Connector life cycle management capabilities.

4 MRS CDL key technologies

Figure 2 CDL key technologies

4.1 CDL Job

MRS CDL abstracts the business at the upper level, and defines a complete business process by introducing the concept of CDL Job. In a Job, the user can select the data source and target storage type, and can filter the data tables to be copied.

Based on the Job structure, MRS CDL provides a mechanism for executing CDL Jobs. At runtime, CDC events are captured from source data storage to Kafka using Kafka Connect Source Connector combined with log replication technology, and then data is extracted from Kafka using Kafka Connect Sink Connector , which pushes the final result to the target store after applying various transformation rules.

Provides a mechanism to define table-level and column-level mapping transformations, and can specify transformation rules during the process of defining CDL jobs.

4.2 Data Comparison

MRS CDL provides a special job for data consistency comparison. Users can select the source and target data store schemas, and select various comparison pairs from the source and target schemas for data comparison to ensure that the data is consistent in the source and target data stores.

Figure 3 Data Comparison abstract view

MRS CDL provides a dedicated Rest API to run Data Compare Jobs and provides the following capabilities:

  • Provides a variety of data comparison algorithms, such as row hashing algorithms, non-primary key column comparisons, etc.
  • Provides a special query interface, which can query the synchronization report and display the execution details of the current Compare task.
  • Provides real-time source and target storage-based repair scripts to repair out-of-sync data with one click.

The following is the execution process of the Data Compare Job:

Figure 4 Data Compare Job execution and viewing process

4.3 Source Connectors

MRS CDL creates various source connectors through the Kafka Connect SDK that capture CDC events from various data sources and push to Kafka. CDL provides specialized Rest APIs to manage the lifecycle of these data source connectors.

4.3.1 Oracle Source Connector

The Oracle Source Connector uses the Log Miner interface provided by Oracle RDBMS to capture DDL and DML events from the Oracle database.

Figure 5 Schematic diagram of Log Miner grabbing data

When processing DML events, CDL can also provide support if BOLB/CLOB columns exist in the table. For the processing of BOLB columns, the key points are processed as follows:

  • When an insert/update operation occurs, a series of LOB_WRITE operations are triggered.
  • LOB_WRITE is used to load the file into a BLOB field.
  • Each LOB_WRITE can only write 1KB of data.
  • For a 1GB image file, we'll sort out the binary data in all 1 million LOB_WRITE operations and merge them into a single object. We will store this object in Huawei OBS, and finally give the location of the object in OBS in the message written to Kafka.

For the capture of DDL events, we create separate sessions to keep track of. The currently supported DDL statements are as follows:

No DDL statement Example
1 CREATE TABLE CREATE TABLE TEST ( EMPID INT PRIMARY KEY, ENAME VARCHAR2(10))
2 ALTER TABLE ... ADD (<name> <data type>) ALTER TABLE TEST ADD ( SALARY NUMBER)
3 ALTER TABLE ... DROP COLUMN ... ALTER TABLE TEST DROP (SALARY)
4 ALTER TABLE ... MODIFY (<column> ... ALTER TABLE TEST MODIFY SALARY INT
5 ALTER ... RENAME... ALTER TABLE TEST RENAME TO CUSTOMER
6 DROP ... DROP TABLE TEST
7 CREATE UNIQUE INDEX ... CREATE UNIQUE INDEX TESTINDEX ON TEST (EMPID, ENAME)
8 DELETE INDEX … Delete existing index

Table 2 Supported DDL statements

4.3.2 MYSQL Source Connector

The Binary Log (Bin Log) file of MYSQL records all operations submitted to the database sequentially, including changes to the table structure and changes to table data. MYSQL Source Connector generates CDC events by reading Bin Log files and submits them to Kafka topics.

The main functional scenarios supported by the MYSQL Source Connector are:

  • Capture DML events and support parallel processing of captured DML events to improve overall performance
  • Support table filtering
  • Support the mapping relationship between configuration table and topic
  • In order to ensure the absolute order of CDC events, we generally require that a table corresponds to only one Partition. However, the MYSQL Source Connector still provides the ability to write multiple Partitions to meet certain scenarios that require sacrificing message ordering to improve performance
  • Provides the ability to restart tasks based on the specified Bin Log file, specified location or GTID to ensure that data is not lost in abnormal scenarios
  • Support for multiple complex data types
  • Support for capturing DDL events

4.3.3 PostgreSQL Source Connector

PostgreSQL's logical decoding feature allows us to parse change events committed to the transaction log, which requires an output plugin to process these changes. The PostgreSQL Source Connector uses the pgoutput plugin to do this. The pgoutput plugin is a standard logic decoding plugin provided by PostgreSQL 10+ without installing additional dependencies.

The functions of the PostgreSQL Source Connector and the MYSQL Source Connector are basically the same except for the differences in some data types.

4.4 Sink Connectors

MRS provides a variety of sink connectors, which can pull data from Kafka and push them to different target storages. The currently supported Sink Connectors are:

  • HDFS Sink Connector
  • OBS Sink Connector
  • Hudi Sink Connector
  • ClickHouse Sink Connector
  • Hive Sink Connector Among them, Hudi Sink Connector and ClickHouse Sink Connector also support scheduling and running through Flink/Spark applications.

4.5 Table filtering

When we want to capture the changes of multiple tables at the same time in a CDL Job, we can use wildcards (regular expressions) instead of table names, that is, CDC events of tables whose names satisfy the rules are allowed to be captured at the same time. An extra table is captured when a wildcard (regular expression) cannot exactly match the target. To this end, CDL provides table filtering to assist in wildcard fuzzy matching scenarios. Currently CDL supports both whitelist and blacklist filtering methods.

4.6 Unified Data Format

MRS CDL uses a unified message format for different data source types, such as Oracle, MYSQL, and PostgreSQL, and stores them in Kafka. Back-end consumers only need to parse one data format for subsequent data processing and transmission, avoiding the need for diverse data formats. Problems that lead to increased back-end development costs.

4.7 Task-level log browsing

Under normal circumstances, a CDL Connector will run multiple Task threads to capture CDC events. When one of the Tasks fails, it is difficult to extract strongly relevant log information from the massive logs for further analysis.

In order to solve the above problems, CDL standardizes the log printing of CDL Connector, and provides a dedicated REST API, through which users can obtain the log file of the specified Connector or Task with one click. You can even specify start and end times to further narrow down log queries.

4.8 Monitoring

MRS CDL provides REST API to query the metric information of all core components of CDL service, including service level, role level, instance level and task level.

4.9 Application Error Handling

In the process of business operation, it often occurs that some messages cannot be sent to the target data source. We call this kind of message an error record. In CDL, there are many scenarios for error records, such as:

  • The message body in the topic does not match the specific serialization method, resulting in failure to read normally
  • The table name stored in the message does not exist in the target storage, so the message cannot be sent to the target

In order to deal with this kind of problem, CDL defines a kind of "dead letter queue", which is specially used to store the error records that occur during operation. In essence, "dead letter queue" is a specific topic created by Sink Connector. When there is an error record, Sink Connector will send it to "dead letter queue" for storage.

At the same time, CDL provides a REST API for users to query these error records at any time for further analysis, and provides a Rest API that allows users to edit and resend these error records.

图6 CDL Application Error Handling

5 Performance

CDL uses several performance optimizations to improve throughput:

  • Task Concurrency We take advantage of the task parallelization feature provided by Kafka Connect, where Connect can split a job into multiple tasks to replicate data in parallel, as follows:

Figure 7 Task concurrency

  • Use Executor threads to execute tasks in parallel. Due to the limitations of data replication technologies such as Log Miner and Bin Log, our Source Connector can only capture CDC events sequentially. Therefore, in order to improve performance, we cache these CDC events in the memory queue first. Then use Executor threads to process them in parallel. These threads will first read data from the internal queue, then process it and push it to Kafka.

Figure 8 Executor thread concurrency

6 Summary

MRS CDL is an important piece of the puzzle in the scenario of real-time data entry into the lake. We still need to further expand and improve scenarios such as data consistency, ease of use, multi-component interconnection, and performance improvement, so as to create better value for customers in the future. .

This article is published by HUAWEI CLOUD .

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4439378/blog/5495305