Technical dry goods|How to use ChunJun to achieve real-time data synchronization?

Real-time synchronization is an important feature of ChunJun, which means that during the data synchronization process, the data transmission and update between the data source and the target system are carried out almost at the same time.

In the real-time synchronization scenario, we pay more attention to the source. When the data in the source system changes, these changes will be immediately transmitted and applied to the target system to ensure that the data in the two systems are consistent. This feature requires the source plug-in to access the source frequently and continuously while the job is running. In production scenarios, we recommend deploying in perjob mode for jobs that run for a long time, have predictable resources, and require stability.

The plugin supports two configuration methods, JSON script and SQL script. For specific parameter configuration, please refer to "ChunJun Connector Documentation": https://sourl.cn/vxq6Zp

This article will introduce how to use ChunJun real-time synchronization, as well as the characteristics, collection logic and principle of the RDB real-time collection plug-in supported by ChunJun, to help you better understand ChunJun and real-time synchronization.

How to use ChunJun to sync in real time

In order to let everyone have a better understanding of how to use ChunJun for real-time synchronization, we assume that there is such a scenario: an e-commerce website wants to synchronize its order data from the MySQL database to the HBase database in real time for subsequent data analysis and processing.

In this scenario, we will use Kafka as an intermediate message queue to achieve data synchronization between MySQL and HBase. The advantage of this is that changes in the MySQL table can be synchronized to the HBase result table in real time, without worrying that the HBase table will not be synchronized after the historical data is modified.

If in your actual application scenario, you don't care whether the historical data changes (or the historical data will not change at all), and the business table has an incremental primary key, then you can refer to the JDBC-Polling mode section later in this article Content.

· The deployment of data source components and the deployment of ChunJun will not be described in detail here

· The scripts in the case all take SQL scripts as an example. JSON scripts can also achieve the same function, but there may be discrepancies in the parameter names. Students who use JSON can refer to the parameter introduction in the "ChunJun Connector" document above

Collect MySQL data to Kafka

● Data preparation

First, we create a topic named order_dml in Kafka, then create an order table in MySQL, and insert some test data. The SQL statement to create the table is as follows:

-- 创建⼀个名为ecommerce_db的数据库,⽤于存储电商⽹站的数据
CREATE DATABASE IF NOT EXISTS ecommerce_db;
USE ecommerce_db;
-- 创建⼀个名为orders的表,⽤于存储订单信息
CREATE TABLE IF NOT EXISTS orders (
 id INT AUTO_INCREMENT PRIMARY KEY, -- ⾃增主键
 order_id VARCHAR(50) NOT NULL, -- 订单编号,不能为空
 user_id INT NOT NULL, -- ⽤户ID,不能为空
 product_id INT NOT NULL, -- 产品ID,不能为空
 quantity INT NOT NULL, -- 订购数量,不能为空
 order_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP -- 订单⽇期,默认值为当前时间
戳,不能为空
);
-- 插⼊⼀些测试数据到orders表
INSERT INTO orders (order_id, user_id, product_id, quantity) 
VALUES ('ORD123', 1, 101, 2), 
('ORD124', 2, 102, 1), 
('ORD125', 3, 103, 3), 
('ORD126', 1, 104, 1), 
('ORD127', 2, 105, 5);

● Use the Binlog plug-in to collect data to Kafka

In order to represent data change types and better handle data changes, the real-time collection plug-in generally uses RowKind in RowData (Flink internal data structure) to record the data event (insert, delete, etc.) type in the log, and the binlog plug-in is the same . And when the data is sent to Kafka, how should the RowKind information be processed?

Here we need to use upsert-kafka-x, upsert-kafka-x will recognize RowKind. The processing logic for various types of time is as follows:

• insert data: enter directly after serialization

• delete data: only write key, set value to null

• Update data: divided into a delete data and insert data processing, that is, first delete the original data according to the primary key, and then write the updated data

In the next step, we will explain how to restore the data in Kafka to HBase or other databases that support upsert semantics. Next, we will write SQL scripts to realize the function of collecting MySQL data into Kafka in real time. The example is as follows:

CREATE TABLE binlog_source ( 
id int, 
order_id STRING, 
user_id INT, 
product_id int, 
quantity int, 
order_date TIMESTAMP(3) 
) WITH ( 
'connector' = 'binlog-x', 
'username' = 'root', 
'password' = 'root', 
'cat' = 'insert,delete,update', 
'url' = 'jdbc:mysql://localhost:3306/ecommerce_db?useSSL=false', 
'host' = 'localhost', 
'port' = '3306', 
'table' = 'ecommerce_db.orders', 
'timestamp-format.standard' = 'SQL', 
'scan.parallelism' = '1' 
); 
CREATE TABLE kafka_sink ( 
id int, 
order_id STRING, 
user_id INT, 
product_id int, 
quantity int, 
order_date TIMESTAMP(3),PRIMARY KEY (id) NOT ENFORCED 
) WITH ( 
'connector' = 'upsert-kafka-x', 
'topic' = 'orders', 
'properties.bootstrap.servers' = 'localhost:9092', 
'key.format' = 'json', 
'value.format' = 'json', 
'value.fields-include' = 'ALL', 
'sink.parallelism' = '1' 
); 
insert into 
kafka_sink 
select 
* 
from 
binlog_source u;

Restore the data in Kafka to HBase

In the above steps, we collected the data in MySQL into Kafka in real time through binlog-x and upsert-kafka-x. To untie the bell, one must tie the bell. We can use upsert-kafka-x to parse the data in Kafka into data with upsert semantics.

When upsert-kafka-x is used as a source plug-in, it will judge whether the value of the data in Kafka is null. If the value is null, mark the RowKind of this data as DELETE, otherwise mark the ROWKIND of the data as INSERT.

ChunJun's hbase-x plug-in currently has the capability of upsert statements, and the data in Kafka can be restored to hbase by using hbase-x. Next is the SQL script example. In order to view the data results in HBase, we cast the int data to string type:

CREATE TABLE kafka_source ( 
id int, 
order_id STRING, 
user_id INT, 
product_id INT, 
quantity INT, 
order_date TIMESTAMP(3), 
PRIMARY KEY (id) NOT ENFORCED 
) WITH ( 
'connector' = 'upsert-kafka-x', 
'topic' = 'orders', 
'properties.bootstrap.servers' = 'localhost:9092', 
'properties.group.id' = 'test_group', 
'key.format' = 'json', 
'value.format' = 'json', 
'scan.parallelism' = '1' 
); 
CREATE TABLE hbase_sink( 
rowkey STRING, order_info ROW < order_id STRING, 
user_id STRING, 
product_id STRING, 
quantity STRING, 
order_date STRING >, 
PRIMARY KEY (rowkey) NOT ENFORCED 
) WITH( 
-- 这⾥以hbase14为例,如果hbase版本是2.x,我们可以使⽤hbase2-x插件代替 
'connector' = 'hbase14-x', 
'zookeeper.quorum' = 'localhost:2181', 
'zookeeper.znode.parent' = '/hbase', 
'table-name' = 'ecommerce_db:orders', 
'sink.parallelism' = '1' 
); 
INSERT INTO 
hbase_sink 
SELECT 
cast(id as STRING), 
ROW( 
cast(order_id as STRING), 
cast(user_id as STRING), 
cast(product_id as STRING), 
cast(quantity as STRING), 
cast(order_date as STRING) 
) 
FROM 
kafka_source

Tips: If we don't need Kafka middleware, we can also use the binlog-x plug-in to connect directly to the hbase-x plug-in.

RDB real-time collection plug-in supported by ChunJun

This section mainly introduces the features, collection logic and principle of ChunJun's RDB real-time collection plug-in.

ChunJun's RDB real-time acquisition can monitor changes in the database in real time, and read data changes when changes occur, such as insert, update, and delete operations. Using ChunJun real-time collection, we can obtain information about changes in the database in real time, so that we can respond to these changes in a timely manner, which can help us better manage and utilize the data in the RDB database.

And ChunJun provides fault recovery and breakpoint resume function to ensure data integrity. The general implementation steps of the ChunJun real-time collection plug-in are as follows:

· Connect to the database and confirm the reading point. The reading point can be understood as an offset, such as in Binlog, which refers to the log file name and the position information of the file

· Start reading the redolog according to the reading point, and obtain the operation records related to data changes

Filter out the required logs based on filtering information such as tableName and operation events (such as insert, delete, update)

· Parsing log logs, the parsed event information includes table name, database name, operation type (insert, update or delete) and changed data row, etc.

· The parsed data will be processed into ChunJun's internal unified DdlRowData for downstream use

The real-time collection connectors currently supported by ChunJun include: binlog(mysql), oceanbasecdc, oraclelogminer, and sqlservercdc.

Introduction to Binlog

The main function of the ChunJun binlog plugin is to read MySQL binary log (binlog) files. These files record all changes to the data, such as inserts, updates, and deletes. Currently, the plugin relies on the Canal component to read MySQL binlog files.

The core operation steps are as follows:

• Confirm the reading point: In the binlog plug-in, we can directly specify the journal-name (binlog file name) and position (the specific position of the file) in the start field of the script

• Read binlog: The binlog plug-in disguises itself as a MySQL Slave node and sends a request to the MySQL Master to send the data stream of the binlog file to it

• Fault recovery and breakpoint resume: When a fault occurs, the plug-in will record the current binlog location information. After recovering from checkpoint/savepoint, we can continue to read the binlog file from the last recorded location to ensure the integrity of data changes

The permissions required to use binlog are detailed in the "binlog plugin usage documentation", the link is as follows:

https://sourl.cn/mvae9m

Introduction to Oracle Logminer

The Logminer plug-in uses the Logminer tool provided by Oracle to obtain information in the Oracle redolog by reading views.

The core operation steps are as follows:

01 Positioning needs to read the starting point (start_scn)

Currently logminer supports four strategies to specify StartScn:

all: collect from the earliest archived log group in the Oracle database (not recommended)

· current: the SCN number when the task is running

· time: the SCN number corresponding to the specified time point

· scn: directly specify the SCN number

02 Locate the end point that needs to be read (end_scn)

The plug-in obtains the loadable redolog file list according to start_scn and maxLogFileSize (default 5G), and end_scn takes the largest scn value in this file list.

03 Load redo logs to Logminer

Load the redolog within the scn range into Logminer through a stored procedure.

04 Read data from the view

Use scn > ? as the where condition to directly query the information in the v$logmnr_contents view to obtain the data in the redolog.

05 Repeat steps 1-4 to achieve continuous reading

as title.

06 Fault recovery and resumable upload

When a failure occurs, the plug-in will save the currently consumed scn number, and read from the last scn number when restarting to ensure data integrity.

• For a detailed introduction to the principle of the plug-in, please refer to the "Oracle Logminer Implementation Principle Document":

https://sourl.cn/6vqz4b

• For the prerequisites for using the lominer plug-in, see "Oracle Configuration LogMiner":

https://sourl.cn/eteyZY

Introduction to SqlServer CDC

The SqlServerCDC plug-in relies on the views provided by the CDC Agent service of SQL Server to obtain the information in the redolog.

The core operation steps are as follows:

01 Positioning needs to read the starting point (from_lsn)

Currently, SqlserverCDC only supports direct configuration of the lsn number. If the lsn number is not configured, the current largest lsn number in the database is taken as from_lsn.

02 Locate the end point to be read (to_lsn)

The SqlserverCDC plug-in periodically (can be specified through the pollInterval parameter) obtains the maximum lsn in the database as end_lsn.

03 Read data from the view

Query the data within the lsn range in the view provided by the Agent service, and filter out the tables and event types that need to be monitored.

04 Repeat steps 1-3 to achieve continuous reading

as title.

05 Failure recovery and resuming from breakpoints

When a failure occurs, the plug-in will save the currently consumed lsn number. When restarting, read from the last lsn number to ensure data integrity.

• For a detailed introduction to the principle of the plug-in, please refer to the "Sqlserver CDC Implementation Principle Document":

https://sourl.cn/5pQvEM

• For details on configuring the SqlServer CDC Agent service, see "Sqlserver Configuration CDC Documentation":

https://sourl.cn/h5nd8j

Introduction to OceanBase CDC

OceanBase is a distributed relational database open sourced by Ant Group, which uses binary logs (binlog) to record data changes. The implementation of OceanBaseCDC relies on the LogProxy service provided by OceanBase. LogProxy provides a service based on the publish-subscribe model, allowing the use of OceanBase's logclient to subscribe to specific binlog data streams.

OceanBaseCDC starts a Listener thread. When the logclient connects to the LogProxy, the Listener will subscribe to the filtered binlog and add it to the internally maintained list. After receiving the COMMIT message, the Listener will pass the log change information to a blocking queue, which will be consumed by the main thread and converted into DdlRowData inside ChunJun, and finally sent downstream.

JDBC-Polling mode read

The polling reading mode of the JDBC plug-in is based on SQL statements for data reading. Compared with the real-time collection based on redo logs, the cost is lower, but the real-time synchronization of the jdbc plug-in has higher requirements for business scenarios:

Have an incrementing primary key of numeric or time type

Do not update historical data or care whether historical data is updated, only care about the acquisition of new data

Introduction to Implementation Principles

• Set the incremental business primary key as the incremental key that the polling mode depends on

• In the process of incremental reading, record the value (state) corresponding to increColumn in real time as the starting point for the next data reading

• After reading a batch of data, read the next batch of data according to the state after a period of time

Polling relies on the logic of partial incremental synchronization. For more information on incremental synchronization, click:

https://sourl.cn/UC8n6K

How to configure a jdbc-polling job

Let me first introduce the configuration items that need to be paid attention to when enabling polling mode:

file

Taking MySQL as an example, suppose we have a history table that stores order information, and the order_id of the order is incremented, and we hope to obtain new data from this table regularly.

CREATE TABLE order.realtime_order_archive ( 
order_id INT PRIMARY KEY COMMENT '订单唯⼀标识', 
customer_id INT COMMENT '客户唯⼀标识', 
product_id INT COMMENT '产品唯⼀标识', 
order_date TIMESTAMP COMMENT '订单⽇期和时间', 
payment_method VARCHAR(255) COMMENT '⽀付⽅式(信⽤卡、⽀付宝、微信⽀付等)', 
shipping_method VARCHAR(255) COMMENT '配送⽅式(顺丰速运、圆通速递等)', 
shipping_address VARCHAR(255) COMMENT '配送地址', 
order_total DECIMAL(10,2) COMMENT '订单总⾦额', 
discount DECIMAL(10,2) COMMENT '折扣⾦额', 
order_status VARCHAR(255) COMMENT '订单状态(已完成、已取消等)' 
); 

We can configure the reader information of the json script like this.

"name": "mysqlreader", 
"parameter": { 
"column" : [ 
"*" //这⾥假设我们读取所有字段,可以填写‘*’ 
], 
"increColumn": "id", 
"polling": true, 
"pollingInterval": 3000, 
"username": "username", 
"password": "password", 
"connection": [ 
{ 
"jdbcUrl": [ 
"jdbc:mysql://ip:3306/liuliu?useSSL=false" 
], 
"schema":"order", 
"table": [ 
"realtime_order_archive" ] 
} 
] 
} 
}

"White Paper on Data Stack Products": https://fs80.cn/cw0iw1

"Data Governance Industry Practice White Paper" download address: https://fs80.cn/380a4b

For those who want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/8687786
Recommended