Technical dry goods|How to use ChunJun to realize offline data synchronization?

ChunJun is a stable, easy-to-use, high-efficiency, batch-flow integrated data integration framework, based on the computing engine Flink to achieve data synchronization and computing between multiple heterogeneous data sources. ChunJun can logically or physically centralize data from different sources, formats, and characteristics, so as to provide enterprises with comprehensive data sharing. It has been deployed in thousands of companies and is running stably.

Before, we have introduced how to use ChunJun to realize real-time data synchronization (click to see the text), this article will introduce the companion article, how to use ChunJun to realize offline data synchronization.

ChunJun offline synchronization case

Offline synchronization is an important feature of ChunJun. The following uses the most common mysql -> hive synchronization task to introduce offline synchronization.

Configuration Environment

Find an empty directory, and then configure the environment of Flink and ChunJun. The following uses /root/chunjun_demo/ as an example.

● Configuring Flink

Download Flink

wget "http://archive.apache.org/dist/flink/flink-1.12.7/flink-1.12.7-bin-scala_2.12.tgz"
tar -zxvf chunjun-dist.tar.gz

● Configure ChunJun

#下载 chunjun, 内部依赖 flink 1.12.7
wget https://github.com/DTStack/chunjun/releases/download/v1.12.8/chunjun-dist-1.12-SNAPSHOT.tar.gz
#新创建⼀个⽬录
mkdir chunjun && cd chunjun
#解压到指定⽬录
tar -zxvf chunjun-dist-1.12-SNAPSHOT.tar.gz

The decompressed ChunJun has the following directories: bin chunjun-dist chunjun-examples lib

● Configure environment variables

#配置 Flink 环境变量
echo "FLINK_HOME=/root/chunjun_demo/flink-1.12.7" >> /etc/profile.d/sh.local
#配置 Chunjun 的环境变量
echo "CHUNJUN_DIST=/root/chunjun_demo/chunjun/chunjun-dist" >> /etc/profile.d/sh.local
#刷新换新变量
. /etc/profile.d/sh.local

● Start Flink Session on Yarn

#启动 Flink Session
bash $FLINK_HOME/bin/yarn-session.sh -t $CHUNJUN_DIST -d

The output is as follows:

echo "stop" | $FLINK_HOME/bin/yarn-session.sh -id application_1683599622970_0270
If this should not be possible, then you can also kill Flink via YARN's web interface or via:
yarn application -kill application_1683599622970_0270

The Yarn Application Id (application_1683599622970_0270) of the Flink Session will be used for submitting tasks below.

● Other configurations

If you use the parquet format, you need to put flink-parquet_2.12-1.12.7.jar under flink/lib. In the above example, you need to put it in $FLINK_HOME/lib.

file

submit task

● Prepare data in MySQL

-- 创建⼀个名为ecommerce_db的数据库,⽤于存储电商⽹站的数据
CREATE DATABASE IF NOT EXISTS chunjun;
USE chunjun;
-- 创建⼀个名为orders的表,⽤于存储订单信息
CREATE TABLE IF NOT EXISTS orders (
 id INT AUTO_INCREMENT PRIMARY KEY, -- ⾃增主键
 order_id VARCHAR(50) NOT NULL, -- 订单编号,不能为空
 user_id INT NOT NULL, -- ⽤户ID,不能为空
 product_id INT NOT NULL, -- 产品ID,不能为空
 quantity INT NOT NULL, -- 订购数量,不能为空
 order_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
-- 订单⽇期,默认值为当前时间戳,不能为空
);
-- 插⼊⼀些测试数据到orders表
INSERT INTO orders (order_id, user_id, product_id, quantity)
VALUES ('ORD123', 1, 101, 2),
       ('ORD124', 2, 102, 1),
       ('ORD125', 3, 103, 3),  
       ('ORD126', 1, 104, 1),
       ('ORD127', 2, 105, 5);
       
select * from chunjun.orders;       

If you don't have MySQL, you can use docker to quickly create one.

docker pull mysql:8.0.12
docker run --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=123456 -d mysql:8.0.12

● Create Hive tables

CREATE DATABASE IF NOT EXISTS chunjun;
USE chunjun;
-- 创建⼀个名为orders的表,⽤于存储订单信息
CREATE TABLE IF NOT EXISTS chunjun.orders (
 id INT,
 order_id VARCHAR(50),
 user_id INT,
 product_id INT,
 quantity INT,
 order_date TIMESTAMP
)
 STORED AS PARQUET;
-- 查看 hive 表,底层的 HDFS ⽂件位置,下⾯的 SQL 结果⾥⾯ Location 字段,就是 HDFS ⽂件的位置。
desc formatted chunjun.orders;
-- Location: hdfs://ns1/dtInsight/hive/warehouse/chunjun.db/orders
-- ⼀会配置同步任务的时候会⽤到 hdfs://ns1/dtInsight/hive/warehouse/chunjun.db/orders

● Configure a task mysql_hdfs.json in the current directory ( /root/chunjun_demo/ )

vim mysql_hdfs.json enter the following content:

{
"job": {
"content": [
 {
"reader": {
"parameter": {
"connection": [
 {
"schema": "chunjun",
"jdbcUrl": [ "jdbc:mysql://172.16.85.200:3306/chunjun" ],
"table": [ "orders" ]
 }
 ],
"username": "root",
"password": "123456",
"column": [
 { "name": "id", "type": "INT" },
 { "name": "order_id", "type": "VARCHAR" },
 { "name": "user_id", "type": "INT" },
 { "name": "product_id", "type": "INT" },
 { "name": "quantity", "type": "INT" },
 { "name": "order_date", "type": "TIMESTAMP" }
 ]
 },
"name": "mysqlreader"
 },
"writer": {
"parameter": {
"path": "hdfs://ns1/dtInsight/hive/warehouse/chunjun.db/orders",
"defaultFS": "hdfs://ns1",
"hadoopConfig": {
"dfs.nameservices": "ns1",
"dfs.ha.namenodes.ns1": "nn1,nn2",
"dfs.namenode.rpc-address.ns1.nn1": "172.16.85.194:9000",
"dfs.namenode.rpc-address.ns1.nn2": "172.16.85.200:9000",
"dfs.client.failover.proxy.provider.ns1":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
 },
"column": [
 { "name": "id", "type": "INT" },
 { "name": "order_id", "type": "VARCHAR" },
 { "name": "user_id", "type": "INT" },
 { "name": "product_id", "type": "INT" },
 { "name": "quantity", "type": "INT" },
 { "name": "order_date", "type": "TIMESTAMP" }
 ],
"writeMode": "overwrite",
"encoding": "utf-8",
"fileType": "parquet",
"fullColumnName":
 [ "id", "order_id", "user_id", "product_id", "quantity", "order_date"],
"fullColumnType":
 [ "INT", "VARCHAR", "INT", "INT", "INT", "TIMESTAMP" ]
 },
"name": "hdfswriter"
 }
 }
 ],
"setting": {
"errorLimit": {
"record": 0
 },
"speed": {
"bytes": 0,
"channel": 1
 }
 }
 }
}

Because we want to synchronize MySQL to Hive, but if we directly synchronize Hive, jdbc will be used internally, and the efficiency of jdbc is not high, so we can directly synchronize the data to the underlying HDFS of Hive, so the writer uses hdfswriter . The script is parsed as follows:

{
"job": {
"content": [
 {
"reader": {
"parameter": {
"connectionComment": "数据库链接, 数据库, 表, 账号, 密码",
"connection": [
 {
"schema": "chunjun",
"jdbcUrl": [ "jdbc:mysql://172.16.85.200:3306/chunjun" ],
"table": [ "orders" ]
 }
 ],
"username": "root",
"password": "123456",
"columnComment": "要同步的列选择, 可以选择部分列",
"column": [
 { "name": "id", "type": "INT" },
 { "name": "order_id", "type": "VARCHAR" },
 { "name": "user_id", "type": "INT" },
 { "name": "product_id", "type": "INT" },
 { "name": "quantity", "type": "INT" },
 { "name": "order_date", "type": "TIMESTAMP" }
 ]
 },
"nameComment" : "source 是 mysql",
"name": "mysqlreader"
 },
"writer": {
"parameter": {
"pathComment": "HDFS 上⾯的路径, 通过 hive 语句的 desc formatted 查看",
"path": "hdfs://ns1/dtInsight/hive/warehouse/chunjun.db/orders",
"defaultFS": "hdfs://ns1",
"hadoopConfigComment": "是 hdfs ⾼可⽤最基本的配置, 在 Hadoop 配置⽂件 hdfs-site.xml 可以找到",
"hadoopConfig": {
"dfs.nameservices": "ns1",
"dfs.ha.namenodes.ns1": "nn1,nn2",
"dfs.namenode.rpc-address.ns1.nn1": "172.16.85.194:9000",
"dfs.namenode.rpc-address.ns1.nn2": "172.16.85.200:9000",
"dfs.client.failover.proxy.provider.ns1":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
 },
"columnComment": "要同步的列选择, 可以选择部分列",
"column": [
 { "name": "id", "type": "INT" },
 { "name": "order_id", "type": "VARCHAR" },
 { "name": "user_id", "type": "INT" },
 { "name": "product_id", "type": "INT" },
 { "name": "quantity", "type": "INT" },
 { "name": "order_date", "type": "TIMESTAMP" }
 ],
"writeModeComment": "覆盖写⼊到 hdfs 上⾯的⽂件, 可选 overwrite, append(默认模式)",
"writeMode": "overwrite",
"encoding": "utf-8",
"fileTypeComment": "可选 orc, parquet, text",
"fileType": "parquet",
"fullColumnNameComment": "全部字段,有时候 column ⾥⾯同步部分字段,但是⼜需要有全部字段的格式,例如 fileType : text ",
"fullColumnName": [ "id", "order_id", "user_id", "product_id", "quantity", "order_date"], 
"fullColumnTypeComment": "全部字段的类型",
"fullColumnType": [ "INT", "VARCHAR", "INT", "INT", "INT", "TIMESTAMP" ]
 },
"nameComment" : "sink 是 hdfs",
"name": "hdfswriter"
 }
 }
 ],
"setting": {
"errorLimit": {
"record": 0
 },
"speed": {
"bytes": 0,
"channel": 1
 }
 }
 }
}

● Submit a task

bash chunjun/bin/chunjun-yarn-session.sh -job mysql_hdfs.json -confProp
{\"yarn.application.id\":\"application_1683599622970_0270\"}

● View tasks

file file

After the task synchronization is completed, you can look at the data on HDFS.

file

Look at the data in the Hive table.

file

Note that if it is a partitioned Hive table, you need to manually refresh the Hive metadata, using the MSCK command. (MSCK is a command in Hive that checks for partitions in a table and adds them to the Hive metadata)

MSCK REPAIR TABLE my_table;

Analysis of ChunJun Offline Synchronization Principle

HDFS file synchronization principle

· For the file system, when synchronizing, the files will be written to the .data file in the path + [filename] directory. If the task fails, the files in .data will not be generated. effect.

· When all the tasks on the TaskManager are finished, the finalizeGlobal method of FinalizeOnMaster will be executed on the JobManager, and finally moveAllTmpDataFileToDir will be called to remove the files in .data to the upper level of .data.

public interface FinalizeOnMaster {

/**
The method is invoked on the master (JobManager) after all (parallel) instances of an OutputFormat finished.
Params:parallelism – The parallelism with which the format or functions was run.
Throws:IOException – The finalization may throw exceptions, which may cause the job to abort.
*/
void finalizeGlobal(int parallelism) throws IOException; 
}
// 在 JobManager 执⾏
@Override
protected void moveAllTmpDataFileToDir() {
if (fs == null) {
openSource();
 }
String currentFilePath = "";
try {
Path dir = new Path(outputFilePath);
Path tmpDir = new Path(tmpPath);

FileStatus[] dataFiles = fs.listStatus(tmpDir);
for (FileStatus dataFile : dataFiles) {
currentFilePath = dataFile.getPath().getName();
fs.rename(dataFile.getPath(), dir);
LOG.info("move temp file:{} to dir:{}", dataFile.getPath(), dir);
 }
fs.delete(tmpDir, true);
 } catch (IOException e) {
throw new ChunJunRuntimeException(
String.format(
"can't move file:[%s] to dir:[%s]", currentFilePath, outputFilePath),
e);
 }
}

Incremental synchronization

Incremental synchronization is mainly for certain tables that only have Insert operations. As the business grows, the data in the tables will increase. If the entire table is synchronized every time, more time and resources will be consumed. Therefore, an incremental synchronization function is required, and only the increased part of the data is read each time.

● Implementation principle

The principle of its implementation is actually to join the filter conditions in the sql statement of the query with the increment key, such as where id > ?, to filter out the data that has been read before.

Incremental synchronization is for two or more synchronization jobs. For the job that performs incremental synchronization for the first time, it actually synchronizes the entire table. What is different from other jobs is that the incremental synchronization job will record an endLocation indicator after the job is executed, and upload this indicator to prometheus for further for subsequent use.

Except for the first job, all subsequent incremental synchronization jobs will use the endLocation of the previous job as the filter basis (startLocation) for this job. For example, after the first job is executed and the endLocation is 10, then the next job will construct an SQL statement such as SELECT id,name,age from table where id > 10 to achieve the purpose of incremental reading.

● Restrictions on use

· Only the RDB Reader plug-in can be used

Implemented by constructing SQL filter statements, so it can only be used for RDB plug-ins

· Incremental synchronization only cares about reading, not writing, so it is only related to the Reader plug-in

· The increment field can only be of numeric type and time type

· Indicators need to be uploaded to prometheus, and prometheus does not support string type, so only data type and time type are supported, time type will be converted into timestamp and uploaded

· Increment key values ​​can be repeated, but must be incremented

· Due to the use of '>', the required field must be incremented

http

Breakpoint resuming is for offline synchronization. For long-term synchronization tasks such as more than 1 day, if the task fails due to some reasons during the synchronization process, it will be very costly to start from the beginning. Therefore, a breakpoint resume is required. The passed function continues from where the task failed.

● Implementation principle

· Based on Flink's checkpoint, a certain field value of the last piece of data on the source end is stored during checkpoint, and the sink end plug-in performs transaction submission.

· When the task fails and is subsequently re-run through checkpoint, the source end will use the value in the state as a condition to filter the data when generating the select statement, so as to recover from the last failure point.

file· When jdbcInputFormat splices and reads SQL, if the state restored from the checkpoint is not empty and the restoreColumn is not empty, then the state in the checkpoint will be used as the starting point to start reading data.

● Applicable scenarios

Through the above principles, we can know that the source end must be an RDB type plug-in, because the data filtering is achieved by splicing the select statement with the where condition to achieve the breakpoint resume. At the same time, the breakpoint resume needs to specify a field as the filter condition, and this field Requirements are incremental.

· The task needs to enable checkpoint

The reader is a plug-in supported by all RDB plug-ins and the writer supports transactions (such as rdb filesystem, etc.). If the downstream is idempotent, the writer plug-in does not need to support transactions

· The data in the source table of the field that is resumed as a breakpoint is incremental, because the filter condition is >

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/8817162