DataX- Ali open source offline synchronization tool realizes full synchronization and incremental synchronization from Sqlserver to Mysql on Windows

Scenes

Kettle - an open source ETL toolset - implements data synchronization from SqlServer to Mysql tables and deploys it on Windows servers:

Kettle-open source ETL tool set-realize data synchronization from SqlServer to Mysql table and deploy it on Windows server_etl realizes sqlserver report server_Overbearing rogue temperament blog-CSDN blog

The use of Kettle has been mentioned above, and the following is a record of Ali’s open source heterogeneous data source synchronization tool DataX

DataX

DataX is an offline synchronization tool for heterogeneous data sources, dedicated to the realization of relational databases (MySQL, Oracle, etc.),

Stable and efficient data synchronization function between HDFS, Hive, ODPS, HBase, FTP and other heterogeneous data sources.

GitHub - alibaba/DataX: DataX is an open source version of Alibaba Cloud DataWorks data integration.

 

design concept

In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link,

As an intermediate transmission carrier, DataX is responsible for connecting various data sources. When a new data source needs to be accessed,

Just connect this data source to DataX to achieve seamless data synchronization with existing data sources.

Current status

DataX is widely used in Alibaba Group, and it undertakes the offline synchronization business of all big data, and has been running stably for 6 years.

At present, the synchronous 8w multi-channel operation is completed every day, and the daily data transmission volume exceeds 300TB

Data sources supported by DataX

GitHub - alibaba/DataX: DataX is an open source version of Alibaba Cloud DataWorks data integration.

A specific example is recorded below - synchronizing data from Sqlserver to Mysql, and the table structure is the same.

Note:

Blog:
Overbearing rogue temperament blog_CSDN Blog-C#, Architecture Road, Blogger in SpringBoot

accomplish

1. DataX installation on Windows

Refer to the quick start document on the official website:

DataX/userGuid.md at master · alibaba/DataX · GitHub

Install and configure the required environment dependencies.

 

There is no need to compile it yourself, so only jdk1.8 and Python3 environment variables are configured.

According to the document download address, download the DataX toolkit

https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202210/datax.tar.gz

After downloading, unzip it.

2. Start and test stream2stream data conversion

After decompression, go to the bin directory and create a new job configuration file stream2stream.json file

Modify the json content to

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

This is an example json file of the official template, which is used to self-check whether DataX is successfully configured and started.

Then open cmd in the bin directory and execute

python datax.py  ./stream2stream.json

Waiting for the execution to complete without prompting an error, but found Chinese garbled characters

 

The Chinese garbled characters in the DataX command box need to set the encoding format, first enter it in cmd

chcp 65001

Then execute the above command

The Chinese output during execution is no longer garbled

 

The execution result is no longer garbled

 

3. Obtain the json template converted from different data sources.

The above is the data source conversion from stream to stream, how to get the json template of other data sources.

DataX provides instructions for obtaining json templates converted from different data sources

You can view the configuration template with the command:

python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

How to get the name of the data source, for example, read from sqlserver and write to mysql, then get the command of the json template:

python datax.py -r sqlserverreader -w mysqlwriter

At this time, a json template from sqlserver to mysql will be returned.

This is because that's what the directory is called in its source code.

 

Get said that you can directly click into the doc directory inside to view the content of the json file of the example

 

And the parameters of each configuration item also have corresponding descriptions

 

sqlserverreader parameter description

DataX/sqlserverreader.md at master · alibaba/DataX · GitHub

mysqlwriter parameter description

DataX/mysqlwriter.md at master · alibaba/DataX · GitHub

So here is a new fully updated json file sqlserver2mysqlALL.json

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "sqlserverreader",
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": [
								"jdbc:sqlserver://localhost:1433;DatabaseName=数据库名"
								],
                                "table": [
								"表名"
								]
                            }
                        ],
                        "password": "改成自己的密码",
                        "username": "用户名",
						"column": [
						"checkid",
						"cardID",
						"hphm",
						"startTime",
						"endTime",
						"linenumber",
						"cwgt",
						"cwgtUL",
						"cwgtJudge",
						"cwkc",
						"cwkcResult",
						"cwkcUL",
						"cwkcJudge",
						"cwkk",
						"cwkkResult",
						"cwkkUL",
						"cwkkJudge",
						"cwkg",
						"cwkgResult",
						"cwkgUL",
						"cwkgJudge",
						"wkccJudge",
						]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": [
					    "checkid",
						"cardID",
						"hphm",
						"startTime",
						"endTime",
						"linenumber",
						"cwgt",
						"cwgtUL",
						"cwgtJudge",
						"cwkc",
						"cwkcResult",
						"cwkcUL",
						"cwkcJudge",
						"cwkk",
						"cwkkResult",
						"cwkkUL",
						"cwkkJudge",
						"cwkg",
						"cwkgResult",
						"cwkgUL",
						"cwkgJudge",
						"wkccJudge",
						],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/数据库名?useUnicode=true&characterEncoding=gbk",
                                "table": [
								"表名"
								]
                            }
                        ],
                        "password": "密码",
                        "preSql": [
						"delete from vehicleresult"
						],
                        "session": [],
                        "username": "用户名",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "5"
            }
        }
    }
}

Note that the process here is to read the data of the specified column from sqlserver, and the column here is the configured column.

Then when writing to mysql, you need to execute the delete statement in advance, which is configured in preSql

delete from vehicleresult

vehicleresult is the table name. Then the write mode is an inline

Then execute the command of the above json template

python datax.py  ./sqlserver2mysqlALL.json

A full update can be achieved.

Note that the data structures on both sides, including type, length, non-null, etc., must be consistent.

For example, if a field in sqlserver is not empty and there is empty data, but the corresponding field in mysql is not empty, it will be considered as dirty data during synchronization and the synchronization will fail.

The above full update results

 

4. Every time the above command is executed, a full update is performed, so a timing bat script is needed to execute the command regularly.

Create a new bat file and modify the content to

#设置编码
chcp 65001
@echo off
title "同步数据"
set INTERVAL=15
timeout %INTERVAL%
 
:Again

python datax.py  ./sqlserver2mysqlALL.json

echo %date% %time:~0,8%
 
timeout %INTERVAL%
 
goto Again

The above content represents execution every 15 seconds

python datax.py  ./sqlserver2mysqlALL.json

Put this bat in the bin directory at the same level as the json file, and double-click to execute it.

5. The above is a full update, how to achieve incremental update.

Note that the incremental update here has conditional restrictions. First, the data here will not be deleted, but will only be added and updated, and the update will only update the data of the current day.

So here first perform the full update above to ensure that the data is acquired for the first docking, and then use the scheduled task to perform incremental update later, just need

Query and replace the current data.

In addition, it is necessary to ensure that there is a date and time field, so when reading and writing data, you can use the where condition to query the current data.

In addition, the primary key here is not an auto-incrementing int data, otherwise it can also be incrementally updated according to the auto-incrementing primary key id.

The sqlserver here is provided by a third-party system and cannot be changed to the required type

 

Modify the above sqlserverreader to add the where condition to query the data of the day

Query the data of the day in Sqlserver

where datediff(day,startTime,getdate())=0

Where startTime is a time field.

Query the data of the day in Mysql

WHERE DATE(startTime) = CURDATE()

So modify the above json file as

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "sqlserverreader",
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": [
								"jdbc:sqlserver://localhost:1433;DatabaseName=数据库名"
								],
                                "table": [
								"表名"
								]
                            }
                        ],
                        "password": "改成自己的密码",
                        "username": "用户名",
						"where": "datediff(day,startTime,getdate())=0",
						"column": [
						"checkid",
						"cardID",
						"hphm",
						"startTime",
						"endTime",
						"linenumber",
						"cwgt",
						"cwgtUL",
						"cwgtJudge",
						"cwkc",
						"cwkcResult",
						"cwkcUL",
						"cwkcJudge",
						"cwkk",
						"cwkkResult",
						"cwkkUL",
						"cwkkJudge",
						"cwkg",
						"cwkgResult",
						"cwkgUL",
						"cwkgJudge",
						"wkccJudge",
						]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": [
					    "checkid",
						"cardID",
						"hphm",
						"startTime",
						"endTime",
						"linenumber",
						"cwgt",
						"cwgtUL",
						"cwgtJudge",
						"cwkc",
						"cwkcResult",
						"cwkcUL",
						"cwkcJudge",
						"cwkk",
						"cwkkResult",
						"cwkkUL",
						"cwkkJudge",
						"cwkg",
						"cwkgResult",
						"cwkgUL",
						"cwkgJudge",
						"wkccJudge",
						],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/数据库名?useUnicode=true&characterEncoding=gbk",
                                "table": [
								"表名"
								]
                            }
                        ],
                        "password": "密码",
                        "preSql": [
						"delete from 表名 WHERE DATE(startTime) = CURDATE();"
						],
                        "session": [],
                        "username": "root",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "5"
            }
        }
    }
}

At this time, you can use the bat script to execute it regularly, and modify the above 15 parameters by yourself at the timing time.

Then add another piece of today's data to test the synchronization effect.

Name the incrementally updated json above as sqlserver2mysqlAdd.json

 

Guess you like

Origin blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/130330353