Scenes
Kettle - an open source ETL toolset - implements data synchronization from SqlServer to Mysql tables and deploys it on Windows servers:
The use of Kettle has been mentioned above, and the following is a record of Ali’s open source heterogeneous data source synchronization tool DataX
DataX
DataX is an offline synchronization tool for heterogeneous data sources, dedicated to the realization of relational databases (MySQL, Oracle, etc.),
Stable and efficient data synchronization function between HDFS, Hive, ODPS, HBase, FTP and other heterogeneous data sources.
GitHub - alibaba/DataX: DataX is an open source version of Alibaba Cloud DataWorks data integration.
design concept
In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link,
As an intermediate transmission carrier, DataX is responsible for connecting various data sources. When a new data source needs to be accessed,
Just connect this data source to DataX to achieve seamless data synchronization with existing data sources.
Current status
DataX is widely used in Alibaba Group, and it undertakes the offline synchronization business of all big data, and has been running stably for 6 years.
At present, the synchronous 8w multi-channel operation is completed every day, and the daily data transmission volume exceeds 300TB
Data sources supported by DataX
GitHub - alibaba/DataX: DataX is an open source version of Alibaba Cloud DataWorks data integration.
A specific example is recorded below - synchronizing data from Sqlserver to Mysql, and the table structure is the same.
Note:
Blog:
Overbearing rogue temperament blog_CSDN Blog-C#, Architecture Road, Blogger in SpringBoot
accomplish
1. DataX installation on Windows
Refer to the quick start document on the official website:
DataX/userGuid.md at master · alibaba/DataX · GitHub
Install and configure the required environment dependencies.
There is no need to compile it yourself, so only jdk1.8 and Python3 environment variables are configured.
According to the document download address, download the DataX toolkit
https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202210/datax.tar.gz
After downloading, unzip it.
2. Start and test stream2stream data conversion
After decompression, go to the bin directory and create a new job configuration file stream2stream.json file
Modify the json content to
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
This is an example json file of the official template, which is used to self-check whether DataX is successfully configured and started.
Then open cmd in the bin directory and execute
python datax.py ./stream2stream.json
Waiting for the execution to complete without prompting an error, but found Chinese garbled characters
The Chinese garbled characters in the DataX command box need to set the encoding format, first enter it in cmd
chcp 65001
Then execute the above command
The Chinese output during execution is no longer garbled
The execution result is no longer garbled
3. Obtain the json template converted from different data sources.
The above is the data source conversion from stream to stream, how to get the json template of other data sources.
DataX provides instructions for obtaining json templates converted from different data sources
You can view the configuration template with the command:
python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
How to get the name of the data source, for example, read from sqlserver and write to mysql, then get the command of the json template:
python datax.py -r sqlserverreader -w mysqlwriter
At this time, a json template from sqlserver to mysql will be returned.
This is because that's what the directory is called in its source code.
Get said that you can directly click into the doc directory inside to view the content of the json file of the example
And the parameters of each configuration item also have corresponding descriptions
sqlserverreader parameter description
DataX/sqlserverreader.md at master · alibaba/DataX · GitHub
mysqlwriter parameter description
DataX/mysqlwriter.md at master · alibaba/DataX · GitHub
So here is a new fully updated json file sqlserver2mysqlALL.json
{
"job": {
"content": [
{
"reader": {
"name": "sqlserverreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:sqlserver://localhost:1433;DatabaseName=数据库名"
],
"table": [
"表名"
]
}
],
"password": "改成自己的密码",
"username": "用户名",
"column": [
"checkid",
"cardID",
"hphm",
"startTime",
"endTime",
"linenumber",
"cwgt",
"cwgtUL",
"cwgtJudge",
"cwkc",
"cwkcResult",
"cwkcUL",
"cwkcJudge",
"cwkk",
"cwkkResult",
"cwkkUL",
"cwkkJudge",
"cwkg",
"cwkgResult",
"cwkgUL",
"cwkgJudge",
"wkccJudge",
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"checkid",
"cardID",
"hphm",
"startTime",
"endTime",
"linenumber",
"cwgt",
"cwgtUL",
"cwgtJudge",
"cwkc",
"cwkcResult",
"cwkcUL",
"cwkcJudge",
"cwkk",
"cwkkResult",
"cwkkUL",
"cwkkJudge",
"cwkg",
"cwkgResult",
"cwkgUL",
"cwkgJudge",
"wkccJudge",
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/数据库名?useUnicode=true&characterEncoding=gbk",
"table": [
"表名"
]
}
],
"password": "密码",
"preSql": [
"delete from vehicleresult"
],
"session": [],
"username": "用户名",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "5"
}
}
}
}
Note that the process here is to read the data of the specified column from sqlserver, and the column here is the configured column.
Then when writing to mysql, you need to execute the delete statement in advance, which is configured in preSql
delete from vehicleresult
vehicleresult is the table name. Then the write mode is an inline
Then execute the command of the above json template
python datax.py ./sqlserver2mysqlALL.json
A full update can be achieved.
Note that the data structures on both sides, including type, length, non-null, etc., must be consistent.
For example, if a field in sqlserver is not empty and there is empty data, but the corresponding field in mysql is not empty, it will be considered as dirty data during synchronization and the synchronization will fail.
The above full update results
4. Every time the above command is executed, a full update is performed, so a timing bat script is needed to execute the command regularly.
Create a new bat file and modify the content to
#设置编码
chcp 65001
@echo off
title "同步数据"
set INTERVAL=15
timeout %INTERVAL%
:Again
python datax.py ./sqlserver2mysqlALL.json
echo %date% %time:~0,8%
timeout %INTERVAL%
goto Again
The above content represents execution every 15 seconds
python datax.py ./sqlserver2mysqlALL.json
Put this bat in the bin directory at the same level as the json file, and double-click to execute it.
5. The above is a full update, how to achieve incremental update.
Note that the incremental update here has conditional restrictions. First, the data here will not be deleted, but will only be added and updated, and the update will only update the data of the current day.
So here first perform the full update above to ensure that the data is acquired for the first docking, and then use the scheduled task to perform incremental update later, just need
Query and replace the current data.
In addition, it is necessary to ensure that there is a date and time field, so when reading and writing data, you can use the where condition to query the current data.
In addition, the primary key here is not an auto-incrementing int data, otherwise it can also be incrementally updated according to the auto-incrementing primary key id.
The sqlserver here is provided by a third-party system and cannot be changed to the required type
Modify the above sqlserverreader to add the where condition to query the data of the day
Query the data of the day in Sqlserver
where datediff(day,startTime,getdate())=0
Where startTime is a time field.
Query the data of the day in Mysql
WHERE DATE(startTime) = CURDATE()
So modify the above json file as
{
"job": {
"content": [
{
"reader": {
"name": "sqlserverreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:sqlserver://localhost:1433;DatabaseName=数据库名"
],
"table": [
"表名"
]
}
],
"password": "改成自己的密码",
"username": "用户名",
"where": "datediff(day,startTime,getdate())=0",
"column": [
"checkid",
"cardID",
"hphm",
"startTime",
"endTime",
"linenumber",
"cwgt",
"cwgtUL",
"cwgtJudge",
"cwkc",
"cwkcResult",
"cwkcUL",
"cwkcJudge",
"cwkk",
"cwkkResult",
"cwkkUL",
"cwkkJudge",
"cwkg",
"cwkgResult",
"cwkgUL",
"cwkgJudge",
"wkccJudge",
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"checkid",
"cardID",
"hphm",
"startTime",
"endTime",
"linenumber",
"cwgt",
"cwgtUL",
"cwgtJudge",
"cwkc",
"cwkcResult",
"cwkcUL",
"cwkcJudge",
"cwkk",
"cwkkResult",
"cwkkUL",
"cwkkJudge",
"cwkg",
"cwkgResult",
"cwkgUL",
"cwkgJudge",
"wkccJudge",
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/数据库名?useUnicode=true&characterEncoding=gbk",
"table": [
"表名"
]
}
],
"password": "密码",
"preSql": [
"delete from 表名 WHERE DATE(startTime) = CURDATE();"
],
"session": [],
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "5"
}
}
}
}
At this time, you can use the bat script to execute it regularly, and modify the above 15 parameters by yourself at the timing time.
Then add another piece of today's data to test the synchronization effect.
Name the incrementally updated json above as sqlserver2mysqlAdd.json