DataX oracle synchronization mysql (full and incremental)

This blog talks about how DataX performs full and incremental data synchronization. Although Oracle is used to demonstrate synchronization to mysql, the synchronization between other databases is similar.

1.Introduction to DataX

DataX is an offline synchronization tool for heterogeneous data sources, dedicated to achieving stable and efficient data synchronization functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
Insert image description here
Github home page address: https://github.com/alibaba/DataX
DataX tool download address: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

DataX is built using Framework + plugin architecture. Abstract data source reading and writing into Reader/Writer plug-ins
Insert image description here

  • Reader: Reader is the data collection module, responsible for collecting data from the data source and sending the data to the Framework.
  • Writer: Writer is the data writing module, responsible for continuously fetching data from the Framework and writing the data to the destination.
  • Framework: Framework is used to connect readers and writers, serves as the data transmission channel for both, and handles core technical issues such as buffering, flow control, concurrency, and data conversion.

The following are the plug-ins supported by DataX3.0, which can synchronize data with each other directly in these data sources according to the json script configuration.

type data source Reader(read) Writer(write) document
RDBMS relational database MySQL read , write
Oracle read , write
OceanBase read , write
SQLServer read , write
PostgreSQL read , write
DRDS read , write
Dameng read , write
Universal RDBMS (supports all relational databases) read , write
Alibaba Cloud Data Warehouse Data Storage ODPS read , write
ADS Write
OSS read , write
OCS read , write
NoSQL data storage OTS read , write
Hbase0.94 read , write
Hbase1.1 read , write
MongoDB read , write
Hive read , write
Unstructured data storage TxtFile read , write
FTP read , write
HDFS read , write
Elasticsearch Write

2.DataX in action

2.1.DataX basic environment construction
  • 1. Upload the downloaded datax.tar.gz to Linux

  • 2. Unzip tar -xzvf datax.tar.gz, there will be /datax directory, enter the cd datax directory
    Insert image description here

  • 3. Delete all hidden files in the datax directory first, otherwise the execution of the script will fail.

 find ./ -name '._*' -print0 |xargs -0 rm -rf
  • 4. Execute the test script: ./bin/datax.py job/job.json. If you see the effect in the picture below, it means that the environment is normal.
    Insert image description here
2.2.DataX fully synchronizes data from oracle to mysql

From the above introduction, we can know that datax synchronizes data through different plug-ins. Each plug-in has a reader and a writer. To synchronize data from oracle to mysql, execute: ./bin/datax.py -r oraclereader -w mysqlwriter, get the sample json configuration, and then modify the parameters inside

#在job 目录中创建 vi oracle_to_mysql.json,这是改完后能同步的参数配置
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "oraclereader", 
                    "parameter": {
                        "column": ["INVESTOR_ID","INVESTOR_NAME","ID_TYPE","ID_NO","CREATE_TIME"], 
                        "splitPk": "INVESTOR_ID",
                        "where" : "INVESTOR_ID is not null",
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:oracle:thin:@172.17.112.177:1521:helowin"], 
                                "table": ["CXX.CUSTOMER"]
                            }
                        ], 
                        "password": "123456", 
                        "username": "admin"
                    }
                }, 
                "writer": {
                    "name": "mysqlwriter", 
                    "parameter": {
                        "column": [ 
                            "customer_no",
                            "customer_name",
                            "id_type",
                            "id_no",
                            "create_time"
                           
                        ], 
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://172.17.112.176:3306/customer_db?useUnicode=true&characterEncoding=UTF-8", 
                                "table": ["customer_datax"]
                            }
                        ], 
                        "username": "admin", 
                        "password": "123456", 
                        "preSql": [], 
                        "session": ["set session sql_mode='ANSI'"], 
                        "writeMode": "update"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "3"
            }
        }
    }
}

Execute full synchronization: ./bin/datax.py job/oracle_to_mysql.json. You can see that 1045 records have been synchronized to mysql.
Insert image description here

2.3.DataX incremental synchronization of data, oracle to mysql

Incremental synchronization requires the cooperation of Linux's crontab scheduled tasks, and then calculates the time through the shell script and passes it to the Where condition of the json script."where" : "CREATE_TIME > unix_to_oracle(${create_time}) and CREATE_TIME <= unix_to_oracle(${end_time})"

${create_time} and ${end_time} are calculated by shell script
vi /home/datax/servers/datax/job/oracle_to_mysql.json

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "oraclereader", 
                    "parameter": {
                        "column": ["INVESTOR_ID","INVESTOR_NAME","ID_TYPE","ID_NO","CREATE_TIME"], 
                        "splitPk": "INVESTOR_ID",
                        "where" : "CREATE_TIME > unix_to_oracle(${create_time}) and CREATE_TIME <= unix_to_oracle(${end_time})",
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:oracle:thin:@172.17.112.177:1521:helowin"], 
                                "table": ["CXX.CUSTOMER"]
                            }
                        ], 
                        "password": "123456", 
                        "username": "admin"
                    }
                }, 
                "writer": {
                    "name": "mysqlwriter", 
                    "parameter": {
                        "column": [ 
                            "customer_no",
                            "customer_name",
                            "id_type",
                            "id_no",
                            "create_time"
                           
                        ], 
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://172.17.112.176:3306/customer_db?useUnicode=true&characterEncoding=UTF-8", 
                                "table": ["customer_datax"]
                            }
                        ], 
                        "username": "admin", 
                        "password": "123456", 
                        "preSql": [], 
                        "session": ["set session sql_mode='ANSI'"], 
                        "writeMode": "update"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "3"
            }
        }
    }
}

vi /home/datax/servers/datax/job/increment_sync.sh

#!/bin/bash
source /etc/profile
# 截至时间设置为当前时间戳
end_time=$(date +%s)
# 开始时间设置为300s前时间戳
create_time=$(($end_time - 300))
# 执行datax脚本,传入时间范围
/home/datax/servers/datax/bin/datax.py /home/datax/servers/datax/job/oracle_to_mysql.json -p "-Dcreate_time=$create_time -Dend_time=$end_time" &

And give increment_sync.sh executable permissions: chmod -R 777 increment_sync.sh

Then set the crontab scheduled task to be executed every 5 minutes, corresponding to the 300s in the above script.

crontab -e
 */5 * * * * /home/datax/servers/datax/job/increment_sync.sh >/dev/null 2>&1

tip: oralce does not have unix_to_oracle function, you need to create it in oracle yourself.

 create or replace function unix_to_oracle(in_number NUMBER) return date is
 begin
  return(TO_DATE('19700101','yyyymmdd') + in_number/86400 +TO_NUMBER(SUBSTR(TZ_OFFSET(sessiontimezone),1,3))/24);
 end unix_to_oracle;

Okay, now the incremental synchronization is completed.

3.DataX synchronization process:

  • 1. When deploying datax for the first time, manually execute the full synchronization script to synchronize existing customer data
  • 2. Then perform incremental synchronization. Use Linux crontab and script to perform incremental synchronization based on time.
  • 3. When Oracle synchronizes MySQL, there are several synchronization modes. It is recommended that "writeMode" be set to "update":
    • 3.1. MySQL's "writeMode" is set to "insert". When there are duplicate data records, it will not be synchronized and will be skipped directly. Even if the data in Oracle has been modified, it will not be synchronized.
    • 3.2. Mysql's "writeMode" is set to "replace". When there are duplicate data records, the records in mysql will be deleted first, and then the records in oracle will be added.
    • 3.3. The "writeMode" of mysql is set to "update". When there are duplicate data records, the columns in oracle will overwrite the columns in mysql. Columns that are not configured for synchronization will not be overwritten.
  • 4. You need to create the uninx_to_date() function in the oracle library

4. Optimization of incremental synchronization method

The above requires creating an uninx_to_date() function in the oracle library. Next, use a shell script to convert the uninx timestamp to the yyyy-MM-dd hh:mm:ss type, and then pass it to the oracle_to_mysql.json configuration. There is no need to create this uninx_to_date() Function

The increment_sync.sh script gets the string type time:

#!/bin/bash
source /etc/profile
#当前时间戳
cur_time=$(date +%s)
#结束时间
end_time="'$(date -d @$cur_time +"%Y-%m-%d %H:%M:%S")'"
#开始时间,为当前时间的前300s
create_time="'$(date -d @$(($cur_time-120)) +"%Y-%m-%d %H:%M:%S")'"

# 执行datax脚本,传入时间范围
/home/datax/servers/datax/bin/datax.py /home/datax/servers/datax/job/oracle_to_mysql.json -p "-Dcreate_time=$create_time -Dend_time=$end_time" &

Modify the where parameter of oracle_to_mysql.json and remove the uninx_to_date() function

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "oraclereader", 
                    "parameter": {
                        "column": ["INVESTOR_ID","INVESTOR_NAME","ID_TYPE","ID_NO","CREATE_TIME"], 
                        "splitPk": "INVESTOR_ID",
                        "where" : "CREATE_TIME >to_date('${create_time}','yyyy-mm-dd hh24:mi:ss')  and CREATE_TIME <= to_date('${end_time}','yyyy-mm-dd hh24:mi:ss')",
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:oracle:thin:@172.17.112.177:1521:helowin"], 
                                "table": ["CXX.CUSTOMER"]
                            }
                        ], 
                        "password": "123456", 
                        "username": "admin"
                    }
                }, 
                "writer": {
                    "name": "mysqlwriter", 
                    "parameter": {
                        "column": [ 
                            "customer_no",
                            "customer_name",
                            "id_type",
                            "id_no",
                            "create_time"
                           
                        ], 
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://172.17.112.176:3306/customer_db?useUnicode=true&characterEncoding=UTF-8", 
                                "table": ["customer_datax"]
                            }
                        ], 
                        "username": "admin", 
                        "password": "123456", 
                        "preSql": [], 
                        "session": ["set session sql_mode='ANSI'"], 
                        "writeMode": "update"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "3"
            }
        }
    }
}

Windows operating system, if you want to use datax with Python3, you need to download the Python3 version of datax.py. Click here to download.

Guess you like

Origin blog.csdn.net/zhuyu19911016520/article/details/124143716