Use datax to achieve incremental synchronization of mysql database data timing execution tasks

1. Download datax (the premise is that CentOS has installed jdk and other operating environments), unzip it (customize the path), use the python that comes with centos7 to execute datax.py, and run the self-test

wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar -zxvf datax.tar.gz  && mv datax /usr/local/
cd /usr/local/datax/bin/
python datax.py  /usr/local/datax/job/job.json

If an error is reported, refer to this blog: https://www.cnblogs.com/juanxincai/p/16258154.html

Here is the solution

cd /usr/local/datax/plugin/reader
ll -a
[root@Data1 reader]# ll -a
total 76
drwxr-xr-x 21 502 games 4096 Feb 19 21:05 .
drwxr-xr-x  4 502 games   66 Feb 19 21:05 ..
drwxr-xr-x  3 502 games  224 Feb 19 21:05 cassandrareader
-rwxr-xr-x 1 502 games 212 Oct 12 2019 ._cassandrareader
....
Delete the ._ prefix file
rm -f ._*
cd /usr/local/datax/plugin/writer/
rm -f ._*

As you can see, the self-test is successful

Synchronization ideas:

Use python to query the data source database table, save the time of the last piece of data queried in a txt file, and read it next time, add a cron task to synchronize the data in the time interval regularly (incremental synchronization)

If you want to use pymysql, you have to upgrade python3, centos7 comes with python2, install it here: https://www.cnblogs.com/juanxincai/p/16280031.html

After the installation is complete, run the self-test and execute the error report here: https://www.cnblogs.com/juanxincai/p/16284779.html

Next you need 4 things:

1. The mysqltomysql.json that executes reading and writing, (my file name here is new.json) contains the information of the data source database table, the fields to be read and other settings, and receives two time parameters passed in from the outside ( Formatted as a timestamp), the path is under /usr/local/datax/job, modify the format and note that the format is standard json, and use it if there is no problem with the format

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "用户名",
                        "password": "密码",
                        "where": "data_time >= FROM_UNIXTIME(${create_time}) and data_time  < FROM_UNIXTIME(${end_time})",
                        "column": [
                            "id","data_time","name","age","insert_time"
                        ],
                        "connection": [
                            {
                                "table": [
                                    "表名"
                                ],
                                "jdbcUrl": [
                                    "jdbc:mysql://数据源IP:3306/数据库名?useUnicode=true&characterEncoding=utf8"
                                ]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "update",
                        "username": "用户名",
                        "password": "密码",
                        "column": [
                            "id","data_time","name","age","insert_time"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://目标库IP:3306/数据库名?useUnicode=true&characterEncoding=utf8",
                                "table": [
                                    "表名"
                                ]
                            }
                        ]
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 6
            }
        }
    }
}

2. The python script executed regularly is used to obtain the last data time of the data source database table and write it into a txt file, execute datax.py to run the above new.json (file name: mysql2mysqlexecute.py)

#!/usr/bin/env python3
# coding: utf-8
import subprocess as sp
import time,os,sys
import pymysql
import pickle

print ("going to execute")

configFilePath = sys.argv[1]
logFilePath = sys.argv[2]
lastDataTime=""


def save_variable(v, filename):
    f = open(filename, 'wb')
    pickle.dump(v, f)
    f.close()
    return filename

def load_variavle(filename):
    ff = open(filename, 'rb')
    r = pickle.load(ff,encoding ="UTF-8")
    ff.close()
    return r
	
startTime=load_variavle('/usr/local/datax/job/tempTime.txt')  #这个就是存放临时时间变量的txt文件,注意编码格式,读取起始时间

def do_sql(sql):
	db = pymysql.connect(host = '数据源库IP',port = 3306,user = '用户名',passwd = '密码',db = '数据库名')
#创建连接(连接数据库)
	cursor = db.cursor()  #创建游标
	cursor = db.cursor(cursor=pymysql.cursors.DictCursor)   #设置游标格式为字典格式,即取值时会以字典的形式呈现
	cursor.execute(sql) #执行sql语句
	rs=cursor.fetchall() 
	#for r in rs: 
	#print (r)
	content=rs
	db.commit()  #提交,以保存执行结果
	cursor.close()   #关闭游标
	db.close()     #关闭连接
	x = rs[0]['dataTime']
	return x;
	
print("startTime=",startTime) #输出格式化的同步开始日期
startTimeArray = time.strptime(startTime, "%Y-%m-%d %H:%M:%S")
startTimeStamp = int(time.mktime(startTimeArray))

sql='SELECT CAST(data_time AS CHAR) as dataTime FROM 表名 ORDER BY data_time DESC LIMIT 1'	

lastDataTime = do_sql(sql)
print("endTime=",lastDataTime)
lastDataTimeArray = time.strptime(lastDataTime, "%Y-%m-%d %H:%M:%S")
lastDataTimeTimeStamp = int(time.mktime(lastDataTimeArray))

try:
	script2execute  = "/usr/bin/python3 /usr/local/datax/bin/datax.py %s -p \"-Dcreate_time=%s -Dend_time=%s\" >> %s"%(configFilePath,startTimeStamp,lastDataTimeTimeStamp,logFilePath)
	print("to be excute script:",script2execute)
	os.system(script2execute)
	#sp.run(script2execute)
except IOError:
		print(IOError) 

print("script execute ending")

save_variable(lastDataTime,'/usr/local/datax/job/tempTime.txt') #保存临时时间变量作为下次的开始时间

print("ending---")

3. The sh script that executes the synchronization task regularly: (file name: timeMission.sh), here you can see that the location of the new.json file is passed in during execution, and the path of the log

#! /bin/bash
source /etc/profile

/usr/bin/python3 /usr/local/datax/job/mysql2mysqlexecute.py  '/usr/local/datax/job/new.json'  '/usr/local/datax/job/test_job.log'   '/usr/loal/datax/job/test_job.record'

4. You can see that the above file name for reading and storing temporary time variables is called: tempTime.txt, just create a new one yourself, pay attention to the path and encoding format

Now that sh is used, remember to give the script execution permission

chmod +x ./xxx.sh

Next, you can write timed tasks. Here we use corntab. If you have any questions, please refer to this article: https://www.cnblogs.com/juanxincai/p/15852374.html

crontab -e
SHELL=/bin/bash
 */5 * * * *  /usr/local/datax/job/timeMission.sh

Exit and save: wq plus these two lines means that timeMission.sh will be executed every 5 minutes, that is, it will be synchronized every 5 minutes

crontab -l

As you can see, our scheduled tasks have been written

systemctl reload crond.service
systemctl restart crond.service

Restart and reload the cron service, view the task execution log output

tail -f /var/spool/mail/root

Look at the datax operation log again

 tail -200f /usr/local/datax/job/test_job.log 

It can be seen that the data has been synchronized, and it can be optimized. Please refer to the official datax documentation. Don’t be afraid of problems during the build and run. Try to troubleshoot carefully. It is possible that the format, execution permission, and file encoding will cause the execution to fail.

Guess you like

Origin blog.csdn.net/zxj19880502/article/details/129299405