Use Kettle to synchronize data from database A to database B at regular intervals

1. Demand background

Due to the project scenario, it is necessary to 定时T+1 增量synchronize the data in table a, table b, and table c in database A (MySQL) to database B (MySQL). Let me explain here that it is not the master-slave backup of the database, or the ordinary data synchronization. After technical research, it was found that Kettle is quite suitable for the following reasons:

  1. Kettle (data extraction, cleaning, conversion, loading) is written in java and can run on Window, Linux, and Unix. A professional ETL tool that supports multiple data sources and multiple middleware;
  2. The visual interface supports graphical GUI design interface, component diversity, and supports component drag and drop, without writing additional code;
  3. Kettle's flowcharts are essentially configuration files, such as .ktr/.kjb files. The advantage of this design is that after we draw the conversion flow chart, we can directly copy it to run in another environment, such as: draw the flow chart on a Windows computer and copy it to run on a Liunx system .
  4. It is free, has many components, and supports open source. Generally, there is no problem in processing T+1 data synchronization. If the concurrency is high, the real-time performance is high, and the data volume is large, it is recommended to use Flink.

2. How to use

1. Download the installation package

Official website address: https://sourceforge.net/projects/pentaho/files/Data%20Integration/

2. Starting method

Windows click Spoon.bat to start

The following picture appears to indicate that it is starting, 如果一直没有反应,使用管理员身份运行.

The main interface is as follows:

3. Connect to the MySQL database

1. Prepare the MySQL connection driver jar package

Since Kettle itself does not have any database driver package, here we need to prepare the driver package first, and the version is the best choice 5.1.49. After downloading the jar package, copy it to the lib directory (Windows and Linux are the same). If Kettle has been started, it needs to be shut down and restarted, otherwise the driver package will not be loaded.

2. Create a data source

Click in turn: Conversion --> Main Object Tree --> DB Connection --> Click New --> Test

Same operation as above, create two data sources: source database and target database; the goal is to synchronize the table data in the source database to the target database

3. Configuration conversion process

① Join the input node: conversion --> input --> table input

② Double-click the input node to open the configuration page, enter the information

Here, because I am 定时T+1 增量synchronizing data, I added a synchronization condition WHERE gmt_create >= CURDATE() to indicate that the data creation time is greater than the current day before querying.

Click preview, there is exactly one piece of data

③Add conversion node: Conversion --> Insert/Update --> Press and hold shift to establish a connection

④ Double-click the Insert/Update node to open the configuration page

⑤ Click to run the test

4. Copy the ktr file to run regularly on Liunx

On Linux, the ktr file is run using Kettle's pan.sh script with the following command:
sh /home/admin/kettle/data-integration/pan.sh -file=/home/admin/kettle/ktr/table_transfer.ktr -norep.

In order to execute this script regularly, I plan to use the corntab function that comes with Linux to set the timing.

First of all, I wrote a shell script named cornSql.sh, which is used to save the execution command of ktr, the content is as follows:

#!/bin/bash 
export KETTLE_HOME=/home/admin/kettle/data-integration
export JAVA_HOME=/usr/java/jdk1.8.0_131
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin:${KETTLE_HOME}
export JRE_HOME=${JAVA_HOME}/jre

TIME=$(date "+%Y%m%d")
sh /home/admin/kettle/data-integration/pan.sh -file=/home/admin/kettle/ktr/table_transfer.ktr -norep >>/home/admin/kettle/log/transfer-"$TIME".log

Secondly, copy the ktr script to the specified directory, that is, the /home/admin/kettle/ktr directory, enter the command to crontab -eopen the command interface, and enter 0 1 * * * /home/admin/kettle/cornSql.shthe command. This sentence means to execute the cornSql.sh script regularly at 1 am every day .

In order to check whether the timing configuration is effective, you can use crontab -l -u rootthe command . If the timing command just now is printed out, it proves that the configuration is effective.

Finally, the next day, check whether the execution log file has been generated. In the /home/admin/kettle/log directory, here I print out the daily execution date, as shown below:

Guess you like

Origin blog.csdn.net/weixin_33005117/article/details/129998959