1. Demand background
Due to the project scenario, it is necessary to 定时T+1
增量
synchronize the data in table a, table b, and table c in database A (MySQL) to database B (MySQL). Let me explain here that it is not the master-slave backup of the database, or the ordinary data synchronization. After technical research, it was found that Kettle is quite suitable for the following reasons:
- Kettle (data extraction, cleaning, conversion, loading) is written in java and can run on Window, Linux, and Unix. A professional ETL tool that supports multiple data sources and multiple middleware;
- The visual interface supports graphical GUI design interface, component diversity, and supports component drag and drop, without writing additional code;
- Kettle's flowcharts are essentially configuration files, such as .ktr/.kjb files. The advantage of this design is that after we draw the conversion flow chart, we can directly copy it to run in another environment, such as: draw the flow chart on a Windows computer and copy it to run on a Liunx system .
- It is free, has many components, and supports open source. Generally, there is no problem in processing T+1 data synchronization. If the concurrency is high, the real-time performance is high, and the data volume is large, it is recommended to use Flink.
2. How to use
1. Download the installation package
Official website address: https://sourceforge.net/projects/pentaho/files/Data%20Integration/
2. Starting method
Windows click Spoon.bat to start
The following picture appears to indicate that it is starting, 如果一直没有反应,使用管理员身份运行
.
The main interface is as follows:
3. Connect to the MySQL database
1. Prepare the MySQL connection driver jar package
Since Kettle itself does not have any database driver package, here we need to prepare the driver package first, and the version is the best choice 5.1.49
. After downloading the jar package, copy it to the lib directory (Windows and Linux are the same). If Kettle has been started, it needs to be shut down and restarted, otherwise the driver package will not be loaded.
2. Create a data source
Click in turn: Conversion --> Main Object Tree --> DB Connection --> Click New --> Test
Same operation as above, create two data sources: source database and target database; the goal is to synchronize the table data in the source database to the target database
3. Configuration conversion process
① Join the input node: conversion --> input --> table input
② Double-click the input node to open the configuration page, enter the information
Here, because I am 定时T+1
增量
synchronizing data, I added a synchronization condition WHERE gmt_create >= CURDATE()
to indicate that the data creation time is greater than the current day before querying.
Click preview, there is exactly one piece of data
③Add conversion node: Conversion --> Insert/Update --> Press and hold shift to establish a connection
④ Double-click the Insert/Update node to open the configuration page
⑤ Click to run the test
4. Copy the ktr file to run regularly on Liunx
On Linux, the ktr file is run using Kettle's pan.sh script with the following command:
sh /home/admin/kettle/data-integration/pan.sh -file=/home/admin/kettle/ktr/table_transfer.ktr -norep
.
In order to execute this script regularly, I plan to use the corntab function that comes with Linux to set the timing.
First of all, I wrote a shell script named cornSql.sh, which is used to save the execution command of ktr, the content is as follows:
#!/bin/bash
export KETTLE_HOME=/home/admin/kettle/data-integration
export JAVA_HOME=/usr/java/jdk1.8.0_131
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin:${KETTLE_HOME}
export JRE_HOME=${JAVA_HOME}/jre
TIME=$(date "+%Y%m%d")
sh /home/admin/kettle/data-integration/pan.sh -file=/home/admin/kettle/ktr/table_transfer.ktr -norep >>/home/admin/kettle/log/transfer-"$TIME".log
Secondly, copy the ktr script to the specified directory, that is, the /home/admin/kettle/ktr directory, enter the command to crontab -e
open the command interface, and enter 0 1 * * * /home/admin/kettle/cornSql.sh
the command. This sentence means to execute the cornSql.sh script regularly at 1 am every day .
In order to check whether the timing configuration is effective, you can use crontab -l -u root
the command . If the timing command just now is printed out, it proves that the configuration is effective.
Finally, the next day, check whether the execution log file has been generated. In the /home/admin/kettle/log directory, here I print out the daily execution date, as shown below: