kettle incremental data synchronization scheme implemented

1. Background

Our current data synchronization between databases is oracle goldengate (ogg) program, features of the program:
Advantages:

  • The change log database based synchronization (oracle redo \ mysql binlog), very fast, very small impact on database performance, suitable for large amounts of data synchronization scenarios

Disadvantages:

  • Synchronous table field changes, the new table, you need to modify a lot of configuration files on the database server, more complicated, and in many cases exact, pump, replicate the process, easy to misuse;
  • If a table synchronization fails, rebuild the configuration is more complex;
  • You need to install the software on each database server;
  • No interface is not intuitive, arranged dispersed ......

In order to solve the above drawbacks ogg, research new synchronization scheme: kettle
kettle through sql, based on the primary key, timestamp increment synchronous data without any configuration on the database server, simply create a configuration JOB server that is on the kettle You may have a simple and intuitive CS platform.
The future will be dominated kettle, synchronize large amounts of data tables (such as single-sheet date sync more than one million records), we will consider ogg.

2. Install

Version: pdi8.3
OS: linux because of interface distortion, using up too thin blue mushrooms, and every time you start spoon is very time-consuming, and therefore with the windows platform.
Installation is simple:

  • Installation jdk1.8;
  • Decompression pdi.zip;
  • Download driver oracle (ojdbc14.jar), mysql (mysql-connector-java-5.1.9.jar, mysql-connector-java-6.0.6.jar) to the directory pdi \ data-integration \ lib:

3. Configuration

Several key components kettle instructions

  • SQL: name suggests, is the implementation of a SQL, you can include multiple
  • Conversion: a table containing input variable set, sort, merge, switch, an output table, update, insert / update other functions, is the essence kettle, to achieve critical logic
  • Operations: a complete synchronization logic, comprising the SQL, conversion, using "Job Scheduling Timing" and "success" job start and end, respectively,
  • DB Connection: Database connection, you can create jobs in the conversion or, in default to take effect this conversion or job; support the "sharing", the connection takes effect globally

Requirements for the table

  1. Table has a primary key;
  2. Application data can not be physically deleted, only logical deletion of data, setting field (delete_flag tinyint / number (1): 0: Not deleted, 1, has been deleted) can be deleted in the source and destination end delete_flag physical data by the timer task 1 = ;
  3. Unified timestamp field (update_time oracle: date / mysql datetime), all data changes (including delete_flag) must also modify update_time;
  4. When synchronized to the type Oracle number mysql bigint, the maximum length of the support 18, can not use the default number (default 38);
  5. Source and target need to simultaneously process ends (truncate, delete, update without update_time) when the data maintenance manual;

Rationale

  1. Create a table sync_timestamp (table_name, time_stamp), each table record completion of the synchronization time stamp, the first synchronization before, to manually set minimum update_time source table;
  2. Each time synchronization, to set the environment variable in TIME_STAMP = sync_timestamp TIME_STAMP;
  3. Greater than or equal to the source table to the target table TIME_STAMP data insert / update;
  4. The maximum value of the target table update_time field, updates the time_stamp sync_timestamp field;

Job: job_sync_t1 (oracle-> mysql)

Conversion: trans_get_timestamp

Table Input:

select DATE_FORMAT(time_stamp,'%Y-%m-%d %H:%i:%S') time_stamp
from sync_timestamp
where table_name='sync_t1'

Set variables:

Conversion: trans_sync_data

Table Input:

select *
from sync_t1
where update_time >= to_date('${TIME_STAMP}','yyyy-mm-dd hh24:mi:ss')

Insert / update:

 

4. Scheduling

kettle job executed by the kitchen command; JOB executed each time, initialize the kitchen to hundreds of M memory, it takes more than 10 seconds, a kitchen can only be executed once the process started JOB; If a large kitchen at the same time starts, it will consume a lot of memory , the OS that requires a lot of memory.

为了合理化利用资源,将JOB调度按调度频率划分到不同文件,每个文件根据具体同步数据的压力情况安排3-5个job:

1分钟执行1次的JOB:

file1: job_1min_1.bat,包含job1-job3

file2: job_1min_2.bat,包含job4-job6

file3: job_1min_3.bat,包含job7-job9

……

 job_1min_1.bat的内容:

e:
cd e:\pdi\data-integration
kitchen /file:e:\mykettle\job1\job.kjb /level:Error
kitchen /file:e:\mykettle\job2\job.kjb /level:Error
kitchen /file:e:\mykettle\job3\job.kjb /level:Error

通过kettle服务器的任务调度定期执行 job_1min_1.bat文件即可。

备注:kettle自身没有防止JOB并发执行的机制(如某JOB执行频率每分钟1次,但一次执行耗时超过1分钟,就会存在并发执行的情况);并发执行时,插入重复数据报错,可以监控到JOB错误,从而优化JOB或调度。

5. 监控

JOB配置日志表,记录JOB执行情况;

创建日志表

CREATE TABLE `sync_log` (
  `ID_JOB` bigint(20) NOT NULL,
  `JOBNAME` varchar(100) COLLATE utf8mb4_bin DEFAULT NULL,
  `STATUS` varchar(20) COLLATE utf8mb4_bin DEFAULT NULL,
  `ERRORS` bigint(20) DEFAULT NULL,
  `LOGDATE` datetime DEFAULT NULL,
  `LOG_FIELD` blob,
  `LINES_READ` bigint(20) DEFAULT NULL,
  `LINES_WRITTEN` bigint(20) DEFAULT NULL,
  `LINES_UPDATED` bigint(20) DEFAULT NULL,
  `LINES_INPUT` bigint(20) DEFAULT NULL,
  `LINES_OUTPUT` bigint(20) DEFAULT NULL,
  `LINES_REJECTED` bigint(20) DEFAULT NULL,
  `STARTDATE` datetime DEFAULT NULL,
  `DEPDATE` datetime DEFAULT NULL,
  `REPLAYDATE` datetime DEFAULT NULL,
  `ENDDATE` datetime DEFAULT NULL,
  PRIMARY KEY (`ID_JOB`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

job指定日志表:

sync_log

sync_timestamp

通过上述2个表的关联,确认JOB是否正常执行,配置到zabbix监控项。

判断标准:sync_timestamp表记录的table_name,在sync_log中都有最新的成功执行的日志(status=end,error=0,logdate<now-interval_seconds)

select count(*)-(select count(*) from dbcopy.sync_timestamp) errorjob
from dbcopy.sync_log a, dbcopy.sync_timestamp b
where (a.jobname,a.logdate) in  (select jobname,max(logdate) 
                                 from dbcopy.sync_log 
                                 where errors=0 and status='end' 
                                 group by jobname)
and a.jobname=b.table_name
and a.logdate > DATE_SUB(now(),INTERVAL (b.interval_seconds + 60) second)

以上SQL返回值>0即说明JOB执行有异常。

over ~

发布了24 篇原创文章 · 获赞 25 · 访问量 2万+

Guess you like

Origin blog.csdn.net/sdmei/article/details/103498493