Incremental synchronization -spring batch (6) binding and dynamic parameters incremental synchronization
tags:springbatch
Article Directory
- Incremental synchronization -spring batch (6) binding and dynamic parameters incremental synchronization
- 1 Introduction
- 2. Development Environment
- 3. Description of incremental synchronization
- 3.1 `CDC` source data based
- 3.2 Trigger-based `CDC`
- 3.3 snapshot-based `CDC`
- 3.4 log-based `CDC`
- 3.5 incremental synchronization method described exemplary
- 4.Spring Batch dynamic parameter binding
- 4.1 follow the original database configuration and multiple data sources
- 4.2 to create a temporary table
- 4.3 Add / Modify dao
- 4.3.1 Adding temporary table dao and service categories
- 4.3.2 modify the source data dao
- 4.3.3 modify the target data dao
- 4.4 modify `sql` statement` user.md` in.
- 4.5 writing components of Spring Batch
- 5. Summary
1 Introduction
On an " easy data read and write -spring batch (5) read and write data in conjunction beetlSql " use Spring Batch
and BeetlSql
, on the database to read and write database synchronization components, the total amount is actually synchronized. Problems that the total amount of each synchronization required to read the entire table data, table data if the amount is large, a large resource consumption, and not easy to update the existing data. Thus, the data synchronization process, the more incremental synchronization, i.e. by certain conditions, distinguish the new data is inserted, there is a change to the data are updated, there is no data to delete the like (of course, generally not data will be physically deleted, only the tombstone and, therefore, becomes a data update).
More case incremental update status (e.g., time, increment ID, position data, etc.) based on the update required, the next update once or more based on the updated state, and therefore, a state needs to be updated to each variable preserved as arguments, the next update the dynamic parameters in this status data is used. Spring Batch
Dynamic run-time parameters support tasks, combined with this feature, you can achieve incremental data synchronization.
2. Development Environment
- JDK: jdk1.8
- Spring Boot: 2.1.4.RELEASE
- Spring Batch:4.1.2.RELEASE
- Development IDE: IDEA
- Build tools Maven: 3.3.9
- Logging component logback: 1.2.3
- chilli: 1.18.6
3. Description of incremental synchronization
Incremental synchronization is synchronized with the relative amount of the whole, that is, each synchronization, only part of the database synchronization source change, thereby improving the efficiency of data synchronization. It is a common mode current data synchronization. Extracting changed data, also known CDC
, i.e., Change Data Capture
change data capture. In the "Pentaho Kettle Solutions: Using the PDI build open source ETL solution," a book, to CDC
explain in more detail made. Here to do some brief description, the current implementation of incremental synchronization There are four, are based on source data CDC
, trigger-based CDC
, snapshot-based CDC
, based on the log CDC
.
3.1 based on source dataCDC
Based on source data CDC
requires that the source data, there are attributes associated with columns, use these attribute columns, you can determine where is the incremental data, the most common attributes listed are:
-
Timestamp
based on the time to identify the data, you need at least one time, preferably two, create a logo, a logo update time, so usually when we designed the database will be addedsys_create_time
andsys_update_time
as the default fields, and is designed to update the current time and the default process. -
Self-energizing sequence
database tables increment sequence field (usually the primary key), to identify the newly inserted data. But in reality than with less.
This method requires a temporary table to hold the last update time or, in practice, usually to create this table in standalone mode, save the data. Comparing the next update time or the last sequence. This is a relatively common way incremental synchronization of this article is to use this method.
3.2 Based on the triggerCDC
Written in the database trigger, the current database execution INSERT
, UPDATE
, DELETE
while other statements, can be activated in the database trigger, then the trigger can save these changes to the staging table data, and then obtain the data from the temporary table, synchronous to the target database. Of course, this approach is the most invasive, generally are not allowed to add database triggers (affects performance) to the database.
3.3 based snapshotCDC
This current method is to extract all the data into a buffer zone, as a snapshot, the next time synchronization when reading data from the data source, and then compare snapshots, find changes in the data. In simple terms it is to do a full table read and compare to find the data changes. Do full table scan, the problem is that performance, it is generally not use this approach.
3.4 Based on logCDC
The most advanced and the least invasive way is to log-based way, the database will insert, update, delete operations remember to log in, as Mysql
there will be binlog
incremental synchronization can read the log files, binary files into understandable manner, and then the inside of the operations in the sequence over again. However, this approach can only be effective for the same kind of database for heterogeneous database can not be achieved. And to implement certain degree of difficulty.
3.5 incremental synchronization method described exemplary
In this example, is still based on test_user
the table for incremental synchronization, the table has a field sys_create_time
and sys_update_time
to identify the data creation and update time (currently, if reality is only a matter of time, this can be based only on time, but this is more difficult to identify this data is updated or inserted). Incremental synchronization process is as follows:
Description:
- Every time sync, will first read the temporary sheet for the last time after synchronizing data.
- If the first synchronization, all the synchronization, if not, according to the time as a parameter query statement.
- The time after the read data, the data into the target table
- Data update time temporary table so that the next synchronization.
4.Spring Batch dynamic parameter binding
The incremental synchronization process above, the key point is to save the time data to the temporary table, the data reading can be used as the comparison condition. This time parameter is dynamic, pass only when tasks are performed in, in Spring Batch
, the support dynamic parameter binding, just use the @StepScope
annotation can be combined BeetlSql
, will soon be able to achieve incremental synchronization. This example is based on an article an example to further develop, you can download the source code to see a complete example.
4.1 follow the original database configuration and multiple data sources
- Source database:
mytest
- Target database:
my_test1
- spring batch database:
my_spring_batch
- Synchronous data sheet:
test_user
4.2 to create a temporary table
Use examples sql/initCdcTempTable.sql
, in the my_spring_batch
library, create a temporary table cdc_temp
and insert records into 1
the record, identity is synchronized test_user
table. Here, we just need to focus last_update_time
and current_update_time
the former means the system time after time synchronization on the last synchronization time after the last data, the latter represented.
4.3 Add / Modify dao
4.3.1 Adding temporary table dao and service categories
- Add category
CdcTempRepository
According to the configuration, as cdc_temp
in my_spring_batch
, and read it in dao.local
the package, it is necessary to add dao.local
the package, and then add the class CdcTempRepository
, as follows:
@Repository
public interface CdcTempRepository extends BaseMapper<CdcTemp> {
}
- Add class
CdcTempService
forcdc_temp
reading and updating the data table
consists of two functions, one is acquired according to the ID of the currentcdc_temp
record, the last time in order to acquire data on a data synchronization. One is the synchronization is complete, updatedcdc_temp
data. as follows:
/**
* 根据id获取cdc_temp的记录
* @param id 记录ID
* @return {@link CdcTemp}
*/
public CdcTemp getCurrentCdcTemp(int id){
return cdcTempRepository.getSQLManager().single(CdcTemp.class, id);
}
/**
* 根据参数更新cdcTemp表的数据
* @param cdcTempId cdcTempId
* @param status job状态
* @param lastUpdateTime 最后更新时间
*/
public void updateCdcTempAfterJob(int cdcTempId,BatchStatus status,Date lastUpdateTime){
//获取
CdcTemp cdcTemp = cdcTempRepository.getSQLManager().single(CdcTemp.class, cdcTempId);
cdcTemp.setCurrentUpdateTime(DateUtil.date());
//正常完成则更新数据时间
if( status == BatchStatus.COMPLETED){
cdcTemp.setLastUpdateTime(lastUpdateTime);
}else{
log.info(LogConstants.LOG_TAG+"同步状态异常:"+ status.toString());
}
//设置同步状态
cdcTemp.setStatus(status.name());
cdcTempRepository.updateById(cdcTemp);
}
4.3.2 modify the source data dao
Dao data source class OriginUserRepository
add a function getOriginIncreUser
, the function corresponding to user.md
the sql
statement.
4.3.3 modify the target data dao
Dao class in the target data TargetUserRepository
added function selectMaxUpdateTime
, the query for the last time after the synchronization data. Sql Since this method is simple, it can be used directly @Sql
annotation, as follows:
@Sql(value="select max(sys_update_time) from test_user")
Date selectMaxUpdateTime();
4.4 modify user.md
the sql
statement.
4.4.1 add incremental data read sql
In user.md
add incremental sql statement read data, as follows:
getOriginIncreUser
===
* 查询user数据
select * from test_user
WHERE 1=1
@if(!isEmpty(lastUpdateTime)){
AND (sys_create_time >= #lastUpdateTime# OR sys_update_time >= #lastUpdateTime#)
@}
Description:
@
Is the beginning ofbeetl
the syntax, and logic variables can be read judgment, if the variable is meant herelastUpdateTime
is not empty, the reading under these conditions.lastUpdateTime
When you call a variable passed by the (Map
)- Specific
beetl
syntax, see the official documentation
4.4.2 Incremental write sql statement insert
For the Mysql
databases, there is insert into ... on duplicate key update ...
usage, i.e. according to a unique key (primary key or unique index), if the data already exists, the update does not exist, is inserted. In the user.md
file, add the following statement:
insertIncreUser
===
* 插入数据
insert into test_user(id,name,phone,title,email,gender,date_of_birth,sys_create_time,sys_create_user,sys_update_time,sys_update_user)
values (#id#,#name#,#phone#,#title#,#email#,#gender#,#dateOfBirth#
,#sysCreateTime#,#sysCreateUser#,#sysUpdateTime#,#sysUpdateUser#)
ON DUPLICATE KEY UPDATE
id = VALUES(id),
name = VALUES(name),
phone = VALUES(phone),
title = VALUES(title),
email = VALUES(email),
gender = VALUES(gender),
date_of_birth = VALUES(date_of_birth),
sys_create_time = VALUES(sys_create_time),
sys_create_user = VALUES(sys_create_user),
sys_update_time = VALUES(sys_update_time),
sys_update_user = VALUES(sys_update_user)
4.5 writing components of Spring Batch
Spring Batch
File structure is as follows:
4.5.1 ItemReader
It is consistent with the previous, just need to getOriginUser
function change getOriginIncreUser
can be.
4.5.2 ItemWriter
Consistent with previous here, just need an ID by the sql user.insertUser
change user.insertIncreUser
can be.
4.5.3 AddIncrementJobEndListener
Since the data synchronization finished, the last step is to update the last time data temporary tables. as follows:
@Slf4j
public class IncrementJobEndListener extends JobExecutionListenerSupport {
@Autowired
private CdcTempService cdcTempService;
@Autowired
private TargetUserRepository targetUserRepository;
@Override
public void afterJob(JobExecution jobExecution) {
BatchStatus status = jobExecution.getStatus();
Date latestDate = targetUserRepository.selectMaxUpdateTime();
cdcTempService.updateCdcTempAfterJob(SyncConstants.CDC_TEMP_ID_USER,status,latestDate);
}
}
Description:
- First check the last time the current data in the database (
selectMaxUpdateTime
) - Intermediate update table data
cdc_temp
inlast_update_time
4.5.4 Add task startup initialization parameters
The first step in data synchronization, you need to initialize the data in the temporary table was last updated, so before starting the task, have to carry out the task parameters to be used when the task execution time parameter passed to the task. as follows:
public JobParameters initJobParam(){
CdcTemp currentCdcTemp = cdcTempService.getCurrentCdcTemp(getCdcTempId());
//若未初始化,则先查询数据库中对应的最后时间
if(SyncConstants.STR_STATUS_INIT.equals(currentCdcTemp.getStatus())
|| SyncConstants.STR_STATUS_FAILED.equals(currentCdcTemp.getStatus())){
Date maxUpdateTime = selectMaxUpdateTime();
//若没有数据,则按初始时间处理
if(Objects.nonNull(maxUpdateTime)){
currentCdcTemp.setLastUpdateTime(maxUpdateTime);
}
}
return JobUtil.makeJobParameters(currentCdcTemp);
}
4.5.5 assemble complete tasks
Finally, you need a IncrementBatchConfig
configuration to read, process, writing, listening assembled, it is worth mentioning that, when configuring reading component, due to the need to use dynamic parameters, here you need to add @StepScope
annotations while using the parameter spEL
acquisition parameters content follows below:
@Bean
@StepScope
public ItemReader incrementItemReader(@Value("#{jobParameters['lastUpdateTime']}") String lastUpdateTime) {
IncrementUserItemReader userItemReader = new IncrementUserItemReader();
//设置参数,当前示例可不设置参数
Map<String,Object> params = CollUtil.newHashMap();
params.put(SyncConstants.STR_LAST_UPDATE_TIME,lastUpdateTime);
userItemReader.setParams(params);
return userItemReader;
}
4.5.6 Testing
Refer to the previous article BeetlsqlJobTest
, write IncrementJobTest
test file. Since the required incremental synchronization test, the test procedure is as follows:
- Before the test data is added in increments
before the test, the source data and destination data have a data table, the data in the source table, the execution code of thesql/user-data-new.sql
new user is added. Note that, becausesys_create_time
andsys_update_time
are defined as follows:
`sys_create_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`sys_update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
Thereby automatically generating time reaches a data insertion, modification time is automatically updated.
-
Run a test
run in the test unitincrementJob
. -
View Results
After the operation is complete, the results are as follows:
After incremental synchronization with the following data:
5. Summary
This article first incremental synchronization made a brief, general incremental synchronization methods currently used are listed, then use Spring Batch
and BeetlSql
use timestamp-based synchronization with incremental way, the present example has a certain practicality, hoping to make data synchronization or related batch developer help.