Incremental synchronization -spring batch (6) binding and dynamic parameters incremental synchronization

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/masson32/article/details/91430684

Incremental synchronization -spring batch (6) binding and dynamic parameters incremental synchronization

tags:springbatch


1 Introduction

On an " easy data read and write -spring batch (5) read and write data in conjunction beetlSql " use Spring Batchand BeetlSql, on the database to read and write database synchronization components, the total amount is actually synchronized. Problems that the total amount of each synchronization required to read the entire table data, table data if the amount is large, a large resource consumption, and not easy to update the existing data. Thus, the data synchronization process, the more incremental synchronization, i.e. by certain conditions, distinguish the new data is inserted, there is a change to the data are updated, there is no data to delete the like (of course, generally not data will be physically deleted, only the tombstone and, therefore, becomes a data update).

More case incremental update status (e.g., time, increment ID, position data, etc.) based on the update required, the next update once or more based on the updated state, and therefore, a state needs to be updated to each variable preserved as arguments, the next update the dynamic parameters in this status data is used. Spring BatchDynamic run-time parameters support tasks, combined with this feature, you can achieve incremental data synchronization.

2. Development Environment

  • JDK: jdk1.8
  • Spring Boot: 2.1.4.RELEASE
  • Spring Batch:4.1.2.RELEASE
  • Development IDE: IDEA
  • Build tools Maven: 3.3.9
  • Logging component logback: 1.2.3
  • chilli: 1.18.6

3. Description of incremental synchronization

Incremental synchronization is synchronized with the relative amount of the whole, that is, each synchronization, only part of the database synchronization source change, thereby improving the efficiency of data synchronization. It is a common mode current data synchronization. Extracting changed data, also known CDC, i.e., Change Data Capturechange data capture. In the "Pentaho Kettle Solutions: Using the PDI build open source ETL solution," a book, to CDCexplain in more detail made. Here to do some brief description, the current implementation of incremental synchronization There are four, are based on source data CDC, trigger-based CDC, snapshot-based CDC, based on the log CDC.

3.1 based on source dataCDC

Based on source data CDCrequires that the source data, there are attributes associated with columns, use these attribute columns, you can determine where is the incremental data, the most common attributes listed are:

  • Timestamp
    based on the time to identify the data, you need at least one time, preferably two, create a logo, a logo update time, so usually when we designed the database will be added sys_create_timeand sys_update_timeas the default fields, and is designed to update the current time and the default process.

  • Self-energizing sequence
    database tables increment sequence field (usually the primary key), to identify the newly inserted data. But in reality than with less.

This method requires a temporary table to hold the last update time or, in practice, usually to create this table in standalone mode, save the data. Comparing the next update time or the last sequence. This is a relatively common way incremental synchronization of this article is to use this method.

3.2 Based on the triggerCDC

Written in the database trigger, the current database execution INSERT, UPDATE, DELETEwhile other statements, can be activated in the database trigger, then the trigger can save these changes to the staging table data, and then obtain the data from the temporary table, synchronous to the target database. Of course, this approach is the most invasive, generally are not allowed to add database triggers (affects performance) to the database.

3.3 based snapshotCDC

This current method is to extract all the data into a buffer zone, as a snapshot, the next time synchronization when reading data from the data source, and then compare snapshots, find changes in the data. In simple terms it is to do a full table read and compare to find the data changes. Do full table scan, the problem is that performance, it is generally not use this approach.

3.4 Based on logCDC

The most advanced and the least invasive way is to log-based way, the database will insert, update, delete operations remember to log in, as Mysqlthere will be binlogincremental synchronization can read the log files, binary files into understandable manner, and then the inside of the operations in the sequence over again. However, this approach can only be effective for the same kind of database for heterogeneous database can not be achieved. And to implement certain degree of difficulty.

3.5 incremental synchronization method described exemplary

In this example, is still based on test_userthe table for incremental synchronization, the table has a field sys_create_timeand sys_update_timeto identify the data creation and update time (currently, if reality is only a matter of time, this can be based only on time, but this is more difficult to identify this data is updated or inserted). Incremental synchronization process is as follows:

Process

Description:

  • Every time sync, will first read the temporary sheet for the last time after synchronizing data.
  • If the first synchronization, all the synchronization, if not, according to the time as a parameter query statement.
  • The time after the read data, the data into the target table
  • Data update time temporary table so that the next synchronization.

4.Spring Batch dynamic parameter binding

The incremental synchronization process above, the key point is to save the time data to the temporary table, the data reading can be used as the comparison condition. This time parameter is dynamic, pass only when tasks are performed in, in Spring Batch, the support dynamic parameter binding, just use the @StepScopeannotation can be combined BeetlSql, will soon be able to achieve incremental synchronization. This example is based on an article an example to further develop, you can download the source code to see a complete example.

4.1 follow the original database configuration and multiple data sources

  • Source database: mytest
  • Target database: my_test1
  • spring batch database: my_spring_batch
  • Synchronous data sheet:test_user

4.2 to create a temporary table

Use examples sql/initCdcTempTable.sql, in the my_spring_batchlibrary, create a temporary table cdc_tempand insert records into 1the record, identity is synchronized test_usertable. Here, we just need to focus last_update_timeand current_update_timethe former means the system time after time synchronization on the last synchronization time after the last data, the latter represented.

4.3 Add / Modify dao

4.3.1 Adding temporary table dao and service categories

  • Add categoryCdcTempRepository

According to the configuration, as cdc_tempin my_spring_batch, and read it in dao.localthe package, it is necessary to add dao.localthe package, and then add the class CdcTempRepository, as follows:

@Repository
public interface CdcTempRepository extends BaseMapper<CdcTemp> {
}
  • Add class CdcTempServicefor cdc_tempreading and updating the data table
    consists of two functions, one is acquired according to the ID of the current cdc_temprecord, the last time in order to acquire data on a data synchronization. One is the synchronization is complete, updated cdc_tempdata. as follows:
/**
 * 根据id获取cdc_temp的记录
 * @param id 记录ID
 * @return {@link CdcTemp}
 */
public CdcTemp getCurrentCdcTemp(int id){
    return cdcTempRepository.getSQLManager().single(CdcTemp.class, id);
}

/**
 * 根据参数更新cdcTemp表的数据
 * @param cdcTempId cdcTempId
 * @param status job状态
 * @param lastUpdateTime 最后更新时间
 */
public void updateCdcTempAfterJob(int cdcTempId,BatchStatus status,Date lastUpdateTime){
    //获取
    CdcTemp cdcTemp = cdcTempRepository.getSQLManager().single(CdcTemp.class, cdcTempId);
    cdcTemp.setCurrentUpdateTime(DateUtil.date());
    //正常完成则更新数据时间
    if( status == BatchStatus.COMPLETED){
        cdcTemp.setLastUpdateTime(lastUpdateTime);
    }else{
        log.info(LogConstants.LOG_TAG+"同步状态异常:"+ status.toString());
    }
    //设置同步状态
    cdcTemp.setStatus(status.name());
    cdcTempRepository.updateById(cdcTemp);
}

4.3.2 modify the source data dao

Dao data source class OriginUserRepositoryadd a function getOriginIncreUser, the function corresponding to user.mdthe sqlstatement.

4.3.3 modify the target data dao

Dao class in the target data TargetUserRepositoryadded function selectMaxUpdateTime, the query for the last time after the synchronization data. Sql Since this method is simple, it can be used directly @Sqlannotation, as follows:

@Sql(value="select max(sys_update_time) from test_user")
Date selectMaxUpdateTime();

4.4 modify user.mdthe sqlstatement.

4.4.1 add incremental data read sql

In user.mdadd incremental sql statement read data, as follows:

getOriginIncreUser
===
* 查询user数据

select * from test_user
WHERE 1=1
@if(!isEmpty(lastUpdateTime)){
AND (sys_create_time >= #lastUpdateTime# OR sys_update_time >= #lastUpdateTime#)
@}

Description:

  • @Is the beginning of beetlthe syntax, and logic variables can be read judgment, if the variable is meant here lastUpdateTimeis not empty, the reading under these conditions.
  • lastUpdateTimeWhen you call a variable passed by the ( Map)
  • Specific beetlsyntax, see the official documentation

4.4.2 Incremental write sql statement insert

For the Mysqldatabases, there is insert into ... on duplicate key update ...usage, i.e. according to a unique key (primary key or unique index), if the data already exists, the update does not exist, is inserted. In the user.mdfile, add the following statement:

insertIncreUser
===
* 插入数据

insert into test_user(id,name,phone,title,email,gender,date_of_birth,sys_create_time,sys_create_user,sys_update_time,sys_update_user)
values (#id#,#name#,#phone#,#title#,#email#,#gender#,#dateOfBirth#
    ,#sysCreateTime#,#sysCreateUser#,#sysUpdateTime#,#sysUpdateUser#)
ON DUPLICATE KEY UPDATE 
id = VALUES(id),
name = VALUES(name),
phone = VALUES(phone),
title = VALUES(title),
email = VALUES(email),
gender = VALUES(gender),
date_of_birth = VALUES(date_of_birth),
sys_create_time = VALUES(sys_create_time),
sys_create_user = VALUES(sys_create_user),
sys_update_time = VALUES(sys_update_time),
sys_update_user = VALUES(sys_update_user)

4.5 writing components of Spring Batch

Spring BatchFile structure is as follows:

File Structure

4.5.1 ItemReader

It is consistent with the previous, just need to getOriginUserfunction change getOriginIncreUsercan be.

4.5.2 ItemWriter

Consistent with previous here, just need an ID by the sql user.insertUserchange user.insertIncreUsercan be.

4.5.3 AddIncrementJobEndListener

Since the data synchronization finished, the last step is to update the last time data temporary tables. as follows:

@Slf4j
public class IncrementJobEndListener extends JobExecutionListenerSupport {

    @Autowired
    private CdcTempService cdcTempService;

    @Autowired
    private TargetUserRepository targetUserRepository;

    @Override
    public void afterJob(JobExecution jobExecution) {
        BatchStatus status = jobExecution.getStatus();
        Date latestDate  = targetUserRepository.selectMaxUpdateTime();
        cdcTempService.updateCdcTempAfterJob(SyncConstants.CDC_TEMP_ID_USER,status,latestDate);
    }
}

Description:

  • First check the last time the current data in the database ( selectMaxUpdateTime)
  • Intermediate update table data cdc_tempinlast_update_time

4.5.4 Add task startup initialization parameters

The first step in data synchronization, you need to initialize the data in the temporary table was last updated, so before starting the task, have to carry out the task parameters to be used when the task execution time parameter passed to the task. as follows:

public JobParameters initJobParam(){
    CdcTemp currentCdcTemp = cdcTempService.getCurrentCdcTemp(getCdcTempId());
    //若未初始化,则先查询数据库中对应的最后时间
    if(SyncConstants.STR_STATUS_INIT.equals(currentCdcTemp.getStatus())
            || SyncConstants.STR_STATUS_FAILED.equals(currentCdcTemp.getStatus())){
        Date maxUpdateTime = selectMaxUpdateTime();
        //若没有数据,则按初始时间处理
        if(Objects.nonNull(maxUpdateTime)){
            currentCdcTemp.setLastUpdateTime(maxUpdateTime);
        }
    }
    return JobUtil.makeJobParameters(currentCdcTemp);
}

4.5.5 assemble complete tasks

Finally, you need a IncrementBatchConfigconfiguration to read, process, writing, listening assembled, it is worth mentioning that, when configuring reading component, due to the need to use dynamic parameters, here you need to add @StepScopeannotations while using the parameter spELacquisition parameters content follows below:

@Bean
@StepScope
public ItemReader incrementItemReader(@Value("#{jobParameters['lastUpdateTime']}") String lastUpdateTime) {
    IncrementUserItemReader userItemReader = new IncrementUserItemReader();
    //设置参数,当前示例可不设置参数
    Map<String,Object> params = CollUtil.newHashMap();
    params.put(SyncConstants.STR_LAST_UPDATE_TIME,lastUpdateTime);
    userItemReader.setParams(params);

    return userItemReader;
}

4.5.6 Testing

Refer to the previous article BeetlsqlJobTest, write IncrementJobTesttest file. Since the required incremental synchronization test, the test procedure is as follows:

  • Before the test data is added in increments
    before the test, the source data and destination data have a data table, the data in the source table, the execution code of the sql/user-data-new.sqlnew user is added. Note that, because sys_create_timeand sys_update_timeare defined as follows:
`sys_create_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`sys_update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

Thereby automatically generating time reaches a data insertion, modification time is automatically updated.

  • Run a test
    run in the test unit incrementJob.

  • View Results
    After the operation is complete, the results are as follows:

    Export

After incremental synchronization with the following data:

result

5. Summary

This article first incremental synchronization made a brief, general incremental synchronization methods currently used are listed, then use Spring Batchand BeetlSqluse timestamp-based synchronization with incremental way, the present example has a certain practicality, hoping to make data synchronization or related batch developer help.

Guess you like

Origin blog.csdn.net/masson32/article/details/91430684