How to implement a simple data synchronization component

At present, there are many open source communities for data synchronization components, such as: Flink CDC, DataX, seaTunel, Kattle, etc., which can be roughly divided into two types: log-based and JDBC-based. These synchronization components can directly meet our library-to-library synchronization requirements through the visual interface or configuration file mapping when the entire library is synchronized or the schema difference is not large, but it is relatively difficult to handle some scenarios with large customization trouble. In order to meet such a scenario, the author wrote a small synchronization component.

PS: Our business scenarios are quite special, and there are many types of sources, including Oracle, Mysql, files, and API interfaces. Another point is that the amount of data that needs to be synchronized is not very large. The introduction of additional data synchronization components will have additional operation and maintenance costs and learning costs for us. Therefore, based on the above two points, we decided to write a small component by ourselves.

business framework

image.png

two basic needs

  • Support double writing, that is, the external source passes through the data synchronization component. On the one hand, it converts the required thematic data according to the customized standardized model; on the other hand, it needs to return the original data intact to our own library;
  • Able to support will 专题库数据pass reverse conversion, written as different原始库的数据格式

component model

Taking Mysql as an example, the main modules of the synchronization component are as follows:

image.png

  • 1. The green background is a business plug-in, each business will correspond to a business plug-in, write the synchronizer Syncer and the converter Convertor
  • 2. The dotted line is a separate channel for converting target library data into different original library data
  • 3. With the help of some queues, a simple producer-consumer model is realized. The queue is used as a buffer to facilitate the subsequent implementation of the control of reading the original library and writing the target library.

The project modules are divided as follows:

image.png

  • api-web provides API services externally, and can initiate data synchronization tasks, cancel data synchronization tasks, suspend data synchronization tasks, and cancel data synchronization tasks, etc.
  • common is some common tools, models, and enumerations.
  • The specific implementation of connectors, such as MysqlConnector, is based on the JDBC API to realize the data reading of the original library.
  • core The core package mainly includes the definition of some interfaces and the standardized process of logic processing
  • executors executor layer, including task management, thread pool resource management, etc.
  • plugins 具体业务插件,主要使用 Syncer 同步器和 Convertor 模型转换器

核心问题

针对数据同步组件,解决的核心问题可以抽象成如下模型:A_DB -> A -> B -> B_DB;即将 A 数据库中的数据读取出来之后转换成 A class instance,然后将 A class instance 转换成 B class instance,再将 B class instance 写到 B 数据库。

解决 A_DB 到 A

从 A_DB -> A 或者 B -> B_DB 这个过程,就是我们所熟知的 ORM 解决的问题;不管是 hibernate、mybatis 还是 SpringBoot JPA 都是围绕着这个问题展开的。

在本篇的组件中,因没有引入 ORM,所以将数据库行映射成一个 java 对象也需要自己实现。DataX 中是通过配置文件来描述的,在本篇中没有才采用这种描述方式,而是通过语言耦合性更高的注解的方式来实现的(由业务属性决定);

如下是一个描述具体业务的 Java 对象的定义,@Table 注解用来描述 JmltModel 和哪个表是关联的, @Colum 注解用来描述属性是和哪个字段关联的.

@Data
// @Table 注解用来描述 JmltModel 和哪个表是关联 的
@Table(name = "user_info") 
public class JmltModel implements Serializable {
    // @Colum 注解用来描述属性是和哪个字段关联的
    @Colum(name = "id")
    private Long id;
    @Colum(name = "email")
    private String email;
    @Colum(name = "name")
    private String name;
    @Colum(name = "create_time")
    private Date create_time;
}

有了这个描述关系,即可以在 runtime 时通过泛型 + 反射来实现 A_DB -> A 过程的模板设计。

MysqlConnector 的实现来进行说明,下面抽取了 MysqlConnector 组件中的部分代码(做了一些删减);下面这段代码中有 1-6 6 的步骤,这部分属于生产端,即从原始 Mysql 表中分页读取数据,并将读取到的数据映射成实际的对象,再通过业务定义的 convertor 转换成目标的对象,最后丢到队列中去等待消费

// 1、originClass 是原始库对象,这里通过反射获取 Table 注解,从而拿到表名
Table table = (Table) originClass.getDeclaredAnnotation(Table.class);
String tableName = table.name();

// 2、计算所有的条数,然后按照分页的方式进行 fetch
SqlTemplate sqlTemplate = new SqlTemplate(this.originDataSource);
int totalCount = sqlTemplate.count();
RowBounds rowBounds = new RowBounds(totalCount);
int totalPage = rowBounds.getTotalPage();
// 3、这里是按分页批量拉取
for (int i = 1; i <= totalPage; i++) {
    int offset = rowBounds.getOffset(i);
    String condition = " limit " + offset + "," + rowBounds.getPageSize();
    ResultSet resultSet = sqlTemplate.select(condition);
    // 4、将 ResultSet 转成 A 这里就是从 A_DB 到 A 的过程
    ResultSetExtractor<R> extractor = (ResultSetExtractor<R>) new ResultSetExtractor<>(originClass);
    List<R> result = extractor.extractData(resultSet);
    // 5、将 A 转成 B
    List<T> targetResult = convertor.batchConvertFrom(result);
    // 6、丢到队列中去等待消费
    this.rowObjectManager.pushToQueue(targetResult);
}

4 in the above code fragment is the process A_DB 到 Afrom , in fact, this part is the process from ResultSet to Java object. Generally speaking, when we program based on JDBC API, ResultSet to Java is very clear for business. It is roughly like this:

String selectSql = "SELECT * FROM employees"; 
try (ResultSet resultSet = stmt.executeQuery(selectSql)) { 
List<Employee> employees = new ArrayList<>(); 
while (resultSet.next()) 
{ 
    Employee emp = new Employee(); 
    emp.setId(resultSet.getInt("emp_id"));
    emp.setName(resultSet.getString("name"));
    emp.setPosition(resultSet.getString("position")); 
    emp.setSalary(resultSet.getDouble("salary")); 
    employees.add(emp); 
} 

This kind of behavior is fine for a clear Java object; but it is definitely not enough for a general component. Then it is necessary to solve how to convert ResultSet into Java objects to become more common. The idea is: ResultSet -> Map -> Java Object, you can get all the column names (K) and values ​​(V) through getMetaData of ResultSet, and store them in the Map, the code is as follows:

/**
 * 将 resultSet 转成 Map
 *
 * @param resultSet
 * @return
 * @throws SQLException
 */
private Map<String, Object> resultSetToMap(ResultSet resultSet) throws Exception {
    Map<String, Object> resultMap = new HashMap<>();
    // 获取 ResultSet 的元数据
    int columnCount = resultSet.getMetaData().getColumnCount();
    // 遍历每一列,将列名和值存储到 Map 中
    for (int i = 1; i <= columnCount; i++) {
        String columnName = resultSet.getMetaData().getColumnName(i);
        Object value = resultSet.getObject(i);
        resultMap.put(columnName, value);
    }
    return resultMap;
}

The next step is to convert the Map into a Java object. Of course, at the framework level, a generic mechanism is used to implement a more general scenario:

private T mapResultSetToObject(Map<String, Object> resultMap, Class<T> objectType) throws Exception {
    // 通过目标对象类型构建一个对象
    T object = objectType.newInstance();  
    // 将 map 的 key 作为 field 的名字,map 的 value 作为 field 的值
    for (Map.Entry<String, Object> entry : resultMap.entrySet()) {  
        String fieldName = entry.getKey();  
        Object value = entry.getValue();  
        try {  
            Field declaredField = objectType.getDeclaredField(fieldName);  
            declaredField.setAccessible(true);  
            declaredField.set(object, value);  
        } catch (NoSuchFieldException e) {  
            LOGGER.error("ignore exception, fieldName: " + fieldName + ", objectType: " + objectType);  
        }  
    }
    // 完成对象的填充并返回
    return object;  
}

From A to B is defined by the business itself, that is, the Convertor part. The following is the interface definition of Convertor. The business implements this interface to realize object-to-object conversion (including batch conversion)

// T 是目标对象类型,R 是原始对象类型
public interface Convertor<T, R> {  
  
/**  
* 将 T 转换成 R  
*  
* @param origin  
* @return  
*/  
T convertFrom(R origin) throws SQLException;  
  
/**  
* 将 R 转换成 T  
*  
* @param target  
* @return  
*/  
R convertTo(T target);  
  
/**  
* 批量转换  
* @param origin  
* @return  
* @throws SQLException  
*/  
List<T> batchConvertFrom(List<R> origin) throws SQLException;  
  
/**  
* 批量转换将 R 转换成 T  
*  
* @param target  
* @return  
*/ 
List<R> batchConvertTo(List<T> target);

From B to B_DB

The front is the process from A_DB to A and then to B, then the next is the process from B to B_DB. As can be seen from the previous flowchart, the queue is used in the component. The implementation of this article is based on Disruptor. After the logic of A->B in Convertor is completed, the List of B will be thrown into the ringBuffer of Disruptor for consumption. The consumption logic is as follows:

public void onEvent(RowObjectEvent rowObjectEvent, long sequence, boolean endOfBatch) {  
    // targetResult  
    List<T> targetResult = (List) rowObjectEvent.getRowObject();  
    // 将 T 写到 目标库  
    Connection tc = null;  
    PreparedStatement pstm = null;  
    try {  
        // 下面即为获取连接,创建 prepareStatement 和执行
        tc = this.targetDataSource.getConnection();  
        SqlTemplate<T> sqlTemplate = new SqlTemplate<>(this.targetDataSource); 
        sqlTemplate.setObj(targetResult.get(0));  
        String sql = sqlTemplate.createBaseSql();  
        pstm = tc.prepareStatement(sql);  
        for (T item : targetResult) {  
            Object[] objects = sqlTemplate.createInsertSql(item);  
            for (int i = 1; i <= objects.length; i++) {  
                if (objects[i - 1] instanceof Long) {  
                    objects[i - 1] = (Long) objects[i - 1] + 1;  
                }  
                pstm.setObject(i, objects[i - 1]);  
            } 
            // 这里执行时批量操作的
            pstm.addBatch();  
        }  
        pstm.executeUpdate();  
    } catch (Exception e) {  
        // ignore some code...
    }  
  
}

Through the above several fragments, the process of data synchronization and some core logic of the components implemented in this article are generally explained.

Summarize

The code snippets in this article are relatively fragmented, and some of the relevant parts of the logic are not reflected in this article; the main purpose of this article is to explain how to achieve synchronization based on JDBC API (actually not only based on JDBC and relation-oriented type database), and the implementation code of each step is given around DB -> A -> B -> DB this idea, and interested students can try to implement a data synchronization component by themselves. If you have any questions, you are welcome to communicate.

Guess you like

Origin juejin.im/post/7245922875182055484