The technological evolution of Dewu's self-built DTS platform | Featured

0 Preface

DTS is Data Transfer Platform (abbreviation of Data Transfer Platform)

With the increase in user traffic of Dewu App, the databases selected for business are becoming more and more diverse, and the demand for data synchronization between heterogeneous data sources is also gradually increasing. In order to control costs and better support business development, we decided to build our own DTS platform. This article mainly shares the experience gained during the upgrade of the DTS platform from the perspectives of technology selection, capability support, and evolution, and provides some references.

1 Technology selection

The main goal of DTS is to support data interaction between different types of data sources, including relational databases (RDBMS), NoSQL databases, OLAP, etc., while integrating database configuration management, data subscription, data synchronization, data migration, and DRC active-active Multiple modules such as data synchronization support, data inspection, monitoring and alarming, and unified authority are used to build a safe, scalable, and highly available data architecture platform.

1.1 Ability comparison

1.2 DTS 1.0 - using canal/otter/datax as the execution engine

1.3 Why switch to Flink?

In order to support multiple read-side data sources and write-side data sources, a unified data processing framework is needed to reduce duplicate components and improve development efficiency. At the same time, the maintenance difficulty and complexity of data source types and components increase linearly, and existing components need to be maintained in one project.

Components such as Canal and Otter have low community activity and have not been maintained and updated for a long time. Therefore, a new, active framework needs to be selected. In addition, existing components cannot effectively support full + incremental integrated operations.

Therefore, it is necessary to use a unified data processing framework that can simultaneously support multiple read-end data sources and write-end data sources, as well as full + incremental integration functions. This can reduce the difficulty and complexity of component maintenance and improve development efficiency.

Through DTS 2.0, we hope to evolve canal/otter/datax into a task execution framework + management platform, which can speed up subsequent iterations of a large number of data sources.

1.4 DTS 2.0 uses Flink as the execution engine

Existing development process:

  • Unified task execution framework, integrating flink and introducing connectors to assemble specific DTS tasks according to configuration
  • Maintain and develop new connectors

When we need to support a new data source, first maintain the data source-related plug-ins in the connector, and then introduce the required components in the execution framework, which contains a large number of reusable functions, so that the connector and functional components can be reused Effect.

2 DTS Existing Capabilities

3  What did we do?

3.1 DTS Connectors framework - data source support speed up

The full/incremental task synchronization framework implemented on the basis of Flink CDC, the basic architecture is as follows

Among them, the SourceFunction and SinkFunction functions provided by Flink are respectively implemented in the Connector, which are respectively responsible for reading data from the reading end and writing data to the writing end. Therefore, a Connector can exist upstream or downstream at the same time.

The startup process of the task:

a. The Main function of the task is as follows, and the corresponding DataStream is constructed according to the SourceFactory or SinkFactory loaded into the corresponding Connector by the following Json file.

DataStream is a data stream operation class provided in Flink
public class Main {    public static void main(String[] args) throws Exception {
        // 解析参数        ParameterTool parameterTool = ParameterTool.fromArgs(args);        String[] parsedArgs = parseArgs(parameterTool);
        Options options = new OptionParser(parsedArgs).getOptions();        options.setJobName(options.getJobName());
        // 执行任务        StreamExecutionEnvironment environment =                EnvFactory.createStreamExecutionEnvironment(options);        exeJob(environment, options);    }

Task Json configuration:

{  "job":{    "content":{      "reader":{        "name":"binlogreader",        "parameter":{          "accessKey":"",          "binlogOssApiUrl":"",          "delayBetweenRestartAttempts":2000,          "fetchSize":1,          "instanceId":"",          "rdsPlatform":"",          "restartAttempts":5,          "secretKey":"",          "serverTimezone":"",          "splitSize":1024,          "startupMode":"LATEST_OFFSET"        }      },      "writer":{        "name":"jdbcwriter",        "parameter":{          "batchSize":10000,          "concurrentWrite":true,          ],          "dryRun":false,          "dumpCommitData":false,          "errorRecord":0,          "flushIntervalMills":30000,          "poolSize":10,          "retries":3,          "smallBatchSize":200        }      }    },
  }}

b. We provide two abstract factory classes, SourceFactory and SinkFactory, among which createSource and createSink are the methods that subfactories need to implement. Different data sources have different implementations.

public abstract class SourceFactory<T> {    public abstract DataStream<T> createSource();}public abstract class SinkFactory<T> {    public abstract void createSink(DataStream<T> rowData) throws Exception;}

c. Next, we only need to implement the corresponding sub-factory method

public class BinlogSourceFactory extends AbstractJdbcSourceFactory {    @Override    public DataStream<TableRowData> createSource() {
        List<String> tables = this.binlogSourceConf.getConnection().getTable();        Set<String> databaseList = new HashSet<>(2);
        // 使用对应的Connector构建DataStream    }}

d. General capability functions: RateLimitFunction, BinlogPositionFunction, which implement corresponding task capabilities, such as current limiting, task position storage, etc.

public class RateLimiterMapFunction<T> extends RichMapFunction<T, T> {

    private transient FlinkConnectorRateLimiter rateLimiter;

    @Override    public T map(T value) throws Exception {        if (rateLimiterEnabled) {            rateLimiter.acquire(1);        }        return value;    }

When the functions required by the task are created, the task actually starts to run.

income:

3.2 RDS log acquisition

DTS provides data synchronization functions for businesses by providing incremental and full synchronization capabilities, but some abnormal situations may be encountered during the execution of incremental subscription/synchronization tasks. Among them, the following three situations require special handling:

  • Binlog availability

The local binlog of the cloud vendor's database instance is valid for 8 hours, and the expired part is backed up by OSS. During the peak period of MySQL business or DDL changes, a large number of binlogs are generated, and the DTS task fails to obtain expired data, so the task is interrupted. Therefore, DTS supports the acquisition and switching of local binlog+OSS backup binlog to ensure log availability.

  • Database  instance master-slave switch

RDS often switches between active and standby nodes, and data must not be lost during the switchover process. Since the Binlog files of the two database instances before and after the switch are generally inconsistent, the task position recording method is the BinlogPosition mode at this time, and the task needs to automatically perform the Binlog alignment operation after the switch to ensure data integrity. Just advance the location query timestamp on the new data instance by 1-2 minutes.

  • Read instance subscription support

Too many binlog dump connections in the DTS task cause pressure on the main library and affect DDL changes, so it is necessary to support reading library subscriptions. The cloud vendor's reading library does not provide backup, and it needs to switch to the main library for reading when the reading log expires.

3.3 Full incremental integration function

Full and incremental integration refers to synchronizing the stock data first, and then starting to synchronize the incremental data after the stock is over. It also includes the acquisition of OSS backup logs in the incremental phase. However, there are still some problems in the stock stage, which need further improvement and optimization.

3.4 Data source access - starrocks, postgres, etc.

It supports synchronization from mysql to starrocks and postgres. On the basis of the task execution framework, only need to develop starrocks-connector, postgres connector supports the corresponding data source. Other capabilities, such as multi-table synchronization, sub-database sub-table and other scenarios, can achieve the effect of multiplexing.

3.5 JBDC Write Transformation

Script expansion and dynamic table name routing:

Data merging and multi-threaded writing:

3.6 Monitor and alarm

DTS tasks need to collect Flink task indicators, mainly including task delay, write rate of each operator stage, operator pressure and usage rate, etc. Among them, the task delay requires access to the alarm service, so we chose to introduce redis to cache the delay time of the task, and then report it to the alarm service to complete the Feishu message and phone alarm.

4 Best Practices

4.1 Problem with 0000-00-00 00:00:00 timestamp

MySQL's timestamp is allowed to be 0000-00-00 00:00:00, which is usually converted to null in Flink tasks, resulting in failure to write to downstream data sources, so special marking is required for different conversions for different data sources Guaranteed tangent rows written.

 4.2 Uniqueness of Flink CDC  task serverId 

The Flink CDC source will pretend to be a MySQL slave node. In order to ensure the accuracy of the data, each slave must have a unique serverId to mark the uniqueness of the slave. Therefore, in the task of flink cdc, we assign a unique serverId interval to each task (the range interval is to support multiple degrees of parallelism).

4.3 Flink task data serialization bottleneck

When using the DataStream API in the flink task and using a more complex data structure for transmission, the serialization cost between operators is relatively high. There are two directions, one is to establish a more efficient data structure for transmission, and the other is to enable flink object complex use, and minimize data transfers between different degrees of parallelism.

5 Future Evolution

The main function of DTS as a data synchronization platform is to provide as efficient a data source synchronization function as possible to help changing business scenarios.

5.1 ETL task management based on Flink SQL

In addition to the existing DataStream API, streaming data processing also exists in the form of SQL. As a general-purpose language, SQL greatly reduces the learning cost for data-related business students. The ETL streaming data processing that can be done through Flink SQL can also solve the processing logic of some complex business scenarios, and transform the business logic into a DAG stream processing graph, which can also be easily used by dragging and dropping. The evolution of FLINK SQL Directions can complement the existing Flink DataStream API.

Application scenarios: ETL's powerful streaming data conversion and processing capabilities greatly improve data integration efficiency, and can also build a real-time report system to improve analysis efficiency, and can also be applied to some real-time large-screen scenarios.

5.2 Unified technology stack

Migrating all existing DTS capabilities to the Flink platform and maintaining a unified technology stack can greatly reduce maintenance costs. The existing legacy two-way synchronization, data comparison and other capabilities need to be further transformed and migrated, in line with the trend of overall technology convergence.

6 Summary

This article mainly shares the following aspects: the benefits brought by Flink compared with the existing technology stack, the iterative direction after switching to Flink and the changes in architectural functions, how to solve new problems, and some future iterative directions , I hope everyone can gain something.

*Text/Fengzi

This article is an original article of Dewu Technology. For more exciting articles, please see: Dewu Technology Official Website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law!

Guess you like

Origin blog.csdn.net/SmartCodeTech/article/details/131706298
DTS