Introduction of Open Source Offline Synchronization Tool DataX3.0

1. Overview of DataX3.0

DataX is an offline synchronization tool for heterogeneous data sources, dedicated to realizing stable and efficient data synchronization between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.

design concept

In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.

Current status of use

DataX is widely used in Alibaba Group, and has undertaken all offline synchronization business of big data, and has been running stably for 6 years. At present, 8w multi-tasks are synchronized every day, and the daily data transmission volume exceeds 300TB.

The DataX1.0 version has been open-sourced before. This time, Alibaba has open-sourced the new version of DataX3.0, with more powerful functions and a better user experience. Github homepage address: https://github.com/alibaba/DataX .

Second, DataX3.0 framework design

As an offline data synchronization framework, DataX itself is built with a Framework + plugin architecture. The data source reading and writing are abstracted into Reader/Writer plug-ins and incorporated into the entire synchronization framework.

Reader: Reader is a data collection module, which is responsible for collecting data from data sources and sending the data to the Framework.

Writer: Writer is a data writing module, responsible for continuously fetching data from the Framework and writing the data to the destination.

Framework: Framework is used to connect readers and writers, as a data transmission channel for both, and to handle core technical issues such as buffering, flow control, concurrency, and data conversion.

3. DataX3.0 plug-in system

After several years of accumulation, DataX now has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have been connected. DataX currently supports the following data:

DataX Framework provides a simple interface to interact with plug-ins, and provides a simple plug-in access mechanism. You only need to add any plug-in to seamlessly connect to other data sources. For details, see: DataX Data Source Guide

4. Core Architecture of DataX3.0

DataX 3.0 open source version supports single-machine multi-threaded mode to complete synchronous job operation. This section briefly describes the relationship between each module of DataX from the overall architecture design according to a sequence diagram of the DataX job life cycle.

Core module introduction:

DataX completes a single data synchronization job, which we call Job. After DataX receives a Job, it will start a process to complete the entire job synchronization process. The DataX Job module is the central management node of a single job, and undertakes functions such as data cleaning, subtask segmentation (converting a single job calculation into multiple subtasks), and TaskGroup management.

After DataXJob is started, it will divide the Job into multiple small tasks (subtasks) according to different source-side segmentation strategies for concurrent execution. Task is the smallest unit of a DataX job, and each Task is responsible for the synchronization of a part of the data.

After splitting multiple tasks, the DataX Job will call the Scheduler module to reassemble the split tasks into a TaskGroup (task group) according to the configured amount of concurrent data. Each TaskGroup is responsible for running all assigned tasks with a certain concurrency. The default number of concurrent tasks for a single task group is 5.

Each Task is started by the TaskGroup. After the Task is started, it will start the Reader—>Channel—>Writer thread to complete the task.

Source: Yunqi Community

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326262809&siteId=291194637