DataX heterogeneous data source synchronization product - technology sharing (1)

DataX is an offline synchronization tool for heterogeneous data sources open sourced by Alibaba. It is dedicated to achieving stable and efficient data between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc. Sync function.

DataX design concept

Insert image description here

DataX itself, as a data synchronization framework, abstracts the synchronization of different data sources into a Reader plug-in that reads data from the source data source, and a Writer plug-in that writes data to the target. In theory, the DataX framework can support data synchronization of any data source type. Work. At the same time, the DataX plug-in system serves as an ecosystem. Every time a new data source is accessed, the newly added data source can interoperate with existing data sources.

DataX framework design

DataX itself, as an offline data synchronization framework, is built using Framework + plugin architecture. Abstract data source reading and writing into Reader/Writer plug-ins and incorporate them into the entire synchronization framework.
Insert image description here

  1. Reader: Reader is the data collection module, responsible for collecting data from the data source and sending the data to the Framework.
  2. Writer: Writer is the data writing module, responsible for continuously fetching data from the Framework and writing the data to the destination.
  3. Framework: Framework is used to connect readers and writers, serves as the data transmission channel for both, and handles core technical issues such as buffering, flow control, concurrency, and data conversion.

DataX core architecture

Insert image description here

Core module introduction:

  • DataX completes a single data synchronization job, which we call a Job. After DataX receives a Job, it will start a process to complete the entire job synchronization process. The DataX Job module is the central management node of a single job, responsible for functions such as data cleaning, sub-task segmentation (converting single job calculations into multiple sub-tasks), and TaskGroup management.
  • After the DataXJob is started, the Job will be divided into multiple small Tasks (subtasks) according to different source-side splitting strategies to facilitate concurrent execution. Task is the smallest unit of DataX job. Each Task is responsible for the synchronization of part of the data.
  • After splitting multiple tasks, DataX Job will call the Scheduler module to recombine the split tasks into a TaskGroup (task group) according to the configured concurrent data volume. * * Each TaskGroup is responsible for running all assigned tasks with a certain concurrency. The default number of concurrency for a single task group is 5.
  • Each Task is started by TaskGroup. After the Task is started, the Reader->Channel->Writer thread will be started to complete the task synchronization work.
  • After the DataX job runs, the Job monitors and waits for the completion of multiple TaskGroup module tasks, and waits for the completion of all TaskGroup tasks before the Job exits successfully. Otherwise, the process exits abnormally and the process exit value is non-0.

DataX execution process

1. Parse the configuration, including three configurations: job.json, core.json, and plugin.json;
2. Set jobId to configuration;
3. Start Engine and enter the startup program through Engine.start();
4. Set RUNTIME_MODE to configuration Among them;
5. Start through the start() method of JobContainer;
6. Execute the preHandler(), init(), prepare(), split(), schedule(), - post(), postHandle() and other methods of the job in sequence;
7. The init() method involves initializing the reader and writer plug-ins according to the configuration. This involves hot loading of the jar package and calling the plug-in init() operation method, and setting the configuration information of the reader and writer at the same time; 8. The prepare() method
involves Initialize the reader and writer plug-ins by calling the prepare() method of the plug-in. Each plug-in has its own jarLoader, which is implemented by integrating URLClassloader; 9. The
split() method adjusts the number of channels through the adjustChannelNumber() method. Perform the most fine-grained segmentation of readers and writers at the same time. It should be noted that the segmentation results of the writer must refer to the segmentation results of the reader. Only when the numbers after segmentation are equal can the 1:1 channel model be satisfied; 10.
Channel Counting is mainly implemented based on the speed limit of byte and record (if you have not set the number of channels). The first step in the split() function is to calculate the size of the channel;
11. The split() method reader plug-in will split according to the value of the channel, but some reader plug-ins may not refer to the channel value. The writer plug-in will return completely based on the reader plug-in 1:1; 12. Inside the split()
method The mergeReaderAndWriterTaskConfigs() is responsible for merging the relationships between readers, writers, and transformers, generating task configurations, and rewriting job.content configurations; 13. The
schedule() method allocates and generates taskGroup objects based on the task configuration generated by split() splitting. , configure according to the number of tasks and the number of tasks supported by a single taskGroup. Divide the two to get the number of taskGroups. 14. Schdule() is executed internally through the schedule() of AbstractScheduler, and continues to execute the startAllTaskGroup() method to create all TaskGroupContainers. Organize related tasks, and TaskGroupContainerRunner is responsible for running TaskGroupContainer to execute assigned tasks;
14. taskGroupContainerExecutorService starts a fixed thread pool to execute TaskGroupContainerRunner objects. The run() method of TaskGroupContainerRunner calls the taskGroupContainer.start() method to create a TaskExecutor for each channel. Start the task through taskExecutor.doStart().

Guess you like

Origin blog.csdn.net/m0_49447718/article/details/132064480
Recommended