Get to know DataX and get started

Get to know DataX and get started



1. Overview of DataX

1.1 What is DataX

DataX is 阿里巴巴an open source 异构数据源离线同步工具, dedicated to realizing stable and efficient 数据同步functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-aDqVojTf-1675513258287)(./1.jpg)]

1.2 Design of DataX

In order to solve the synchronization problem of heterogeneous data sources, Alibaba designed DataX as follows:

The complex mesh synchronous link is turned into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.

As shown below:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-gbkkwkcA-1675513258288)(./2.jpg)]

1.3 Supported Data Sources

DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected.

As shown below:

1.3 Supported Data Sources

DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected.

As shown below:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-m8K1w5dp-1675513258289)(./3.jpg)]

1.4 Framework Design

The framework of DataX is shown in the figure below:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-0oAanikV-1675513258289)(./4.jpg)]

  • Reader: Data collection module, responsible for collecting data from data sources and sending the data to Frame work.
  • Writer: The data writing module is responsible for continuously fetching data from the Framework and writing the data to the destination.
  • Framework: It is used to connect the reader and the writer as a data transmission channel between the two, and handle core technical issues such as buffering, flow control, concurrency, and data conversion.

1.5 Operating principle

See below:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-4FyUM3Mx-1675513258290)(./5.jpg)]

explain:

  • Job: The management node of a single job, responsible for data cleaning, subtask division, TaskGroup. Monitoring and management.
  • Task: Segmented from a Job, it is the smallest unit of a DataX job, and each Task is responsible for a part of data synchronization.
  • Schedule: Tasks are organized into TaskGroups, and the concurrent number of a single TaskGroup is 5. '
  • TaskGroup: responsible for starting the Task.

For example, the user submits a DataX job and configures 20 concurrency, the purpose is to synchronize the mysql data of 100 sub-tables to odps.

Then DataX's scheduling decision-making idea is:

  • DataXJob is divided into 100 tasks according to the sub-database and table.

  • According to 20 concurrency, DataX calculation needs to allocate 4 TaskGroups in total.

  • 4 TaskGroups divide the 100 divided tasks equally, and each TaskGroup is responsible for running 25 tasks in total with 5 concurrency.

1.6 Comparison of DataX and Sqoop

Function DataX Sqoop
operating mode single process multithread MR
MySQL read and write Stand-alone pressure is high; read and write granularity is easy to control The MR mode is heavy, and it is troublesome to deal with writing errors
Hive read and write Stand-alone pressure very good
file format orc support orc does not support, you can add
distributed Not supported, can be circumvented through the scheduling system support
Flow Control With flow control function needs customization
Statistics There are some statistics, and the report needs to be customized No, distributed data collection is inconvenient
Data validation There is a verification function in the core part No, distributed data collection is inconvenient
monitor needs customization needs customization
Community Open source soon, the community is not active Has been active, the core part has little change

2. Easy to get started

2.1 Official address

Download link: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

Source address: https://github.com/alibaba/DataX

2.2 Prerequisites

  • Operating system: Linux

  • JDK: 1.8 or above, 1.8 is recommended

  • Python: Python2.6.X is recommended

2.3 Installation

  • Upload the downloaded datax.tar.gz to the /opt/software directory of node01

  • Unzip datax.tar.gz to the /opt/module directory

[whybigdata@node01 software]$ tar -zxvf datax.tar.gz -C /opt/module/
  • Run the autodetection script
[whybigdata@node01 bin]$ cd /opt/module/datax/bin/
[whybigdata@node01 bin]$ python datax.py /opt/module/datax/job/job.json

The results are as follows, indicating that there is no problem and the installation is successful

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-jhe8rTAS-1675513258291)(./6.jpg)]

The full text is over!

Guess you like

Origin blog.csdn.net/m0_52735414/article/details/128885213