Get to know DataX and get started
Article Directory
1. Overview of DataX
1.1 What is DataX
DataX is 阿里巴巴
an open source 异构数据源离线同步工具
, dedicated to realizing stable and efficient 数据同步
functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
1.2 Design of DataX
In order to solve the synchronization problem of heterogeneous data sources, Alibaba designed DataX as follows:
The complex mesh synchronous link is turned into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.
As shown below:
1.3 Supported Data Sources
DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected.
As shown below:
1.3 Supported Data Sources
DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected.
As shown below:
1.4 Framework Design
The framework of DataX is shown in the figure below:
- Reader: Data collection module, responsible for collecting data from data sources and sending the data to Frame work.
- Writer: The data writing module is responsible for continuously fetching data from the Framework and writing the data to the destination.
- Framework: It is used to connect the reader and the writer as a data transmission channel between the two, and handle core technical issues such as buffering, flow control, concurrency, and data conversion.
1.5 Operating principle
See below:
explain:
- Job: The management node of a single job, responsible for data cleaning, subtask division, TaskGroup. Monitoring and management.
- Task: Segmented from a Job, it is the smallest unit of a DataX job, and each Task is responsible for a part of data synchronization.
- Schedule: Tasks are organized into TaskGroups, and the concurrent number of a single TaskGroup is 5. '
- TaskGroup: responsible for starting the Task.
For example, the user submits a DataX job and configures 20 concurrency, the purpose is to synchronize the mysql data of 100 sub-tables to odps.
Then DataX's scheduling decision-making idea is:
-
DataXJob is divided into 100 tasks according to the sub-database and table.
-
According to 20 concurrency, DataX calculation needs to allocate 4 TaskGroups in total.
-
4 TaskGroups divide the 100 divided tasks equally, and each TaskGroup is responsible for running 25 tasks in total with 5 concurrency.
1.6 Comparison of DataX and Sqoop
Function | DataX | Sqoop |
---|---|---|
operating mode | single process multithread | MR |
MySQL read and write | Stand-alone pressure is high; read and write granularity is easy to control | The MR mode is heavy, and it is troublesome to deal with writing errors |
Hive read and write | Stand-alone pressure | very good |
file format | orc support | orc does not support, you can add |
distributed | Not supported, can be circumvented through the scheduling system | support |
Flow Control | With flow control function | needs customization |
Statistics | There are some statistics, and the report needs to be customized | No, distributed data collection is inconvenient |
Data validation | There is a verification function in the core part | No, distributed data collection is inconvenient |
monitor | needs customization | needs customization |
Community | Open source soon, the community is not active | Has been active, the core part has little change |
2. Easy to get started
2.1 Official address
Download link: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
Source address: https://github.com/alibaba/DataX
2.2 Prerequisites
-
Operating system: Linux
-
JDK: 1.8 or above, 1.8 is recommended
-
Python: Python2.6.X is recommended
2.3 Installation
-
Upload the downloaded datax.tar.gz to the /opt/software directory of node01
-
Unzip datax.tar.gz to the /opt/module directory
[whybigdata@node01 software]$ tar -zxvf datax.tar.gz -C /opt/module/
- Run the autodetection script
[whybigdata@node01 bin]$ cd /opt/module/datax/bin/
[whybigdata@node01 bin]$ python datax.py /opt/module/datax/job/job.json
The results are as follows, indicating that there is no problem and the installation is successful
The full text is over!