Article directory
1. Introduction to DataX
1.1 Overview of DataX
DataX is an offline synchronization tool for heterogeneous data sources open sourced by Alibaba. Data synchronization function.
Source address: https://github.com/alibaba/DataX
1.2 Data sources supported by DataX
DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have been connected. Currently, the supported data is as shown in the figure below.
type | data source | Reader (read) | Writer (write) |
---|---|---|---|
RDBMS relational database | MySQL | √ | √ |
Oracle | √ | √ | |
OceanBase | √ | √ | |
SQLServer | √ | √ | |
PostgreSQL | √ | √ | |
DRDS | √ | √ | |
Generic RDBMS | √ | √ | |
Alibaba Cloud Data Warehouse Data Storage | ODPS | √ | √ |
ADS | √ | ||
OSS | √ | √ | |
OCS | √ | √ | |
NoSQL data storage | OTS | √ | √ |
Hbase0.94 | √ | √ | |
Hbase1.1 | √ | √ | |
Phoenix4.x | √ | √ | |
Phoenix5.x | √ | √ | |
MongoDB | √ | √ | |
Hive | √ | √ | |
Cassandra | √ | √ | |
unstructured data storage | TxtFile | √ | √ |
FTP | √ | √ | |
HDFS | √ | √ | |
Elasticsearch | √ | ||
time series database | OpenTSDB | √ | |
TSDB | √ | √ |
2. Principles of DataX architecture
2.1 DataX Design Concept
In order to solve the problem of synchronizing heterogeneous data sources, DataX has changed the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.
2.2 DataX framework design
DataX itself, as an offline data synchronization framework, is built with Framework + plugin architecture. Abstract data source reading and writing into a Reader/Writer plug-in, which is incorporated into the entire synchronization framework.
Reader: Reader is a data collection module responsible for collecting data from data sources and sending the data to Framework.
Writer: Writer is the data writing module, which is responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework: Framework is used to connect reader and writer as a data transmission channel between the two, and handle core technical issues such as buffering, flow control, concurrency, and data conversion.
2.3 DataX operation process
The following uses a sequence diagram of the DataX job life cycle to illustrate the DataX operation process, core concepts, and the relationship between each concept.
Job: A single data synchronization job is called a job, and a job starts a process.
Task: According to the splitting strategies of different data sources, a Job will be split into multiple Tasks. Task is the smallest unit of DataX job, and each Task is responsible for the synchronization of part of the data.
TaskGroup: The Scheduler scheduling module groups Tasks, and each Task group is called a Task Group. Each TaskGroup is responsible for running its allocated tasks at a certain degree of concurrency, and the concurrency degree of a single Task Group is 5.
Reader→Channel→Writer: After each Task is started, the thread of Reader→Channel→Writer will be started to complete the synchronization work.
2.4 DataX scheduling decision-making ideas
For example, a user submits a DataX job and configures the total concurrency as 20, with the purpose of synchronizing a mysql data source with 100 sub-tables. DataX's scheduling decision-making idea is:
1) DataX Job divides the synchronization work into 100 Tasks according to the sub-database and table segmentation strategy.
2) According to the configured total concurrency of 20 and the concurrency of each Task Group of 5, DataX calculation needs to allocate a total of 4 TaskGroups.
3) 4 TaskGroups divide 100 Tasks equally, and each TaskGroup is responsible for running 25 Tasks.
2.5 Comparison between DataX and Sqoop
Function | DataX | Sqoop |
---|---|---|
operating mode | single process multithread | MR |
distributed | Not supported, can be circumvented through the scheduling system | support |
Flow Control | With flow control function | needs customization |
Statistics | There are some statistics, and the report needs to be customized | No, distributed data collection is inconvenient |
Data validation | There is a verification function in the core part | No, distributed data collection is inconvenient |
monitor | needs customization | needs customization |
3. DataX deployment
3.1 Download the DataX installation package and upload it to /opt/software of hadoop102
Download address: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
3.2 Unzip datax.tar.gz to /opt/module
[summer@hadoop102 software]$ tar -zxvf datax.tar.gz -C /opt/module/
3.3 Self-test, execute the following command
[summer@hadoop102 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json