Big Data Project E-commerce Data Warehouse DataX, Introduction to DataX, Data Sources Supported by DataX, Principles of DataX Architecture, DataX Deployment

1. Introduction to DataX

1.1 Overview of DataX

  DataX is an offline synchronization tool for heterogeneous data sources open sourced by Alibaba. Data synchronization function.
Source address: https://github.com/alibaba/DataX

1.2 Data sources supported by DataX

  DataX currently has a relatively comprehensive plug-in system, and mainstream RDBMS databases, NOSQL, and big data computing systems have been connected. Currently, the supported data is as shown in the figure below.

type data source Reader (read) Writer (write)
RDBMS relational database MySQL
Oracle
OceanBase
SQLServer
PostgreSQL
DRDS
Generic RDBMS
Alibaba Cloud Data Warehouse Data Storage ODPS
ADS
OSS
OCS
NoSQL data storage OTS
Hbase0.94
Hbase1.1
Phoenix4.x
Phoenix5.x
MongoDB
Hive
Cassandra
unstructured data storage TxtFile
FTP
HDFS
Elasticsearch
time series database OpenTSDB
TSDB

2. Principles of DataX architecture

2.1 DataX Design Concept

  In order to solve the problem of synchronizing heterogeneous data sources, DataX has changed the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.

insert image description here

2.2 DataX framework design

  DataX itself, as an offline data synchronization framework, is built with Framework + plugin architecture. Abstract data source reading and writing into a Reader/Writer plug-in, which is incorporated into the entire synchronization framework.

insert image description hereReader: Reader is a data collection module responsible for collecting data from data sources and sending the data to Framework.
Writer: Writer is the data writing module, which is responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework: Framework is used to connect reader and writer as a data transmission channel between the two, and handle core technical issues such as buffering, flow control, concurrency, and data conversion.

2.3 DataX operation process

  The following uses a sequence diagram of the DataX job life cycle to illustrate the DataX operation process, core concepts, and the relationship between each concept.
insert image description hereJob: A single data synchronization job is called a job, and a job starts a process.
Task: According to the splitting strategies of different data sources, a Job will be split into multiple Tasks. Task is the smallest unit of DataX job, and each Task is responsible for the synchronization of part of the data.
TaskGroup: The Scheduler scheduling module groups Tasks, and each Task group is called a Task Group. Each TaskGroup is responsible for running its allocated tasks at a certain degree of concurrency, and the concurrency degree of a single Task Group is 5.
Reader→Channel→Writer: After each Task is started, the thread of Reader→Channel→Writer will be started to complete the synchronization work.

2.4 DataX scheduling decision-making ideas

For example, a user submits a DataX job and configures the total concurrency as 20, with the purpose of synchronizing a mysql data source with 100 sub-tables. DataX's scheduling decision-making idea is:
1) DataX Job divides the synchronization work into 100 Tasks according to the sub-database and table segmentation strategy.
2) According to the configured total concurrency of 20 and the concurrency of each Task Group of 5, DataX calculation needs to allocate a total of 4 TaskGroups.
3) 4 TaskGroups divide 100 Tasks equally, and each TaskGroup is responsible for running 25 Tasks.

2.5 Comparison between DataX and Sqoop

Function DataX Sqoop
operating mode single process multithread MR
distributed Not supported, can be circumvented through the scheduling system support
Flow Control With flow control function needs customization
Statistics There are some statistics, and the report needs to be customized No, distributed data collection is inconvenient
Data validation There is a verification function in the core part No, distributed data collection is inconvenient
monitor needs customization needs customization

3. DataX deployment

3.1 Download the DataX installation package and upload it to /opt/software of hadoop102

Download address: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

insert image description here

3.2 Unzip datax.tar.gz to /opt/module

[summer@hadoop102 software]$ tar -zxvf datax.tar.gz -C /opt/module/

insert image description here

3.3 Self-test, execute the following command

[summer@hadoop102 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json

insert image description here

Guess you like

Origin blog.csdn.net/Redamancy06/article/details/127597577