[Ali] dataX open source ETL tool --dataX simple to use [Ali] dataX open source ETL tool --dataX simple to use

[Ali] dataX open source ETL tool --dataX simple to use

I. Overview

  1. What is?

  DataX is widely used in the Alibaba Group, offline data synchronization tools / platforms , including realization of MySQL, Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase, TableStore (OTS), MaxCompute (ODPS), DRDS other different efficient configuration data synchronization between the data sources.

  Open Source Address : https://github.com/alibaba/DataX

 

Second, Introduction

  1. Design architecture

  

 

  Through a transit exchange data DataX, as long as any data source and connected DataX i.e. any data source can be realized and synchronization

   2. Frame

  

  Core components:

      Reader: data acquisition module is responsible for collecting data from the source

    Writer: data writing module is responsible for writing the target library

    Framework: data transmission channel, responsible for processing the data buffering

    More than just rewrite Reader and Writer plug-in, you can achieve a new data source support

  Support for mainstream data sources, see: https://github.com/alibaba/DataX/blob/master/introduction.md

  JOB understood from a core module assembly datax of:

    datax complete a single data synchronization operation, known as Job, job data will be responsible for clean-up task cut grading work;

    After the task starts, Job cut will be divided according to different policy sources, into multiple concurrent execution of Task, Task execution of the job is the smallest unit

    After the segmentation is completed, according Scheduler module, combined into the Task TaskGroup, each responsible for a certain group and concurrent dispensing Task

Third, the entry

  1. Install

    Refer to the official documentation

    Extraction is installed Essay : https://www.cnblogs.com/jiangbei/p/10901201.html

  2. Use

    The core is to write the configuration file (the current version using JSON)

    [Profile] :

      Look at an example:

  JSON configuration file

    Job entire profile is a profile for the hands to Job root element, the following two sub-elements Job CI: setting and Content,

  Wherein, setting the task description information itself, content source description (Reader) and destination (Writer) information:

  

   Wherein, the content reader and writer is divided into two, corresponding to the source and destination:

  

 

 Each document reader and writer plug-ins click on the corresponding folder into the doc can!

    

     3. Example

      Mysql to write a local print the Job:

      Writing profile based on Mysqlreader plugin:

{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "root",
"column": [
"id",
"age",
"name"
],
"connection": [
{
"table": [
"girl"
],
"jdbcUrl": [
"jdbc:mysql://192.168.19.129:3306/mysql"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print":true
}
}
}
]
}
}

 

 

  根据自检脚本(文件颜色浅绿色),可以知道需要是可执行文件,添加权限:

chmod +x mysqltest.json

    使用运行命令,运行即可:(进入bin目录运行命令如下)

python datax.py ../job/mysqltest.json

     有时是希望灵活一点的自定义的配置,则可参考如下json配置:

 

{
"job": {
"setting": {
"speed": {
"channel":1
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "root",
"connection": [
{
"querySql": [
"select db_id,on_line_flag from db_info where db_id < 10;"
],
"jdbcUrl": [
"jdbc:mysql://bad_ip:3306/database",
"jdbc:mysql://127.0.0.1:bad_port/database",
"jdbc:mysql://127.0.0.1:3306/database"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": false,
"encoding": "UTF-8"
}
}
}
]
}
}

四、拓展

  1.调度

    任务很少,情景简单的情况下,使用Linux自带的corntab即可,当然,正常的时候推荐使用调度平台

    这里推荐airflow(其他相关的azkaban、oozie等调度不展开)

 

 

 

Guess you like

Origin www.cnblogs.com/qq18361642/p/11874085.html