DataX
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
System Requirements
- Linux
- JDK(1.8以上,推荐1.8)
- Python(推荐Python2.6.X)
- Apache Maven 3.x (Compile DataX)
直接下载DataX工具包:DataX下载地址
下载后解压至本地某个目录,进入bin目录,即可运行同步作业:
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}
配置示例:从stream读取数据并打印到控制台
第一步、创建创业的配置文件(json格式)
可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
$ cd {YOUR_DATAX_HOME}/bin $ python datax.py -r streamreader -w streamwriter DataX (UNKNOWN_DATAX_VERSION), From Alibaba ! Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved. Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [], "sliceRecordCount": "" } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "", "print": true } } } ], "setting": { "speed": { "channel": "" } } } }
根据模板配置
stream2stream.json
如下:
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
第二步:启动DataX
扫描二维码关注公众号,回复:
1717785 查看本文章
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json
同步结束,显示日志如下:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2018-06-09 14:48:04.652 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2018-06-09 14:48:04.682 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.111-b14 jvmInfo: Linux amd64 3.10.0-693.el7.x86_64 cpu num: 4 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [PS MarkSweep, PS Scavenge] MEMORY_NAME | allocation_size | init_size PS Eden Space | 256.00MB | 256.00MB Code Cache | 240.00MB | 2.44MB Compressed Class Space | 1,024.00MB | 0.00MB PS Survivor Space | 42.50MB | 42.50MB PS Old Gen | 683.00MB | 683.00MB Metaspace | -0.00MB | 0.00MB 2018-06-09 14:48:04.792 [main] INFO Engine - { "content":[ { "reader":{ "name":"streamreader", "parameter":{ "column":[ { "type":"long", "value":"10" }, { "type":"string", "value":"hello,你好,世界-DataX" } ], "sliceRecordCount":10 } }, "writer":{ "name":"streamwriter", "parameter":{ "encoding":"UTF-8", "print":true } } } ], "setting":{ "speed":{ "channel":5 } } } 2018-06-09 14:48:04.949 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2018-06-09 14:48:04.950 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2018-06-09 14:48:04.951 [main] INFO JobContainer - DataX jobContainer starts job. 2018-06-09 14:48:04.953 [main] INFO JobContainer - Set jobId = 0 2018-06-09 14:48:05.008 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2018-06-09 14:48:05.008 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work . 2018-06-09 14:48:05.008 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work . 2018-06-09 14:48:05.009 [job-0] INFO JobContainer - jobContainer starts to do split ... 2018-06-09 14:48:05.009 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels. 2018-06-09 14:48:05.010 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks. 2018-06-09 14:48:05.010 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks. 2018-06-09 14:48:05.105 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2018-06-09 14:48:05.145 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2018-06-09 14:48:05.150 [job-0] INFO JobContainer - Running by standalone Mode. 2018-06-09 14:48:05.239 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks. 2018-06-09 14:48:05.246 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2018-06-09 14:48:05.246 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2018-06-09 14:48:05.263 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started 2018-06-09 14:48:05.293 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2018-06-09 14:48:05.421 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started 2018-06-09 14:48:05.537 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started 2018-06-09 14:48:05.650 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 10 hello,你好,世界-DataX 2018-06-09 14:48:05.752 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[486]ms 2018-06-09 14:48:05.753 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[433]ms 2018-06-09 14:48:05.753 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[492]ms 2018-06-09 14:48:05.753 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[104]ms 2018-06-09 14:48:05.753 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[218]ms 2018-06-09 14:48:05.754 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2018-06-09 14:48:15.245 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-06-09 14:48:15.246 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2018-06-09 14:48:15.248 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work. 2018-06-09 14:48:15.249 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work. 2018-06-09 14:48:15.250 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2018-06-09 14:48:15.254 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /usr/local/datax/hook 2018-06-09 14:48:15.258 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s 2018-06-09 14:48:15.258 [job-0] INFO JobContainer - PerfTrace not enable! 2018-06-09 14:48:15.260 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-06-09 14:48:15.261 [job-0] INFO JobContainer - 任务启动时刻 : 2018-06-09 14:48:04 任务结束时刻 : 2018-06-09 14:48:15 任务总计耗时 : 10s 任务平均流量 : 95B/s 记录写入速度 : 5rec/s 读出记录总数 : 50 读写失败总数 : 0
后台执行并打印日志到指定的文件的命令如下
nohup python datax.py /usr/local/dataxjson/stream2stream.json >/usr/local/dataxjson/stream2stream.log 2>&1 &
Support Data Channels
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
---|---|---|---|---|
RDBMS 关系型数据库 | MySQL | √ | √ | 读 、写 |
Oracle | √ | √ | 读 、写 | |
SQLServer | √ | √ | 读 、写 | |
PostgreSQL | √ | √ | 读 、写 | |
DRDS | √ | √ | 读 、写 | |
通用RDBMS(支持所有关系型数据库) | √ | √ | 读 、写 | |
阿里云数仓数据存储 | ODPS | √ | √ | 读 、写 |
ADS | √ | 写 | ||
OSS | √ | √ | 读 、写 | |
OCS | √ | √ | 读 、写 | |
NoSQL数据存储 | OTS | √ | √ | 读 、写 |
Hbase0.94 | √ | √ | 读 、写 | |
Hbase1.1 | √ | √ | 读 、写 | |
MongoDB | √ | √ | 读 、写 | |
Hive | √ | √ | 读 、写 | |
无结构化数据存储 | TxtFile | √ | √ | 读 、写 |
FTP | √ | √ | 读 、写 | |
HDFS | √ | √ | 读 、写 | |
Elasticsearch | √ | 写 |