DataX3.0学习

DataX

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

System Requirements

直接下载DataX工具包:DataX下载地址

下载后解压至本地某个目录,进入bin目录,即可运行同步作业:

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}


配置示例:从stream读取数据并打印到控制台

  • 第一步、创建创业的配置文件(json格式)

    可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

    $ cd  {YOUR_DATAX_HOME}/bin
    $  python datax.py -r streamreader -w streamwriter
    DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
    Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
    Please refer to the streamreader document:
        https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 
    
    Please refer to the streamwriter document:
         https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
     
    Please save the following configuration as a json file and  use
         python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
    to run the job.
    
    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "streamreader", 
                        "parameter": {
                            "column": [], 
                            "sliceRecordCount": ""
                        }
                    }, 
                    "writer": {
                        "name": "streamwriter", 
                        "parameter": {
                            "encoding": "", 
                            "print": true
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {
                    "channel": ""
                }
            }
        }
    }

根据模板配置

stream2stream.json
如下:
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

第二步:启动DataX

扫描二维码关注公众号,回复: 1717785 查看本文章
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json 

同步结束,显示日志如下:


DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-06-09 14:48:04.652 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-06-09 14:48:04.682 [main] INFO  Engine - the machine info  => 

        osInfo: Oracle Corporation 1.8 25.111-b14
        jvmInfo:        Linux amd64 3.10.0-693.el7.x86_64
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size                      
        PS Eden Space                  | 256.00MB                       | 256.00MB                       
        Code Cache                     | 240.00MB                       | 2.44MB                         
        Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
        PS Survivor Space              | 42.50MB                        | 42.50MB                        
        PS Old Gen                     | 683.00MB                       | 683.00MB                       
        Metaspace                      | -0.00MB                        | 0.00MB                         


2018-06-09 14:48:04.792 [main] INFO  Engine - 
{
        "content":[
                {
                        "reader":{
                                "name":"streamreader",
                                "parameter":{
                                        "column":[
                                                {
                                                        "type":"long",
                                                        "value":"10"
                                                },
                                                {
                                                        "type":"string",
                                                        "value":"hello,你好,世界-DataX"
                                                }
                                        ],
                                        "sliceRecordCount":10
                                }
                        },
                        "writer":{
                                "name":"streamwriter",
                                "parameter":{
                                        "encoding":"UTF-8",
                                        "print":true
                                }
                        }
                }
        ],
        "setting":{
                "speed":{
                        "channel":5
                }
        }
}

2018-06-09 14:48:04.949 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2018-06-09 14:48:04.950 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2018-06-09 14:48:04.951 [main] INFO  JobContainer - DataX jobContainer starts job.
2018-06-09 14:48:04.953 [main] INFO  JobContainer - Set jobId = 0
2018-06-09 14:48:05.008 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2018-06-09 14:48:05.008 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
2018-06-09 14:48:05.008 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2018-06-09 14:48:05.009 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2018-06-09 14:48:05.009 [job-0] INFO  JobContainer - Job set Channel-Number to 5 channels.
2018-06-09 14:48:05.010 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2018-06-09 14:48:05.010 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2018-06-09 14:48:05.105 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2018-06-09 14:48:05.145 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2018-06-09 14:48:05.150 [job-0] INFO  JobContainer - Running by standalone Mode.
2018-06-09 14:48:05.239 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2018-06-09 14:48:05.246 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2018-06-09 14:48:05.246 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2018-06-09 14:48:05.263 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2018-06-09 14:48:05.293 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2018-06-09 14:48:05.421 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2018-06-09 14:48:05.537 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
2018-06-09 14:48:05.650 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
10      hello,你好,世界-DataX
2018-06-09 14:48:05.752 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[486]ms
2018-06-09 14:48:05.753 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[433]ms
2018-06-09 14:48:05.753 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[492]ms
2018-06-09 14:48:05.753 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[104]ms
2018-06-09 14:48:05.753 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[218]ms
2018-06-09 14:48:05.754 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2018-06-09 14:48:15.245 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2018-06-09 14:48:15.246 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2018-06-09 14:48:15.248 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2018-06-09 14:48:15.249 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
2018-06-09 14:48:15.250 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2018-06-09 14:48:15.254 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /usr/local/datax/hook
2018-06-09 14:48:15.258 [job-0] INFO  JobContainer - 
         [total cpu info] => 
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                -1.00%                         | -1.00%                         | -1.00%
                        

         [total gc info] => 
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
                 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2018-06-09 14:48:15.258 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-06-09 14:48:15.260 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2018-06-09 14:48:15.261 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-06-09 14:48:04
任务结束时刻                    : 2018-06-09 14:48:15
任务总计耗时                    :                 10s
任务平均流量                    :               95B/s
记录写入速度                    :              5rec/s
读出记录总数                    :                  50
读写失败总数                    :                   0

后台执行并打印日志到指定的文件的命令如下

nohup python datax.py /usr/local/dataxjson/stream2stream.json >/usr/local/dataxjson/stream2stream.log 2>&1 &


Support Data Channels

DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南

类型 数据源 Reader(读) Writer(写) 文档
RDBMS 关系型数据库 MySQL  、
  Oracle         √         √      、
  SQLServer  、
  PostgreSQL  、
  DRDS  、
  通用RDBMS(支持所有关系型数据库)  、
阿里云数仓数据存储 ODPS  、
  ADS  
  OSS  、
  OCS  、
NoSQL数据存储 OTS  、
  Hbase0.94  、
  Hbase1.1  、
  MongoDB  、
  Hive  、
无结构化数据存储 TxtFile  、
  FTP  、
  HDFS  、
  Elasticsearch  


猜你喜欢

转载自blog.csdn.net/zsj777/article/details/80632959