Big data DataX detailed installation tutorial

Table of contents

1. Environmental preparation

2. Installation and deployment 

2.1 Binary installation 

2.2 python 3 support

3. First experience with Data X 

3.1 Configuration example

3.1.1. Generate configuration template

3.1.2 Create configuration file

3.1.3 Running DataX

3.1.4 Result display

3.2 Dynamic parameter transfer  

3.2.1. Introduction to dynamic parameter transfer

3.2.2. Case of dynamic parameter transfer

3.3 Burst settings 

3.3.1 Direct designation

3.3.2 Bps

3.3.3 tps

3.3.4. Priority


Official reference document:https://github.com/alibaba/DataX/blob/master/userGuid.md 

1. Environmental preparation

2. Installation and deployment 

2.1 Binary installation 

(base) [root@hadoop03 ~]# tar -zxvf datax.tar.gz -C /usr/local/
  • 4. Self-test script  
# python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

# 例如:
python /usr/local/datax/bin/datax.py /usr/local/datax/job/job.json

  • 5. Exception resolution

If the following error occurs when executing the self-test program:

[main] WARN  ConfigParser - 插件[streamreader,streamwriter]加载失败,1s后重试... Exception:Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
[main] ERROR Engine -

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
	at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
	at com.alibaba.datax.common.util.Configuration.from(Configuration.java:95)
	at com.alibaba.datax.core.util.ConfigParser.parseOnePluginConfig(ConfigParser.java:153)
	at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:125)
	at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
	at com.alibaba.datax.core.Engine.entry(Engine.java:137)
	at com.alibaba.datax.core.Engine.main(Engine.java:204)

Solution:Delete all files starting with _ in the plugin directory

cd /usr/local/datax/plugin
find ./* -type f -name ".*er" | xargs rm -rf

2.2 python 3 support

        The DataX project itself is developed using Python2, so it needs to be executed using the Python2 version. But the Python version we installed is 3, and the syntax difference between 3 and 2 is quite large. Therefore, if you directly use python3 to execute, problems will occur.

        If you need to use python3 to execute the data synchronization plan, you need to modify the three files in the bin directory , just modify the following parts of these three files:py

  • print xxx is replaced with print(xxx)

  • Exception, e exchange Exception as e

# 以 datax.py 为例进行修改
(base) [root@hadoop03 ~]# cd /usr/local/datax/bin/
(base) [root@hadoop03 /usr/local/datax/bin]# ls
datax.py  dxprof.py  perftrace.py
(base) [root@hadoop03 /usr/local/datax/bin]# vim datax.py
    print(readerRef)
    print(writerRef)
    jobGuid = 'Please save the following configuration as a json file and  use\n     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json \nto run the job.\n'
    print(jobGuid)

Use the python3 command to execute the self-test script:

(base) [root@hadoop03 /usr/local/datax/bin]# python3 /usr/local/datax/bin/datax.py /usr/local/datax/job/job.json

3. First experience with Data X 

3.1 Configuration example

3.1.1. Generate configuration template

        The data synchronization work of DataX requires the use of json files to save configuration information and configure writer, reader and other information. We can use the following command to generate a configured json template, modify this template, and generate the final json file.

python3 /usr/local/datax/bin/datax.py -r {reader} -w {writer}

        Replace {reader} with the name of the reader component you want, and replace {writer} with the name of the writer component you want.

  • Supported readers:

All readers are stored in the plugin/reader directory under the DataX installation directory and can be viewed in this directory:

(base) [root@hadoop03 /usr/local/datax]# ls
bin  conf  job  lib  log  log_perf  plugin  script  tmp
(base) [root@hadoop03 /usr/local/datax]# ls plugin/reader/
cassandrareader   ftpreader        hbase11xsqlreader  loghubreader        odpsreader      otsreader         sqlserverreader  tsdbreader
clickhousereader  gdbreader        hbase20xsqlreader  mongodbreader       opentsdbreader  otsstreamreader   starrocksreader  txtfilereader
datahubreader     hbase094xreader  hdfsreader         mysqlreader         oraclereader    postgresqlreader  streamreader
drdsreader        hbase11xreader   kingbaseesreader   oceanbasev10reader  ossreader       rdbmsreader       tdenginereader
  • Supported writers:

All writers are stored in the plugin/writer directory under the DataX installation directory and can be viewed in this directory:

(base) [root@hadoop03 /usr/local/datax]# ls plugin/writer/
adbpgwriter       datahubwriter        gdbwriter          hdfswriter          mongodbwriter       odpswriter    postgresqlwriter  streamwriter
adswriter         doriswriter          hbase094xwriter    hologresjdbcwriter  mysqlwriter         oraclewriter  rdbmswriter       tdenginewriter
cassandrawriter   drdswriter           hbase11xsqlwriter  kingbaseeswriter    neo4jwriter         oscarwriter   selectdbwriter    tsdbwriter
clickhousewriter  elasticsearchwriter  hbase11xwriter     kuduwriter          oceanbasev10writer  osswriter     sqlserverwriter   txtfilewriter
databendwriter    ftpwriter            hbase20xsqlwriter  loghubwriter        ocswriter           otswriter     starrockswriter

For example, if you need to view the configuration of streamreader and streamwriter, you can use the following operations:

python3 /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

        This command can print the json template directly on the console. If you want to save it in the form of a file, redirect the output to the specified file:

python3 /usr/local/datax/bin/datax.py -r streamreader -w streamwriter > ~/stream2stream.json

3.1.2 Create configuration file

创built stream2stream.json Text:

(base) [root@hadoop03 ~]# mkdir jobs
(base) [root@hadoop03 ~]# cd jobs/
(base) [root@hadoop03 ~/jobs]# vim stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

3.1.3 Running DataX

(base) [root@hadoop03 ~/jobs]# python3 /usr/local/datax/bin/datax.py stream2stream.json 

3.1.4 Result display

3.2 Dynamic parameter transfer  

3.2.1. Introduction to dynamic parameter transfer

        When DataX synchronizes data, it needs to use its own configuration file, in which the synchronization plan can be defined, usually in the format of json. When executing a synchronization plan, some dynamic data is required in some scenarios. For example:

  • Synchronize MySQL data to HDFS. Only the table names and fields are different during multiple synchronizations.

  • When incrementally synchronizing MySQL data to HDFS or Hive, you need to specify the time for each synchronization.

  • ...

        At these times, it will be very troublesome if we write a newjson file every time. At this time, we can use < /span>Dynamic parameter transfer

        The so-called dynamic parameter transfer is to use a variable-like method to define some parameters that can be changed in the synchronization scheme of json. When executing a synchronization plan, you can specify specific values ​​for these parameters.

3.2.2. Case of dynamic parameter transfer

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": $TIMES,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 1
       }
    }
  }
}

        When using the synchronization scheme, you can use -D to specify specific parameter values. For example, in the above json, we set a parameter TIMES. When using it, you can specify the value of TIMES to dynamically set the value of sliceRecordCount.  

python3 /usr/local/datax/bin/datax.py -p "-DTIMES=3" stream2stream.json

3.3 Burst settings 

        In the DataX processing flow, the Job will be divided into several Tasks for concurrent execution and managed by different TaskGroups. Internally, each Task is composed of a structure of reader -> channel -> writer, where the number of channels determines the degree of concurrency. So how is the number of channels specified?

  • Directly specify the number of channels

  • Calculate the number of channels by Bps

  • Calculate the number of channels by tps

3.3.1 Direct designation

        In the json file of the synchronization plan, we can set job.setting.speed.channel to set the number of channels. This is the most direct way. In this configuration, the channel's Bps is the default 1MBps, which means 1MB of data is transmitted per second.

3.3.2 Bps

        Bps (Byte per second) is a very common representation of data transmission rate. In DataX, you can limit the Bps of the total job and the Bps of a single channel through parameter settings to achieve the effect of speed limit and channel number calculation.

  • Job Bps: The overall speed limit for a Job can be set through job.setting.speed.byte.

  • channel Bps: The speed limit for a single channel can be set through core.transport.channel.speed.byte.

3.3.3 tps

        Tps (transcation per second) is a very common representation of data transmission rate. In DataX, the tps of the total job and the tps of a single channel can be limited through parameter settings to achieve the effect of speed limit and channel number calculation.

  • Job tps: The overall rate limit for a Job can be set through job.setting.speed.record.

  • channel tps:The rate limit for a single channel can be set through core.transport.channel.speed.record.

3.3.4. Priority

  • If both Bps and tps limits are configured, whichever is smaller will prevail.

  • Only when neither Bps nor tps is configured, the channel number configuration will prevail.

Previous article:Introduction to big data DataX data synchronization and data analysis-CSDN blog 

Next article:Big data DataX-Web detailed installation tutorial-CSDN blog

Guess you like

Origin blog.csdn.net/weixin_46560589/article/details/133852824