Table of contents
2. Installation and deployment
3. First experience with Data X
3.1.1. Generate configuration template
3.1.2 Create configuration file
3.2 Dynamic parameter transfer
3.2.1. Introduction to dynamic parameter transfer
3.2.2. Case of dynamic parameter transfer
Official reference document:https://github.com/alibaba/DataX/blob/master/userGuid.md
1. Environmental preparation
-
Linux operating system
-
JDK (1.8 and above are acceptable, 1.8 is recommended):Installing JDK and Maven environment under Linux_linux installation jdk and maven-CSDN blog
-
Python (2 or 3 is acceptable):Spark-3.2.4 High Availability Cluster Installation and Deployment Detailed Graphic Tutorial_spark High Availability-CSDN Blog
-
Apache Maven 3.x (only required for source code compilation and installation):Installing JDK and Maven environment under Linux_linux installation jdk and maven-CSDN blog
2. Installation and deployment
2.1 Binary installation
-
1. Download and install the DataX toolkit, download address:https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz
-
2. Upload the downloaded package to Linux
-
3. Unzip and install
(base) [root@hadoop03 ~]# tar -zxvf datax.tar.gz -C /usr/local/
- 4. Self-test script
# python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
# 例如:
python /usr/local/datax/bin/datax.py /usr/local/datax/job/job.json
- 5. Exception resolution
If the following error occurs when executing the self-test program:
[main] WARN ConfigParser - 插件[streamreader,streamwriter]加载失败,1s后重试... Exception:Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
[main] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.common.util.Configuration.from(Configuration.java:95)
at com.alibaba.datax.core.util.ConfigParser.parseOnePluginConfig(ConfigParser.java:153)
at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:125)
at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
at com.alibaba.datax.core.Engine.entry(Engine.java:137)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
Solution:Delete all files starting with _ in the plugin directory
cd /usr/local/datax/plugin
find ./* -type f -name ".*er" | xargs rm -rf
2.2 python 3 support
The DataX project itself is developed using Python2, so it needs to be executed using the Python2 version. But the Python version we installed is 3, and the syntax difference between 3 and 2 is quite large. Therefore, if you directly use python3
to execute, problems will occur.
If you need to use python3
to execute the data synchronization plan, you need to modify the three files in the bin
directory , just modify the following parts of these three files:py
-
print xxx is replaced with print(xxx)
-
Exception, e exchange Exception as e
# 以 datax.py 为例进行修改
(base) [root@hadoop03 ~]# cd /usr/local/datax/bin/
(base) [root@hadoop03 /usr/local/datax/bin]# ls
datax.py dxprof.py perftrace.py
(base) [root@hadoop03 /usr/local/datax/bin]# vim datax.py
print(readerRef)
print(writerRef)
jobGuid = 'Please save the following configuration as a json file and use\n python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json \nto run the job.\n'
print(jobGuid)
Use the python3 command to execute the self-test script:
(base) [root@hadoop03 /usr/local/datax/bin]# python3 /usr/local/datax/bin/datax.py /usr/local/datax/job/job.json
3. First experience with Data X
3.1 Configuration example
3.1.1. Generate configuration template
The data synchronization work of DataX requires the use of json files to save configuration information and configure writer, reader and other information. We can use the following command to generate a configured json template, modify this template, and generate the final json file.
python3 /usr/local/datax/bin/datax.py -r {reader} -w {writer}
Replace {reader} with the name of the reader component you want, and replace {writer} with the name of the writer component you want.
- Supported readers:
All readers are stored in the plugin/reader
directory under the DataX installation directory and can be viewed in this directory:
(base) [root@hadoop03 /usr/local/datax]# ls
bin conf job lib log log_perf plugin script tmp
(base) [root@hadoop03 /usr/local/datax]# ls plugin/reader/
cassandrareader ftpreader hbase11xsqlreader loghubreader odpsreader otsreader sqlserverreader tsdbreader
clickhousereader gdbreader hbase20xsqlreader mongodbreader opentsdbreader otsstreamreader starrocksreader txtfilereader
datahubreader hbase094xreader hdfsreader mysqlreader oraclereader postgresqlreader streamreader
drdsreader hbase11xreader kingbaseesreader oceanbasev10reader ossreader rdbmsreader tdenginereader
- Supported writers:
All writers are stored in the plugin/writer
directory under the DataX installation directory and can be viewed in this directory:
(base) [root@hadoop03 /usr/local/datax]# ls plugin/writer/
adbpgwriter datahubwriter gdbwriter hdfswriter mongodbwriter odpswriter postgresqlwriter streamwriter
adswriter doriswriter hbase094xwriter hologresjdbcwriter mysqlwriter oraclewriter rdbmswriter tdenginewriter
cassandrawriter drdswriter hbase11xsqlwriter kingbaseeswriter neo4jwriter oscarwriter selectdbwriter tsdbwriter
clickhousewriter elasticsearchwriter hbase11xwriter kuduwriter oceanbasev10writer osswriter sqlserverwriter txtfilewriter
databendwriter ftpwriter hbase20xsqlwriter loghubwriter ocswriter otswriter starrockswriter
For example, if you need to view the configuration of streamreader
and streamwriter
, you can use the following operations:
python3 /usr/local/datax/bin/datax.py -r streamreader -w streamwriter
This command can print the json template directly on the console. If you want to save it in the form of a file, redirect the output to the specified file:
python3 /usr/local/datax/bin/datax.py -r streamreader -w streamwriter > ~/stream2stream.json
3.1.2 Create configuration file
创built stream2stream.json Text:
(base) [root@hadoop03 ~]# mkdir jobs
(base) [root@hadoop03 ~]# cd jobs/
(base) [root@hadoop03 ~/jobs]# vim stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
3.1.3 Running DataX
(base) [root@hadoop03 ~/jobs]# python3 /usr/local/datax/bin/datax.py stream2stream.json
3.1.4 Result display
3.2 Dynamic parameter transfer
3.2.1. Introduction to dynamic parameter transfer
When DataX synchronizes data, it needs to use its own configuration file, in which the synchronization plan can be defined, usually in the format of json. When executing a synchronization plan, some dynamic data is required in some scenarios. For example:
-
Synchronize MySQL data to HDFS. Only the table names and fields are different during multiple synchronizations.
-
When incrementally synchronizing MySQL data to HDFS or Hive, you need to specify the time for each synchronization.
-
...
At these times, it will be very troublesome if we write a newjson file every time. At this time, we can use < /span>Dynamic parameter transfer
The so-called dynamic parameter transfer is to use a variable-like method to define some parameters that can be changed in the synchronization scheme of json. When executing a synchronization plan, you can specify specific values for these parameters.
3.2.2. Case of dynamic parameter transfer
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": $TIMES,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
When using the synchronization scheme, you can use -D to specify specific parameter values. For example, in the above json, we set a parameter TIMES. When using it, you can specify the value of TIMES to dynamically set the value of sliceRecordCount
.
python3 /usr/local/datax/bin/datax.py -p "-DTIMES=3" stream2stream.json
3.3 Burst settings
In the DataX processing flow, the Job will be divided into several Tasks for concurrent execution and managed by different TaskGroups. Internally, each Task is composed of a structure of reader -> channel -> writer
, where the number of channels determines the degree of concurrency. So how is the number of channels specified?
-
Directly specify the number of channels
-
Calculate the number of channels by
Bps
-
Calculate the number of channels by
tps
3.3.1 Direct designation
In the json file of the synchronization plan, we can set job.setting.speed.channel
to set the number of channels. This is the most direct way. In this configuration, the channel's Bps
is the default 1MBps, which means 1MB of data is transmitted per second.
3.3.2 Bps
Bps (Byte per second) is a very common representation of data transmission rate. In DataX, you can limit the Bps of the total job and the Bps of a single channel through parameter settings to achieve the effect of speed limit and channel number calculation.
-
Job Bps: The overall speed limit for a Job can be set through
job.setting.speed.byte
. -
channel Bps: The speed limit for a single channel can be set through
core.transport.channel.speed.byte
.
3.3.3 tps
Tps (transcation per second) is a very common representation of data transmission rate. In DataX, the tps of the total job and the tps of a single channel can be limited through parameter settings to achieve the effect of speed limit and channel number calculation.
-
Job tps: The overall rate limit for a Job can be set through
job.setting.speed.record
. -
channel tps:The rate limit for a single channel can be set through
core.transport.channel.speed.record
.
3.3.4. Priority
-
If both Bps and tps limits are configured, whichever is smaller will prevail.
-
Only when neither Bps nor tps is configured, the channel number configuration will prevail.
Previous article:Introduction to big data DataX data synchronization and data analysis-CSDN blog
Next article:Big data DataX-Web detailed installation tutorial-CSDN blog