Konfigurieren Sie DataX_3.0.0
1. Konfigurieren Sie DataX
1)下载DataX安装包并上传到hadoop102的/opt/software
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
.tar.gz到/opt/module
[atguigu@hadoop102 software]$ tar -zxvf datax.tar.gz -C /opt/module/
2)自检,执行如下命令
[atguigu@hadoop102 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json
出现如下内容2)解压datax,则表明安装成功
……
2021-10-12 21:51:12.335 [job-0] INFO JobContainer -
任务启动时刻 : 2021-10-12 21:51:02
任务结束时刻 : 2021-10-12 21:51:12
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
// An highlighted block
var foo = 'bar';
2.DataX-Fall
Die Verwendung von DataX ist sehr einfach: Benutzer müssen lediglich den entsprechenden Reader und Writer entsprechend der Datenquelle und dem Ziel ihrer eigenen synchronisierten Daten auswählen, die Reader- und Writer-Informationen in einer JSON-Datei konfigurieren und dann zum Senden den folgenden Befehl ausführen die Datensynchronisierungsaufgabe. .
[atguigu@hadoop102 datax]$ python bin/datax.py path/to/your/job.json
2.1TableMode von MySQLReader
[gpb@hadoop102 datax]$ cd job/
[gpb@hadoop102 job]$ vim base_province.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"name",
"region_id",
"area_code",
"iso_code",
"iso_3166_2"
],
"where": "id>=3",
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop102:3306/gmall"
],
"table": [
"base_province"
]
}
],
"password": "000000",
"splitPk": "",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "name",
"type": "string"
},
{
"name": "region_id",
"type": "string"
},
{
"name": "area_code",
"type": "string"
},
{
"name": "iso_code",
"type": "string"
},
{
"name": "iso_3166_2",
"type": "string"
}
],
"compress": "gzip",
"defaultFS": "hdfs://hadoop102:8020",
"fieldDelimiter": "\t",
"fileName": "base_province",
"fileType": "text",
"path": "/base_province",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
Python bin/datax.py job/base_province.json
2.2 QuerySQLMode von MySQLReader
(1) Ändern Sie die Konfigurationsdatei base_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2) Der Inhalt der Konfigurationsdatei lautet wie folgt
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop102:3306/gmall"
],
"querySql": [
"select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
]
}
],
"password": "000000",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "name",
"type": "string"
},
{
"name": "region_id",
"type": "string"
},
{
"name": "area_code",
"type": "string"
},
{
"name": "iso_code",
"type": "string"
},
{
"name": "iso_3166_2",
"type": "string"
}
],
"compress": "gzip",
"defaultFS": "hdfs://hadoop102:8020",
"fieldDelimiter": "\t",
"fileName": "base_province",
"fileType": "text",
"path": "/base_province",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
3)提交任务
(1)清空历史数据
[atguigu@hadoop102 datax]$ hadoop fs -rm -r -f /base_province/*
(2)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py job/base_province_sql.json
4)查看结果
(1)DataX打印日志
2021-10-13 11:13:14.930 [job-0] INFO JobContainer -
任务启动时刻 : 2021-10-13 11:13:03
任务结束时刻 : 2021-10-13 11:13:14
任务总计耗时 : 11s
任务平均流量 : 66B/s
记录写入速度 : 3rec/s
读出记录总数 : 32
读写失败总数 : 0
(2)查看HDFS文件
[atguigu@hadoop102 datax]$ hadoop fs -cat /base_province/* | zcat
4.2.3 DataX-Parameterübertragung
通常情况下,离线数据同步任务需要每日定时重复执行,故HDFS上的目标路径通常会包含一层日期,以对每日同步的数据加以区分,也就是说每日同步数据的目标路径不是固定不变的,因此DataX配置文件中HDFS Writer的path参数的值应该是动态的。为实现这一效果,就需要使用DataX传参的功能。
DataX传参的用法如下,在JSON配置文件中使用${
param}引用参数,在提交任务时使用-p"-Dparam=value"传入参数值,具体示例如下。
1)编写配置文件
(1)修改配置文件base_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2)配置文件内容如下
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop102:3306/gmall"
],
"querySql": [
"select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
]
}
],
"password": "000000",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "name",
"type": "string"
},
{
"name": "region_id",
"type": "string"
},
{
"name": "area_code",
"type": "string"
},
{
"name": "iso_code",
"type": "string"
},
{
"name": "iso_3166_2",
"type": "string"
}
],
"compress": "gzip",
"defaultFS": "hdfs://hadoop102:8020",
"fieldDelimiter": "\t",
"fileName": "base_province",
"fileType": "text",
"path": "/base_province/${dt}",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
2)提交任务
(1)创建目标路径
[atguigu@hadoop102 datax]$ hadoop fs -mkdir /base_province/2020-06-14
(2)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py -p"-Ddt=2020-06-14" job/base_province.json
3)查看结果
[atguigu@hadoop102 datax]$ hadoop fs -ls /base_province
Found 2 items
drwxr-xr-x - atguigu supergroup 0 2021-10-15 21:41 /base_province/2020-06-14
4.3 HDFS-Daten mit MySQL-Fall synchronisieren
Fallanforderungen: Synchronisieren Sie die Daten im Verzeichnis /base_province auf HDFS mit der Tabelle test_province unter der MySQL-Gmall-Datenbank.
Anforderungsanalyse: Um diese Funktion zu implementieren, müssen HDFSReader und MySQLWriter ausgewählt werden.
1)编写配置文件
(1)创建配置文件test_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2)配置文件内容如下
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"defaultFS": "hdfs://hadoop102:8020",
"path": "/base_province",
"column": [
"*"
],
"fileType": "text",
"compress": "gzip",
"encoding": "UTF-8",
"nullFormat": "\\N",
"fieldDelimiter": "\t",
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "000000",
"connection": [
{
"table": [
"test_province"
],
"jdbcUrl": "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&characterEncoding=utf-8"
}
],
"column": [
"id",
"name",
"region_id",
"area_code",
"iso_code",
"iso_3166_2"
],
"writeMode": "replace"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
3)提交任务
(1)在MySQL中创建gmall.test_province表
DROP TABLE IF EXISTS `test_province`;
CREATE TABLE `test_province` (
`id` bigint(20) NOT NULL,
`name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
(2)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py job/test_province.json