配置DataX

1.配置DataX

1)下载DataX安装包并上传到hadoop102的/opt/software
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
.tar.gz到/opt/module
[atguigu@hadoop102 software]$ tar -zxvf datax.tar.gz -C /opt/module/
2)自检,执行如下命令
[atguigu@hadoop102 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json
出现如下内容2)解压datax,则表明安装成功
……
2021-10-12 21:51:12.335 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2021-10-12 21:51:02
任务结束时刻                    : 2021-10-12 21:51:12
任务总计耗时                    :                 10s
任务平均流量                    :          253.91KB/s
记录写入速度                    :          10000rec/s
读出记录总数                    :              100000
读写失败总数                    :                   0


// An highlighted block
var foo = 'bar';

2.DataX案例

DataX的使用十分简单,用户只需根据自己同步数据的数据源和目的地选择相应的Reader和Writer,并将Reader和Writer的信息配置在一个json文件中,然后执行如下命令提交数据同步任务即可。

[atguigu@hadoop102 datax]$ python bin/datax.py path/to/your/job.json

2.1MySQLReader之TableMode

[gpb@hadoop102 datax]$ cd job/
[gpb@hadoop102 job]$ vim base_province.json

{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "where": "id>=3",
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall"
                                ],
                                "table": [
                                    "base_province"
                                ]
                            }
                        ],
                        "password": "000000",
                        "splitPk": "",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}

python bin/datax.py job/base_province.json

2.2MySQLReader之QuerySQLMode

(1)修改配置文件base_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2)配置文件内容如下


{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}


3)提交任务
(1)清空历史数据
[atguigu@hadoop102 datax]$ hadoop fs -rm -r -f /base_province/*
(2)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax 
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py job/base_province_sql.json
4)查看结果
(1)DataX打印日志
2021-10-13 11:13:14.930 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2021-10-13 11:13:03
任务结束时刻                    : 2021-10-13 11:13:14
任务总计耗时                    :                 11s
任务平均流量                    :               66B/s
记录写入速度                    :              3rec/s
读出记录总数                    :                  32
读写失败总数                    :                   0
(2)查看HDFS文件
[atguigu@hadoop102 datax]$ hadoop fs -cat /base_province/* | zcat


4.2.3 DataX传参

通常情况下,离线数据同步任务需要每日定时重复执行,故HDFS上的目标路径通常会包含一层日期,以对每日同步的数据加以区分,也就是说每日同步数据的目标路径不是固定不变的,因此DataX配置文件中HDFS Writer的path参数的值应该是动态的。为实现这一效果,就需要使用DataX传参的功能。
DataX传参的用法如下,在JSON配置文件中使用${
    
    param}引用参数,在提交任务时使用-p"-Dparam=value"传入参数值,具体示例如下。
1)编写配置文件
(1)修改配置文件base_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2)配置文件内容如下
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province/${dt}",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}
2)提交任务
(1)创建目标路径
[atguigu@hadoop102 datax]$ hadoop fs -mkdir /base_province/2020-06-142)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax 
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py -p"-Ddt=2020-06-14" job/base_province.json
3)查看结果
[atguigu@hadoop102 datax]$ hadoop fs -ls /base_province
Found 2 items
drwxr-xr-x   - atguigu supergroup          0 2021-10-15 21:41 /base_province/2020-06-14

4.3 同步HDFS数据到MySQL案例

案例要求:同步HDFS上的/base_province目录下的数据到MySQL gmall 数据库下的test_province表。
需求分析:要实现该功能,需选用HDFSReader和MySQLWriter。



1)编写配置文件
(1)创建配置文件test_province.json
[atguigu@hadoop102 ~]$ vim /opt/module/datax/job/base_province.json
(2)配置文件内容如下
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "hdfsreader",
                    "parameter": {
    
    
                        "defaultFS": "hdfs://hadoop102:8020",
                        "path": "/base_province",
                        "column": [
                            "*"
                        ],
                        "fileType": "text",
                        "compress": "gzip",
                        "encoding": "UTF-8",
                        "nullFormat": "\\N",
                        "fieldDelimiter": "\t",
                    }
                },
                "writer": {
    
    
                    "name": "mysqlwriter",
                    "parameter": {
    
    
                        "username": "root",
                        "password": "000000",
                        "connection": [
                            {
    
    
                                "table": [
                                    "test_province"
                                ],
                                "jdbcUrl": "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&characterEncoding=utf-8"
                            }
                        ],
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}
3)提交任务
(1)在MySQL中创建gmall.test_province表
DROP TABLE IF EXISTS `test_province`;
CREATE TABLE `test_province`  (
  `id` bigint(20) NOT NULL,
  `name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;2)进入DataX根目录
[atguigu@hadoop102 datax]$ cd /opt/module/datax 
(3)执行如下命令
[atguigu@hadoop102 datax]$ python bin/datax.py job/test_province.json 

猜你喜欢

转载自blog.csdn.net/qq_45972323/article/details/132371187
今日推荐