DataX 数据脱敏平台
开发与实验 中国大陆 谨慎参考 单机 并发
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
以星型结构进行数据传输。
环境搭建
软件和环境一览
DataX 主机: Ubuntu 16.04 1Ghz CPU 1GB RAM
openjdk version “1.8.0_162”
DataX
数据接收主机: Ubuntu 16.04 1Ghz CPU 1GB RAM
openjdk version “1.8.0_162”
sudo apt-get install openjdk-8-jdk
apache-hive-1.2.2
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/stable/apache-hive-1.2.2-bin.tar.gz
tar -zxvf apache-hive-1.2.2-bin.tar.gz
hadoop-2.9.0
wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz
tar -zxvf hadoop-2.9.0.tar.gzMySQL Server 5.7.21
sudo apt-get install mysql-server
apt-get install mysql-client
sudo apt-get install libmysqlclient-dev
请参考官方介绍
Hadoop 相关配置
安装参考网页
至于是集群还是SingleCluster请自选
Hive 相关配置
参考[博客](
https://blog.csdn.net/pucao_cug/article/details/71773665)
这一步需要耐心,认真配置
DataX 相关配置
作业文件 mysqlToHDFS.json
请认真参阅:https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md
请有意识地针对你自己的机器修改数据库名和表名
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "你的用户名",
"password": "你的密码",
"column": [
"id",
"name"
],
"splitPk": "id",
"connection": [
{
"table": [
"你的表"
],
"jdbcUrl": [
"jdbc:mysql://127.0.0.1:3306/你的数据库名"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://127.0.0.1:9000",
"fileType": "text",
"path": "/user/hive/warehouse/hdfswriter.db/text_table",
"fileName": "text_table",
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
],
"writeMode": "append",
"fieldDelimiter": "\t",
}
}
}
]
}
}
问题:用上面方法导入数据成功后,hive 里面用select查询表查不到数据,hadoop 却能看到文件内容,什么鬼?
答:fileName需要和text_table名一致。
利用transformer用于数据脱敏尝试
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 4,
"percentage": 0.03
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "账户",
"password": "密码",
"column": [ "id", "name" ],
"splitPk": "id",
"connection": [ { "table": [ "readtable" ], "jdbcUrl": [ "jdbc:mysql://127.0.0.1:3306/datax" ] } ] }
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://127.0.0.1:9000",
"fileType": "text",
"path": "/user/hive/warehouse/hdfswriter.db/text_table",
"fileName": "text_table",
"column": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ],
"writeMode": "append",
"fieldDelimiter": "\t" }
},
"transformer": [
{
"name": "dx_replace",
"parameter":
{
"columnIndex":1,
"paras":["2","2","**"] }
}
]
}
]
}
}