【技术调研报告】DataX 离线异构数据同步框架

版权声明:其他网站转载请注明原始链接,尽量不要破坏格式 https://blog.csdn.net/landstream/article/details/79889390

DataX 数据脱敏平台

开发与实验 中国大陆 谨慎参考 单机 并发
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

以星型结构进行数据传输。

环境搭建

软件和环境一览

DataX 主机: Ubuntu 16.04 1Ghz CPU 1GB RAM

  • openjdk version “1.8.0_162”

  • DataX

数据接收主机: Ubuntu 16.04 1Ghz CPU 1GB RAM

请参考官方介绍

Hadoop 相关配置

安装参考网页

至于是集群还是SingleCluster请自选

Hive 相关配置

参考[博客](
https://blog.csdn.net/pucao_cug/article/details/71773665)

这一步需要耐心,认真配置

DataX 相关配置

作业文件 mysqlToHDFS.json
请认真参阅:https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md

请有意识地针对你自己的机器修改数据库名和表名
{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "你的用户名",
                        "password": "你的密码",
                        "column": [
                            "id",
                            "name"
                        ],
                        "splitPk": "id",
                        "connection": [
                            {
                                "table": [
                                    "你的表"
                                ],
                                "jdbcUrl": [
     "jdbc:mysql://127.0.0.1:3306/你的数据库名"
                                ]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://127.0.0.1:9000",
                        "fileType": "text",
                        "path": "/user/hive/warehouse/hdfswriter.db/text_table",
                        "fileName": "text_table",
                        "column": [
                      {
                         "name": "id",
                         "type": "int"
                     },
                     {
                         "name": "name",
                         "type": "string"
                     } 
                        ],
                        "writeMode": "append",
                        "fieldDelimiter": "\t",

                    }
                }
            }
        ]
    }
}

问题:用上面方法导入数据成功后,hive 里面用select查询表查不到数据,hadoop 却能看到文件内容,什么鬼?

答:fileName需要和text_table名一致。

利用transformer用于数据脱敏尝试

{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 4,
                "percentage": 0.03
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "账户",
                        "password": "密码",
                        "column": [ "id", "name" ],
                        "splitPk": "id",
                        "connection": [ { "table": [ "readtable" ], "jdbcUrl": [ "jdbc:mysql://127.0.0.1:3306/datax" ] } ] }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://127.0.0.1:9000",
                        "fileType": "text",
                        "path": "/user/hive/warehouse/hdfswriter.db/text_table",
                        "fileName": "text_table",
                        "column": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ],
                        "writeMode": "append",
                        "fieldDelimiter": "\t" }
                },
                "transformer": [
                                    {
                                        "name": "dx_replace",
                                        "parameter": 
                                            {
                                            "columnIndex":1,
                                            "paras":["2","2","**"] }  
                                    }
                                ]
            }
        ]
    }
}

参考资料

  1. https://github.com/alibaba/DataX/blob/master/introduction.md
  2. https://github.com/alibaba/DataX/blob/master/transformer/doc/transformer.md

猜你喜欢

转载自blog.csdn.net/landstream/article/details/79889390