使用DataX将Hive与MySQL中的表互导

版权声明:本文为博主原创文章,转载请加上原文地址,谢谢! https://blog.csdn.net/Dr_Guo/article/details/82222151

一、DataX简介(摘自https://github.com/alibaba/DataX

DataX

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

Features

DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

DataX详细介绍

请参考:DataX-Introduction

Quick Start

Download DataX下载地址

请点击:Quick Start

Support Data Channels

DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南

类型 数据源 Reader(读) Writer(写) 文档
RDBMS 关系型数据库 MySQL
Oracle
SQLServer
PostgreSQL
DRDS
通用RDBMS(支持所有关系型数据库)
阿里云数仓数据存储 ODPS
ADS
OSS
OCS
NoSQL数据存储 OTS
Hbase0.94
Hbase1.1
MongoDB
Hive
无结构化数据存储 TxtFile
FTP
HDFS
Elasticsearch

二、开始正文

2.1 将Hive表导入MySQL

1.将DataX下载后上传到服务器
2.解压后使用自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
3.运行没有问题后根据模板和自己的需求创建your_job.json配置文件,这里以hive2mysql为例:
(需要注意的是读hive表其实就是读hive存放在hdfs上的文件)
{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/apps/hive/warehouse/test_table/*",
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "column": [ { "index": 0, "type": "string" }, { "index": 1, "type": "long" } ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\u0001" }

                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "username",
                        "password": "password",
                        "column": [ "word", "cnt" ],
                        "session": [ "set session sql_mode='ANSI'" ],
                        "preSql": [ "delete from test_table" ],
                        "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=true&characterEncoding=gbk", "table": [ "test_table" ] } ] }
                }
            }
        ]
    }
}
4.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/hive2mysql.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-08-29 23:19:46.674 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-29 23:19:46.691 [main] INFO  Engine - the machine info  => 

    osInfo: Oracle Corporation 1.8 25.161-b12
    jvmInfo:    Linux amd64 2.6.32-696.23.1.el6.x86_64
    cpu num:    1

    totalPhysicalMemory:    -0.00G
    freePhysicalMemory: -0.00G
    maxFileDescriptorCount: -1
    currentOpenFileDescriptorCount: -1

    GC Names    [Copy, MarkSweepCompact]

    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       


2018-08-29 23:19:46.732 [main] INFO  Engine - 
...
2018-08-29 23:20:00.489 [job-0] INFO  JobContainer - 
     [total cpu info] => 
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%


     [total gc info] => 
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
         MarkSweepCompact     | 1                  | 1                  | 1                  | 0.071s             | 0.071s             | 0.071s             

2018-08-29 23:20:00.490 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-29 23:20:00.499 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 31267 bytes | Speed 3.05KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.285s | Percentage 100.00%
2018-08-29 23:20:00.500 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-29 23:19:46
任务结束时刻                    : 2018-08-29 23:20:00
任务总计耗时                    :                 13s
任务平均流量                    :            3.05KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0

**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter": "\u0001"!!!**


2.2 将MySQL表导入Hive

1.在Hive中建表(储存为文本文件类型)
hive> create table mysql_table(word string, cnt int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
OK
Time taken: 0.194 seconds
hive> select * from mysql_table limit 10;
OK
Time taken: 0.162 seconds
2.编写mysql2hive.json配置文件
{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "123456",
                        "column": [ "word", "cnt" ],
                        "splitPk": "cnt",
                        "connection": [ { "table": [ "test_table" ], "jdbcUrl": [ "jdbc:mysql://192.168.10.116:3306/test_datax" ] } ] }
                },
               "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "fileType": "text",
                        "path": "/apps/hive/warehouse/mysql_table",
                        "fileName": "mysql_table",
                        "column": [ { "name": "word", "type": "string" }, { "name": "cnt", "type": "int" } ],
                        "writeMode": "append",
                        "fieldDelimiter": ",",
                        "compress":"gzip" }
                }
            }
        ]
    }
}
3.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/mysql2hive.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-08-30 01:03:36.399 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-30 01:03:36.415 [main] INFO  Engine - the machine info  => 

    osInfo: Oracle Corporation 1.8 25.161-b12
    jvmInfo:    Linux amd64 2.6.32-696.23.1.el6.x86_64
    cpu num:    1

    totalPhysicalMemory:    -0.00G
    freePhysicalMemory: -0.00G
    maxFileDescriptorCount: -1
    currentOpenFileDescriptorCount: -1

    GC Names    [Copy, MarkSweepCompact]

    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       


2018-08-30 01:03:36.469 [main] INFO  Engine - 
...
2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - 
     [total cpu info] => 
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%


     [total gc info] => 
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 1                  | 1                  | 1                  | 0.346s             | 0.346s             | 0.346s             
         MarkSweepCompact     | 1                  | 1                  | 1                  | 0.206s             | 0.206s             | 0.206s             

2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-30 01:03:52.096 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 19462 bytes | Speed 1.90KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.975s |  All Task WaitReaderTime 0.022s | Percentage 100.00%
2018-08-30 01:03:52.101 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-30 01:03:36
任务结束时刻                    : 2018-08-30 01:03:52
任务总计耗时                    :                 15s
任务平均流量                    :            1.90KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0
4.查看hive表中是否有数据
hive> select * from mysql_table limit 10;
OK
at-spi-python-1.28.1    1
mesa-dri-drivers-9.2    1
m4-1.4.13-5.el6.x86_    1
libXxf86vm-1.1.2-2.e    1
libuuid-2.17.2-12.14    1
httpd-2.2.15-29.el6.    1
libgweather-2.28.0-5    1
lvm2-2.02.100-8.el6.    1
xmlrpc-c-1.16.24-121    1
parted-2.1-21.el6.x8    1
Time taken: 0.148 seconds, Fetched: 10 row(s)
hive> select * from mysql_table order by cnt desc limit 10;
Query ID = hdfs_20180830010430_68021b2b-9bd8-44a0-92df-dbe0c717053a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (Executing on YARN cluster with App id application_1535363634130_0005)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.63 s    
--------------------------------------------------------------------------------
OK
Installing  918
warning:    2
*** 2
tcp_wrappers-libs-7.    1
m4-1.4.13-5.el6.x86_    1
httpd-2.2.15-29.el6.    1
libgweather-2.28.0-5    1
libXxf86vm-1.1.2-2.e    1
xmlrpc-c-1.16.24-121    1
parted-2.1-21.el6.x8    1
Time taken: 33.713 seconds, Fetched: 10 row(s)

备注:

1.可能会遇到的错误

2018-08-29 22:37:54.327 [job-0] ERROR JobContainer - Exception when job run com.alibaba.datax.common.
exception.DataXException: Code:[DBUtilErrorCode-01], Description:[获取表字段相关信息失败.].  -
获取表:test_table 的字段的元信息时失败. 请联系 DBA 核查该库、表信息. -
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'id' in 'field list'

原因:”column”: [ ]中定义的列名与mysql不一致


经DataX智能分析,该任务最可能的错误原因是:com.alibaba.datax.common.exception.DataXException: 
Code:[UnstructuredStorageReader-03], Description:[您填写的参数值不合法.]. - 仅仅支持单字符切分, 
您配置的切分为 : ['01']

原因:分隔符问题,详见上面的特别注意


2.MysqlWriter 针对 Mysql 类型转换列表

DataX 内部类型 Mysql 数据类型
Long int, tinyint, smallint, mediumint, int, bigint, year
Double float, double, decimal
String varchar, char, tinytext, text, mediumtext, longtext
Date date, datetime, timestamp, time
Boolean bit, bool
Bytes tinyblob, mediumblob, blob, longblob, varbinary

* bit类型目前是未定义类型转换

3.MysqlReader针对Mysql类型转换列表

DataX 内部类型 Mysql 数据类型
Long int, tinyint, smallint, mediumint, int, bigint
Double float, double, decimal
String varchar, char, tinytext, text, mediumtext, longtext, year
Date date, datetime, timestamp, time
Boolean bit, bool
Bytes tinyblob, mediumblob, blob, longblob, varbinary

请注意:

  • 除上述罗列字段类型外,其他类型均不支持
  • tinyint(1) DataX视作为整形
  • year DataX视作为字符串类型
  • bit DataX属于未定义行为

猜你喜欢

转载自blog.csdn.net/Dr_Guo/article/details/82222151