Sqoop从Mysql导出到Hive的几个需要注意的地方

背景：
在此记录下使用 sqoop将数据从mysql导出到 hive过程中需要注意的几个地方，包括 NULL值处理、增量导入、parquet格式的日期类型

注：这种数据传输工具其实没必要深究所有用法，用到哪块去研究下就行了

Sqoop版本：1.4.6 - CDH5.13.0

一、NULL处理

1. 观察测试

先看执行语句：
sqoop import --connect jdbc:mysql://10.3.1.4:3308/anti_fraud --username catchd --password Edrayd8^u --table c_apply_info --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite --create-hive-table --hive-database test --hive-table c_apply_info --delete-target-dir -m 1

注：-m 1 启动 1个map任务来并行导入数据，默认是4个，最好不要将数字设置为高于集群的节点数

这是mysql转hive存储为text类型的基本导出语句

我们注意看mysql中的NULL值在导入成Hive表后是如何存在的：
这里写图片描述

我们测试下这个null到底是个啥：

执行 hive语句：

select * from test.c_apply_info where remark is null

发现 无法显示 null字符串的记录！

select * from test.c_apply_info where remark = 'null'

这种方式才能查找出，说明确实是null字符串，并且在HDFS中也是以null字符串形式存储的！

2.null字符串转为NULL

添加下面两条参数可以实现：

param	含义
–null-string `<null-string>`	如果指定列为字符串类型，使用指定字符串替换值为null的该类列的值
–null-non-string `<null-string>`	如果指定列为非字符串类型，使用指定字符串替换值为null的该类列的值

添加后再测试：

sqoop import --connect jdbc:mysql://10.3.1.4:3308/anti_fraud --username catchd--password Edrayd8^u --table c_apply_info --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite --create-hive-table --hive-database test --hive-table c_apply_info --delete-target-dir -m 1 --null-string '\\N' --null-non-string '\\N'

可以看到，加上–null-string语句后，‘null’ 字符串变成了NULL:
这里写图片描述

执行 hive语句：

select * from test.c_apply_info where remark is null

发现这样可以成功查询出NULL值

select * from test.c_apply_info where remark = 'NULL'

而这种方式并不能，说明确实是NULL，而不是字符串

由于我们 –null-string 后添加的是 '\\N'，那么通过这个能不能查出NULL值呢：

select * from test.c_apply_info where remark = '\N'
select * from test.c_apply_info where remark = '\\N'

发现上面这两种方式并不行

注：经测试，–null-string 后添加 '\\N' 或者 'NULL' 都可以！即都可以在hive表中通过is null 的方式查出NULL值，但不同的是--null-string '\\N'实际在HDFS中是以\N的存储NULL，而--null-string 'NULL'实际在HDFS中是以NULL的存储NULL

3.扩展

我们可以通过 serialization.null.format 属性指定 hive的NULL值识别：

alter table test.c_apply_info SET SERDEPROPERTIES('serialization.null.format' = 'aaa');

通过该语句，我们可以把hive表中的‘aaa’字符串转化为NULL值，我们在查询hive表时，原本显示的aaa字符串就会变为NULL！但实际HDFS存储时还是aaa！

注：该属性value不能为中文！

通过这个也可以实现我们对NULL值处理的某些需求！

二、增量导入

核心参数：

param	含义
–check-column	用来指定一些列，这些列在增量导入时用来检查这些数据是否作为增量数据进行导入，和关系型数据库中的自增字段及时间戳类似. 注意:这些被指定的列的类型不能使任意字符类型（在关系数据库中），如char、varchar等类型都是不可以的，同时–check-column可以去指定多个列
–incremental	用来指定增量导入的模式，两种模式分别为Append和Lastmodified
–last-value	指定上一次导入中检查列指定字段最大值，即会导入比lastvalue指定值大的数据记录

注意：上面三个参数都必须添加！

执行语句：

sqoop import --connect jdbc:mysql://10.3.1.4:3308/anti_fraud --username catchd--password EdrwMd8^u --table c_apply_info --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-database test --hive-table c_apply_info --incremental lastmodified --check-column update_time --last-value '2018-06-27 00:00:00' -m 1 --null-string '\\N' --null-non-string '\\N'

以上语句使用 lastmodified 模式进行增量导入，结果报错：

错误信息：--incremental lastmodified option for hive imports is not supported. Please remove the parameter --incremental lastmodified

错误原因：Sqoop 不支持 mysql转hive时使用 lastmodified 模式进行增量导入，但mysql转HDFS时可以支持该方式！

我们使用append方式导入：

sqoop import --connect jdbc:mysql://10.3.1.4:3308/anti_fraud --username catchd--password EdrwMd8^u --table c_apply_info --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-database test --hive-table c_apply_info --incremental append--check-column update_time --last-value '2018-06-27 00:00:00' -m 1 --null-string '\\N' --null-non-string '\\N'

增量导入成功！

三、指定parquet格式

执行语句：

sqoop import --connect jdbc:mysql://10.3.1.4:3308/anti_fraud --username moercredit --password EdrwM7ayd8^u --table c_apply_info --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite --create-hive-table --hive-database test --hive-table c_apply_info --delete-target-dir -m 1 --null-string '\\N' --null-non-string '\\N' --as-parquetfile --outdir /root/sqoop

在上面的基础上指定了parquet文件格式 和 java中间文件的目录路径outdir

建立完成后查看建表语句：

show CREATE TABLE test.c_apply_info

1   CREATE TABLE `test.c_apply_info`(
2     `id` bigint, 
3     `cust_id` bigint, 
4     `channel_id` bigint, 
5     `product_type` int, 
6     `apply_amount` string, 
7     `apply_period` int, 
8     `repayment_type` int, 
9     `loan_usage` int, 
10    `loan_usage_other` string, 
............
31    `remark` string, 
32    `is_deleted` int)
33  ROW FORMAT SERDE
34    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
35  STORED AS INPUTFORMAT 
36    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
37  OUTPUTFORMAT 
38    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
39  LOCATION
40    'hdfs://cdh1:8020/user/hive/warehouse/test.db/c_apply_info'
41  TBLPROPERTIES (
42    'COLUMN_STATS_ACCURATE'='false', 
43    'avro.schema.url'='hdfs://cdh1:8020/user/hive/warehouse/test.db/c_apply_info/.metadata/schemas/1.avsc', 
44    'kite.compression.type'='snappy', 
45    'numFiles'='0', 
46    'numRows'='-1', 
47    'rawDataSize'='-1', 
48    'totalSize'='0', 
49    'transient_lastDdlTime'='1534771094')

关于：ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
SerDe 是Serialize/Deserialize的简称，用于指定序列化和反序列化
serde包括内置类型：Avro、ORC、RegEx、Thrift、Parquet、CSV、JsonSerde
也可以写一个自定义类继承 AbstractSerde类，重写initialize和deserialize等方法，然后hive语句：add jar MySerde.jar

问题：
发现 mysql 中的 datetime 类型字段create_time和update_time都变成了bigint类型！
原来parquet格式和之前默认Text格式不同，之前是把datetime（mysql）转为string，parquet是把datetime转化为bigint