版权声明:本文为博主原创文章,未经博主女朋友允许不得转载。 https://blog.csdn.net/qq_26442553/article/details/85693276
一个网友问我很简单的查询导出语句,使用insert .....directory导出数据后,无论是在hdfs上还是本地查看的,都是查看显示乱码
insert overwrite directory '/user/finance/hive/warehouse/fdm_sor.db/t_tmp/'
select * from t_tmp;
#远程查看数据内容
[robot~]$ hadoop fs -cat /user/finance/hive/warehouse/fdm_sor.db/t_tmp/000000_0.deflate
x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼
x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼
问题分析:很明显导出的文件是.deflate格式。像.deflate,.gz,.zip,.bz2,.lzo等这些后缀格式的文件,都是压缩格式。正常如何文件存储没有压缩的话,是没有后缀的。所以分析这里是因为hive启动了压缩命令,对所有写入的文件进行了压缩。使用如下命名进行查看,显然启动压缩格式,而对应的类就是deflate压缩方法。
SET hive.exec.compress.output;#是否启动压缩
SET mapreduce.output.fileoutputformat.compress.codec;#压缩格式
hive (zala.a)> SET hive.exec.compress.output;
hive.exec.compress.output=true
hive (zala.a)> SET mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
hive中常见的压缩格式与对应的类,具体关于压缩的使用参考后续博客
压缩格式 | 对应的编码/解码器 |
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip | org.apache.hadoop.io.compress.BZipCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
所以这个时候如果想导出文件,那么需要加入取消压缩的使用。hive中默认压缩是不使用的,可以使用如下命令进行设置单次导出取消压缩。
SET hive.exec.compress.output=false;
insert overwrite directory '/user/finance/hive/warehouse/fdm_sor.db/t_tmp/'
select * from t_tmp;
查看结果如下,导出文件也是text格式了:
[robot 3333]$ hadoop fs -cat /user/finance/hive/warehouse/fdm_sor.db/t_tmp/000000_0
1111fdfsdfrerfwef\N\N
234343dfdsfdsaaaa\N\N
33333dfsdfdsabnhh\N\N
4444fdsfsdfaaaaaa\N\N
1111fdfsdfrerfwef
234343dfdsfdsaaaa
33333dfsdfdsabnhh
4444fdsfsdfaaaaaa
2.insert...directory查询导出使用详解
注意insert的查询导出可以自定义导出文件存储格式,行列解析模式等存储信息。
insert查询导出的标准语法:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
insert 查询导出多次导出语法:
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format新增用法:
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
使用演示:
1.将rcfile存储的文件查询导出,分隔符为‘@’,存储格式为textfile的文件到本地。
SET hive.exec.compress.output=false;
insert overwrite local directory '/home/finance/mytest/3333'
row format delimited fields terminated by '@'
stored as textfile
select * from t_tmp_rc;
导出结果显示:
[finance@master2-dev 3333]$ cat 000000_0
abc@123@456
abc@123@456
abc@123@456
abc@123@456
abc@123@456
abc@123@456
2.演示一张表多次导出数据,扫描一次。
SET hive.exec.compress.output=false;
from t_tmp_rc --适合从一个宽表文件导出数据,这种方式只需要扫描一次表即可,效率高。
insert overwrite local directory '/home/finance/mytest/3333'
row format delimited fields terminated by '@' --这里列分隔符是'@'
stored as textfile
select a ,b+'123',c
insert overwrite local directory '/home/finance/mytest/2222'
row format delimited fields terminated by '*' --这里列分隔符是'%'
stored as textfile
select a ,b+'123','标注',c ;
结果演示如下:
finance@master2-dev mytest]$ cat 2222/000000_0
[email protected]@标注@456
[email protected]@标注@456
[email protected]@标注@456
[email protected]@标注@456
[email protected]@标注@456
[email protected]@标注@456
[finance@master2-dev mytest]$ cat 3333/000000_0
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456