hive 测试存储和压缩

 测试存储和压缩

官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

ORC存储方式的压缩:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

268,435,456

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns

""

comma separated list of column names for which bloom filter should be created

orc.bloom.filter.fpp

0.05

false positive probability for bloom filter (must >0.0 and <1.0)

注意所有关于ORCFile的参数都是在HQL语句的TBLPROPERTIES字段里面出现

1)创建一个非压缩的的ORC存储方式

(1)建表语句

create table log_orc_zlib(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as orc

tblproperties("orc.compress"="ZLIB");

(2)插入数据

insert into log_orc_zlib select * from log_text;

 

(3)查看插入后数据

hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ;

 

7.7 M  /user/hive/warehouse/log_orc_none/000000_0

2)创建一个SNAPPY压缩的ORC存储方式

(1)建表语句

create table log_orc_snappy(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as orc

tblproperties("orc.compress"="SNAPPY");

 

(2)插入数据

insert into log_orc_snappy select * from log_text;

 

(3)查看插入后数据

hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;

3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0

3)上一节中默认创建的ORC存储方式,导入数据后的大小为

2.8 M  /user/hive/warehouse/log_orc/000000_0

比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩,ZLIB采用的是deflate压缩算法。比snappy压缩的小。

4)存储方式和压缩总结

在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。

猜你喜欢

转载自blog.csdn.net/qq_39674417/article/details/113756277