Hive 文件存储格式与数据压缩结合

1 压缩比和查询速度对比

1)TextFile

(1)创建表,存储数据格式为TEXTFILE

create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

(2)向表中加载数据

load data local inpath '/export/servers/hivedatas/log.data' into table log_text ;

(3)查看表中数据大小

dfs -du -h /user/hive/warehouse/myhive.db/log_text;

2)ORC

(1)创建表,存储数据格式为ORC

create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

(2)向表中加载数据

insert into table log_orc select * from log_text ;

(3)查看表中数据大小

dfs -du -h /user/hive/warehouse/myhive.db/log_orc;

3)Parquet

(1)创建表,存储数据格式为parquet

create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

(2)向表中加载数据

insert into table log_parquet select * from log_text ;

(3)查看表中数据大小

dfs -du -h /user/hive/warehouse/myhive.db/log_parquet;

存储文件的压缩比总结:

ORC > Parquet > textFile

4)存储文件的查询速度测试:

1)TextFile

hive (default)> select count(*) from log_text;

Time taken: 21.54 seconds, Fetched: 1 row(s)

2)ORC

hive (default)> select count(*) from log_orc;

Time taken: 20.867 seconds, Fetched: 1 row(s)

3)Parquet

hive (default)> select count(*) from log_parquet;

Time taken: 22.922 seconds, Fetched: 1 row(s)

存储文件的查询速度总结:

ORC > TextFile > Parquet

2 ORC存储指定压缩方式

官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

ORC存储方式的压缩:

Key Default Notes
orc.compress ZLIB high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 number of bytes in each compression chunk
orc.stripe.size 67,108,864 number of bytes in each stripe
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1000)
orc.create.index true whether to create row indexes
orc.bloom.filter.columns “” comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

1)创建一个非压缩的的ORC存储方式

(1)建表语句

create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

(2)插入数据

insert into table log_orc_none select * from log_text ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/myhive.db/log_orc_none;

2)创建一个SNAPPY压缩的ORC存储方式

(1)建表语句

create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

(2)插入数据

insert into table log_orc_snappy select * from log_text ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;

3 存储方式和压缩总结:

​ 在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy

发布了107 篇原创文章 · 获赞 20 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/beishanyingluo/article/details/105256792