Hive中的数据压缩

1.数据文件存储格式

下面简介一下hive 支持的存储格式

file_format:
	: SEQUENCEFILE
	| TEXTFILE        -- (Default, depending on hive.default.fileformat configuration)
	| RCFILE          -- (Note: Available in Hive 0.6.0 and later)
	| ORC             -- (Note: Available in Hive 0.11.0 and later)
	| PARQUET         -- (Note: Available in Hive 0.13.0 and later)
	| AVRO            -- (Note: Available in Hive 0.14.0 and later)
	| INPUTFORMAT     input_format_classname OUTPUTFORMAT output_format_classname

数据存储格式分为按行存储数据和按列存储数据。
(1)ORCFile(Optimized Row Columnar File):hive/shark/spark支持。使用ORCFile格式存储列数较多的表。
(2)Parquet(twitter+cloudera开源,被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比较复杂,其灵感主要来自于dremel,parquet存储结构的主要亮点是支持嵌套数据结构以及高效且种类丰富算法(以应对不同值分布特征的压缩)。
在这里插入图片描述
(1)存储为TEXTFILE格式

create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

load data local inpath '/opt/datas/page_views.data' into table page_views ;
dfs -du -h /user/hive/warehouse/page_views/ ;
18.1 M  /user/hive/warehouse/page_views/page_views.data

(2)存储为ORC格式

create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

insert into table page_views_orc select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc/ ;
2.6 M  /user/hive/warehouse/page_views_orc/000000_0

(3)存储为Parquet格式

create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

insert into table page_views_parquet select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet/ ;
13.1 M  /user/hive/warehouse/page_views_parquet/000000_0

(4)存储为ORC格式,使用snappy压缩

create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

insert into table page_views_orc_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;
3.8 M  /user/hive/warehouse/page_views_orc_snappy/000000_0

这里为什么会大了呢 因为默认使用Gzip压缩的

(5)存储为ORC格式,不使用压缩

create table page_views_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

insert into table page_views_orc_none select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;
7.6 M  /user/hive/warehouse/page_views_orc_none/000000_0

(6)存储为Parquet格式,使用snappy压缩

set parquet.compression=SNAPPY ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;
insert into table page_views_parquet_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;
2.7 M  /user/hive/warehouse/page_views_parquet_snappy/000000_0

在实际的项目开发当中,hive表的数据的存储格式一般使用orcfile / parquet,数据压缩一般使用snappy压缩格式。
转载自 https://blog.csdn.net/gongxifacai_believe/article/details/80833480

欢迎关注,更多福利

这里写图片描述

猜你喜欢

转载自blog.csdn.net/u012957549/article/details/85859311