1 compression ratio test file storage
1.1 Test Data
HTTPS: // github.com / liufengji / Compression_Format_Data log .txt size 18. A . 1 M
1.2 TextFile
-
Creating tables, storing data in the format TextFile
create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as textfile ;
-
Loading data into the table
load data local inpath '/home/hadoop/log.txt' into table log_text ;
-
View the table amounts of data
dfs -du -h /user/hive/warehouse/log_text; +------------------------------------------------+--+ | DFS Output | +------------------------------------------------+--+ | 18.1 M /user/hive/warehouse/log_text/log.txt | +------------------------------------------------+--+
1.3 Parquet
-
Creating tables, storing data in the format of parquet
create table log_parquet ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as parquet;
-
Loading data into the table
insert into table log_parquet select * from log_text;
-
View the table amounts of data
dfs -du -h /user/hive/warehouse/log_parquet; +----------------------------------------------------+--+ | DFS Output | +----------------------------------------------------+--+ | 13.1 M /user/hive/warehouse/log_parquet/000000_0 | +----------------------------------------------------+--+
1.4 ORC
-
Creating tables, storing data in the format of ORC
create table log_orc ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc ;
-
Loading data into the table
insert into table log_orc select * from log_text ;
-
View the table amounts of data
dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M /user/hive/warehouse/log_orc/000000_0 | +-----------------------------------------------+--+
1.5 storage file compression ratio summary
ORC > Parquet > textFile
2, the file is stored query speed test
2.1 TextFile
select count(*) from log_text; +---------+--+ | _c0 | +---------+--+ | 100000 | +---------+--+ 1 row selected (16.99 seconds)
2.2 Parquet
select count(*) from log_parquet; +---------+--+ | _c0 | +---------+--+ | 100000 | +---------+--+ 1 row selected (17.994 seconds)
2.3 ORC
select count(*) from log_orc; +---------+--+ | _c0 | +---------+--+ | 100000 | +---------+--+ 1 row selected (15.943 seconds)
2.4 the file is stored query speed summary
ORC > TextFile > Parquet
3, in conjunction with storage and compression
-
The advantage of using compression is to minimize the required disk storage space and reduce disk and network operation io
-
Official website address
-
ORC supports three compression: ZLIB, SNAPPY, NONE. The last is not compressed, ORC uses a default ZLIB compression .
3.1 Create a non-compressed storage of ORC table
-
1, create a non-compression of the ORC table
create table log_orc_none ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="NONE") ;
-
2, loading data
insert into table log_orc_none select * from log_text ;
-
3 amount of data, see table size
dfs -du -h /user/hive/warehouse/log_orc_none; +----------------------------------------------------+--+ | DFS Output | +----------------------------------------------------+--+ | 7.7 M /user/hive/warehouse/log_orc_none/000000_0 | +----------------------------------------------------+--+
3.2 Create a snappy compressed ORC storage table
-
1, create a snappy compression of the ORC table
create table log_orc_snappy ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY") ;
-
2, loading data
insert into table log_orc_snappy select * from log_text ;
-
3 amount of data, see table size
dfs -du -h /user/hive/warehouse/log_orc_snappy; +------------------------------------------------------+--+ | DFS Output | +------------------------------------------------------+--+ | 3.8 M /user/hive/warehouse/log_orc_snappy/000000_0 | +------------------------------------------------------+--+
3.3 Create a ZLIB compression of ORC storage table
-
Do not specify the compression format is the default using ZLIB compression
-
You can refer to the table created above log_orc
-
-
View the table amounts of data
dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M /user/hive/warehouse/log_orc/000000_0 | +-----------------------------------------------+--+
3.4 storage and compression summary
-
orc default compression ratio Snappy ZLIB compression is smaller.
-
In a real project which, hive table data storage generally choose: ORC or Parquet .
-
Because snappy compression and decompression efficiency is higher, compression generally choose snappy .