hive five basics

Hive major file storage format comparison

1 compression ratio test file storage

1.1 Test Data
HTTPS: // github.com / liufengji / Compression_Format_Data 
log .txt size 18. A . 1 M
1.2 TextFile
  • Creating tables, storing data in the format TextFile

create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile ;
  • Loading data into the table

load data local inpath '/home/hadoop/log.txt' into table log_text ;
  • View the table amounts of data

dfs -du -h /user/hive/warehouse/log_text;
​
+------------------------------------------------+--+
|                   DFS Output                   |
+------------------------------------------------+--+
| 18.1 M  /user/hive/warehouse/log_text/log.txt  |
+------------------------------------------------+--+
1.3 Parquet
  • Creating tables, storing data in the format of parquet

create table log_parquet  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
  • Loading data into the table

insert into table log_parquet select * from log_text;
  • View the table amounts of data

dfs -du -h /user/hive/warehouse/log_parquet;
​
+----------------------------------------------------+--+
|                     DFS Output                     |
+----------------------------------------------------+--+
| 13.1 M  /user/hive/warehouse/log_parquet/000000_0  |
+----------------------------------------------------+--+
1.4 ORC
  • Creating tables, storing data in the format of ORC

create table log_orc  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc  ;
  • Loading data into the table

insert into table log_orc select * from log_text ;
  • View the table amounts of data

dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
|                  DFS Output                   |
+-----------------------------------------------+--+
| 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
+-----------------------------------------------+--+
1.5 storage file compression ratio summary
ORC >  Parquet >  textFile

2, the file is stored query speed test

2.1 TextFile
select count(*) from log_text;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (16.99 seconds)
2.2 Parquet
select count(*) from log_parquet;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (17.994 seconds)
2.3 ORC
select count(*) from log_orc;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (15.943 seconds)
2.4 the file is stored query speed summary
ORC > TextFile > Parquet

3, in conjunction with storage and compression

  • The advantage of using compression is to minimize the required disk storage space and reduce disk and network operation io

  • Official website address

  • ORC supports three compression: ZLIB, SNAPPY, NONE. The last is not compressed, ORC uses a default ZLIB compression .

3.1 Create a non-compressed storage of ORC table
  • 1, create a non-compression of the ORC table

create table log_orc_none (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="NONE") ;
  • 2, loading data

insert into table log_orc_none select * from log_text ;
  • 3 amount of data, see table size

dfs -du -h /user/hive/warehouse/log_orc_none;
+----------------------------------------------------+--+
|                     DFS Output                     |
+----------------------------------------------------+--+
| 7.7 M  /user/hive/warehouse/log_orc_none/000000_0  |
+----------------------------------------------------+--+
3.2 Create a snappy compressed ORC storage table
  • 1, create a snappy compression of the ORC table

create table log_orc_snappy (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="SNAPPY") ;
  • 2, loading data

insert into table log_orc_snappy select * from log_text ;
  • 3 amount of data, see table size

dfs -du -h /user/hive/warehouse/log_orc_snappy;
+------------------------------------------------------+--+
|                      DFS Output                      |
+------------------------------------------------------+--+
| 3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0  |
+------------------------------------------------------+--+
3.3 Create a ZLIB compression of ORC storage table
  • Do not specify the compression format is the default using ZLIB compression

    • You can refer to the table created above log_orc

  • View the table amounts of data

dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
|                  DFS Output                   |
+-----------------------------------------------+--+
| 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
+-----------------------------------------------+--+
3.4 storage and compression summary
  • orc default compression ratio Snappy ZLIB compression is smaller.

  • In a real project which, hive table data storage generally choose: ORC or Parquet .

  • Because snappy compression and decompression efficiency is higher, compression generally choose snappy .

Guess you like

Origin www.cnblogs.com/lojun/p/11396793.html