1. Preparations
Log.data find a test file size is 18.1M
2. Comparison storage format
2.1 The default TextFile
The default format, the data is not compressed, large disk overhead, large data parsing overhead. Can be combined Gzip, Bzip2 used (automatic inspection system, automatically extracts the query is executed), but this way, the data will not Hive segmentation, the data which can not operate in parallel.
Creating tables, storing data in the format TEXTFILE
create table log_text1 (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
Loading data into the table
load data local inpath '/waq/log.data' into table log_text1 ;
Check the size of the data in the table
hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_text1;
2.2 ORC
A file can be divided into several orc Stripe
a stripe can be divided into three parts
indexData: index data of some columns
rowData: real data storage
StripFooter: stripe metadata information
- Index Data: a lightweight index, the default is to do a line of 1W every index. The index recorded just here to do the fields in a row offset in a Row Data.
- Row Data: specific data is stored, take the first part of the line, these lines are stored and then by column. Each column is encoded, it is divided into a plurality of Stream stored.
- Stripe Footer: metadata information is stored in each stripe
Each file has a File Footer, there is stored the number of lines per Stripe, the data type information for each Column and the like; the tail of each file is a PostScript, there is recorded the compressed file type and the whole of FileFooter length information. When reading the file, it will seek to the end of the file read PostScript, from the inside to resolve File Footer length, read FileFooter, parsed from the inside to the respective information Stripe, Stripe each read, i.e., read from the back.
Creating tables, storing data in the format of ORC
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;
Loading data into the table
insert into table log_orc select * from log_text1;
Check the size of the data in the table
hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc;
2.3 Parquet
Parquet binary file is stored, it is not directly readable file comprising data and metadata of the file, the file format is thus self Parquet resolved.
Typically, the memory Parquet when data will be in accordance with the size of the Block size setting line group, since in general the smallest unit of each Mapper task of data processing is a Block, which can put each row group consisting of a Mapper task processing, increases task execution parallelism.
Creating tables, storing data in the format of parquet
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;
Loading data into the table
insert into table log_parquet select * from log_text1;
Check the size of the data in the table
hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_parquet;
2.4 compression ratio storing files Summary:
ORC > Parquet > textFile
2.5 files stored query speed test first pass:
TextFile
select count(*) from log_text1;
ORC
select count(*) from log_orc;
Parquet
select count(*) from log_parquet;
2.5 files stored query speed test a second time:
TextFile
select count(*) from log_text1;
ORC
select count(*) from log_orc;
Parquet
select count(*) from log_parquet;
2.6 files stored query speed test for the third time:
TextFile
select count(*) from log_text1;
ORC
select count(*) from log_orc;
Parquet
select count(*) from log_parquet;
Query speed storage file summary:
TextFile > ORC > Parquet
3. The storage and compression binding
3.1 of uncompressed storage ORC
ORC to create a non-compressed storage of
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
Insert data
insert into table log_orc_none select * from log_text1;
View the data after insertion
hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc_none;
Storing the compressed ORC 3.2SNAPPY
ORC create a SNAPPY compressed storage
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
Insert data
insert into table log_orc_snappy select * from log_text1;
View the data size after insertion
hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc_snappy;
ORC storage created by default, the size of the imported data:
Snappy compression ratio is smaller. The reason is that orc store files using the default ZLIB compression. Compression ratio snappy little.
Storage and compression summary:
hive table selection data storage format ships: orc or parquet. Compression generally choose snappy.