hive mainstream data storage format and compressed _ Comparative Experiment

1. Preparations

Log.data find a test file size is 18.1M
Here Insert Picture Description

2. Comparison storage format

2.1 The default TextFile

The default format, the data is not compressed, large disk overhead, large data parsing overhead. Can be combined Gzip, Bzip2 used (automatic inspection system, automatically extracts the query is executed), but this way, the data will not Hive segmentation, the data which can not operate in parallel.

Creating tables, storing data in the format TEXTFILE

create table log_text1 (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

Loading data into the table

load data local inpath '/waq/log.data' into table log_text1 ;

Check the size of the data in the table

Here Insert Picture Description

 hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_text1;

Here Insert Picture Description

2.2 ORC

A file can be divided into several orc Stripe
a stripe can be divided into three parts
indexData: index data of some columns
rowData: real data storage
StripFooter: stripe metadata information

  • Index Data: a lightweight index, the default is to do a line of 1W every index. The index recorded just here to do the fields in a row offset in a Row Data.
  • Row Data: specific data is stored, take the first part of the line, these lines are stored and then by column. Each column is encoded, it is divided into a plurality of Stream stored.
  • Stripe Footer: metadata information is stored in each stripe

Each file has a File Footer, there is stored the number of lines per Stripe, the data type information for each Column and the like; the tail of each file is a PostScript, there is recorded the compressed file type and the whole of FileFooter length information. When reading the file, it will seek to the end of the file read PostScript, from the inside to resolve File Footer length, read FileFooter, parsed from the inside to the respective information Stripe, Stripe each read, i.e., read from the back.

Creating tables, storing data in the format of ORC

create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

Loading data into the table

insert into table log_orc select * from log_text1;

Check the size of the data in the table

Here Insert Picture Description

 hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc; 

Here Insert Picture Description

2.3 Parquet

Parquet binary file is stored, it is not directly readable file comprising data and metadata of the file, the file format is thus self Parquet resolved.
Typically, the memory Parquet when data will be in accordance with the size of the Block size setting line group, since in general the smallest unit of each Mapper task of data processing is a Block, which can put each row group consisting of a Mapper task processing, increases task execution parallelism.

Creating tables, storing data in the format of parquet

create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

Loading data into the table

insert into table log_parquet select * from log_text1;

Check the size of the data in the table

Here Insert Picture Description

hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_parquet;

Here Insert Picture Description

2.4 compression ratio storing files Summary:

ORC > Parquet > textFile

2.5 files stored query speed test first pass:

TextFile

select count(*) from log_text1;

Here Insert Picture Description

ORC

select count(*) from log_orc;

Here Insert Picture Description

Parquet

select count(*) from log_parquet;

Here Insert Picture Description

2.5 files stored query speed test a second time:

TextFile

Here Insert Picture Description

select count(*) from log_text1;

ORC

Here Insert Picture Description

select count(*) from log_orc;

Parquet

Here Insert Picture Description

select count(*) from log_parquet;

2.6 files stored query speed test for the third time:

TextFile

Here Insert Picture Description

select count(*) from log_text1;

ORC

Here Insert Picture Description

select count(*) from log_orc;

Parquet

Here Insert Picture Description

select count(*) from log_parquet;

Query speed storage file summary:

TextFile > ORC > Parquet

3. The storage and compression binding

3.1 of uncompressed storage ORC

ORC to create a non-compressed storage of

create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

Insert data

insert into table log_orc_none select * from log_text1;

View the data after insertion

Here Insert Picture Description

hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc_none;

Here Insert Picture Description

Storing the compressed ORC 3.2SNAPPY

ORC create a SNAPPY compressed storage

create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

Insert data

insert into table log_orc_snappy select * from log_text1;

View the data size after insertion

Here Insert Picture Description

 hadoop fs -du -h /user/hive/warehouse/testgmall.db/log_orc_snappy;

Here Insert Picture Description

ORC storage created by default, the size of the imported data:

Here Insert Picture Description
Snappy compression ratio is smaller. The reason is that orc store files using the default ZLIB compression. Compression ratio snappy little.

Storage and compression summary:

hive table selection data storage format ships: orc or parquet. Compression generally choose snappy.

Published 162 original articles · won praise 264 · views 80000 +

Guess you like

Origin blog.csdn.net/weixin_43893397/article/details/104711481