Hive stored

Storage format

Use the default textfile:

hive>create table t1(id int, name string) stored as textfile;
hive>create table t1(id int, name string);

When the default storage format using a textfile, the same effect of the above two statements.

hive>desc formatted t1;
    默认为textfile格式时候的格式:
	InputFormat:            org.apache.hadoop.mapred.TextInputFormat 
	OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
	Compressed:             No
hive>create table t2(id int, name string) stored as
	 inputformat 'org.apache.hadoop.mapred.TextInputFormat'         
	 outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; 

The results produced by the same textfile

** Note: ** inputformat outputformat can be customized with your own

development path:

| TEXTFILE – (Default, depending on hive.default.fileformat configuration)
| RCFILE – (Note: Available in Hive 0.6.0 and later)
| ORC – (Note: Available in Hive 0.11.0 and later)
| PARQUET – (Note: Available in Hive 0.13.0 and later)
| AVRO – (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

** Recommended: ** ORC and PARQUET enough


VS column memory line storage

Line storage

Here Insert Picture Description

As shown above, the structure of the line store:
Suppose file has ABCD 4 columns 5 rows

  • HDFS is in correspondence Block
  • So what kind of structure in Block which is it?

Here Insert Picture Description

  • The line brings the advantages of storage:

For each column of each row of data is necessarily inside the same Block

  • Line storage brings disadvantages:

Because his record there are a lot of columns, each column of data type may be different

If the data type of each column are not the same, then the compression of time, will cause trouble
are not the same for each type of compression ratio can not be achieved for each type of data compression ratio to achieve the best results
select a, b from xxx;
for line storage, is the need to read all the columns gave out
that is c and d Although we do not have access, but it will be read, this will result in lifting of IO

Columnar storage

Here Insert Picture Description

As shown above, the column storage structure:

The columns in each row to open, and stored in different Block inside
columnar storage can not be guaranteed that all rows in the same columns in the same Block inside
two AB Block 1 exists in
Column C in the presence of a Block
there is a Block in column D

  • This will bring a great advantage
    since it is the presence of data in the same column of the same Block inside, then the data type must be the same, it can be compressed using a higher compression ratio mode
  • select c from xxx;
    只需要拿走1个Block里面的数据就可以了,不用去碰其余两个Block里的数据
    IO大幅度的减少
  • 带来一个缺点:
    查询的东西一多,要读取的Block多
    ===>
    如果我们使用select * from xxx; 行式和列式可能没有多大的区别
    如果我们查的是某些字段,那么列式的优势就体现出来了
    在我们的大数据场景来说:
  • 可能我们有几百列,但是我们所查询的就几列
    这种场景下,使用列式存储更好

TextFile

默认的存储格式
普通文件/json/xml ==> 使用TextFile进行存储

**注意:**使用TextFile进行存储,很多都是当作字符串来进行处理的

SequenceFile

Here Insert Picture Description

如上图所示,Record是真正存放数据的,有3种级别:

  • No compression的Record:
    RecordLength KeyLength Key Value
  • Record compression的Record:
    RecordLength KeyLength Key CompressedValue
  • Block compression的Record:
    Number of records CompressedKeyLength CompressedKeys CompressedValueLengths CompressedValues

假设:

record length = 10,key length = 4
那么value的length就是6
按照SequenceFile的存储形式,当我们读value的时候,拿到RecordLength和KeyLength,可以直接跳过Key,直接去拿Value
==> 这样在查询的时候可能会好一些
仅仅了解一下,用的不多

性能测试:

hive>create table page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as sequencefile;

hive>LOAD DATA LOCAL INPATH '/home/hadoop/data/page_views.dat' OVERWRITE INTO TABLE page_views_seq;

会报错: Failed with exception Wrong file format.
原因: 对于SequenceFile来说,是不能直接load的
解决方法: 先在Hive表中创建一张textfile存储格式的表,然后再将这个表的数据给灌进来到SequenceFile为存储格式的表中来

hive>insert into table page_views_seq select * from page_views;

导入数据成功

查看大小(使用SequenceFile之前):

$>hadoop fs -du -h /user/hive/warehouse/page_views
18.1M  18.1M   /user/hive/warehouse/page_views/page_views.dat

查看大小(使用SequenceFile之后):

$>hadoop fs -du -h /user/hive/warehouse/page_views_seq
19.6M  19.6M   /user/hive/warehouse/page_views_seq/000000_0
会比原来大的原因:因为SequenceFile多了header等这些信息

SequenceFile是基于行式存储来做的
现在生产环境上使用SequenceFile不多

RCFile(Record Columnar File)

Here Insert Picture Description

如上图所示,RCFile是Facebook开源的
RCFile是行列存储混合的(从上图中也可以体现的出来)
一个Row Group为4M

性能测试:

hive>create table page_views_rc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as rcfile;

hive>insert into table page_views_rc select * from page_views;

查看大小(使用RCFile之后):

$>hadoop fs -du -h /user/hive/warehouse/page_views_rc
17.9M  17.9M   /user/hive/warehouse/page_views_rc/000000_0

ORC

Here Insert Picture Description

ORC存储结构如上图所示,引入了以下几个概念:

  • Stripe (Stripe of a 250M):
    Index the Data + + Row Stripe the Data Footer
    Index the Data: lightweight index database
    in which a Stripe, if the numeric types are stored maximum and minimum values; if the former is kept suffix string ; survive such benefits:
  • When your SQL statement there where the back, you wrote id> 100
    if the first one Stripe in memory of 0 to 99
    2nd Stripe in the deposit is 100 to 199
    third Stripe in store from 200 to 299
    then we have just the SQL statement, the first one will not read the Stripe
    like this, query performance will certainly enhance
  • Stripe Footer: is the type of deposit; this should not be too concerned about going
  • Query data, the index went to check, and if not, then this would not have to read the Stripe
  • This improves performance

Performance testing (using compression, default ZLIB compression):

hive>create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as orc;

hive>insert into table page_views_orc select * from page_views;

See size (after use ORC, and the use of compression):

$>hadoop fs -du -h /user/hive/warehouse/page_views_orc
  2.8M  2.8M   /user/hive/warehouse/page_views_orc/000000_0

No compression

hive>create table page_views_orc_none 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
stored as orc TBLPROPERTIES("orc.compress"="NONE");

报错: FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

The reason: not specified data source

hive>create table page_views_orc_none 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
stored as orc TBLPROPERTIES("orc.compress"="NONE")
as select * from page_views;

See size (after use ORC, compression is not used):

$>hadoop fs -du -h /user/hive/warehouse/page_views_orc_none
  7.7M  7.7M   /user/hive/warehouse/page_views_orc_none/000000_0

Parquet

Performance tests
do not use compression (default is not using compression)

hive>create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as parquet;

hive>insert into table page_views_parquet_gzip select * from page_views;

View size (after use Parquet, do not use compression):

$>hadoop fs -du -h /user/hive/warehouse/page_views_parquet
  13.1M  13.1M   /user/hive/warehouse/page_views_parquet/000000_0

In compressed (GZIP compression set)

hive>set parquet.compression=GZIP;				// 设置压缩方式
hive>create table page_views_parquet_gzip(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as parquet;

hive>insert into table page_views_parquet_gzip select * from page_views;

See size (after use ORC, using compression)

$>hadoop fs -du -h /user/hive/warehouse/page_views_parquet_gzip
  3.9M  3.9M   /user/hive/warehouse/page_views_parquet_gzip/000000_0

HDFS Read performance test

hive>select count(1) from page_views where session_id='xxxxxxx';
hive>select count(1) from page_views_seq where session_id='xxxxxxx';
hive>select count(1) from page_views_rc where session_id='xxxxxxx';
hive>select count(1) from page_views_orc where session_id='xxxxxxx';
hive>select count(1) from page_views_orc_none where session_id='xxxxxxx';
hive>select count(1) from page_views_parquet where session_id='xxxxxxx';
hive>select count(1) from page_views_parquet_gzip where session_id='xxxxxxx';

Note: session_id to xxx in place, it just provides a test idea

To observe HDFS Read through the information printed on the console, compared to a variety of formats as follows:
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/aubekpan/article/details/88806237