hive built table supported file types and compression formats

MapReduce data compression
data compression hive
file formats supported hive
hive log analysis, comparison of various compression

A: mapreduce compression

  • mapreduce compression is mainly optimized shuffle stage.
    shuffle side

--partition (partition)
- the Sort (Sort)
- as Combine (merge)
- the compress (compression)
- Group (group)
optimization shuffle in mapreduce is the solution to disk and network IO IO problem essentially.
Reduce Clusterware file transfer process.
Two: hive compression:
Compression and decompression need cpu's, hive of common compression formats:
bzip2, gzip, lzo, snappy and other
cdh default compression used is snappy

Compression ratio: bzip2> gzip> lzo bzip2 most save storage space.
Note: sanppy not the best compression ratio

Decompression speed: lzo> gzip> bzip2 lzo decompression speed is the fastest.
Note: the pursuit of the fastest sanppy compression rate
compression and decompression need cpu loss is relatively large.

Cluster points: cpu intensive (usually the type of network computing)
hadoop disk and network IO IO intensive, dual-NIC card binding.
Three: hadoop command checks if compression of
bin / Hadoop checknative
3.1 supports compression so mounted:
the tar--zxvf 2.5.0 Native-snappy.tar.gz -C / Home / Hadoop / yangyang / Hadoop / lib / Native
3.2 command detection :
bin / hadoop checknative
3.3 MapReduce support compression:
CodeName:
zlib: org.apache.hadoop.io.compress.DefaultCodec
gzip: org.apache.hadoop.io.compress.GzipCodec
gzip2: org.apache.hadoop.io.compress .Bzip2Codec
lzo: org.apache.hadoop.io.compress.LzoCodec
LZ4: org.apache.hadoop.io.compress.Lz4Codec
Snappy: org.apache.hadoop.io.compress.SnappyCodec
3.4 MapReduce job is executed temporary supports two kinds of compression method:
1. run the command execution time.
= To true -Dmapreduce.map.output.compress
-Dmapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
such as:
bin / Share Yarn JAR / Hadoop / MapReduce / Hadoop-MapReduce-examples- 2.5.0-cdh5.3.6.jar wordcount -Dmapreduce.map.output.compress = true -Dmapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec /input/dept.txt / output1
you can add a time in front of the bin, the time will check the running

Testing job tasks:

  1. The total time of the test run job
  2. Check compression of frequency, the size of the compressed file.
  3. Change the configuration file:
    Change mapred-site.xml file


    mapreduce.map.output.compress
    true


    mapreduce.map.output.compress.codec
    org.apache.hadoop.io.compress.DefaultCodec

After the change is complete, restart services on it
four hive support compression.
4.1 temporary parameter changes to take effect
hive> set ---> See all parameters
hive> set hive.exec.compress.intermediate = true - open intermediate compression
> set = CodeName mapred.map.output.compression.codec
> SET = hive.exec.compress.output to true
> = BLOCK SET mapred.map.output.compression.type / the RECORD
in hive-site.xml to increase the corresponding parameter in it permanent
4.2: hive supported file types:
4.2.1 store row and column difference storage
database column store Unlike traditional relational database, the data is an important benefit by, the column is stored in rows in a table brought one of them is due to selection rules query is defined by the column, the entire database is automatically indexed.
  According to data gathered storage column stores for each field, the query takes only a few fields in time, can greatly reduce the amount of data read, a data aggregation storage field, it would be easier to gather this store design better the compression / decompression algorithm.
image_1ak7j9l59fq1k819ek15eeqq9.png-173kB

4.2.2 hive Supported file types:
Modify the default file series hive parameters:
the SET hive.default.fileformat = Orc

TextFile: default type, the line memory
rcfile: line blocks, then each column of memory
avro: Binary
ORC rcfile: an upgraded version, the default is zlib, which supports the snappy format does not support
Parquet
4.2.3 the ORC format (hive / shark / spark support)
image_1ak7jnb7p1ko3128h2u9b0kqulm.png-108.9kB

Usage:
Create Table Adress (
name String,
Street String,
City String,
State Double,
ZIP int
) Stored AS ORC tblproperties ( "orc.compress" = "NONE") ---> specify a compression algorithm
row format delimited fields terminated by '\ T';
4.2.4 PARQUET format (twitter + cloudera open, Hive, the Spark, Drill, Impala,
Pig other support)
image_1ak7k27pbsvu1lmo1hp6tus9e013.png-124.4kB

Usage:
Create Table Adress (
name String,
Street String,
City String,
State Double,
ZIP int
) Stored AS Parquet ---> the type of text
row format delimited fields terminated by '\ t';
five: hive log analysis, each species comparison compressed
5.1 hive create the table in the above structure:
5.1.1 textfile type:
Create table page_views_textfile (
track_time String,
URL String,
session_id String,
refere String,
IP String,
end_user_id String,
city_id String
)
Row DELIMITED Fields terminated by the format ' \ T '
STORED textfile the AS; file type ---> specified table
image_1ak7u3id314bq57q11rhp60dhh9.png-17.9kB

Loading data into the table
Load the inpath local Data '/home/hadoop/page_views.data' INTO Table page_views_textfile;
image_1ak7u5tbjqql1mfv1ic61ffquqam.png 12.5 kb in-

5.1.2 orc 类型:
create table page_views_orc(
track_time string,
url string,
session_id string,
refere string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
STORED AS orc ;
image_1ak7ulou44mhq501efc1jktboa13.png-17.5kB

插入数据:
insert into table page_views_orc select * from page_views_textfile ;
image_1ak7urdk31sf97971edt1it81qtm1g.png-54.7kB
5.1.3 parquet 类型
create table page_views_parquet(
track_time string,
url string,
session_id string,
refere string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
STORED AS parquet ;
image_1ak7uu3rmgd5umk5sleh1t2a1t.png-17.1kB

插入数据:
insert into table page_views_parquet select * from page_views_textfile ;
image_1ak7uuuuh1e31iktc1g1q971pg22a.png-62.1kB

Six: Compare:
6.1 File Size statistics
hive (yangyang)> dfs -du -h /user/hive/warehouse/yangyang.db/page_views_textfile;

hive (yangyang)> dfs -du -h /user/hive/warehouse/yangyang.db/page_views_orc;

hive (yangyang)> dfs -du -h /user/hive/warehouse/yangyang.db/page_views_parquet ;
image_1ak7vb3havvabfa1mdsien1g282n.png-21.2kB

As can be seen from the above table the smallest generated on orc.
6.2 Finding time to test comparison:
Hive (yangyang)> the SELECT COUNT (session_id) from page_views_textfile;
Hive (yangyang)> the SELECT COUNT (session_id) from page_views_orc;

hive (yangyang)> select count(session_id) from page_views_parquet;
6.3 textfile 文件类型:
image_1ak800moc1ov711mhkk11nk44eo3u.png-7.2kB
image_1ak801ugb1n2v1mhk3f1qshems4b.png-6.8kB

6.4 orc File Type:
image_1ak8056dcd8o1ei83vp9hupo84o.png-22.9KB
image_1ak807ni71fi3dpm1ccb18an155755.png-6.7 kb

6.5 parquet 类型:
image_1ak8091ilmeqing1qunurc5to5i.png-15.9kB
image_1ak80a3441aj99hih2u7n6fd25v.png-6.6kB

七 hive 创建表与指定压缩:
7.1 orc+snappy 格式:
create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
refere string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
STORED AS orc TBLPROPERTIES("orc.compression"="Snappy");
image_1ak81oqnp1sd01nq01tj6htdnh06c.png-21kB

插入数据:
insert into table page_views_orc_snappy select * from page_views_textfile ;
image_1ak81sl3k1pa16un267186g1uu96p.png-34.8kB

7.2 parquet+snappy 格式:
set parquet.compression=Snappy ;
set hive.exec.compress.output=true ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
refere string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
STORED AS parquet ;
image_1ak826m0hu73195t97i4qrj680.png-24kB

插入数据:
insert into table page_views_parquet_snappy select * from page_views_textfile ;
image_1ak8277pv12vi16unbag10q01sfm8d.png-62.9kB

7.3 Comparison test:
7.3.1 File Size comparison:
Hive (yangyang)> the DFS -du -h /user/hive/warehouse/yangyang.db/page_views_orc_snappy;

hive (yangyang)> dfs -du -h /user/hive/warehouse/yangyang.db/page_views_parquent_snappy ;
image_1ak82jdklbniq9bhei129b1hde8q.png-15.4kB

7.3.2 Contrast query:
Hive (yangyang)> the SELECT COUNT (session_id) from page_views_orc_snappy;

hive (yangyang)> select count(session_id) from page_views_parquet_snappy;
image_1ak832dfdg2v1vvm15umjmd4us97.png-58.7kB

image_1ak833gls106i1hjg1sg1ljn1ud39k.png-58.7kB

Guess you like

Origin www.cnblogs.com/kukudetent/p/12168699.html