hive compression type test

background:

1) has been created 4 different types of table

2) clean hxh2, hxh3, hxh4 data tables, data retained inside hxh1, hxh1 table data size: 74.1GB

3) while creating hxh5 table and hxh1 the same type are stored TEXTFILE

4) The original data size: 74.1 G

 

start testing:

1 , TextFile test

  1. Hive default format data table, storage: storage row.
  2. Gzip compression algorithm can be used, but the compressed file does not support split
  3. In the deserialization process, we must not judge character by character delimiters and line endings, deserialization overhead several times higher than SequenceFile.

 

Open Compression:

  1. = to true hive.exec.compress.output the SET ; - to enable compression format
  2. mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec SET ; - compression format specifies the output for Gzip
  3. to true mapred.output.compress = SET ; - turn output compressed mapred
  4. io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec SET ; - compressing Selection GZIP

 

Table hxh5 to insert data:

insert into table hxh5 partition(createdate="2019-07-21") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

hxh5 table data size after compressing: 23.8 G, consumption: 81.329 seconds

 

2 , Test Sequence File

  1. Data files can be compressed to save disk space, but some native Hadoop in one compressed file shortcomings is not support dividing. Support split file can have multiple parallel processing of large data mapper program file, most files are not supported divisible because these files can only be read from the beginning. Sequence File is divisible file formats, support for Hadoop's block-level compression.
  2. A binary file Hadoop API provided in the form of a sequence of key-value to the file. Storage: store line.
  3. sequencefile supports three compression Select: NONE, RECORD, BLOCK. Record low compression ratio, RECORD is the default option, usually BLOCK will bring a more RECORD better compression performance.
  4. Advantages and hadoop api file is in MapFile are compatible

 

Open Compression:

  1. = to true hive.exec.compress.output the SET ; - to enable compression format
  2. mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec SET ; - compression format specifies the output for Gzip
  3. = BLOCK mapred.output.compression.type the SET ; - compression option is set to BLOCK
  4. set mapred.output.compress=true;
  5. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

 

Hxh2 to insert the data:

insert into table hxh2 partition(createdate="2019-07-21") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

 

hxh2 table data size after compressing: 80.8 G, consumption: 186.495 seconds // type type setting is not performed, a default Record

hxh2 table data size after compressing: 25.2 G, consumption: 81.67 seconds // set type to type BLOCK

 

3 , RCFile test

Storage: data block rows, each block stored in columns. It combines the advantages of rows and columns of memory storage:

  1. First, RCFile ensure that the data in the same row on the same node, so the tuple reconstruction of the overhead is very low
  2. Secondly, the same as the column storage, data compression can be utilized RCFile column dimensions, and can skip the unnecessary read column
  3. Additional data: RCFile data does not in any way support a write operation to provide an additional interface is only because the underlying HDFS currently only supports data write additions to the end of the file.
  4. Line Group size: larger groups of rows of data compression helps improve efficiency, but may damage the read performance data, as this increases the consumption Lazy decompression performance. Assembly line and become the group takes up more memory, which can affect other jobs concurrently executing MR. Considering the storage and query efficiency both, Facebook 4MB selected as the default set of line size, of course, also allows a user to select its own configuration parameters.

 

Open Compression:

  1. set hive.exec.compress.output=true;
  2. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
  3. set mapred.output.compress=true;
  4. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

 

Hxh3 to insert the data:

insert into table hxh3 partition(createdate="2019-07-01") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

After compression table size: 22.5 G, time-consuming: 136.659 seconds

 

4 , ORC test

Storage: data block rows, each block is stored in columns. 

Compression fast, fast column access. Efficient than rcfile, it is a modified version of rcfile.

 

Open Compression:

  1. set hive.exec.compress.output=true;
  2. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
  3. set mapred.output.compress=true;
  4. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

 

Hxh4 to insert the data:

insert into table hxh4 partition(createdate="2019-07-01") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

After compression table size: 21.9 G, elapsed time: 76.602 seconds

 

5 , what is severable

  When considering how to compress data by those MapReduce process, consider whether compression format support division is very important. Consider uncompressed file is stored in the HDFS, having a size of 1GB, HDFS block size is 64MB, so the file will be stored as 16, the MapReduce this file as an input job creates an input slices (split, also known as "blocking". for the block, we collectively referred to as "block".) each slice is managed as a separate input process of map tasks separately. 

  Assuming now that the file is a compressed file gzip format, compressed size of 1GB. As before, HDFS this file 16 is stored. However, for each block to create a block it is of no use, because it is impossible to read from an arbitrary point gzip data stream, map independent of the other tasks can not only read the data block in a block. DEFLATE gzip format to store the compressed data, as a series DEFLATE compressed data block for storage. The problem is that the beginning of each block is not specified in the user data stream at any point to locate the starting position of the next block, but synchronized with the data stream itself. Thus, gzip does not support split (block) mechanism. 

  In this case, MapReduce undivided gzip file format, because it knows that the input is gzip compression format (known by file extension), and gzip compression mechanism does not support splitting mechanism. Thus a map HDFS task processing block 16, and the local map data most are not. At the same time, because fewer map task, job division particle size fine enough, resulting in longer run time.

 

6 , the compressed mode described

1. Compression mode evaluation

Of compression can be evaluated using the following three criteria:

  1. Compression ratio: the higher the compression ratio, the smaller the compressed file, so the higher the compression ratio, the better.
  2. Compression time: the sooner the better.
  3. Whether a compressed file format can be subdivided: divisible format allows a plurality of single file processing Mapper program, better parallelization.

2. compression mode Contrast

  1. BZip2 has the highest compression ratio but also bring higher CPU overhead, Gzip than BZip2 followed. If based on disk utilization and I / O considerations, the two compression algorithms are more attractive algorithm.
  2. Snappy and LZO algorithm has faster decompression speed, if more concerned about compression, decompression speed, they are a good choice. Snappy and LZO data compression speed on roughly the same, but the Snappy LZO algorithm to be faster than in the decompression speed.
  3. The Hadoop will be large files into HDFS block (default 64MB) splits size fragments, each fragment corresponding to a program Mapper. In these compression algorithms BZip2, LZO, Snappy compression is divisible, Gzip segmentation is not supported.

 

7 , common compression format

Compression

Compressed size

Compression speed

Can I separate

GZIP

in

in

no

BZIP2

small

slow

Yes

LZO-

Big

fast

Yes

Snapp

Big

fast

Yes

Note:

Here can be separated by means: local files after a certain compression algorithm compresses hdfs reached the top, and then MapReduce computation, the mapper stage support for compressed files were split separated and separating is valid.

 

 

Hadoop encoder / decoder mode, as shown in Table 

Compression format

Corresponding encoder / decoder

DEFAULT

org.apache.hadoop.io.compress.DefaultCodec

Gzip

org.apache.hadoop.io.compress.GzipCodec

Bzip

org.apache.hadoop.io.compress.BZip2Codec

DEFLATE

org.apache.hadoop.io.compress.DeflateCodec

Snappy

org.apache.hadoop.io.compress.SnappyCodec (using intermediate output)

LZO-

org.apache.hadoop.io.compress.Lz4Codec (using intermediate output)

 

8 , comparing the results

Before compression is 74.1G, compressed file size directory

 

TextFile

Sequence File

RCFile

ORC

GZip

23.8 G

25.2 G

22.5 G

21.9 G

Snappy

39.5 G

41.3 G

39.1 G

21.9 G

BZIP

17.9 G

18.9 G

18.7 G

21.9 G

LZO-

39.9 G

41.8 G

40.8 G

21.9 G

The compressed file name

 

TextFile

Sequence File

RCFile

ORC

GZip

*.gz

000000_0

000000_1000

000000_0

Snappy

*.snappy

000000_0

000000_1

000000_0

BZIP

* .Bz2

000000_0

000000_1000

000000_0

LZO-

*.lz4

000000_2

000000_0

000000_0

Import data time consuming

 

TextFile

Sequence File

RCFile

ORC

GZip

81.329s

81.67s

136.6s

76.6s

Snappy

226s

180s

79.8s

75s

BZIP

138.2s

134s

145.9s

98.3s

LZO-

231.8

234s

86.1s

248.1s

Query speed

select count(1) from table_name

 

TextFile

Sequence File

RCFile

ORC

GZip

46.2s

50.4s

44.3s

38.3s

Snappy

46.3s

54.3s

42.2s

40.3s

BZIP

114.3s

110.3s

40.3s

38.2s

LZO-

60.3s

52.2s

42.2s

50.3s

to sum up:

Compression ratio:

BZip> Gzip> Snappy> LZO, but for the same compression ratio ORC storage type are of

Compression time:

Gzip <BZip <Snappy <LZO, but the corresponding type of storage RCFile and ORC are Snappy compression type compression shortest time

Statistics Time:

GZip <Snappy <LZO <Bzip, but the corresponding type of storage RCFile and ORC are BZip compression type shortest query

Recommended compression type:

1) BZip and Gzip have very good compression ratio but would result in higher CPU consumption, if it is based on disk utilization and I / O, may be considered are two compression algorithms

 

2 has a faster solution) LZO compression algorithm and Snappy speed, if attention compression and decompression speed, they are a good choice. If the hive table storage type RCFile and ORC time, Snappy and has considerable LZO decompression efficiency, but in terms of compression Snappy even better than LZO

 

3) Hadoop will be divided into a large file size HDFS block splits fragments, each fragment corresponding to a program Mapper. In these compression algorithm Bzip2, LZO, Snappy compression can be separated, Gzip not support separate

 

 

Guess you like

Origin www.cnblogs.com/gentlemanhai/p/11275442.html