background:
1) has been created 4 different types of table
2) clean hxh2, hxh3, hxh4 data tables, data retained inside hxh1, hxh1 table data size: 74.1GB
3) while creating hxh5 table and hxh1 the same type are stored TEXTFILE
4) The original data size: 74.1 G
start testing:
1 , TextFile test
- Hive default format data table, storage: storage row.
- Gzip compression algorithm can be used, but the compressed file does not support split
- In the deserialization process, we must not judge character by character delimiters and line endings, deserialization overhead several times higher than SequenceFile.
Open Compression:
- = to true hive.exec.compress.output the SET ; - to enable compression format
- mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec SET ; - compression format specifies the output for Gzip
- to true mapred.output.compress = SET ; - turn output compressed mapred
- io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec SET ; - compressing Selection GZIP
Table hxh5 to insert data:
insert into table hxh5 partition(createdate="2019-07-21") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
hxh5 table data size after compressing: 23.8 G, consumption: 81.329 seconds
2 , Test Sequence File
- Data files can be compressed to save disk space, but some native Hadoop in one compressed file shortcomings is not support dividing. Support split file can have multiple parallel processing of large data mapper program file, most files are not supported divisible because these files can only be read from the beginning. Sequence File is divisible file formats, support for Hadoop's block-level compression.
- A binary file Hadoop API provided in the form of a sequence of key-value to the file. Storage: store line.
- sequencefile supports three compression Select: NONE, RECORD, BLOCK. Record low compression ratio, RECORD is the default option, usually BLOCK will bring a more RECORD better compression performance.
- Advantages and hadoop api file is in MapFile are compatible
Open Compression:
- = to true hive.exec.compress.output the SET ; - to enable compression format
- mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec SET ; - compression format specifies the output for Gzip
- = BLOCK mapred.output.compression.type the SET ; - compression option is set to BLOCK
- set mapred.output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Hxh2 to insert the data:
insert into table hxh2 partition(createdate="2019-07-21") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
hxh2 table data size after compressing: 80.8 G, consumption: 186.495 seconds // type type setting is not performed, a default Record
hxh2 table data size after compressing: 25.2 G, consumption: 81.67 seconds // set type to type BLOCK
3 , RCFile test
Storage: data block rows, each block stored in columns. It combines the advantages of rows and columns of memory storage:
- First, RCFile ensure that the data in the same row on the same node, so the tuple reconstruction of the overhead is very low
- Secondly, the same as the column storage, data compression can be utilized RCFile column dimensions, and can skip the unnecessary read column
- Additional data: RCFile data does not in any way support a write operation to provide an additional interface is only because the underlying HDFS currently only supports data write additions to the end of the file.
- Line Group size: larger groups of rows of data compression helps improve efficiency, but may damage the read performance data, as this increases the consumption Lazy decompression performance. Assembly line and become the group takes up more memory, which can affect other jobs concurrently executing MR. Considering the storage and query efficiency both, Facebook 4MB selected as the default set of line size, of course, also allows a user to select its own configuration parameters.
Open Compression:
- set hive.exec.compress.output=true;
- set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
- set mapred.output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Hxh3 to insert the data:
insert into table hxh3 partition(createdate="2019-07-01") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
After compression table size: 22.5 G, time-consuming: 136.659 seconds
4 , ORC test
Storage: data block rows, each block is stored in columns.
Compression fast, fast column access. Efficient than rcfile, it is a modified version of rcfile.
Open Compression:
- set hive.exec.compress.output=true;
- set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
- set mapred.output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Hxh4 to insert the data:
insert into table hxh4 partition(createdate="2019-07-01") select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
After compression table size: 21.9 G, elapsed time: 76.602 seconds
5 , what is severable
When considering how to compress data by those MapReduce process, consider whether compression format support division is very important. Consider uncompressed file is stored in the HDFS, having a size of 1GB, HDFS block size is 64MB, so the file will be stored as 16, the MapReduce this file as an input job creates an input slices (split, also known as "blocking". for the block, we collectively referred to as "block".) each slice is managed as a separate input process of map tasks separately.
Assuming now that the file is a compressed file gzip format, compressed size of 1GB. As before, HDFS this file 16 is stored. However, for each block to create a block it is of no use, because it is impossible to read from an arbitrary point gzip data stream, map independent of the other tasks can not only read the data block in a block. DEFLATE gzip format to store the compressed data, as a series DEFLATE compressed data block for storage. The problem is that the beginning of each block is not specified in the user data stream at any point to locate the starting position of the next block, but synchronized with the data stream itself. Thus, gzip does not support split (block) mechanism.
In this case, MapReduce undivided gzip file format, because it knows that the input is gzip compression format (known by file extension), and gzip compression mechanism does not support splitting mechanism. Thus a map HDFS task processing block 16, and the local map data most are not. At the same time, because fewer map task, job division particle size fine enough, resulting in longer run time.
6 , the compressed mode described
1. Compression mode evaluation
Of compression can be evaluated using the following three criteria:
- Compression ratio: the higher the compression ratio, the smaller the compressed file, so the higher the compression ratio, the better.
- Compression time: the sooner the better.
- Whether a compressed file format can be subdivided: divisible format allows a plurality of single file processing Mapper program, better parallelization.
2. compression mode Contrast
- BZip2 has the highest compression ratio but also bring higher CPU overhead, Gzip than BZip2 followed. If based on disk utilization and I / O considerations, the two compression algorithms are more attractive algorithm.
- Snappy and LZO algorithm has faster decompression speed, if more concerned about compression, decompression speed, they are a good choice. Snappy and LZO data compression speed on roughly the same, but the Snappy LZO algorithm to be faster than in the decompression speed.
- The Hadoop will be large files into HDFS block (default 64MB) splits size fragments, each fragment corresponding to a program Mapper. In these compression algorithms BZip2, LZO, Snappy compression is divisible, Gzip segmentation is not supported.
7 , common compression format
Compression |
Compressed size |
Compression speed |
Can I separate |
GZIP |
in |
in |
no |
BZIP2 |
small |
slow |
Yes |
LZO- |
Big |
fast |
Yes |
Snapp |
Big |
fast |
Yes |
Note: |
Here can be separated by means: local files after a certain compression algorithm compresses hdfs reached the top, and then MapReduce computation, the mapper stage support for compressed files were split separated and separating is valid. |
Hadoop encoder / decoder mode, as shown in Table
Compression format |
Corresponding encoder / decoder |
DEFAULT |
org.apache.hadoop.io.compress.DefaultCodec |
Gzip |
org.apache.hadoop.io.compress.GzipCodec |
Bzip |
org.apache.hadoop.io.compress.BZip2Codec |
DEFLATE |
org.apache.hadoop.io.compress.DeflateCodec |
Snappy |
org.apache.hadoop.io.compress.SnappyCodec (using intermediate output) |
LZO- |
org.apache.hadoop.io.compress.Lz4Codec (using intermediate output) |
8 , comparing the results
Before compression is 74.1G, compressed file size directory
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
23.8 G |
25.2 G |
22.5 G |
21.9 G |
Snappy |
39.5 G |
41.3 G |
39.1 G |
21.9 G |
BZIP |
17.9 G |
18.9 G |
18.7 G |
21.9 G |
LZO- |
39.9 G |
41.8 G |
40.8 G |
21.9 G |
The compressed file name
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
*.gz |
000000_0 |
000000_1000 |
000000_0 |
Snappy |
*.snappy |
000000_0 |
000000_1 |
000000_0 |
BZIP |
* .Bz2 |
000000_0 |
000000_1000 |
000000_0 |
LZO- |
*.lz4 |
000000_2 |
000000_0 |
000000_0 |
Import data time consuming
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
81.329s |
81.67s |
136.6s |
76.6s |
Snappy |
226s |
180s |
79.8s |
75s |
BZIP |
138.2s |
134s |
145.9s |
98.3s |
LZO- |
231.8 |
234s |
86.1s |
248.1s |
Query speed
select count(1) from table_name
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
46.2s |
50.4s |
44.3s |
38.3s |
Snappy |
46.3s |
54.3s |
42.2s |
40.3s |
BZIP |
114.3s |
110.3s |
40.3s |
38.2s |
LZO- |
60.3s |
52.2s |
42.2s |
50.3s |
to sum up:
Compression ratio: |
BZip> Gzip> Snappy> LZO, but for the same compression ratio ORC storage type are of |
Compression time: |
Gzip <BZip <Snappy <LZO, but the corresponding type of storage RCFile and ORC are Snappy compression type compression shortest time |
Statistics Time: |
GZip <Snappy <LZO <Bzip, but the corresponding type of storage RCFile and ORC are BZip compression type shortest query |
Recommended compression type: |
1) BZip and Gzip have very good compression ratio but would result in higher CPU consumption, if it is based on disk utilization and I / O, may be considered are two compression algorithms
2 has a faster solution) LZO compression algorithm and Snappy speed, if attention compression and decompression speed, they are a good choice. If the hive table storage type RCFile and ORC time, Snappy and has considerable LZO decompression efficiency, but in terms of compression Snappy even better than LZO
3) Hadoop will be divided into a large file size HDFS block splits fragments, each fragment corresponding to a program Mapper. In these compression algorithm Bzip2, LZO, Snappy compression can be separated, Gzip not support separate |