Hive compression

Why use compressed

With the increasing amount of data, how data processing enables us to improve the efficiency of data processing, how to choose and use compression is particularly important.
Compression advantages:
1) reduce the file size (the reduce size File)
2) save disk space (svae Disk Space)
. 3) to increase the network transmission speed and efficiency (Increase tansfer speed at a given data rate)


Compression Technology

Compression is divided into lossless compression (Lossless Compression) and lossy compression (Lossy Compression).
Lossless compression is generally applicable to user behavior data does not allow this type of business scenarios of data loss,
lossy compression generally applicable to large file compression, processing such as pictures, videos, the advantage of relatively high compression ratio and compression ratio, can save more much space.

Off-line data processing, for example:
Here Insert Picture Description
three business scenario, the input, intermediate and output. Log collection compress the input HDFS, is calculated by the decompression Spark / MapReduce, and then compressed into the corresponding data source.


Compression Contrast

Compression can be brought before the paper said benefit, but CPU consumption is relatively high in compression at the same time, then the best choice to do compression when doing cost-effective compression processing.

Compression format Compression Tools algorithm File name extension Whether to support split
gzip gzip default .gz ×
bzip2 bzip2 bzip2 .bz2
LZO- LZO- LZO- .lzo √(Yes if indexed)
LZ4 LZ4 LZ4 .lz4 ×
Snappy N/A Snappy .snappy ×

Compressed size compared as follows:
In Hadoop for
Compression Ratio: BZIP2> the GZIP> the LZO


Compression time comparison are as follows:
Here Insert Picture Description
the compression ratio and the compression speed is inversely proportional to (because the higher the ratio, the data after compression the less you your compression takes time so that the more compressed)
, such as: the best compression BZIP2 good, but the compression and decompression time of the slowest
compression Speed: LZO> GZIP> BZIP2

Explain the advantages and disadvantages

gzip
advantages:
compression rate is relatively high,
decompression speed is faster
hadoop support itself, linux system comes with gzip command
Disadvantages:
does not support split


bzip2
advantages:
high compression
support Split
Hadoop itself supports, but does not support native
Cons:
compression / decompression speed is slower


LZO
advantages:
a compression / decompression speed is relatively fast, reasonable compressibility
support fragmentation, is popular in Hadoop compression format
supported Hadoop Native libraries
disadvantages:
a compression ratio lower than the number of gzip
Hadoop itself does not support the need to install


snappy
advantages:
high speed compression speed and sound compression ratio
support hadoop native library
Disadvantages:
does not support split
compression rate is lower than gzip
hadoop itself does not support, you need to install
the command linux system there is no corresponding


Common Codec

Compression format class
Zlib org.apache.hadoop.io.compress.DefaultCodec
Gzip org.apache.hadoop.io.compress.GzipCodec
Bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO- com.hadoop.compression.lzo.LzoCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

In the compressed configuration of Hadoop

core-site.xml
Do not include spaces here for easy reading compression wrap writing class, pay attention to production

<property>
<name>io.compression.codecs</name>
	<value>
		org.apache.hadoop.io.compress.GzipCodec,
		org.apache.hadoop.io.compress.DefaultCodec,
		org.apache.hadoop.io.compress.BZip2Codec
	</value>
</property>

mapred-site.xml (with only the final output, intermediate output with themselves)

<property>   
	<name>mapreduce.output.fileoutputformat.compress</name>
	<value>true</value>
</property>

<property>
	<name>mapreduce.output.fileoutputformat.compress.codec</name>
	<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

Note: The configuration file is if it is reduce the map will be mapreduce.map.output

Guess you like

Origin blog.csdn.net/aubekpan/article/details/88391684