Hadoop-data compression summary

Hadoop data compression

Overview

Compression technology can effectively reduce the number of bytes read and written by the underlying storage system (HDFS). Compression improves the efficiency of network bandwidth and disk space. Under Hadoop, it is very important to use data compression, especially when the data scale is large and the workload is intensive. In this case, I/O operations and network data transmission take a lot of time. In addition, the Shuffle and Merge processes also face tremendous I/O pressure.

​ Given that disk I/O and network bandwidth are precious resources of Hadoop, data compression is very helpful for saving resources, minimizing disk I/O and network transmission. However, despite the low CPU overhead of compression and decompression operations, its performance improvement and resource saving are not without cost.

​ If disk I/O and network bandwidth affect the performance of MapReduce jobs, enabling compression at any MapReduce stage can improve end-to-end processing time and reduce I/O and network traffic.

An optimization strategy for compressing Mapreduce: compressing the output of Mapper or Reducer through compression encoding to reduce disk IO and improve the running speed of MR programs (but correspondingly increase the burden of cpu computing).

Note: Proper use of compression features can improve performance, but improper use can also reduce performance.

The basic principle:

(1) For computationally intensive jobs, use less compression

(2) IO-intensive jobs, use compression

Compression encoding supported by MR

Compression format Hadoop comes with it? algorithm File extension Whether it can be divided After changing to compressed format, whether the original program needs to be modified
DEFAULT Yes, use it directly DEFAULT .deflate no Same as text processing, no modification
Gzip Yes, use it directly DEFAULT .gz no Same as text processing, no modification
bzip2 Yes, use it directly bzip2 .bz2 Yes Same as text processing, no modification
VOC No, need to install VOC .lzo Yes Need to build index, but also need to specify the input format
Snappy No, need to install Snappy .snappy no Same as text processing, no modification

In order to support multiple compression/decompression algorithms, Hadoop has introduced an encoder/decoder, as shown in the following table

Compression format Corresponding encoder/decoder
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
VOC com.hadoop.compression.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance

Compression algorithm Original file size Compressed file size Compression speed Decompression speed
gzip 8.3GB 1.8GB 17.5MB/s 58MB/s
bzip2 8.3GB 1.1GB 2.4MB/s 9.5MB/s
VOC 8.3GB 2.9GB 49.3MB/s 74.6MB/s

http://google.github.io/snappy/
http://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Compression method selection

  • Gzip compression

    Advantages: The compression rate is relatively high, and the compression/decompression speed is relatively fast; Hadoop itself supports, processing gzip format files in the application is the same as directly processing text; most linux systems come with gzip commands, which are easy to use.

    Disadvantage: Split is not supported.

    Application scenario: When each file is compressed within 130M (within 1 block size), gzip compression format can be considered. For example, one day or one hour of logs are compressed into one gzip file, and multiple gzip files are used to achieve concurrency when running the mapreduce program. Hive programs, streaming programs, and mapreduce programs written in java are exactly the same as text processing. The original program does not need to be modified after compression.

  • Bzip2 compression

    Advantages: support split; has a very high compression rate, higher than gzip compression rate; Hadoop itself supports, but does not support native (Java and C interoperability API interface); in the Linux system comes with bzip2 command, easy to use.

    Disadvantages: slow compression/decompression speed; native is not supported.

    Application scenarios: suitable for low speed requirements but high compression ratios, it can be used as the output format of mapreduce jobs; or the data after output is relatively large, and the processed data needs to be compressed and archived to reduce disk space and later data use There are fewer cases; or when you want to compress a single large text file to reduce storage space, but you also need to support split and be compatible with the previous application (that is, the application does not need to be modified).

  • Lzo compression

    Advantages: compression/decompression speed is also relatively fast, reasonable compression rate; support split, which is the most popular compression format in hadoop; can install lzop command under linux system, easy to use.

    Disadvantages: The compression rate is lower than gzip; Hadoop itself does not support it and needs to be installed; some special processing is required for files in lzo format in the application (in order to support split, you need to build an index, and you need to specify inputformat as lzo format).

    Application scenario: A large text file can be considered if it is larger than 200M after compression, and the larger the single file, the more obvious the advantages of lzo.

  • Snappy compression

    Advantages: high-speed compression speed and reasonable compression rate.

    Disadvantages: Split is not supported; the compression rate is lower than gzip; Hadoop itself does not support it and needs to be installed;

    Application scenario: When the map output data of the Mapreduce job is relatively large, it is used as the compression format of the intermediate data from Map to Reduce; or as the output of a Mapreduce job and the input of another Mapreduce job.

Compression location selection

​ Compression can be enabled at any stage of MapReduce.

Compression configuration parameters

To enable compression in Hadoop, you can configure the following parameters:

parameter Defaults stage Suggest
io.compression.codecs (configured in core-site.xml) org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec Input compression Hadoop uses the file extension to determine whether a certain codec is supported
mapreduce.map.output.compress (configured in mapred-site.xml) false mapper output Set this parameter to true to enable compression
mapreduce.map.output.compress.codec (configured in mapred-site.xml) org.apache.hadoop.io.compress.DefaultCodec mapper output Use LZO or snappy codec to compress data at this stage
mapreduce.output.fileoutputformat.compress (configured in mapred-site.xml) false reducer output Set this parameter to true to enable compression
mapreduce.output.fileoutputformat.compress.codec(在mapred-site.xml中配置) org.apache.hadoop.io.compress. DefaultCodec reducer output Use standard tools or codecs such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.type(在mapred-site.xml中配置) RECORD reducer output Compression type used for SequenceFile output: NONE and BLOCK

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/105428943