Hadoop data compression
Overview
Compression technology can effectively reduce the number of bytes read and written by the underlying storage system (HDFS). Compression improves the efficiency of network bandwidth and disk space. Under Hadoop, it is very important to use data compression, especially when the data scale is large and the workload is intensive. In this case, I/O operations and network data transmission take a lot of time. In addition, the Shuffle and Merge processes also face tremendous I/O pressure.
Given that disk I/O and network bandwidth are precious resources of Hadoop, data compression is very helpful for saving resources, minimizing disk I/O and network transmission. However, despite the low CPU overhead of compression and decompression operations, its performance improvement and resource saving are not without cost.
If disk I/O and network bandwidth affect the performance of MapReduce jobs, enabling compression at any MapReduce stage can improve end-to-end processing time and reduce I/O and network traffic.
An optimization strategy for compressing Mapreduce: compressing the output of Mapper or Reducer through compression encoding to reduce disk IO and improve the running speed of MR programs (but correspondingly increase the burden of cpu computing).
Note: Proper use of compression features can improve performance, but improper use can also reduce performance.
The basic principle:
(1) For computationally intensive jobs, use less compression
(2) IO-intensive jobs, use compression
Compression encoding supported by MR
Compression format | Hadoop comes with it? | algorithm | File extension | Whether it can be divided | After changing to compressed format, whether the original program needs to be modified |
---|---|---|---|---|---|
DEFAULT | Yes, use it directly | DEFAULT | .deflate | no | Same as text processing, no modification |
Gzip | Yes, use it directly | DEFAULT | .gz | no | Same as text processing, no modification |
bzip2 | Yes, use it directly | bzip2 | .bz2 | Yes | Same as text processing, no modification |
VOC | No, need to install | VOC | .lzo | Yes | Need to build index, but also need to specify the input format |
Snappy | No, need to install | Snappy | .snappy | no | Same as text processing, no modification |
In order to support multiple compression/decompression algorithms, Hadoop has introduced an encoder/decoder, as shown in the following table
Compression format | Corresponding encoder/decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
VOC | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Comparison of compression performance
Compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed |
---|---|---|---|---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
VOC | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
http://google.github.io/snappy/
http://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Compression method selection
-
Gzip compression
Advantages: The compression rate is relatively high, and the compression/decompression speed is relatively fast; Hadoop itself supports, processing gzip format files in the application is the same as directly processing text; most linux systems come with gzip commands, which are easy to use.
Disadvantage: Split is not supported.
Application scenario: When each file is compressed within 130M (within 1 block size), gzip compression format can be considered. For example, one day or one hour of logs are compressed into one gzip file, and multiple gzip files are used to achieve concurrency when running the mapreduce program. Hive programs, streaming programs, and mapreduce programs written in java are exactly the same as text processing. The original program does not need to be modified after compression.
-
Bzip2 compression
Advantages: support split; has a very high compression rate, higher than gzip compression rate; Hadoop itself supports, but does not support native (Java and C interoperability API interface); in the Linux system comes with bzip2 command, easy to use.
Disadvantages: slow compression/decompression speed; native is not supported.
Application scenarios: suitable for low speed requirements but high compression ratios, it can be used as the output format of mapreduce jobs; or the data after output is relatively large, and the processed data needs to be compressed and archived to reduce disk space and later data use There are fewer cases; or when you want to compress a single large text file to reduce storage space, but you also need to support split and be compatible with the previous application (that is, the application does not need to be modified).
-
Lzo compression
Advantages: compression/decompression speed is also relatively fast, reasonable compression rate; support split, which is the most popular compression format in hadoop; can install lzop command under linux system, easy to use.
Disadvantages: The compression rate is lower than gzip; Hadoop itself does not support it and needs to be installed; some special processing is required for files in lzo format in the application (in order to support split, you need to build an index, and you need to specify inputformat as lzo format).
Application scenario: A large text file can be considered if it is larger than 200M after compression, and the larger the single file, the more obvious the advantages of lzo.
-
Snappy compression
Advantages: high-speed compression speed and reasonable compression rate.
Disadvantages: Split is not supported; the compression rate is lower than gzip; Hadoop itself does not support it and needs to be installed;
Application scenario: When the map output data of the Mapreduce job is relatively large, it is used as the compression format of the intermediate data from Map to Reduce; or as the output of a Mapreduce job and the input of another Mapreduce job.
Compression location selection
Compression can be enabled at any stage of MapReduce.
Compression configuration parameters
To enable compression in Hadoop, you can configure the following parameters:
parameter | Defaults | stage | Suggest |
---|---|---|---|
io.compression.codecs (configured in core-site.xml) | org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec | Input compression | Hadoop uses the file extension to determine whether a certain codec is supported |
mapreduce.map.output.compress (configured in mapred-site.xml) | false | mapper output | Set this parameter to true to enable compression |
mapreduce.map.output.compress.codec (configured in mapred-site.xml) | org.apache.hadoop.io.compress.DefaultCodec | mapper output | Use LZO or snappy codec to compress data at this stage |
mapreduce.output.fileoutputformat.compress (configured in mapred-site.xml) | false | reducer output | Set this parameter to true to enable compression |
mapreduce.output.fileoutputformat.compress.codec(在mapred-site.xml中配置) | org.apache.hadoop.io.compress. DefaultCodec | reducer output | Use standard tools or codecs such as gzip and bzip2 |
mapreduce.output.fileoutputformat.compress.type(在mapred-site.xml中配置) | RECORD | reducer output | Compression type used for SequenceFile output: NONE and BLOCK |