Write catalog title here
Compression technology can effectively reduce the number of bytes read and written by the underlying storage system (HDFS), increase the efficiency of network bandwidth and disk space, reduce disk I/O, and increase the speed of MapReduce. When running MapReduce programs, I/O operations, network data transmission, Shuffle, and Merge take a lot of time, especially when the data scale is large and the workload is intensive, so it is very important to use data compression.
The use of compression technology reduces disk I/O, but at the same time increases the CPU computing burden. Therefore, proper use of compression features can improve performance, but improper use can also reduce performance.
The basic principle of compression: computing-intensive jobs use less compression, and I/O-intensive jobs use more compression .
Compression encoding supported by MapReduce
Compression format | Does hadoop come with | algorithm | File extension | Whether it can be divided | After changing to compressed format, whether the original program needs to be modified |
---|---|---|---|---|---|
DEFLATE | Yes, use it directly | DEFLATE | .deflate | no | Same as text processing, no modification |
Gzip | Yes, use it directly | DEFLATE | .gz | no | Same as text processing, no modification |
bzip2 | Yes, use it directly | bzip2 | .bz2 | Yes | Same as text processing, no modification |
VOC | No, need to install | VOC | .lzo | Yes | Need to build index, but also need to specify the input format |
Snappy | No, need to install | Snappy | .snappy | no | Same as text processing, no modification |
Compression format | Corresponding encoder/decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
VOC | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Gzip compression
Advantages : The compression rate is relatively high, and the compression/decompression speed is also relatively fast. Hadoop itself supports, processing Gzip format files in the application is the same as processing text directly. Most Linux systems come with Gzip command, which is easy to use.
Disadvantages : Does not support slicing.
Application scenarios : When each file is compressed within 130M (within 1 block size), Gzip compression format can be considered. For example, compress one day or one hour of logs into a Gzip file.
bzip2 compression
Advantages : supports slices, has a very high compression rate, higher than Gzip compression rate. Hadoop itself comes with it and is easy to use.
Disadvantages : slow compression/decompression speed.
Application scenarios : suitable for when the speed requirement is not high, but a higher compression rate is required. Or when the data after output is relatively large, the processed data needs to be compressed and archived to reduce disk space, and the data will be used less later. Or when you want to compress a single large text file to reduce storage space, but you also need to support slicing and be compatible with previous applications.
Lzo compression
Advantages : faster compression/decompression speed, reasonable compression rate. Support slicing. It is the most popular compression format in Hadoop. The lzop command can be installed under the Linux system, which is easy to use.
Disadvantages : The compression rate is lower than Gzip. Hadoop itself does not support it and needs to be installed. In the application, some special processing is required for the files in the Lzo format (in order to support the indexing of the slice, the InputFormat needs to be specified as the Lzo format)
Application scenario : a large text file, which is larger than 200M after compression can be considered, and The larger the single file, the more obvious the advantages of Lzo.
Snappy compression
Advantages : high-speed compression speed and reasonable compression rate.
Disadvantages : Does not support slicing, and the compression rate is lower than Gzip. Hadoop itself does not support it and needs to be installed.
Application scenario : When the output data of the Map of the MapReduce job is relatively large, it is used as the compression format of the intermediate data from Map to Reduce. Or as the output of a MapReduce job and the input of another MapReduce job.
Compression parameter configuration
parameter | Defaults | stage | Suggest |
---|---|---|---|
io.compression.codecs (configured in core-site.xml) |
org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec | Input compression | Hadoop uses the file extension to determine whether a certain codec is supported |
mapreduce.map.output.compress (configured in mapred-site.xml) |
false | mapper output | Set this parameter to true to enable compression |
mapreduce.map.output.compress.codec (configured in mapred-site.xml) |
org.apache.hadoop.io.compress.DefaultCodec | mapper output | Companies mostly use LZO or Snappy codecs to compress data at this stage |
mapreduce.output.fileoutputformat.compress (configured in mapred-site.xml) |
false | reducer output | Set this parameter to true to enable compression |
mapreduce.output.fileoutputformat.compress.codec (在mapred-site.xml中配置) |
org.apache.hadoop.io.compress. DefaultCodec | reducer output | Use standard tools or codecs such as gzip and bzip2 |
mapreduce.output.fileoutputformat.compress.type (在mapred-site.xml中配置) |
RECORD | reducer output | Compression type used for SequenceFile output: NONE and BLOCK |
Compress/decompress
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;
public class TestCompress {
public static void main(String[] args) throws Exception {
compress("e:/hello.txt","org.apache.hadoop.io.compress.BZip2Codec");
decompress("e:/hello.txt.bz2");
}
private static void compress(String filename, String method) throws Exception {
// 1.获取输入流
FileInputStream fis = new FileInputStream(new File(filename));
Class codecClass = Class.forName(method);
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());
// 2.获取输出流
FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension()));
CompressionOutputStream cos = codec.createOutputStream(fos);
// 3.流的对拷
IOUtils.copyBytes(fis, cos, 1024*1024*5, false);
// (4)关闭资源
cos.close();
fos.close();
fis.close();
}
private static void decompress(String filename) throws FileNotFoundException, IOException {
// 1.校验是否能解压缩
CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
CompressionCodec codec = factory.getCodec(new Path(filename));
if (codec == null) {
System.out.println("cannot find codec for file " + filename);
return;
}
// 2.获取输入流
CompressionInputStream cis = codec.createInputStream(new FileInputStream(new File(filename)));
// 3.获取输出流
FileOutputStream fos = new FileOutputStream(new File(filename + ".decoded"));
// 4.流的对拷
IOUtils.copyBytes(cis, fos, 1024*1024*5, false);
// 5.关闭资源
cis.close();
fos.close();
}
}
Even if the input and output files of MapReduce are all uncompressed files, the intermediate result output of the Map task can still be compressed, because it needs to be written on the hard disk and transmitted to the Reduce node through the network. The compression can improve a lot of performance. Just set two properties in the driver class.
Compression is used at the Map output:
// 开启map端输出压缩
configuration.setBoolean("mapreduce.map.output.compress", true);
// 设置map端输出压缩方式
configuration.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
Reduce output uses compression:
// 设置reduce端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);
// 设置压缩的方式
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);