Hadoop-MapReduce-data compression

    Compression technology can effectively reduce the number of bytes read and written by the underlying storage system (HDFS), increase the efficiency of network bandwidth and disk space, reduce disk I/O, and increase the speed of MapReduce. When running MapReduce programs, I/O operations, network data transmission, Shuffle, and Merge take a lot of time, especially when the data scale is large and the workload is intensive, so it is very important to use data compression.
    The use of compression technology reduces disk I/O, but at the same time increases the CPU computing burden. Therefore, proper use of compression features can improve performance, but improper use can also reduce performance.
    The basic principle of compression: computing-intensive jobs use less compression, and I/O-intensive jobs use more compression .

Compression encoding supported by MapReduce

Compression format Does hadoop come with algorithm File extension Whether it can be divided After changing to compressed format, whether the original program needs to be modified
DEFLATE Yes, use it directly DEFLATE .deflate no Same as text processing, no modification
Gzip Yes, use it directly DEFLATE .gz no Same as text processing, no modification
bzip2 Yes, use it directly bzip2 .bz2 Yes Same as text processing, no modification
VOC No, need to install VOC .lzo Yes Need to build index, but also need to specify the input format
Snappy No, need to install Snappy .snappy no Same as text processing, no modification
Compression format Corresponding encoder/decoder
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
VOC com.hadoop.compression.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec

Gzip compression

    Advantages : The compression rate is relatively high, and the compression/decompression speed is also relatively fast. Hadoop itself supports, processing Gzip format files in the application is the same as processing text directly. Most Linux systems come with Gzip command, which is easy to use.
    Disadvantages : Does not support slicing.
    Application scenarios : When each file is compressed within 130M (within 1 block size), Gzip compression format can be considered. For example, compress one day or one hour of logs into a Gzip file.

bzip2 compression

    Advantages : supports slices, has a very high compression rate, higher than Gzip compression rate. Hadoop itself comes with it and is easy to use.
    Disadvantages : slow compression/decompression speed.
    Application scenarios : suitable for when the speed requirement is not high, but a higher compression rate is required. Or when the data after output is relatively large, the processed data needs to be compressed and archived to reduce disk space, and the data will be used less later. Or when you want to compress a single large text file to reduce storage space, but you also need to support slicing and be compatible with previous applications.

Lzo compression

    Advantages : faster compression/decompression speed, reasonable compression rate. Support slicing. It is the most popular compression format in Hadoop. The lzop command can be installed under the Linux system, which is easy to use.
    Disadvantages : The compression rate is lower than Gzip. Hadoop itself does not support it and needs to be installed. In the application, some special processing is required for the files in the Lzo format (in order to support the indexing of the slice, the InputFormat needs to be specified as the Lzo format)
    Application scenario : a large text file, which is larger than 200M after compression can be considered, and The larger the single file, the more obvious the advantages of Lzo.

Snappy compression

    Advantages : high-speed compression speed and reasonable compression rate.
    Disadvantages : Does not support slicing, and the compression rate is lower than Gzip. Hadoop itself does not support it and needs to be installed.
    Application scenario : When the output data of the Map of the MapReduce job is relatively large, it is used as the compression format of the intermediate data from Map to Reduce. Or as the output of a MapReduce job and the input of another MapReduce job.
Insert picture description here

Compression parameter configuration

parameter Defaults stage Suggest
io.compression.codecs
(configured in core-site.xml)
org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec Input compression Hadoop uses the file extension to determine whether a certain codec is supported
mapreduce.map.output.compress
(configured in mapred-site.xml)
false mapper output Set this parameter to true to enable compression
mapreduce.map.output.compress.codec
(configured in mapred-site.xml)
org.apache.hadoop.io.compress.DefaultCodec mapper output Companies mostly use LZO or Snappy codecs to compress data at this stage
mapreduce.output.fileoutputformat.compress
(configured in mapred-site.xml)
false reducer output Set this parameter to true to enable compression
mapreduce.output.fileoutputformat.compress.codec
(在mapred-site.xml中配置)
org.apache.hadoop.io.compress. DefaultCodec reducer output Use standard tools or codecs such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.type
(在mapred-site.xml中配置)
RECORD reducer output Compression type used for SequenceFile output: NONE and BLOCK

Compress/decompress

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;

public class TestCompress {
    
    
	public static void main(String[] args) throws Exception {
    
    
		compress("e:/hello.txt","org.apache.hadoop.io.compress.BZip2Codec");
 		decompress("e:/hello.txt.bz2");
	}

	private static void compress(String filename, String method) throws Exception {
    
    
		// 1.获取输入流
		FileInputStream fis = new FileInputStream(new File(filename));
		Class codecClass = Class.forName(method);
		CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());
		// 2.获取输出流
		FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension()));
		CompressionOutputStream cos = codec.createOutputStream(fos);
		// 3.流的对拷
		IOUtils.copyBytes(fis, cos, 1024*1024*5, false);
		// (4)关闭资源
		cos.close();
		fos.close();
		fis.close();
	}

	private static void decompress(String filename) throws FileNotFoundException, IOException {
    
    
		// 1.校验是否能解压缩
		CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
		CompressionCodec codec = factory.getCodec(new Path(filename));
		if (codec == null) {
    
    
			System.out.println("cannot find codec for file " + filename);
			return;
		}
		// 2.获取输入流
		CompressionInputStream cis = codec.createInputStream(new FileInputStream(new File(filename)));
		// 3.获取输出流
		FileOutputStream fos = new FileOutputStream(new File(filename + ".decoded"));
		// 4.流的对拷
		IOUtils.copyBytes(cis, fos, 1024*1024*5, false);
		// 5.关闭资源
		cis.close();
		fos.close();
	}
}

    Even if the input and output files of MapReduce are all uncompressed files, the intermediate result output of the Map task can still be compressed, because it needs to be written on the hard disk and transmitted to the Reduce node through the network. The compression can improve a lot of performance. Just set two properties in the driver class.
    Compression is used at the Map output:

// 开启map端输出压缩
configuration.setBoolean("mapreduce.map.output.compress", true);
// 设置map端输出压缩方式
configuration.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);

    Reduce output uses compression:

// 设置reduce端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);
// 设置压缩的方式
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class); 

Guess you like

Origin blog.csdn.net/H_X_P_/article/details/106120476