16. Data compression in Hadoop

       This article mainly talks about data compression in Hadoop. This is also the last article in MapReduce. From the next article, we will start to talk about another core module of Hadoop-Yarn. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


table of Contents

1. Overview of Hadoop compression

2. Compression encoding supported by MapReduce

3. Introduction to compression method

3.1 Gzip compression

3.2 Bzip2 compression

3.3 Lzo compression

3.4 Snappy compression

Four, compression position selection

Five, compression parameter configuration

Six, compression operation example

6.1 Compression and decompression of data stream

6.2 Compression at the Map output

6.3 Use compression at Reduce output


 

1. Overview of Hadoop compression

       Compression technology can effectively reduce the number of read and write bytes of HDFS and improve the efficiency of network bandwidth and disk space. When running MapReduce programs, IO operations, network data transmission, Shuffle and Merge take a lot of time, especially when the data scale is large and the workload is intensive. Therefore, data compression is very important. You can enable compression at any stage of MapReduce. Although the compression technology reduces disk IO, it also increases the computational burden of the CPU. Therefore, proper compression features can improve performance, and improper use can also reduce performance. For computation-intensive jobs, use compression less; for IO-intensive jobs, use compression more.

2. Compression encoding supported by MapReduce

       In order to support multiple compression/decompression algorithms, Hadoop introduces an encoder/decoder, as shown below:

       Compression performance comparison:

3. Introduction to compression method

3.1 Gzip compression

       Advantages: high compression efficiency and fast speed. Hadoop itself supports it. Processing Gzip files in applications is the same as processing text directly. Most Linux systems have their own Gzip commands, which are more convenient to use.

       Disadvantages: Split is not supported.

       Application scenario: When each file is compressed within 130M (within a block size), Gzip compression format can be considered.

3.2 Bzip2 compression

       Advantages: supports Split, has a high compression rate, Hadoop itself comes with it, and it is easy to use.

       Disadvantages: slow compression/decompression speed.

       Application scenario: Suitable for when the speed is not high, but a higher compression rate is required. Or the data after output is relatively large, and the processed data needs to be compressed and archived to reduce disk space and the data will be used less later. Or for a single large file and text, you want to compress it to reduce storage space, while supporting Split and being compatible with previous applications.

3.3 Lzo compression

       Advantages: faster compression/decompression speed, reasonable compression ratio, support for Split, it is the most popular compression format in Hadoop, you can install lzop command under Linux system, easy to use.

       Disadvantages: The compression rate is lower than Gzip, and Hadoop itself does not support it, it needs to be installed. In the application, some special processing is required for the Lzo format file (in order to support Split, the index needs to be built, and the InputFormat needs to be specified as the Lzo format).

       Application scenario: Applied to large files, the larger the single file, the more obvious the advantages of Lzo.

3.4 Snappy compression

       Advantages: high-speed compression speed and reasonable compression rate.

       Disadvantages: Split is not supported, the compression rate is lower than Gzip, and Hadoop itself does not support.

       Application scenario: When the Map output data of a MapReduce job is relatively large, it is used as a compression format for the intermediate data from Map to Reduce. Or as a compressed format for the output of a MapReduce job and the input of another MapReduce job.

Four, compression position selection

       Compression can be started at any stage of MapReduce's action. In situations where there is a large amount of data and you plan to process it repeatedly, you should consider compressing the input. However, there is no need to display the specified codec. Hadoop automatically checks the file extension, and if the extension matches, it will compress and decompress the file with the appropriate codec. Otherwise, Hadoop will not use any codecs. When the amount of intermediate data output by the Map task is large, compression technology should be considered at this stage. This can significantly improve the Shuffle process of internal data, and the Shuffle process is the most resource-consuming link in the Hadoop processing process. If you find that the large amount of data causes slow network transmission, you should consider using compression technology. Fast codecs that can be used to compress Mapper output include LZO or Snappy. The use of compression technology in the Reducer stage can reduce the amount of data to be stored, thus reducing the required disk space. When the MapReduce job forms a job chain, since the input of the second job has also been compressed, enabling compression is also effective.

Five, compression parameter configuration

Six, compression operation example

6.1 Compression and decompression of data stream

package com.xzw.hadoop.mapreduce.compress;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.*;

/**
 * @author: xzw
 * @create_date: 2020/8/22 14:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class TestCompress {
    public static void main(String[] args) throws IOException, ClassNotFoundException {
        //压缩
        compress("e:/input/nginx_log", "org.apache.hadoop.io.compress.BZip2Codec");

        //解压缩
//        uncompress("e:/input/nginx_log.bz2");
    }

    /**
     * 压缩
     * @param filename 压缩的文件名
     * @param method 采用的压缩方法
     * @throws IOException
     * @throws ClassNotFoundException
     */
    private static void compress(String filename, String method) throws IOException, ClassNotFoundException {
        //1、获取输入流
        FileInputStream fis = new FileInputStream(new File(filename));
        Class codeClass = Class.forName(method);
        CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codeClass, new Configuration());

        //2、获取输出流
        FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension()));
        CompressionOutputStream cos = codec.createOutputStream(fos);

        //3、流对拷
        IOUtils.copyBytes(fis, cos, 1024*1024*5, false);

        //4、关闭资源
        cos.close();
        fos.close();
        fis.close();
    }

    /**
     * 解压缩
     * @param filename 需要解压的文件名
     * @throws IOException
     */
    private static void uncompress(String filename) throws IOException {
        //1、校验是否能解压缩
        CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
        CompressionCodec codec = factory.getCodec(new Path(filename));

        if (codec == null) {
            System.out.println("cannot find codec for file " + filename);
            return;
        }

        //2、获取输入流
        CompressionInputStream cis = codec.createInputStream(new FileInputStream(new File(filename)));

        //3、获取输出流
        FileOutputStream fos = new FileOutputStream(new File(filename + ".decoded"));

        //4、流对拷
        IOUtils.copyBytes(cis, fos, 1024*1024*5, false);

        //5、关闭资源
        cis.close();
        fos.close();
    }
}

       The commonly used compression methods are as follows:

DEFLATE:org.apache.hadoop.io.compress.DefaultCodec
gzip:org.apache.hadoop.io.compress.GzipCodec
bzip2:org.apache.hadoop.io.compress.BZip2Codec

       Run it to see the result:

6.2 Compression at the Map output

       Here we use the WordCount instance in "Nine, Hadoop Core Components of MapReduce" . To achieve Map output compression, you only need to add the following two attributes to the Driver class:

// 开启map端输出压缩
configuration.setBoolean("mapreduce.map.output.compress", true);
// 设置map端输出压缩方式
configuration.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);

6.3 Use compression at Reduce output

       The same is to use the previous WordCount example, just add the following properties to the Driver class:

// 设置reduce端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);

// 设置压缩的方式
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

 

       This is the end of this article. What problems did you encounter during this process? Welcome to leave a message and let me see what problems you all encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/108164116