13. OutputFormat in MapReduce

       Since there is InputFormat, there will naturally be OutputFormat. This article mainly introduces OutputFormat in MapReduce. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


table of Contents

1. The working mechanism of MapReduce 

1.1 Working mechanism of MapTask

1.2 ReduceTask working mechanism

Two, OutputFormat in MapReduce

2.1 Common OutputFormat implementation classes

2.1.1 TextOutputFormat

2.1.2 SequenceFileOutputFormat

2.2 Custom OutputFormat instance

2.2.1 Requirements and data

2.2.2 Writing Bean Class

2.2.3 Write Mapper Class

2.2.4 Writing the RecordWriter class

2.2.5 Write the OutputFormat class

2.2.6 Writing the Reducer class

2.2.7 Write Driver driver class

2.2.8 Test


 

1. The working mechanism of MapReduce 

       Before introducing OutputFormat, let's first look at the working mechanism of MapTask and ReduceTask.

1.1 Working mechanism of MapTask

       (1) Read phase: MapTask parses the key/value from the input InputSplit through the RecordReader written by the user. (2) Map stage: This node mainly passes the parsed key/value to the user to write the map() function for processing, and generates a series of new key/value. (3) Collect collection stage: In the user-written map() function, when the data processing is completed, OutputCollector.collect() is generally called to output the result. Inside this function, it will partition the generated key/value (call Partitioner) and write it into a ring memory buffer. (4) Spill stage: "Overwrite". When the ring buffer is full, MapReduce will write the data to the local disk and generate a temporary file. It should be noted that before writing the data to the local disk, the data must be sorted locally, and the data must be merged and compressed if necessary. (5) Combine stage: When all data processing is completed, MapTask merges all temporary files once to ensure that only one data file will be generated in the end. When all the data is processed, MapTask will merge all temporary files into one large file and save it to the file output/file.out, and generate the corresponding index file output/file.out.index. In the process of file merging, MapTask merges by partition. For a certain partition, it will use multiple rounds of recursive merging. Each round merges io.sort.factor (default 10) files, and re-adds the generated files to the list to be merged. After sorting the files, repeat the above process until a large file is finally obtained. Let each MapTask finally generate only one data file, which can avoid the overhead of random reading caused by opening a large number of files and reading a large number of small files at the same time.

       The detailed steps of the overflow phase: 1. Use the quick sort algorithm to sort the data in the buffer area. The sorting method is to sort according to the partition number Partition first, and then sort according to the key. In this way, after sorting, the data is gathered together by partition, and all data in the same partition is ordered by key. 2. Write the data in each partition to the temporary file output/spillN.out (N represents the current number of overflow writes) in the task working directory according to the partition number from small to large. If the user has set the Combiner, before writing the file, perform an aggregation operation on the data in each partition. 3. Write the meta information of the partition data to the memory index data structure SpillRecord, where the meta information of each partition includes the offset in the temporary file, the data size before compression, and the data size after compression. If the current memory index size exceeds 1MB, write the memory index to the file output/spillN.out.index.

1.2 ReduceTask working mechanism

       (1) Copy stage: ReduceTask remotely copies a piece of data from each MapTask, and for a piece of data, if its size exceeds a certain threshold, it is written to disk, otherwise it is directly placed in memory. (2) Merge stage: While copying data remotely, ReduceTask starts two background threads to merge files on the memory and disk to prevent excessive memory usage or too many files on the disk. (3) Sort stage: According to MapReduce semantics, the input data of the reduce() function written by the user is a set of data gathered by key. In order to cluster data with the same key, Hadoop uses a sort-based strategy. Since each MapTask has implemented partial sorting of its processing results, ReduceTask only needs to merge and sort all data once. (4) Reduce phase: The reduce() function writes the calculation result to HDFS.

       Precautions:

       1. ReduceTask=0 means that there is no Reduce phase, and the number of output files is the same as the number of Maps.

       2. The default value of ReduceTask is 1, so the number of output files is one.

       3. If the data is unevenly distributed, data skew may occur on the Reduce side.

       4. The number of ReduceTasks is not arbitrarily set, and specific business scenarios must be considered.

       5. How many ReduceTasks should be set according to the performance of the cluster.

       6. If the partition is not 1, but the ReduceTask is 1, the partition process is not executed at this time. Because in the source code of MapTask, the prerequisite for executing partitions is to first determine whether ReduceNum is greater than 1. If it is not greater than 1, it will definitely not be executed.

Two, OutputFormat in MapReduce

2.1 Common OutputFormat implementation classes

       OutputFormat is the base class for MapReduce output, and all implementations of MapReduce output implement the OutputFormat interface. The following are several common OutputFormat implementation classes.

2.1.1 TextOutputFormat

       TextOutputFormat is the default output format, he writes each record as a text line. His keys and values ​​can be of any type, because TextOutputFormat calls the toString() method to convert them into strings.

2.1.2 SequenceFileOutputFormat

       The SequenceFileOutputFormat output is used as the input of subsequent MapReduce tasks. This is a good output format because its format is compact and can be easily compressed.

2.2 Custom OutputFormat instance

       Custom OutputFormat is the focus of this article, because sometimes in order to meet different business scenarios, it is necessary to customize OutputFormat. Let's take a look at it through a case.

2.2.1 Requirements and data

       First look at the data:

       The data is still the log data of Nginx used in the previous example. The fields in the above figure represent: time, version, client ip, access path, status, domain name, server ip, size, response time. Now I want to output the "/iclock/getrequest" path in the access path to a file, and put the rest of the access path in another file.

2.2.2 Writing Bean Class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 14:02
 * @desc: 时间、版本、客户端ip、访问路径、状态、域名、服务端ip、size、响应时间
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogBean implements Writable {
    private String date;
    private String version;
    private String clientIP;
    private String url;
    private String status;
    private String domainName;
    private String serverIP;
    private String size;
    private String responseDate;

    public LogBean() {
    }

    public void set(String date, String version, String clientIP, String url, String status, String domainName,
                    String serverIP, String size, String responseDate) {
        this.date = date;
        this.version = version;
        this.clientIP = clientIP;
        this.url = url;
        this.status = status;
        this.domainName = domainName;
        this.serverIP = serverIP;
        this.size = size;
        this.responseDate = responseDate;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public String getVersion() {
        return version;
    }

    public void setVersion(String version) {
        this.version = version;
    }

    public String getClientIP() {
        return clientIP;
    }

    public void setClientIP(String clientIP) {
        this.clientIP = clientIP;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getStatus() {
        return status;
    }

    public void setStatus(String status) {
        this.status = status;
    }

    public String getDomainName() {
        return domainName;
    }

    public void setDomainName(String domainName) {
        this.domainName = domainName;
    }

    public String getServerIP() {
        return serverIP;
    }

    public void setServerIP(String serverIP) {
        this.serverIP = serverIP;
    }

    public String getSize() {
        return size;
    }

    public void setSize(String size) {
        this.size = size;
    }

    public String getResponseDate() {
        return responseDate;
    }

    public void setResponseDate(String responseDate) {
        this.responseDate = responseDate;
    }

    @Override
    public String toString() {
        return date + '\t' + version + '\t' + clientIP + '\t' + url + '\t' + status + '\t' + domainName + '\t'
                + serverIP + '\t' + size + '\t' + responseDate;
    }

    /**
     * 序列化方法
     *
     * @param dataOutput
     * @throws IOException
     */
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(date);
        dataOutput.writeUTF(version);
        dataOutput.writeUTF(clientIP);
        dataOutput.writeUTF(url);
        dataOutput.writeUTF(status);
        dataOutput.writeUTF(domainName);
        dataOutput.writeUTF(serverIP);
        dataOutput.writeUTF(size);
        dataOutput.writeUTF(responseDate);
    }

    /**
     * 反序列化方法
     *
     * @param dataInput
     * @throws IOException
     */
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.date = dataInput.readUTF();
        this.version = dataInput.readUTF();
        this.clientIP = dataInput.readUTF();
        this.url = dataInput.readUTF();
        this.status = dataInput.readUTF();
        this.domainName = dataInput.readUTF();
        this.serverIP = dataInput.readUTF();
        this.size = dataInput.readUTF();
        this.responseDate = dataInput.readUTF();
    }
}

2.2.3 Write Mapper Class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 13:59
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogMapper extends Mapper<LongWritable, Text, Text, LogBean> {
    private Text k = new Text();
    private LogBean v = new LogBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //1、获取一行数据
        String line = value.toString();

        //2、切分
        String[] fields = line.split("\t");

        //3、获取对应的数据
        String date = fields[0];
        String version = fields[1];
        String clientIP = fields[2];
        String url = fields[3];
        String status = fields[4];
        String domainName = fields[5];
        String serverIP = fields[6];
        String size = fields[7];
        String responseDate = fields[8];

        //4、封装数据
        k.set(url);
        v.set(date, version, clientIP, url, status, domainName, serverIP, size, responseDate);

        //5、写出
        context.write(k, v);
    }
}

2.2.4 Writing the RecordWriter class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 14:23
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogRecordWriter extends RecordWriter<Text, LogBean> {

    private FSDataOutputStream getrequest;
    private FSDataOutputStream others;

    public LogRecordWriter(TaskAttemptContext job) throws IOException {
        //1、获取文件系统
        FileSystem fs;
        fs = FileSystem.get(job.getConfiguration());

        //2、创建输出流
        String outDir = job.getConfiguration().get(FileOutputFormat.OUTDIR);
        getrequest = fs.create(new Path(outDir + "/getrequest.txt"));
        others = fs.create(new Path(outDir + "/others.txt"));
    }

    @Override
    public void write(Text key, LogBean value) throws IOException, InterruptedException {
        //判断路径是否是/iclock/getrequest,然后将value输出到不同的文件
        String k = key.toString() + "\n";
        if (k.contains("getrequest")) {
            getrequest.write(value.toString().getBytes());
            getrequest.write("\n".getBytes());
        } else {
            others.write(value.toString().getBytes());
            others.write("\n".getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //关闭资源
        IOUtils.closeStream(getrequest);
        IOUtils.closeStream(others);
    }
}

2.2.5 Write the OutputFormat class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 14:44
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogOutputFormat extends FileOutputFormat<Text, LogBean> {
    @Override
    public RecordWriter<Text, LogBean> getRecordWriter(TaskAttemptContext job) throws IOException,
            InterruptedException {
        return new LogRecordWriter(job);
    }
}

2.2.6 Writing the Reducer class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 14:20
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogReducer extends Reducer<Text, LogBean, Text, LogBean> {
    @Override
    protected void reduce(Text key, Iterable<LogBean> values, Context context) throws IOException,
            InterruptedException {
        for (LogBean value: values) {
            context.write(key, value);
        }
    }
}

2.2.7 Write Driver driver class

package com.xzw.hadoop.mapreduce.outputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/10 14:54
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class LogDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //输入输出路径
        args = new String[]{"e:/input/nginx_log", "e:/output"};

        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(LogDriver.class);

        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LogBean.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LogBean.class);

        job.setOutputFormatClass(LogOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

2.2.8 Test

       The test results are as follows:

       The content of getrequest.txt is as follows:

       The content of others.txt is as follows:

 

 

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/107911569