11. InputFormat in MapReduce

       The MapReduce execution process can be divided into Map and Reduce. In the Map process, there is a very important part of InputFormat. This article mainly introduces InputFormat. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


table of Contents

1. Parallelism of slicing and MapTask

Two, FileInputFormat slice

Three, CombineTextInputFormat slice

Fourth, the implementation class of FileInputFormat

4.1 TextInputFormat

4.2 KeyValueTextInputFormat

4.2.1 KeyValueTextInputFormat example

 4.3 NlineInputFormat

4.3.1 NlineInputFormat example

4.4 Custom InputFormat

4.4.1 Example of custom InputFormat


 

1. Parallelism of slicing and MapTask

       (1) The parallelism of the Map phase of a job is determined by the number of slices when the client submits the job. (2) Each Split slice is assigned a parallel instance of MapTask for processing. (3) By default, slice size=blocksize. (4) When slicing, the entire data set is not considered, but each file is sliced ​​separately.

Two, FileInputFormat slice

       Slicing mechanism: (1) Simple slicing according to the content length of the file. (2) The slice size, which is equal to the block size by default. (3) When slicing, the entire data set is not considered, but each file is sliced ​​separately.

       The slice information can be obtained by the following code:

//获取切片的文件名称
String name = inputSplit.getPath().getName();
//根据文件类型获取切片信息
FileSplit inputSplit = (FileSplit) context.getInputSplit();

Three, CombineTextInputFormat slice

       The default TextInputFormat slicing mechanism of the framework is to slice tasks according to file planning. No matter how small the file is, it will be a separate slice and handed over to a MapTask. In this way, if there are a large number of small files, a large number of MapTasks will be generated, which is extremely inefficient. . CombineTextInputFormat slice is used in scenes with too many small files, it can logically plan multiple small files into one slice. In this way, multiple small files can be handed over to one MapTask for processing, thereby improving processing efficiency. It is best to set the specific value of the virtual storage slice maximum value according to the actual small file size. The setting method is as follows:

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);//设置切片最大值为4M

       Slicing mechanism: The slicing process includes two parts: virtual storage process and slicing process.

       1. Virtual storage process: compare the size of all files in the input directory with the set setMaxInputSplitSize value in turn. If it is not greater than the set maximum value, logically divide it into one block. If the input file is larger than the set maximum value and more than twice, then cut a block with the maximum value; when the remaining data size exceeds the set maximum value and not more than twice the maximum value, the file is divided into 2 virtual storage blocks (to prevent Too small slices appear).

       2. Slicing process: Determine whether the file size of the virtual storage is greater than the setMaxInputSplitSize value, and if it is greater than or equal to the value, a single slice is formed. If it is not greater than, merge with the next virtual storage file to form a slice together.

Fourth, the implementation class of FileInputFormat

       Common interface implementation classes of FileInputFormat include: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat and custom InputFormat.

4.1 TextInputFormat

       TextInputFormat is the default FileInputFormat implementation class. Read each record in line. The key is to store the starting byte offset of the line in the entire file, LongWritable type. The value is the content of this line, not including any line terminator, Text type.

4.2 KeyValueTextInputFormat

       Each line of KeyValueTextInputFormat is a record, which is divided into key and value by separator. The separator can be set by setting conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t"); in the driver class. The default separator is \t. At this time, the key is the Text sequence with each row before the tab.

4.2.1 KeyValueTextInputFormat example

       1. First look at the data and requirements. The data is shown in the figure below. Now it is required to count the number of rows with the same first word in each row.

       2. Write the Mapper class

package com.xzw.hadoop.mapreduce.keyvaluetextinputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/29 14:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class KVTextMapper extends Mapper<Text, Text, Text, LongWritable> {
    //1、设置value
    private LongWritable v = new LongWritable(1);

    @Override
    protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {

        //写出
        context.write(key, v);
    }
}

       3. Write the Reducer class

package com.xzw.hadoop.mapreduce.keyvaluetextinputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/29 14:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class KVTextReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    LongWritable v = new LongWritable();

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long sum = 0L;

        //1、汇总统计
        for (LongWritable value:values) {
            sum += value.get();
        }

        v.set(sum);

        //2、输出
        context.write(key, v);
    }
}

       4. Write Driver driver class

package com.xzw.hadoop.mapreduce.keyvaluetextinputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/29 14:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class KVTextDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"e:/input/xzw.txt", "e:/output"};

        Configuration configuration = new Configuration();
        //设置分隔符
        configuration.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");
        //获取job对象
        Job job = Job.getInstance(configuration);

        job.setJarByClass(KVTextDriver.class);
        job.setMapperClass(KVTextMapper.class);
        job.setReducerClass(KVTextReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        //设置输入格式
        job.setInputFormatClass(KeyValueTextInputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

       5. Test results

 4.3 NlineInputFormat

       If NlineInputFormat is used, the InputSplit processed by each map process is no longer divided by Block blocks, but by the number of lines N specified by NlineInputFormat. That is, the total number of lines in the input file divided by N is equal to the number of slices. If it cannot be divided evenly, the number of slices is equal to the quotient plus one. The keys and values ​​here are consistent with TextInputFormat.

4.3.1 NlineInputFormat example

       1. Requirements and data: The data content is shown below. The requirement is to count the number of occurrences of each word, and put three rows of data into one slice.

       2. Write the Mapper class

package com.xzw.hadoop.mapreduce.nlineinputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 11:05
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NLMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private Text k = new Text();
    private LongWritable v = new LongWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //1、获取一行数据
        String line = value.toString();

        //2、切分数据
        String[] fields = line.split("\t");

        //3、写出
        for (int i = 0; i < fields.length; i++) {
            k.set(fields[i]);
            context.write(k, v);
        }
    }
}

       3. Write the Reducer class

package com.xzw.hadoop.mapreduce.nlineinputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 11:12
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NLReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    LongWritable v = new LongWritable();

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long sum = 0L;

        //1、汇总
        for (LongWritable value: values) {
            sum += value.get();
        }
        v.set(sum);

        //2、写出
        context.write(key, v);
    }
}

       4. Write the Driver class

package com.xzw.hadoop.mapreduce.nlineinputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 11:12
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NLDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"e:/input/xzw.txt", "e:/output"};

        //1、获取job对象
        Job job = Job.getInstance(new Configuration());

        //2、设置每个切片InputSplit中划分三条记录
        NLineInputFormat.setNumLinesPerSplit(job, 3);

        //3、使用NLineInputFormat处理记录数
        job.setInputFormatClass(NLineInputFormat.class);

        //4、设置相关jar包
        job.setJarByClass(NLDriver.class);
        job.setMapperClass(NLMapper.class);
        job.setReducerClass(NLReducer.class);

        //5、设置Mapper和Reducer输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //6、设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //7、提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

       5. The test results are as follows

       The number of output slices is:

4.4 Custom InputFormat

       In the actual development process, the InputFormat type that comes with the Hadoop framework cannot meet all business scenarios. Sometimes it is necessary to customize the InputFormat to solve the corresponding problems. Custom InputFormat first needs to define a class to inherit FileInputFormat, then rewrite RecordReader to read a complete file and package it as KV at a time, and finally use SequenceFileOutputFormat to output the merged file during output.

4.4.1 Example of custom InputFormat

       1. Requirements: SequenceFile file is a file format used by Hadoop to store binary kv pairs. Now multiple small files need to be merged into a SequenceFile file. The storage format is file path + name as key and file content as value. The files are three small files as follows:

       2. Custom RecordReader class

package com.xzw.hadoop.mapreduce.inputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 13:34
 * @desc: 自定义RecordReader类
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class CombinerRecordReader extends RecordReader<Text, BytesWritable> {
    private Configuration configuration;
    private FileSplit fs;
    private FSDataInputStream inputStream;
    private boolean isProgress = true;
    private BytesWritable value = new BytesWritable();
    private Text key = new Text();

    /**
     * 初始化方法,框架会在一开始的时候调用一次
     *
     * @param inputSplit
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    public void initialize(InputSplit inputSplit, TaskAttemptContext context) throws IOException,
            InterruptedException {
        //转换切片类型到文件切片
        fs = (FileSplit) inputSplit;

        //通过切片获取路径
        Path path = fs.getPath();

        //通过路径获取文件系统
        configuration = context.getConfiguration();
        FileSystem fileSystem = path.getFileSystem(configuration);

        //开流
        inputStream = fileSystem.open(path);
    }

    /**
     * 读取下一组KV值
     *
     * @return 如果读到了,返回true;如果读完了,返回false
     * @throws IOException
     * @throws InterruptedException
     */
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (isProgress) {
            //读key
            key.set(fs.getPath().toString());
            //读value
            byte[] buf = new byte[(int) fs.getLength()];
            inputStream.read(buf);
            value.set(buf, 0, buf.length);

            isProgress = false;
            return true;
        } else {
            return false;
        }
    }

    /**
     * 返回当前读到的key
     *
     * @return 当前key
     * @throws IOException
     * @throws InterruptedException
     */
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    /**
     * 返回当前读到的value
     *
     * @return 当前value
     * @throws IOException
     * @throws InterruptedException
     */
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    /**
     * 当前数据读取的进度
     *
     * @return 当前进度
     * @throws IOException
     * @throws InterruptedException
     */
    public float getProgress() throws IOException, InterruptedException {
        return isProgress ? 0 : 1;
    }

    /**
     * 关闭资源
     *
     * @throws IOException
     */
    public void close() throws IOException {
        IOUtils.closeStream(inputStream);
    }
}

       3. Define the class to inherit FileInputFormat

package com.xzw.hadoop.mapreduce.inputformat;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 13:24
 * @desc: 定义类继承FileInputFormat
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class CombinerFileInputFormat extends FileInputFormat<Text, BytesWritable> {

    /**
     * 返回false,表示不可分割
     * @param context
     * @param filename
     * @return
     */
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }

    /**
     * 创建RecordReader对象
     * @param inputSplit
     * @param taskAttemptContext
     * @return
     * @throws IOException
     * @throws InterruptedException
     */
    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new CombinerRecordReader();
    }
}

       4. Write the Mapper class

package com.xzw.hadoop.mapreduce.inputformat;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 13:49
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class SequenceFileMapper extends Mapper<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {
        context.write(key, value);
    }
}

       5. Write the Reducer class

package com.xzw.hadoop.mapreduce.inputformat;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 13:51
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class SequenceFileReducer extends Reducer<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException,
            InterruptedException {
        context.write(key, values.iterator().next());
    }
}

       6. Write Driver driver class

package com.xzw.hadoop.mapreduce.inputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/1 13:58
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class SequenceFileDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"e:/input", "e:/output"};

        //1、获取job对象
        Job job = Job.getInstance(new Configuration());

        //2、设置jar包存储位置,连接自定义的Mapper和Reducer
        job.setJarByClass(SequenceFileDriver.class);
        job.setMapperClass(SequenceFileMapper.class);
        job.setReducerClass(SequenceFileReducer.class);

        //3、设置输入的InputFormat
        job.setInputFormatClass(CombinerFileInputFormat.class);
        //4、设置输出的Outputformat
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        //5、设置Mapper和Reducer的输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);

        //6、设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //7、提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

       7. The test results are as follows

 

       At this point, this article is over. What problems did you encounter in this process, welcome to leave a message, let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/107656198