Hadoop large data technologies MapReduce (3) - CombineTextInputFormat

3.1.5 CombineTextInputFormat practical operation case
Example: count the number of words
  1. Preparations
    create input folder in the root directory of hdfs, then placed four in the size of which were 1.5M, 35M, 5.5M, 6.5M of small files as input data
  2. Specific code
  • Mapper class
/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private Text mapOutputKey = new Text();
    private IntWritable mapOutputValue = new IntWritable();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String linevalue = value.toString();  //1.将读取的文件变成,偏移量+内容//读取一行数据
        StringTokenizer st = new StringTokenizer(linevalue);//使用空格分隔
        while (st.hasMoreTokens()) {//判断是否还有分隔符,有的话代表还有单词
            String word = st.nextToken();//返回从当前位置到下一个分隔符之间的字符串(单词)
            mapOutputKey.set(word);
            mapOutputValue.set(1);
            context.write(mapOutputKey, mapOutputValue);
        }
    }
}
  • Reducer categories:
/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable outputValue = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;    //汇总
        for (IntWritable value : values) {
            sum += value.get();
        }
        outputValue.set(sum);
        context.write(key, outputValue);
    }
}
  • Driver Class
/**
 * @Author zhangyong
 * @Date 2020/3/4 16:35
 * @Version 1.0
 */
public class WordCountDriver {
    public static void main(String[] args) throws Exception {
       //需要在resources下面提供core-site.xml文件
        args = new String[]{
                "/input/",
                "/output/"
        };

        Configuration cfg = new Configuration();   //获取配置

        Job job = Job.getInstance(cfg, WordCountDriver.class.getSimpleName());
        job.setJarByClass(WordCountDriver.class);

        //如果不设置InputFormat,默认是TextInputFormat
        job.setInputFormatClass(CombineTextInputFormat.class);
		//虚拟存储切片最大值设为20M
        CombineTextInputFormat.setMaxInputSplitSize(job,20*1024*1024);
  
        //设置map与需要设置的内容类 + 输出key与value
        job.setMapperClass(WordCountMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //设置reduce
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置input与output
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //将job交给Yarn
        boolean issucess = job.waitForCompletion(true);
        int status=  issucess ? 0 : 1;
        System.exit(status);
    }
}
  1. operation result
    Here Insert Picture Description
Published 37 original articles · won praise 7 · views 1174

Guess you like

Origin blog.csdn.net/zy13765287861/article/details/104685595