MapReduce实例_WordCount

文章目录

1.MapReduce概述
2.WordCount单词统计

2.1 数据准备test.txt
2.2 Map程序
2.3 Reduce程序
2.4 Main程序

1.MapReduce概述

MapReduce 原理
MapReduce 是一种变成模式，用于大规模的数据集的分布式运算。通俗的将就是会将任务分给不同的机器做完，然后在收集汇总。
MapReduce有两个核心：Map,Reduce,它们分别单独计算任务，每个机器尽量计算自己hdfs内部的保存信息，Reduce则将计算结果汇总。

2.WordCount单词统计

2.1 数据准备test.txt

hello hadoop
wille learn hadoop WordCount
but the hadoop is not easy

2.2 Map程序

package com.ice.hadoop.test.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

  @Override
  protected void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    String line = value.toString();
    String[] words = line.split(" ");
    for (String word : words) {
      context.write(new Text(word), new IntWritable(1));
    }
  }
}

这里定义了一个mapper类，其中有一个map方法。MapReduce框架每读到一行数据，就会调用一次这个map方法。
Mapper<LongWritable, Text, Text, IntWritable>其中的4个类型分别是：输入key类型、输入value类型、输出key类型、输出value类型。
MapReduce框架读到一行数据侯以key value形式传进来，key默认情况下是mr矿机所读到一行文本的起始偏移量（Long类型），value默认情况下是mr框架所读到的一行的数据内容（String类型）。
输出也是key value形式的，是用户自定义逻辑处理完成后定义的key，用户自己决定用什么作为key，value是用户自定义逻辑处理完成后的value，内容和类型也是用户自己决定。
此例中，输出key就是word（字符串类型），输出value就是单词数量（整型）。
这里的数据类型和我们常用的不一样，因为MapReduce程序的输出数据需要在不同机器间传输，所以必须是可序列化的，例如Long类型，Hadoop中定义了自己的可序列化类型LongWritable，String对应的是Text，int对应的是IntWritable。

2.3 Reduce程序

package com.ice.hadoop.test.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

  @Override
  protected void reduce(Text key, Iterable<IntWritable> values, Context context)
      throws IOException, InterruptedException {
    Integer count = 0;
    for (IntWritable value : values) {
      count += value.get();
    }
    context.write(key, new IntWritable(count));
  }
}

这里定义了一个Reducer类和一个reduce方法。当传给reduce方法时，就变为：Reducer<Text, IntWritable, Text, IntWritable> 4个类型分别指：输入key的类型、输入value的类型、输出key的类型、输出value的类型。
需要注意，reduce方法接收的是：一个字符串类型的key、一个可迭代的数据集。因为reduce任务读取到map任务处理结果是这样的：
（good，1）（good，1）（good，1）（good，1）
当传给reduce方法时，就变为：
key：good
value：（1,1,1,1）
所以，reduce方法接收到的是同一个key的一组value。

2.4 Main程序

package com.ice.hadoop.test.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountMapReduce {

  public static void main(String[] args) throws Exception {
    //创建配置对象
    Configuration conf = new Configuration();
    //创建Job对象
    Job job = Job.getInstance(conf, "wordCount");
    //设置mapper类
    job.setMapperClass(WordcountMapper.class);
    //设置 Reduce类
    job.setReducerClass(WordCountReducer.class);

    //设置运行job类
    job.setJarByClass(WordCountMapReduce.class);

    //设置map输出的key,value类型
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    //设置reduce输出的key,value类型
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    //设置输入路径金额输出路径
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    //提交job
    boolean b = job.waitForCompletion(true);

    if (!b){
      System.out.println("word count failed!");
    }
  }
}

编译打包后:

hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put test.txt /wordcount/input

执行wordcount jar

hadoop jar mapreduce-wordcount-0.0.1-SNAPSHOT.jar com/ice/hadoop/test/wordcount/WordCountMapReduce /wordcount/input /wordcount/output

执行完成后验证

hdfs dfs -cat /wordcount/output/*

在这里插入图片描述

959ggg

发布了277 篇原创文章 · 获赞 24 · 访问量 2万+

私信关注