MapperReduce初探系列（2）——WordCount程序的实现

——WordCount程序的实现是学习MapperReduce不可或缺的一个步骤，这个程序就好像Java中的HelloWord程序一样，不过这个程序相对于HelloWord来说难许多，不过不要紧，任务程序都是有规律可寻的！

WordCount程序分为三个部分，当然在计算模型里面只有mapper任务和reduce任务，这里我们加入一个驱动程序，也是老生常谈的Runner类。

这里提供一个maven搭建好的项目方案，是我写好现成代码GitHub的代码地址：

[email protected]:HeGuanXun/hadoop-hdfs.git

（1）WorkCountMapper的代码

public class WorkCountMapper extends Mapper<LongWritable,Text,Text,LongWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();

        String[] works = line.split(" ");

        for (String work:works)
        {
            context.write(new Text(work),new LongWritable(1));
        }

    }
}

（2）WorkCountReducer的代码

public class WorkCountReducer extends Reducer<Text,LongWritable,Text,LongWritable> {

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

        String work = key.toString();

        long count = 0;

        for (LongWritable value:values)
        {
            count+=value.get();
        }
        context.write(new Text(work),new LongWritable(count));
    }
}

（3）WorkCountRunner的代码

public class WorkCountRunner {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err.println("Usage: WorkCountRunner <input path> <output path>");
            System.exit(-1);
        }

        /**构造一个配置对象，读取配置文件，或者往该对象中设值*/
        Configuration config = new Configuration();

        /**创建job对象，用来描述本任务的相关信息*/
        Job job = Job.getInstance(config);

        /**本job所用的jar包就是本类所在的jar包*/
        job.setJarByClass(WorkCountRunner.class);

        /**本job使用哪些类用来作为mapper和reducer*/
        job.setMapperClass(WorkCountMapper.class);
        job.setReducerClass(WorkCountReducer.class);

        /**本job中mapper的输出数据key的类型*/
        job.setMapOutputKeyClass(Text.class);
        /**本job中mapper的输出数据value的类型*/
        job.setMapOutputValueClass(LongWritable.class);

        /**本job中reducer的输出数据key的类型*/
        job.setOutputKeyClass(Text.class);
        /**本job中reducer的输出数据value的类型*/
        job.setOutputValueClass(LongWritable.class);

        /**指定本job所输入路径*/
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        /**指定本job所输出路径*/
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        /**将任务提交给集群*/

        System.exit(job.waitForCompletion(true)?1:0);
    }
}

（4）我的项目是maven搭建起来的，所以直接打成jar包就可以了，maven项目默认是打成war包，这里我们需要修改一个地方，然后重新打包就可以了

（5）拷贝改jar包上传到hdfs文件系统中,然后使用 [hadoop jar xxxx.jar 全命名的类 input output ]组合模式运行该程序

a.创建必要环境条件

[hadoop@hadoop01 ~]$ hadoop fs -mkdir /wordcount
[hadoop@hadoop01 ~]$ hadoop fs -mkdir /wordcount/data
[hadoop@hadoop01 ~]$ hadoop fs -put /etc/profile /wordcount/data/

b.执行程序

[hadoop@hadoop01 ~]$ hadoop jar hadoop-hdfs-1.0-SNAPSHOT.jar com.hgx.hadoop.mapper.WorkCountRunner /wordcount/data/ /wordcount/output

c.程序执行跟执行完成的提示

看到这样子的输出，那么就说明程序能正常是完成执行了，其实也可以在web查看进程状态的，服务器默认使用的端口是8088，如下

图中红框起来的都是一些重要信息。

（6）查看程序直接的结果，reduce的输出文件。这里为了不影响文章的美观，我这里只截取一点。

那么这个程序就是这么的简单。

初探MapperReduce，这个程序是必须要走的，理解了这个程序对于进阶MapperReduce学习会有非常大的帮助！

MapperReduce初探系列（2）——WordCount程序的实现

猜你喜欢