MR计算模型二

mapreduce函数的编写 1

map函数

继承Mapper<Object, Object, Object, Object>

重写public void map(Object key, Object value, Context context) throws IOException, InterruptedException 方法

map函数主要用于数据的清洗和原始处理

map函数的输入输出

map函数每执行一次，处理一条数据

map的输入，key默认是行号的偏移量，value是一行的内容

context.write(Object, Object)方法输出

map的输出是reduce的输入

mapreduce函数的编写 2

reduce函数

继承Reducer<Object, Object, Object, Object>

重写public void reduce(Object key, Iterable<Object> values, Context context) throws IOException, InterruptedException 方法 reduce函数是主要的业务处理和数据挖掘部分

reduce函数的输入输出

context.write(data, new IntWritable(1))方法输出

reduce的输入时map的输出，但不是直接输出，而是按照相同key汇总过后的集合

context.write(Object, Object)方法输出

mapreduce函数的编写 3

编写job

 logger.warn("HelloHadoopSort已启动");
        Configuration coreSiteConf = new Configuration();
		coreSiteConf.addResource(Resources.getResource("core-site.xml"));

        Job job = Job.getInstance(coreSiteConf, "HelloHadoopSort");
        job.setJarByClass(HelloHadoopSort.class);
        //设置Map和Reduce处理类
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        //设置map输出类型
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path("/sort/input"));
        FileOutputFormat.setOutputPath(job, new Path("/sort/output"));
        boolean flag = job.waitForCompletion(true);
        logger.warn("HelloHadoopSort已完成，运行结果：" + flag);

WordCountMap类继承了

org.apache.hadoop.mapreduce.Mapper，4个泛型类型分别是map函数输入key的类型，输入value的类型，输出key的类型，输出value的类型。

WordCountReduce类继承了org.apache.hadoop.mapreduce.Reducer，4个泛型类型含义与map类相同。

map的输出类型与reduce的输入类型相同，而一般情况下，map的输出类型与reduce的输出类型相同，因此，reduce的输入类型与输出类型相同。

在map中，读取一行内容，按照空格分组，得到一行中的每个单词，把单词做为key输出，value的内容可以为空或任意内容。

在reduce中，获取到某个单词及所有集合，集合的尺寸即是该单词出现的数量，把单词及其数量输出到hdfs中