MapReduce之倒排索引的讲解--例子

1.需求：将以下三个文件进行倒排。

【word1.txt】
MapReduce is simple
【word2.txt】
MapReduce is powerful is simple
   【word3.txt】
Hello MapReduce bye MapReduce
结果：
Hello   word3.txt:1;
MapReduce   word3.txt:2;word1.txt:1;word2.txt:1;
bye   word3.txt:1;
is   word1.txt:1;word2.txt:2;
powerful   word2.txt:1;
simple   word2.txt:1;word1.txt:1;
1. MapTask思路：
1）Mapper阶段
a. 输入<k1.v1> --> <LongWritable , Text> --> <偏移量，“MapReduce is simple”>
得到inputsplit对象：FileSplit file = context.getInputSplit();
获取文件的文件名：file.getPath().toString().getName();
MapReduce word1.txt 1
MapReduce word2.txt 1
   MapReduce word3.txt 1
MapReduce word3.txt 1
b. 输出 <k2,v2> --> <MapReduce : word3.txt,1>,<MapReduce:word3.txt,1>
<is:word1.txt,1>, <simple:word1.txt> ...................
2) 编写Combiner 继承Reducer：
由<k2,v2> --> <MapReduce:word3.txt,1>,<MapReduce:word3.txt,1>
重置k,v值：类似与执行了一次Reducer阶段
<_k2,_v2> --> <MapReduce,word3.txt:2>，<is ,word1.txt:1>
注意：********
combiner最基本是实现本地key的聚合，对map输出的key排序，value进行迭代；
   combiner的目的是减少map网络流量；
combiner的对象是对于(一个)map；
combiner具有和reduce相似的功能。
2.Reducer思路：
由<_k2,_v2> --> <MapReduce,word3.txt:2>
<MapReduce,word2.txt:1>
<MapReduce,word1.txt:1>
得到<k3,v3>---> <MapReduce,[word3.txt:2,word2.txt:1,word1.txt:1]

代码阶段:
1.Mapper.class

public class ReverseMapper extends Mapper<LongWritable,Text,Text,Text> {
	private Text word=new Text();
	private Text one=new Text();
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		 InputSplit inputsplit=context.getInputSplit();
		 String filename=((FileSplit) inputsplit).getPath().getName();//得到文件名
		//需要切分的文件是由context获取
		 StringTokenizer line=new StringTokenizer(value.toString());//具有空格的分隔符
		 while(line.hasMoreTokens()){
			 word.set(line.nextToken()+"-"+filename);//得到key的那行的第一个字符
			 
			 context.write(word,one);
		 }
	}
}

2.Combiner.class阶段

public class ReverseCombiner extends Reducer<Text, Text, Text, Text> {
	private Text word=new Text();
	private Text v3=new Text();
	@Override
	protected void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		 String [] splits=key.toString().split("-");
		 
		int count=0;
		for(Text v:values){
			count++; //key在一个word.txt出现总数
		}
		 
			String _word=splits[0];    //一行 的  单词
			String _filename=splits[1];//文件名
			word.set(_word);
			v3.set(_filename+"-"+count);
		}           
		context.write(word, v3); //<单词  ，文件名 ： 出现的次数>
		             //<word,fileName : count>,  是对每个map都进行的操作
	}
}

3.Reducer.class

public class ReverseReducer extends Reducer<Text, Text, Text, Text> {
	private Text k3 =new Text();
	private Text v3 =new Text();
	@Override
	protected void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		StringBuffer sb=new StringBuffer();//需要用该类型进行追加操作
			for(Text t:values){
				sb.append(t.toString()).append(";");
			}
				v3.set(sb.toString());
				context.write(key, v3);//--> <word,fileName:count>
				//具有相同的key值，对value的值遍历并追加，然后write
                //是对所有的mapTask进行的操作
	}
}

4.Driver.class

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		 Path  outfile=new Path("file:///D:/serverse");
		 FileSystem fs=outfile.getFileSystem(conf);
		 if(fs.exists(outfile)){
			 fs.delete(outfile,true);
		 }
		 
	    Job job = Job.getInstance(conf);		  
		job.setJarByClass(ReverseDriver.class);
		job.setJobName("Reverse Sort");
		job.setMapperClass(ReverseMapper.class);//输入数据方法
		job.setCombinerClass(ReverseCombiner.class);
		job.setReducerClass(ReverseReducer.class);//计算结果
 
		job.setOutputKeyClass(Text.class);
	    job.setOutputValueClass(Text.class);
		 
		 FileInputFormat.addInputPath(job, new Path("file:///D:/测试数据/倒叙排序/"));
		 FileOutputFormat.setOutputPath(job,outfile);
		 
		 System.exit(job.waitForCompletion(true) ? 0 : 1);
		 
	}

5.运行结果

   输入样本：
             【word1.txt】  
		              MapReduce is simple
	     【word2.txt】  
		              MapReduce is powerful is simple
	     【word3.txt】  
	                  Hello MapReduce bye MapReduce

    输出结果：
             Hello  	word3.txt:1;
		MapReduce	word3.txt:2;word1.txt:1;word2.txt:1;
		bye	word3.txt:1;
		is	word1.txt:1;word2.txt:2;
		powerful	word2.txt:1;
		simple	word2.txt:1;word1.txt:1;

注意：
1）Combiner是在Reducer执行之前执行完成，它是继承Reducer，相当于再一次重写 reduce方法,
2）主类需要声明：自定义的Combiner类，
即： job.setCombinerClass(ReverseCombiner.class);
3）总结：
combiner最基本是实现本地key的聚合，对map输出的key排序，value进行迭代；
combiner的目的是减少map网络流量；
combiner的对象是对于map；
combiner具有和reduce相似的功能。只不过combiner合并对象，是对于一个map；
reduce合并对象，是对于多个map；