MapReduce倒排索引

"倒排索引"是文档检索系统中最常用的数据结构，被广泛地应用于全文搜索引擎。

它主要是用来存储某个单词（或词组）在一个文档或一组文档中的存储位置的映射，即提供了一种根据内容来查找文档的方式。

由于不是根据文档来确定文档所包含的内容，而是进行相反的操作，因而称为倒排索引（Inverted Index）。

1 实例描述
通常情况下，倒排索引由一个单词（或词组）以及相关的文档列表组成，
文档列表中的文档或者是标识文档的ID号，或者是指文档所在位置的URL。
在实际应用中，还需要给每个文档添加一个权值，用来指出每个文档与搜索内容的相关度。

样例输入：                                            
1）file1：  
MapReduce is simple
2）file2：  
MapReduce is powerful is simple 
3）file3：  
Hello MapReduce bye MapReduce

输出：
Hello    | file3:1 |
MapReduce    | file3:2 | file1:1 | file2:1 |
bye  | file3:1 |
is   | file1:1 | file2:2 |
powerful     | file2:1 |
simple   | file2:1 | file1:1 |

2.实现步骤

a、map方法实现key=单词名：文件名 value=1，以便于combiner实现词频统计和单词名：文件名合并
b、combiner方法实现词频统计和相同名合并，key=单词名，value=文件名：词频，以便于reduce实现value追加
c、reduce方法实现value追加

1）Map过程 
   首先使用默认的TextInputFormat类对输入文件进行处理，得到文本中每行的偏移量及其内容。
   显然，Map过程首先必须分析输入的<key,value>对，得到倒排索引中需要的三个信息：单词、文档URL和词频。

存在两个问题：
第一，<key,value>对只能有两个值，在不使用Hadoop自定义数据类型的情况下，
需要根据情况将其中两个值合并成一个值，作为key或value值；

第二，通过一个Reduce过程无法同时完成词频统计和生成文档列表，所以必须增加一个Combine过程完成词频统计。 

单词和URL组成key值（如"MapReduce：file1.txt"），将词频作为value，
这样做的好处是可以利用MapReduce框架自带的Map端排序，将同一文档的相同单词的词频组成列表，
传递给Combine过程，实现类似于WordCount的功能。

2）Combine过程 
   经过map方法处理后，Combine过程将key值相同的value值累加，得到一个单词在文档在文档中的词频，
   如果直接将输出作为Reduce过程的输入，在Shuffle过程时将面临一个问题：所有具有相同单词的记录
（由单词、URL和词频组成）应该交由同一个Reducer处理，但当前的key值无法保证这一点，所以必须修改key值和value值。

这次将单词作为key值，URL和词频组成value值（如"file1.txt：1"）。
可以利用MapReduce框架默认的HashPartitioner类完成Shuffle过程，将相同单词的所有记录发送给同一个Reducer进行处理。

3）Reduce过程 
经过上述两个过程后，Reduce过程只需将相同key值的value值组合成倒排索引文件所需的格式即可，
剩下的事情就可以直接交给MapReduce框架进行处理了。

4）需要解决的问题
本倒排索引在文件数目上没有限制，但是单词文件不宜过大（具体值与默认HDFS块大小及相关配置有关），要保证每个文件对应一个split。
否则，由于Reduce过程没有进一步统计词频，最终结果可能会出现词频未统计完全的单词。
可以通过重写InputFormat类将每个文件为一个split，避免上述情况。
或者执行两次MapReduce，第一次MapReduce用于统计词频，第二次MapReduce用于生成倒排索引。
除此之外，还可以利用复合键值对等实现包含更多信息的倒排索引。

实现代码：

package Inverted;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class InvertedIndex {

    static String INPUT_PATH = "hdfs://master:9000/c";
    static String OUTPUT_PATH = "hdfs://master:9000/output";

    static class Map extends Mapper<Object,Object,Text,Text>{  

        private Text keyInfo = new Text();
        private Text valueInfo = new Text();
        private FileSplit split;

        //实现：key=单词名：所在文件名 value=1 
    protected void map(Object key, Object value, Context context) throws IOException, InterruptedException{

        split = (FileSplit)context.getInputSplit();
        StringTokenizer itr = new StringTokenizer(value.toString());

        while(itr.hasMoreTokens()){
            //获得file_name
            int splitIndex = split.getPath().toString().indexOf("file");
            //key--->word:file_name
            keyInfo.set(itr.nextToken()+":"+split.getPath().toString().substring(splitIndex));
            //value--->1
            valueInfo.set("1");
            System.out.println("keyInfo: "+keyInfo);
            System.out.println("valueInfo: "+valueInfo);
            context.write(keyInfo, valueInfo);
        }
    }   
}

    //实现词频统计+key=单词名，value=文件名：词频
    static class Combine  extends Reducer<Text,Text,Text,Text>{

        private Text info = new Text();

         protected void reduce(Text key, Iterable<Text> values, Context context) 
                 throws IOException, InterruptedException{

             //词频统计求和
             int sum=0;
             for(Text value:values){
                 sum+=Integer.parseInt(value.toString());
             }

             int splitIndex = key.toString().indexOf(":");
             System.out.println("key: "+key);
             System.out.println("splitIndex: "+splitIndex);
             //key--->word  value--->filename:sum
             info.set(key.toString().substring(splitIndex+1)+":"+sum);
             key.set(key.toString().substring(0, splitIndex));

             context.write(key, info);

         }
    }

        //实现相同key值，value值相追加
    static class Reduce  extends Reducer<Text,Text,Text,Text>{

        private Text result = new Text();

         protected void reduce(Text key, Iterable<Text> values, Context context) 
                 throws IOException, InterruptedException{

             String fileList = new String();
             for(Text value:values){
                 fileList=fileList+" | "+value.toString();
             }
             fileList+=" |";
        result.set(fileList);
        context.write(key, result);

         }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // TODO Auto-generated method stub

        Path outputpath = new Path(OUTPUT_PATH);
        Configuration conf = new Configuration();
        FileSystem fs = outputpath.getFileSystem(conf);

        if(fs.exists(outputpath)){
            fs.delete(outputpath,true);
        }
        conf.set("fs.default.name ", "hdfs://master:9000/");


        String[] ioArgs = new String[]{"index_in","index_out"};
        String[] otherArgs = new GenericOptionsParser(conf,ioArgs).getRemainingArgs();
        if(otherArgs.length!=2){
            System.out.println("Usage:Inverted Index<in><out>");
        }


        Job job = Job.getInstance(conf);

        job.setJarByClass(InvertedIndex.class); 

        job.setCombinerClass(Reduce.class);

        FileInputFormat.setInputPaths(job, INPUT_PATH);
        FileOutputFormat.setOutputPath(job, outputpath);

        job.setMapperClass(Map.class);
        //设置Combiner
        job.setCombinerClass(Combine.class);
        job.setReducerClass(Reduce.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.waitForCompletion(true);
    }
}

TF-IDF
原理：在一个文件里，词频（Term Frequency）TF指某个词在文件中出现的次数。
这个数字通常会被正规化，防止它偏向长的文件（在长文件中比短文件有更高的词频，而不管该词是否重要）
IDF逆向文件频率是普遍重要度的度量。
IDF=Math.log10(总文件数/包含该词文件的数目)。

某一文件内的高词频，以及该词在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。
因此TF-IDF可以过滤掉常见的词语，保留重要的词语。

例子：
在一篇文件总词数100个，其中mapreduce出现3次，词频=0.03
测定mapreduce在1000份文件中出现过，文件总数10000000，IDF=log（10000000/1000）=4
最后 TF-IDF的分数=0.03*4=0.12

1 问题分析

job1 对每个文件集中每个输入文件，分别统计其各个单词出现次数，输出为
<单词w |文件名f , w在f中出现的次数 f-w-count>

job2 对job1的输出，统计文件f中所有单词的个数（及一共有多少个唯一的单词）
输出为 <单词w | 文件名f，w在f中出现次数 f-w-count | 文件f中单词数f-length>

job3 先统计文件集的文件个数length；
然后，根据job2的输出，统计每个单词在所有文件集中出现的文件个数，输出
    1.  <w,[f1=f1-w-count|f1-length, f2…  ]>
    2.  <w|f1,f1-w-count|f1-length*log(length/k)>
即<单词 w|文件名f1, tf-idf-f1-w>也就是每个单词在文件中的权重TF-IDF。
统计文件集docs个数length
+使用log（length/f-contains-w）=IDF
+f-w-count/length得到TF

2.示例代码
TF=wordCount/sumOfWordsInDoc

double tf=Double.valueOf(wordFrequencyAndTotalWords[0])/Double.valueOf(wordFrequencyAndTotalWords[1]);

double idf=(double)totalDocs/totalDocsForWrod;

double tfidf=tf*Math.log10(idf)

context.write(word|filename,tfidf)   //计算某个文件某个单词的TF-IDF

猜你喜欢