倒排索引的简单实现

首先看一个例子:



假设有3篇文章,file1, file2, file3,文件内容如下:



file1 (单词1,单词2,单词3,单词4....)

file2 (单词a,单词b,单词c,单词d....)

file3 (单词1,单词a,单词3,单词d....)


那么建立的倒排索引就是这个样子:



单词1 (file1,file3)

单词2 (file1)

单词3 (file1,file3)

单词a (file2, file3)

....


倒排索引的概念很简单:就是将文件中的单词作为关键字,然后建立单词与文件的映射关系。当然,你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念,几乎所有的搜索引擎都会使用到倒排索引。



下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字,查找出该词所出现的行号以及出现的次数。



import java.io.*;
import java.util.HashMap;
import java.util.Map;

public class InvertedIndex {

	private Map<String, Map<Integer, Integer>> index;
	private Map<Integer, Integer> subIndex;

	public void createIndex(String filePath) {
		index = new HashMap<String, Map<Integer, Integer>>();

		try {
			File file = new File(filePath);
			InputStream is = new FileInputStream(file);
			BufferedReader read = new BufferedReader(new InputStreamReader(is));

			String temp = null;
			int line = 1;
			while ((temp = read.readLine()) != null) {
				String[] words = temp.split(" ");
				for (String word : words) {
					if (!index.containsKey(word)) {
						subIndex = new HashMap<Integer, Integer>();
						subIndex.put(line, 1);
						index.put(word, subIndex);
					} else {
						subIndex = index.get(word);
						if (subIndex.containsKey(line)) {
							int count = subIndex.get(line);
							subIndex.put(line, count+1);
						} else {
							subIndex.put(line, 1);
						}
					}
				}
				line++;
			}
			read.close();
			is.close();
		} catch (IOException e) {
			System.out.println("error in read file");
		}
	}

	public void find(String str) {
		String[] words = str.split(" ");
		for (String word : words) {
			StringBuilder sb = new StringBuilder();
			if (index.containsKey(word)) {
				sb.append("word: " + word + " in ");
				Map<Integer, Integer> temp = index.get(word);
				for (Map.Entry<Integer, Integer> e : temp.entrySet()) {
					sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");
				}
			} else {
				sb.append("word: " + word + " not found");
			}
			System.out.println(sb);
		}
	}

	public static void main(String[] args) {
		InvertedIndex index = new InvertedIndex();
		index.createIndex("news.txt");
		index.find("I love Shanghai today");
	}
}
 


其中,输入文件news.txt内容为:



I am eriol
I live in Shanghai and I love Shanghai
I also love travelling
life in Shanghai
is beautiful



输出结果为:



word: I in line 1 [1] , line 2 [2] , line 3 [1] ,
word: love in line 2 [1] , line 3 [1] ,
word: Shanghai in line 2 [2] , line 4 [1] ,
word: today not found

猜你喜欢

转载自wwangcg.iteye.com/blog/1336106
今日推荐