利用lucene构建自己的搜索引擎

   lucene 是什么？
    lucene是一个基于JAVA的全文信息检索工具包，它不是一个完整的搜索引擎，它只为你的程序构建索引，然后在索引上进行搜索。具体lucene能做什么，见下图：

   lucene能做什么？
   任何文本的东西都可以给lucene构建索引，不管是pdf、html只要能转换为文本，lucene就可以构建索引，查询索引。就这么简单

   下面就用一个简单的例子来构建自己的应用程序：

     Lucene 软件包分析

Lucene 软件包的发布形式是一个 JAR 文件，下面我们分析一下这个 JAR 文件里面的主要的 JAVA 包，使读者对之有个初步的了解。

Package: org.apache.lucene.document

这个包提供了一些为封装要索引的文档所需要的类，比如 Document, Field。这样，每一个文档最终被封装成了一个 Document 对象。

Package: org.apache.lucene.analysis

这个包主要功能是对文档进行分词，因为文档在建立索引之前必须要进行分词，所以这个包的作用可以看成是为建立索引做准备工作。

Package: org.apache.lucene.index

这个包提供了一些类来协助创建索引以及对创建好的索引进行更新。这里面有两个基础的类：IndexWriter 和 IndexReader，其中 IndexWriter 是用来创建索引并添加文档到索引中的，IndexReader 是用来删除索引中的文档的。

Package: org.apache.lucene.search

这个包提供了对在建立好的索引上进行搜索所需要的类。比如 IndexSearcher 和 Hits, IndexSearcher 定义了在指定的索引上进行搜索的方法，Hits 用来保存搜索得到的结果。

    首先介绍一下lucene的基本类：

    Document : 用于表示一个文件，document由Field组成，是Field的一个集合，Field表示用于建立索引和查询索引。所以为了能合理的区分多个document，建议对每个文档用到的Field应该是能唯一标识该docuemnt，尽量和别的document的Field内容不同。
    Field： Document的集合内容，用于区分文档和查询文档的基本单位，有两部分组成一个是name，一个是原始的未加工的文本内容。如Field field = new Field("name" ,"value");,在查询索引的时刻可以依据name来查询，也就是用name对应的value来匹配我们的关键字。

     Analyze：建立索引避免不了对文档内容进行分词，Analyze对象正是用于分词的对象。该类是一个抽象内，你可以依据自己的业务规则来分词，还有很多第三方分词器，自己用适配模式+代理模式来适配自己的业务也行。
    Directory： Directory是用来存储index文件的目录，该类有两个实现类FSDirectory/RAMDirectory(FS = FileSystem , RAM...).

   IndexWriter： IndexWriter是用来建立索引的具体类，IndexReader 是用来删除索引的具体类.

   IndexSearcher：IndexSearcher是在已经建立的索引上进行查询的具体类。

建立索引的例子代码：

/**
	 * 构建索引
	 * @param indexDir
	 * @param dataDir
	 * @throws Exception
	 */
	public void buildIndex(String indexDir, String dataDir) throws Exception {
		// 存放索引位置
		File indexFile = new File(indexDir);
		// 数据文件位置
		File dataFile = new File(dataDir);
		 File[] alldataFiles = dataFile.listFiles();
		Version version = Version.LUCENE_34;
		//分词 分析类
		Analyzer luceneAnalyzer = new StandardAnalyzer(version);
		//存放索引的位置
		Directory directory = FSDirectory.open(indexFile);
		//写入索引
		IndexWriter indexWriter = new IndexWriter(directory,
				new IndexWriterConfig(version, luceneAnalyzer));
		for (File tempFile : alldataFiles) {
			if (tempFile.isFile()) {
				//文档抽象对象
				Document document = new Document();
				document.add(new Field("path", tempFile.getCanonicalPath(),
						Field.Store.YES, Field.Index.ANALYZED));
				document.add(new Field("content", readFileContent(tempFile),
						Field.Store.YES, Field.Index.ANALYZED));
				indexWriter.addDocument(document);
			}
		}
		indexWriter.optimize();
		indexWriter.close();
		System.out.println("end.....................");
 
	}

查询内容代码：

private static String readFileContent(File file) throws IOException {
		BufferedReader reader = new BufferedReader(new InputStreamReader(
				new FileInputStream(file), "gbk"));
		StringBuffer content = new StringBuffer();
		for (String line = null; (line = reader.readLine()) != null;) {
			content.append(line).append("\n");
		}
		return content.toString();
	}

基础类：
Query:将用户的输入组织成luence能识别的方式，Query有多个实现类，对应多种情况。
QueryParser：该类用于将索引的Field和关键字转换成luence能理解的Query
IndexSearcher：该类用于在已经建立的索引上执行搜索。
TopDocs：该类用于保存IndexSearcher搜索到的结果。是对结果的一个封装。

查询索引的例子代码：

/**
	 * 搜索内容
	 * @param indexDir
	 * @throws Exception
	 */
	public void searchIndex(String indexDir) throws Exception {
		// 存放索引位置
		File indexFile = new File(indexDir);
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
		String queryString = "爱到万年";
		// 把要搜索的文本解析为Query
		String[] fields = { "path", "content" };
		QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_34,
				fields, analyzer); // 查询解析器
		Query query = queryParser.parse(queryString);
		// 查询
		IndexSearcher indexSearcher = new IndexSearcher(
				FSDirectory.open(indexFile));
		Filter filter = null;
		//获取匹配结果
		TopDocs topDocs = indexSearcher.search(query, filter, 10000);// topDocs
																		// 类似集合
		System.out.println("总共有【" + topDocs.totalHits + "】条匹配结果.");
		// 输出
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
			int docSn = scoreDoc.doc;// 文档内部编号
			Document doc = indexSearcher.doc(docSn);// 根据文档编号取出相应的文档
			System.out.println("path:" + doc.get("path"));
			System.out.println("content:" + doc.get("content"));
		}
		System.out.println("end.............. " + new Date().toLocaleString());
	}

利用lucene构建自己的搜索引擎

猜你喜欢