lucene学习笔记02-基本索引

    上一篇文章中提到，使用lucene包括两个步骤：一是索引；二是检索。索引是基础、是前提，检索是目的。本文讲的是lucene的基本索引。

    本文以及后面的文章都以存储在磁盘的文件为背景，进行索引和检索的演示。

    对磁盘文件，我们可能有以下的检索需求：

按照文件名检索（这个经常有）
按照文件路径检索（这个。。。）开玩笑，我们没有这样的检索需求，但是我们需要从检索结果中了解这个信息
按照文件类型检索
按照文件大小检索
按照修改时间检索
按照文件内容检索
……

撇开lucene不谈，想一下我们人会怎么处理这样的需求。

好了，我们可能会拿一张纸、一支笔，然后填写类似下面的表格：

序号	文件名	文件路径	文件类型	文件大小	修改时间	文件内容	……

有了这样的表格，我们就可以“按图索骥”，完成上面的检索任务了。

这个填充表格的过程就是索引的过程，与lucene的对应为：

纸：也就是保存索引的地方。在lucene中对应为Directory。lucene中有几种Directory的实现，最常用的是FSDirectory和RAMDirectory。从名称中不难知道：FSDIrectory是将索引保存到磁盘文件中，就相当于本例中的纸；RAMDirectory是将索引保存到内存中，就相当于本例中把内容保存到大脑中。
笔：也就是写索引的工具。在lucene中对应为IndexWriter。有Java基础的人应该可以推测，这个IndexWriter是一个Writer的子类。

记录：在lucene中对应为Document。

字段：在lucene中对应为Field（IndexableField）。每个Field实例可以设置：字段名、是否建立索引、是否分词、是否存储等几个属性。对于设置为存储的字段，我们可以从Document中直接读取该字段的值，而当试图从Document中取未存储字段的值时，返回null值。经常使用的Field子类有以下几个：

Field的子类	是否建立索引	是否分词	是否存储
StringField	是	否	可以控制
DoubleField FloatField LongField IntField	是	否	可以控制
TextField	是	是	可以控制
StoredField	否	否	是

序号：数据库中有自增的Id，在lucene中，也有一个自增的Id，称为docId。这个字段不需要你指定，而是lucene自动生成的。在检索时，lucene可以根据docId，确定唯一的Document。

有了上面的分析，我们需要记录的字段有：

序号：这个是lucene自动生成的，不需要处理
文件名：StringField，存储。字段名为filename
文件路径：StoredField，存储。字段名为pathname
文件类型：StringField，存储。字段名为type
文件大小：LongField，存储。字段名为size
修改时间：LongField，存储。字段名为lastmodified
文件内容：TextField，不存储。字段名为content

下面进行程序设计。

伪代码如下：

    创建一个对磁盘文件进行索引的类，并提供一个索引的方法，接收两个参数：第一个参数为索引保存路径，第二个参数为磁盘文件（夹）路径（可以是多个）。
    这个方法的实现：
    1. 获得Directory对象-->获得IndexWriter对象；
    2. 对每个文件（夹）进行索引：
        如果是文件，则将文件信息转为Document对象，并将Document对象加入IndexWriter对象中；
        如果是文件夹，则对于文件夹下的文件或文件夹，重复2过程。
    3. 关闭IndexWriter对象和Directory对象。

编程实现：

package cn.lym.lucene.quickstart.index;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;

import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import cn.lym.lucene.quickstart.util.FileUtil;
import cn.lym.lucene.quickstart.util.StreamUtil;

/**
 * 提供对磁盘文件建立索引的功能
 * 
 * @author liuyimin
 *
 */
public class Indexer {
	/**
	 * Logger对象
	 */
	private static final Logger logger = LogManager.getLogger(Indexer.class);

	/**
	 * 建立索引
	 * 
	 * @param indexDir
	 *            索引保存路径
	 * @param dataDirs
	 *            数据文件路径
	 * @throws Exception
	 */
	public void index(String indexDir, String... dataDirs) throws Exception {
		Directory directory = FSDirectory.open(new File(indexDir));
		IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, new StandardAnalyzer());
		IndexWriter writer = new IndexWriter(directory, config);

		for (String dataDir : dataDirs) {
			index(writer, new File(dataDir));

			writer.commit();
		}

		// 关闭流
		StreamUtil.close(writer, directory);
	}

	/**
	 * 对文件（或目录）建立索引
	 * 
	 * @param writer
	 *            IndexWriter对象
	 * @param file
	 *            文件或目录
	 */
	private void index(IndexWriter writer, File file) {
		if (file.isDirectory()) {// 目录，需要递归建立索引
			File[] subFiles = file.listFiles();
			if (subFiles != null) {
				for (File subFile : subFiles) {
					index(writer, subFile);
				}
			}
		} else if (file.isFile()) {// 文件，对文件建立索引
			if (logger.isDebugEnabled()) {
				logger.debug("indexing file: " + file.getAbsolutePath());
			}

			try {
				Document document = file2Document(file);
				writer.addDocument(document);
			} catch (Exception e) {
				logger.error(
						"An error occurred while adding a document to indexwriter. File: " + file.getAbsolutePath(), e);
			}
		}
	}

	/**
	 * 将文件转为lucene的{@link Document}类型<br/>
	 * 其中包括：
	 * <ul>
	 * <li>pathname：路径名</li>
	 * <li>filename：文件名</li>
	 * <li>size：文件大小（字节）</li>
	 * <li>type：文件类型</li>
	 * <li>content：文件内容（只有明文文件有，判断是否是明文文件：{@link FileUtil#isPlainTextFile(File)}
	 * ）</li>
	 * </ul>
	 * 
	 * @param file
	 * @return
	 */
	private Document file2Document(File file) {
		Document document = new Document();
		document.add(new StoredField("pathname", file.getAbsolutePath()));
		document.add(new StringField("filename", file.getName(), Store.YES));
		document.add(new StringField("type", FileUtil.getFileType(file), Store.YES));
		document.add(new LongField("size", file.length(), Store.YES));
		document.add(new LongField("lastmodified", file.lastModified(), Store.YES));
		if (FileUtil.isPlainTextFile(file)) {// 对明文文件的内容建立索引
			try {
				Reader reader = new FileReader(file);
				document.add(new TextField("content", reader));
			} catch (Exception e) {
				logger.error("An error occurred while indexing " + file.getAbsolutePath(), e);
			}
		}
		return document;
	}
}

使用到的两个工具类：FileUtil和StreamUtil。

FileUtil：

package cn.lym.lucene.quickstart.util;

import java.io.File;

/**
 * 文件有关的工具类
 * 
 * @author liuyimin
 *
 */
public class FileUtil {
	/**
	 * 获得文件类型
	 * 
	 * @param file
	 * @return
	 */
	public static String getFileType(File file) {
		String fileName = file.getName();
		int index = fileName.lastIndexOf(".");
		if (index != -1) {
			return fileName.substring(index + 1);
		}
		return fileName;
	}

	/**
	 * 判断文件是否是明文的文件
	 * 
	 * @param file
	 * @return
	 */
	public static boolean isPlainTextFile(File file) {
		// 为了简化，这里只将txt文件作为明文文件
		String fileType = getFileType(file);
		return "txt".equals(fileType);
	}
}

StreamUtil：

package cn.lym.lucene.quickstart.util;

import java.io.Closeable;

/**
 * 流操作有关的工具类
 * 
 * @author liuyimin
 *
 */
public class StreamUtil {
	/**
	 * 关闭流操作
	 * 
	 * @param closeables
	 */
	public static void close(Closeable... closeables) {
		if (closeables != null) {
			for (Closeable closeable : closeables) {
				if (closeable != null) {
					try {
						closeable.close();
					} catch (Exception e) {
					} finally {
						closeable = null;
					}
				}
			}
		}
	}
}

好了，写一个单元测试测试一下：

package cn.lym.lucene.quickstart.index;

import org.junit.Before;
import org.junit.Test;

public class IndexerTest {
	private Indexer indexer;

	@Before
	public void init() {
		this.indexer = new Indexer();
	}

	@Test
	public void testIndex() throws Exception {
		String indexDir = "E:\\Documents\\lucene-quickstart\\";
		String dataDir = "D:\\";
		this.indexer.index(indexDir, dataDir);
	}
}

程序正常运行完成之后，在索引存放目录下，应该有如下的文件：

lucene索引文件

本文的代码可以从 https://git.oschina.net/coding4j/lucene-quickstart 获得。

按照文件名检索（这个经常有）
按照文件路径检索（这个。。。）开玩笑，我们没有这样的检索需求，但是我们需要从检索结果中了解这个信息
按照文件类型检索
按照文件大小检索
按照修改时间检索
按照文件内容检索
……

撇开lucene不谈，想一下我们人会怎么处理这样的需求。

好了，我们可能会拿一张纸、一支笔，然后填写类似下面的表格：

序号	文件名	文件路径	文件类型	文件大小	修改时间	文件内容	……

有了这样的表格，我们就可以“按图索骥”，完成上面的检索任务了。

这个填充表格的过程就是索引的过程，与lucene的对应为：

纸：也就是保存索引的地方。在lucene中对应为Directory。lucene中有几种Directory的实现，最常用的是FSDirectory和RAMDirectory。从名称中不难知道：FSDIrectory是将索引保存到磁盘文件中，就相当于本例中的纸；RAMDirectory是将索引保存到内存中，就相当于本例中把内容保存到大脑中。
笔：也就是写索引的工具。在lucene中对应为IndexWriter。有Java基础的人应该可以推测，这个IndexWriter是一个Writer的子类。

lucene学习笔记02-基本索引

猜你喜欢