Lucene In Action 第二章 2.3.1节 向索引添加Document

一、创建索引存放目录
要添加索引必须先指定索引存放的目录,获取路径的方法有如下方法:
1、Directory dir = FSDirectory.open(new File(indexDir));//在磁盘中创建Directory
2、Directory dir = new RAMDirectory(FSDirectory.open(new File(indexDir)), IOContext.READ);//在内存中创建Directory

FSDirectory类的说明:
public abstract class FSDirectory extends BaseDirectory
Base class for Directory implementations that store index files in the file system. There are currently three core subclasses。
该类有3个子类,性能各有不同,可以直接调用该类的open方法自动选择适合本系统环境的子类创建directory,如1中所示。
一般来说open方法:
1、对于大多数Solaris and Windows 64-bit JREs,会返回MMapDirectory;
2、other JREs on Windows,会返回SimpleFSDirectory,性能最差;
3、对于其他非Windows JRES,返回NIOFSDirectory

RAMDirectory类的说明:
public class RAMDirectory extends BaseDirectory
A memory-resident Directory implementation. Locking implementation is by default the SingleInstanceLockFactory but can be changed with BaseDirectory.setLockFactory(org.apache.lucene.store.LockFactory).Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions ofbyte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use MMapDirectory, which is a high-performance directory implementation working directly on the file system cache of the operating system, so copying data to Java heap space is not useful.

警告:这个类只适合和比较小的index工作。如果是数百MB的索引会造成资源浪费,因为该类内部使用长度为1024字节的缓冲区,这将会产生百万个1024字节数组。对于小型的常驻内存的索引,这个类进行了优化。这个类的并发性能也很差。

如果是大型索引,推荐使用MMapDirectory。


二、创建Analyzer和IndexWriterConfig
创建索引存放目录后,接着就要创建Analyzer分析器和IndexWriterConfig。

Analyzer的作用:
用于创建多个TokenStream(包括TokenFilter,这也是TokenStream),这些TokenStream用来分析文本,可以是用户搜索的关键字,也可以是建索引时创建的document中的field,这实际上是一种如何从文本(text)中解出索引词的策略(An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.)。
常用的创建分析器的方法:
1、Analyzer analyzer = new StandardAnalyzer(Version. LUCENE_46); //创建一个标准分析器,一元分词,英文按空格等分词,中文按单字分词
2、Analyzer analyzer = new IKAnalyzer();   //创建中文IK分词器,需要下载IK分词器jar文件

还有很多分析器,具体看API。也可以自创分析器,如:
Analyzer analyzer = new Analyzer() {
  @Override
   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
     Tokenizer source = new FooTokenizer(reader);
     TokenStream filter = new FooFilter(source);
     filter = new BarFilter(filter);
     return new TokenStreamComponents(source, filter);
   }
};

IndexWriterConfig的作用:
用于设置IndexWriter要用到的各个设置参数,如:
1、indexWriter.setOpenMode(OpenMode.CREATE_OR_APPEND);     //设置打开模式,如果不存在则创建,反之在已存在的索引后追加document
2、indexWriter.setRAMBufferSizeMB(10.0);     //设置缓冲大小,如果需要建的索引很大,就尽可能设大一些。

其他请查阅API。
创建IndexWriterConfig的方法:
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_46 , analyzer);//第二个参数是已经创建好的Analyzer对象。


三、向索引添加文档
创建索引存放目录后,就可以调用IndexWriter类的addDocument方法添加document。
也有两个方法:
1、addDocument(Document):使用默认的Analyzer添加document,该Analyzer在创建IndexWriterConfig时指定。
2、addDocument(Document, Analyzer):使用参数指定的Analyzer添加document,替换默认的Analyzer。但是要小心,搜索时使用的Analyzer要匹配建索引时使用的Analyzer。


程序2.1
package com.meiliwan.lucenelearning.learning.lia.ch2;  

import java.io.File;  
import java.io.IOException;  

import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.document.FieldType;  
import org.apache.lucene.index.DirectoryReader;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.index.IndexWriterConfig.OpenMode;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.store.IOContext;  
import org.apache.lucene.store.RAMDirectory;  
import org.apache.lucene.util.Version;  

public class IndexingTest {  
	private String[] ids = { "1", "2"};  
	private String[] unindexed = { "Netherlands", "Italy"};  
	private String[] unstored = { "Amsterdam has lots of bridges", "Venice has lots of canals"};  

	private String[] text = { "Amsterdam", "Venice"};  

	private Directory directory;      

	public IndexingTest(String indexDir){  
		try {  
			if(indexDir == null){  
				System. out.println( "目录为空,程序退出" );  
				System. exit(1);  
			}  

			File indexPath = new File(indexDir);  

			if(!indexPath.exists()){  
				indexPath.mkdirs();  
			}  

			if(!indexPath.isDirectory()){  
				System. out.println( "不是目录");  
				System. exit(2);  
			}  

			directory = FSDirectory. open(indexPath);             

		} catch (IOException e1) {  
			e1.printStackTrace();  
		}      

	}  

	public void buildIndex(){  
		IndexWriter indexWriter = getWriter();  

		if(indexWriter == null){  
			return;  
		}  

		try {  
			for( int i = 0; i < ids. length; i++){  
				Document document = new Document();  

				//document.add(new StringField("id", ids[i], Store.YES));//可索引可存储,不可分词 
				FieldType idFieldTypeType = new FieldType();  
				idFieldTypeType.setStored( true);      //存储后可以显示给用户看  
				idFieldTypeType.setIndexed( true);    // 可以索引,即搜索时可以使用这个id来搜索    
				idFieldTypeType.setTokenized(false);  // 不可分词,即该id的值作为一个整体,不能分割
				document.add( new Field( "id", ids[i], idFieldTypeType));  

				FieldType contryFieldType = new FieldType();  
				contryFieldType.setStored( true);  
				contryFieldType.setIndexed( false);  
				contryFieldType.setTokenized(false);
				document.add( new Field( "contry", unindexed[i], contryFieldType));  

				FieldType contentsFieldType = new FieldType();  
				contentsFieldType.setStored( false);  
				contentsFieldType.setIndexed( true); 
				contentsFieldType.setTokenized(true);
				document.add( new Field( "contents", unstored[i], contentsFieldType));  

				//document.add(new TextField("city", text[i], Store.YES));//可索引可存储  
				//document.add(new TextField("city", new StringReader(text[i])));//注意,这个创建的是Store.NO的field  
				FieldType cityFieldType = new FieldType();  
				cityFieldType.setStored( true);  
				cityFieldType.setIndexed( true);  
				cityFieldType.setTokenized(true);
				document.add( new Field( "city", text[i], cityFieldType));  

				indexWriter.addDocument(document);  
			}  
			indexWriter.commit();  
			System. out.println( "索引创建成功" );  
		} catch (Exception e) {  
			// TODO: handle exception  
		}finally{  
			closeWriter(indexWriter);  
		}         
	}  

	public void search(String fieldName, String queryText){  
		IndexReader indexReader = getReader();  

		if(indexReader == null){  
			return;  
		}  

		IndexSearcher searcher = new IndexSearcher(indexReader);  

		Term term = new Term(fieldName, queryText);  
		Query query = new TermQuery(term);  

		try {  
			TopDocs topDocs = searcher.search(query, 10);  

			ScoreDoc[] scoreDocs = topDocs.scoreDocs;  

			System. out.println( "命中" + scoreDocs.length + "个document(s)");  
			for( int i = 0; i < scoreDocs. length; i++){  
				int docId = scoreDocs[i]. doc;    //取得每个document的id;  
				Document document = searcher.doc(docId);  
				System. out.println( "id: " + document.get("id"));  
				System. out.println( "contry: " + document.get("contry"));  
				System. out.println( "contents: " + document.get("contents"));  
				System. out.println( "city: " + document.get("city"));  
				System. out.println( "score: " + scoreDocs[i].score);  
			}  
		} catch (IOException e) {  
			// TODO Auto-generated catch block  
			e.printStackTrace();  
		}finally{  
			closeReader(indexReader);  
		}      
	}  

	private IndexWriter getWriter(){  
		IndexWriter indexWriter = null;  
		try{  
			IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46 ,  
					new StandardAnalyzer(Version.LUCENE_46 ));  
			config.setOpenMode(OpenMode. CREATE_OR_APPEND);    
			config.setRAMBufferSizeMB(5.0);  

			indexWriter = new IndexWriter(directory, config);  
		}catch(IOException e){  
			e.printStackTrace();  
		}         
		return indexWriter;  

	}  

	private IndexReader getReader(){  
		IndexReader indexReader = null;  
		try {  
			if(directory != null){  
				Directory dir = new RAMDirectory(directory, IOContext.READ);  

				indexReader = DirectoryReader.open(dir);  
			}             
		} catch (Exception e) {  
			// TODO: handle exception  
		}  

		return indexReader;  
	}  

	private void closeWriter(IndexWriter writer){  
		try {  
			if(writer != null){  
				writer.close();  
			}  
		} catch (Exception e) {  
			// TODO: handle exception  
		}  
	}  

	private void closeReader(IndexReader reader){  
		try {  
			if(reader != null){  
				reader.close();  
			}  
		} catch (Exception e) {  
			// TODO: handle exception  
		}  
	}  

	public void close(IndexWriter indexWriter, IndexReader indexReader){  
		try {  
			if( indexWriter != null){  
				indexWriter.close(); //关闭前会提交所有更新到索引  
			}  

			if( indexReader != null){  
				indexReader.close();  
			}  

			if( directory != null){  
				directory.close();  
			}  
		} catch (Exception e) {  
			// TODO: handle exception  
		}  
	}  


	public void close(){  
		close(null, null);  
	}  

	public static void main(String[] args){  
		IndexingTest indexingTest = new IndexingTest( "E:" + File.separator + "lia-index" + File.separator + "ch2-demo1");  
		//indexingTest.buildIndex();  

		indexingTest.search( "contents", "bridges");  //可以替换参数,如改成"id", "2"

		indexingTest.close();  
	}  

}  



搜索结果:
命中1个document(s)
id: 1
contry: Netherlands
contents: null
city: Amsterdam
score: 0.5

因为contents field被设置成setStored(false),所以没有内容显示。

推荐使用luke查看创建的索引。这个工具超过10MB,没法上传。自己搜索下一个吧。

猜你喜欢

转载自omglion.iteye.com/blog/2010458