Lucene Basic API Components

An understanding of Lucene's basic API components.

In order to facilitate comparison and learning, the API components used during table data


indexing The API components used during retrieval
IndexWriter IndexReader
IndexWriterConfig IndexSearcher
Directory Directory
Analyzer QueryParser or Query subclass
Document TopDocs
Field ScoreDoc--Term       



The author will compare the above components one by one. The first

page will introduce the various classes during the
indexing process. 1. IndexWriter is the core class in the indexing process. It is mainly responsible for creating an index or opening an existing index, and providing operations such as
adding
2. The IndexWriterConfig API is in There is no such configuration class in the lower version of Lucene. This class is also more important. To use this class, you need to pass in 2 parameters in its constructor. The first parameter is the current version number of Lucene, and the second is the index. It is the tokenizer used. In addition to this most commonly used function, it also provides a large number of tools and methods, such as setting the buffer size in memory, setting the size of document data batch submission, obtaining thread status, setting creation mode, and whether to enable compounding A series of indexes and so on, you can do some basic configuration optimization and other information on the index.
3. The Directory class represents the storage location of the Lucene index. It is an abstract class. It has a series of subclasses that can be used to process the index. Using different subclasses will have a great impact on the performance of the system, but in essence, To improve performance, it is nothing more than exchanging space for time or taking time or space 2. In specific use, we can use its subclass to obtain the storage path where the index is located, and then pass it to the IndexWriter class constructor.
4. This class of Analyzer is also the base class of all analyzers. Before the text file is indexed, it needs to be processed by the analyzer and processed into the corresponding lexical unit and unified format. It can extract effective information and filter out some stop words. Lucene automatically There are several analyzers, but most of them are for English or European languages. If you want to use the Chinese tokenizer, you can use its own SmartCN tokenizer, or you can use open source IK, messeg4j, etc., Choosing what kind of analyzer is an important step in the indexing process, and this key depends on your business needs.
5. Document represents the meaning of a document, which is similar to a row of records in a database. We can add the fields we want to the document, and then index the documents one by one to provide retrieval.
6. Field is the field stored in the document. Each field has a domain name and field value. This is similar to the field name and value of a database. We can use Field to precisely control the value of each field. The most commonly used are 2 Field, one is StringField that does not provide word segmentation and another is TextFiled that does not provide word segmentation. Of course, there are other Fields, which will not be introduced here.
7. The IndexReader class is used to obtain the index file stream opened by the subclass of Directory, and then initialize the query component in the construction method of the IndexSearcher. This class does not exist in the lower version of Lucene. The class was added in the new version later.
8. The IndexSearcher class is the core class during program search and is the bridge connecting the index. It opens the index in a read-only manner, providing a large number of retrieval, sorting, filtering, etc. and other functions.
9. Both QueryParser or Query can complete some retrieval functions. The difference is that QueryParser provides more powerful functions, which are convenient for custom development of some retrieval schemes, while Query and a series of subclasses under its command are some APIs that come with Lucene. Using these APIs, some basic retrieval can be done in most cases. If you need to customize your own retrieval scheme, you need to use QueryParser. In most cases, we most often use the TermQuery subclass under Query, and of course there are A number of other function-specific Query subclasses exist.
10. The TopDocs class is a simple container pointer. It generally records the first N retrieval results. In TopDocs, it only stores the docid of the document and the obtained score. In addition, the first N results, the default The sorting method is arranged according to the size of the score.
11. The ScoreDoc class usually uses an array, which only contains the docid of the document and the obtained score. Unlike TopDocs, we can use this class to perform database-like paging operations. Of course, you You must ensure that you have enough memory. If it is paging of massive data, this operation can easily cause memory overflow. At this time, we need to consider other methods.
12. The Term class is the most basic unit of the search function. Similar to the Field, the domain name and the retrieved string need to be passed in when searching. It is a small but indispensable class.



So far, the author has briefly analyzed the basic and commonly used classes of Lucene. Maybe in most cases, we know how to use them, but we just don't know their basic concepts. The author thinks that if you really understand these things, you can bring great convenience in development or in some exchanges with colleagues.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326473302&siteId=291194637