Lucene: Understanding the indexing process

INDEX SEGMENTS

Every Lucene index consists of one or more segments. Each segment is a standalone index, holding a subset of all indexed documents. A new segment is created whenever the writer flushes buffered added documents and pending deletions into the directory. At search time, each segment is visited separately
and the results are combined.

Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is the segment’s name and <ext> is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, and so on)

There’s one special file, referred to as the segments file and named segments_<N>,that references all live segments. This file is important! Lucene first opens this file, and then opens each segment referenced by it. The value <N>, called “the generation,” is an integer that increases by one every time a change is committed to the index. Naturally, over time the index will accumulate many segments, especially if you
open and close your writer frequently. This is fine. Periodically, IndexWriter will select segments and coalesce them by merging them into a single new segment and then removing the old segments. The selection of segments to be merged is governed by a separate MergePolicy. Once merges are selected, their execution is done by the MergeScheduler.

Adding documents to an index

IndexWriter writer = getWriter();
for (int i = 0; i < ids.length; i++)
{
  Document doc = new Document();

  doc.add(new Field("id", ids[i],Field.Store.YES,Field.Index.NOT_ANALYZED));
  doc.add(new Field("country", unindexed[i],Field.Store.YES,Field.Index.NO));
  doc.add(new Field("contents", unstored[i],Field.Store.NO,Field.Index.ANALYZED));
  doc.add(new Field("city", text[i],Field.Store.YES,Field.Index.ANALYZED));

  writer.addDocument(doc);
}

Deleting documents from an index

writer.deleteDocuments(new Term("id", "1"));

Updating documents in the index

updateDocument(Term, Document) first deletes all documents containing theprovided term and then adds the new document using the writer’s default analyzer.
updateDocument(Term, Document, Analyzer) does the same but uses the provided analyzer instead of the writer’s default analyzer.

------------------------------------------------------------------------------------------------------------------------------------

Field options

Field is perhaps the most important class when indexing documents: it’s the actual class that holds each value to be indexed. When you create a field, you can specify numerous options to control what Lucene should do with that field once you add the document to the index.

Field options for indexing

The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index.

Index.ANALYZED —Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract, etc.).
Index.NOT_ANALYZED —Do index the field, but don’t analyze the String value.Instead, treat the Field’s entire value as a single token and make that tokensearchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names,Social Security numbers, and telephone numbers. This option is especially useful for enabling “exact match” searching.
Index.ANALYZED_NO_NORMS—A variant of Index.ANALYZED that doesn’t store norms information in the index. Norms record index-time boost information inthe index but can be memory consuming when you’re searching.
Index.NOT_ANALYZED_NO_NORMS—Just like Index.NOT_ANALYZED, but alsodoesn’t store norms. This option is frequently used to save index space andmemory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.
Index.NO —Don’t make this field’s value available for searching.

Field options for storing fields

The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:

Store.YES—Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader.This option is useful for fields that you’d like to use when displaying the searchresults (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.
Store.NO—Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.

Field options for term vectors

Sometimes when you index a document you’d like to retrieve all its unique terms at search time.

One common use is to speed up highlighting the matched tokens instored fields.
Another use is to enable a link, “Find similar documents,” that when clicked runs a new search using the salient terms in an original document.
Yet another example is automatic categorization of documents.

TermVector.YES —Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
TermVector.WITH_POSITIONS —Records the unique terms and their counts,and also the positions of each occurrence of every term, but no offsets
TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term,but no positions
TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts,along with positions and offsets
TermVector.NO—Doesn’t store any term vector information

Field options for sorting

When returning documents that match a search, Lucene orders them by their score by default. Sometimes, you need to order results using other criteria. For instance, if you’re searching email messages, you may want to order results by sent or received date, or perhaps by message size or sender.

Fields used for sorting must be indexed and must contain one token per document. Typically this means using Field.Index.NOT_ANALYZED or Field.Index.NOT_ANALYZED_NO_NORMS (if you’re not boosting documents or fields), but if your analyzer will always produce only one token,such as KeywordAnalyzer, Field.Index.ANALYZED or Field.Index.ANALYZED_NO_NORMS will work as well.

Lucene: Understanding the indexing process

猜你喜欢