Sort of inverted index lucene

lucene reverse-order index: to find records based on the value of the property
steps down sort:
  Article 1: Tom lives in Guangzhou, I live guangzhou too.
  Article 2: of He Once lived in shanghai
1. made Keywords
  1.1 word - space segmentation
  1.2 no actual word filter, filter punctuation
    in once filtered too no specific meaning, punctuation, filter out
  1.3 unified case, when the unified state (when the past tense ed, now, future tense)
    lived, Lives converted to a live
  final the result of:

     

 

2. establish reverse ordering index
  2.1 keyword (sorted in alphabetic order, by positioning the fast-dimensional search algorithm)
  2.2 appearance frequency
  2.3 appearance position (character position, keyword position), when the position of keywords used lucene

  
The above were used as 3.lucene
  3.1 dictionary file (the file contains a pointer frequency and location of the file)
  3.2 files frequency
  3.3 file location

  Corresponding keyword, frequency, appearance position

  3.4 The concept of using a field, where the expression of location information (such as title, article, URL), field is described in the dictionary file, each keyword has a field of information, every keyword must belong to one or more field

4 compression algorithm
  4.1 Compression keywords
  such as: the first word Arabia
    the second word in Arabic
    can be compressed into a second word <3, language>

  4.2 pairs of digital compressed
  digital difference value with a stored value
  on an article number: 16382
  Current article number: 16389
  this compressed storage 7 (only one byte)

5. scenarios:
  Lucene first binary dictionary lookup, word documents, and the frequency of occurrence point location, the dictionary is very small, millisecond, ordinary sequence comparison algorithm matching the entire process, not indexed rather slow.

 

Guess you like

Origin www.cnblogs.com/glblog/p/11897794.html