lucene dictionary implementation principles (rpm)

Original: https://www.cnblogs.com/LBSer/p/4119841.html

1 lucene dictionary

      Use lucene query will be inevitable to use the dictionary function it provides that the term to find the corresponding inverted file information id lists according to a given term. In fact lucene index files and tip tim file name suffix is ​​implemented lucene dictionary function.

      How to implement a dictionary of it? We immediately think of the array is sorted, that term has a dictionary is sorted alphabetically arrays, each array to store the inverted file id list and the corresponding term. Every time load index as long as the term array into memory, you can look through the half. This method query time complexity of Log (N), N refers to the number term, amount of space is O (N * str (term)). Disadvantage consume memory array is sorted, the need for each full term storage, up to the term when the number of millions, the memory occupied unacceptable.

2 common dictionary data structure

Many data structures can be completed dictionary function, are summarized below.

data structure Advantages and disadvantages
Sort list Array / List Use a binary search, unbalanced
HashMap/TreeMap High performance, large memory consumption, almost three times the original data
Skip List Jump table, you can quickly find words, in lucene, redis, Hbase etc. are realized. TreeMap like structure with respect to, particularly suitable for high concurrency scenarios ( Skip List description )
Trie For an English dictionary, the system if there are many strings and the strings substantially no common prefix, the corresponding trie tree consuming memory ( a trie data structures )
Double Array Trie Suitable for Chinese dictionary, small memory footprint, a lot of segmentation tools are used in this algorithm ( deep double array Trie )
Ternary Search Tree Ternary tree, each node has three nodes, both space-saving and fast query advantages ( Ternary Search Tree )
Finite State Transducers (FST) A Finite state transition machine, Lucene 4 there is an open source implementation, and use of a large number of

 

3 FST principle Brief

     lucene from the data structure is used in large quantities 4 FST (Finite State Transducer). FST has two advantages: 1) small footprint. By repeated use of the word to the dictionary prefixes and suffixes, compressed memory space; Fast 2) query speed. O (len (str)) query time complexity.

     Following is a brief description of the configuration process under the FST (demo tool: http://examples.mikemccandless.com/fst.py?terms=&cmd=Build+it%21 ). We "cat", "deep", "do", "dog", "dogs" these five words were inserted to build FST (Note: must be ordered).

1) insert "cat"

     CAT insert, each letter is formed an edge, wherein the edge point t end.

 

2) Insert "deep"

    The previous word "cat" maximum prefix match, no match is found directly inserted, P side end point.

3) insert "do"

    "Deep" maximum prefix match with the previous word, d is found, then a new increase in the d side edges o, o pointing end side.

4) Insert "dog"

    "Do" the maximum prefix match before a word is found do, then after adding a new edge side o g, g points to end the side.

5) Insert "dogs"

     "Dog" maximum prefix match with the previous word, found dog, then increased after new side s g, s edge point to the end point.

     最终我们得到了如上一个有向无环图。利用该结构可以很方便的进行查询,如给定一个term “dog”,我们可以通过上述结构很方便的查询存不存在,甚至我们在构建过程中可以将单词与某一数字、单词进行关联,从而实现key-value的映射。

4 FST使用与性能评测

      我们可以将FST当做Key-Value数据结构来进行使用,特别在对内存开销要求少的应用场景。Lucene已经为我们提供了开源的FST工具,下面的代码是使用说明。

复制代码
 1 public static void main(String[] args) {
 2         try {
 3             String inputValues[] = {"cat", "deep", "do", "dog", "dogs"};
 4             long outputValues[] = {5, 7, 17, 18, 21};
 5             PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton(true);
 6             Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
 7             BytesRef scratchBytes = new BytesRef();
 8             IntsRef scratchInts = new IntsRef();
 9             for (int i = 0; i < inputValues.length; i++) {
10                 scratchBytes.copyChars(inputValues[i]);
11                 builder.add(Util.toIntsRef(scratchBytes, scratchInts), outputValues[i]);
12             }
13             FST<Long> fst = builder.finish();
14             Long value = Util.get(fst, new BytesRef("dog"));
15             System.out.println(value); // 18
16         } catch (Exception e) {
17             ;
18         }
19     }
复制代码

   

      FST压缩率一般在3倍~20倍之间,相对于TreeMap/HashMap的膨胀3倍,内存节省就有9倍到60倍!(摘自:把自动机用作 Key-Value 存储),那FST在性能方面真的能满足要求吗?

      下面是我在苹果笔记本(i7处理器)进行的简单测试,性能虽不如TreeMap和HashMap,但也算良好,能够满足大部分应用的需求。

 

 参考文献

http://sbp810050504.blog.51cto.com/2799422/1361551

http://blog.sina.com.cn/s/blog_4bec92980101hvdd.html

http://blog.mikemccandless.com/2013/06/build-your-own-finite-state-transducer.html

http://examples.mikemccandless.com/fst.py?terms=mop%2F0%0D%0Amoth%2F1%0D%0Apop%2F2%0D%0Astar%2F3%0D%0Astop%2F4%0D%0Atop%2F5%0D%0Atqqq%2F6&cmd=Build+it%21

Guess you like

Origin www.cnblogs.com/ajianbeyourself/p/11260042.html
Recommended