Lucene-the underlying implementation principle

Original author: SessionBest

Original address: Lucene underlying implementation principle, its index structure

1. Introduction to Lucene and Indexing Principles

　　This part expands from three aspects: Introduction to Lucene, Index Principles, Lucene Index Implementation.

1.1 Introduction to Lucene

　　Lucene was originally developed by the famous Doug Cutting, open sourced in 2000, and is now the best choice for open source full-text retrieval solutions. Its characteristics are summarized as follows: full Java implementation, open source, high performance, complete functions, easy expansion, and complete functions are reflected in Support for word segmentation, various query methods (prefix, fuzzy, regular, etc.), scoring highlight, columnar storage (DocValues), etc.
　　Moreover, although Lucene has been in development for more than 10 years, it still maintains an active development to meet the increasing data analysis needs. The latest version 6.0 introduces block kd trees, which comprehensively improves the retrieval performance of digital type and geographic location information. In addition, Lucene-based Solr and ElasticSearch distributed retrieval and analysis systems are also developing in full swing, and ElasticSearch is also used in our projects.
　　The overall use of Lucene is shown in the figure:
lucene role

　　Combine the code to explain the four steps:

IndexWriter iw=new IndexWriter();//创建IndexWriter
Document doc=new Document( new StringField("name", "Donald Trump", Field.Store.YES)); //构建索引文档
iw.addDocument(doc);            //做索引库
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher = new IndexSearcher(reader); //打开索引
Query query = parser.parse("name:trump");//解析查询

TopDocs results =searcher.search(query, 100);//检索并取回前100个文档号
for(ScoreDoc hit:results.hits)
{
    Document doc=searcher .doc(hit.doc)//真正取文档
}

It is very simple to use, but only by knowing the principles behind this can we make good use of Lucene. The general retrieval principles and the implementation details of Lucene will be introduced later.

1.2 Index Principle

　　Full-text retrieval technology has a long history, and most of them are based on inverted index . There have been some other solutions such as file fingerprinting. The inverted index, as the name suggests, is contrary to which words an article contains. It starts from the word and records which documents the word has appeared in. It consists of two parts-a dictionary and an inverted table .

Write picture description here

　　Among them, the dictionary structure is particularly important. There are many kinds of dictionary structures, each with its own advantages and disadvantages. The simplest is to sort an array and retrieve data through binary search. The faster has a hash table. The disk search has B-tree and B+ tree , but An inverted index structure that can support terabytes of data needs to be balanced in time and space. The following figure lists the advantages and disadvantages of some common dictionaries:
Write picture description here

　　The available ones are: B+ tree, skip table, FST
B+ tree: 　

Theoretical basis: balanced multi-way search tree
Advantages: external storage index, updatable
Disadvantages: large space, not fast enough

Jump table:
Write picture description here

Advantages: simple structure, controllable skip interval, and controllable levels. Before Lucene3.0, the skip table structure was also used, which was later replaced by FST. However, skip tables have other applications in Lucene such as inverted table merging and document number indexing .
Disadvantages: poor fuzzy query support

FST ：

The index structure currently used by Lucene

Write picture description here

Theoretical basis: "Direct construction of minimal acyclic subsequential transducers", which constructs minimal directed acyclic graphs by inputting ordered strings .

Advantages: low memory usage, compression rate generally between 3 times and 20 times, good fuzzy query support, fast query

Disadvantages: complex structure, orderly input requirements, and difficult to update. Lucene has an FST implementation. From the external interface, it is very similar to the Map structure, with search and iteration:

String inputs={"abc","abd","acf","acg"}; //keys
long outputs={1,3,5,7};                  //values
FST<Long> fst=new FST<>();
for(int i=0;i<inputs.length;i++)
{
    fst.add(inputs[i],outputs[i])
}
//get 
Long value=fst.get("abd");               //得到3
//迭代
BytesRefFSTEnum<Long> iterator=new BytesRefFSTEnum<>(fst);
while(iterator.next!=null){...}

1 million data performance test:

data structure	HashMap	TreeMap	FST
Build time (ms)	185	500	1512
Query all keys (ms)	106	218	890

It can be seen that the performance of FST is basically the same as that of HaspMap, but FST has an incomparable advantage in that it occupies a small amount of memory, only about one-tenth of HashMap. This is essential for large-scale retrieval of large data. After all, the speed is faster. It is useless not to enter the memory. Therefore, a qualified dictionary structure requires:

Query speed.
Memory usage.
Combination of memory + disk.

Later we will analyze the Lucene index structure, focusing on these three points from the features of Lucene's FST implementation.

1.3 Lucene index implementation

After years of evolution and optimization, Lucene has an index file structure as shown in the figure. It can be basically divided into three parts: dictionary, inverted table, forward file, and columnar storage DocValues .

Write picture description here

　　The following details the structure of each part:

1. Index structure

　　The data structure currently used by Lucene is FST, and its characteristics are:

The word search complexity is O(len(str))
Shared prefix, save space
Memory stores prefix index, disk stores suffix word block

　　This is consistent with the three elements of the dictionary structure we mentioned earlier: 1. Query speed. 2. Memory usage. 3. Combination of memory + disk. We insert four words abd, abe, acf, acg into the index library, and look at its index file content.

Write picture description here

In the tip part, each column has an FST index, so there will be multiple FSTs, and each FST stores the prefix and suffix block pointers, where the prefixes are a, ab, and ac. Other information of suffix blocks and words such as inverted table pointer, TFDF, etc. are stored in tim. The doc file contains an inverted table of each word. So its retrieval process is divided into three steps:

Load the tip file in the memory, and find the position of the suffix word block through the FST matching prefix.
According to the position of the word block, read the suffix block in the tim file on the disk and find the suffix and the corresponding inverted table position information.
Load the inverted table in the doc file according to the position of the inverted table.

There will be two problems here. The first is how to calculate the prefix, and the second is how to write the suffix to the disk and locate it through FST. The following describes the process of building FST by Lucene: It is known that FST requires input in order, so Lucene will parse it out The document words are sorted in advance, and then the FST is constructed. We assume that the input is abd, abd, acf, acg, then the whole construction process is as follows:

Write picture description here

When inserting abd, there is no output.
When inserting abe, the prefix ab is calculated, but at this time, I don't know that there will be no other words prefixed with ab, so there is no output at this time.
When inserting acf, because it is in order, knowing that there will be no more words prefixed with ab, then you can write tip and tim. In tim, write the suffix block d, e and their inverted table position ip_d, Write a, b and the position of the suffix word block prefixed with ab in ip_e, tip (in real situations, more information such as word frequency will be written).
When acg is inserted, it is calculated that the prefix ac is shared with acf. At this time, the input has ended and all data is written to disk. The suffix blocks f, g and the corresponding inverted table positions are written in tim, and c and the suffix block positions prefixed with ac are written in tip.

The above is a simplified process. The main optimization strategies implemented by Lucene's FST are:

The minimum number of suffixes. Lucene has a minimum number of suffixes for the prefix written to the tip, the default is 25, in order to further reduce memory usage. If the number of suffixes is 25, then there will be no ab and ac prefixes, there will be only one follower node, and abd, abe, acf, and acg will all be stored as suffixes in the tim file. An index library of our 10g, the index memory consumption only accounts for about 20M.
The prefix calculation is based on byte instead of char, which can reduce the number of suffixes and prevent too many suffixes from affecting performance. For example, the three Chinese characters of Yu (e9 b8 a2), Shou (e9 b8 a3), and An (e9 b8 a4) are constructed by FST. There is not only a root node and three Chinese characters as suffixes, but starting from the unicode code. e9 and b8 are prefixes, a2, a3, and a4 are suffixes, as shown in the figure below:

Write picture description here

2. Inverted table structure

　　The inverted table is a collection of document numbers, but there are a lot of details about how to store and retrieve it. The inverted table structure currently used by Lucene is called Frame of reference, which has two main features:
　　1) Data compression: You can see how to convert 6 The number is compressed from the original 24bytes to 7bytes.

Write picture description here

2) The jump table speeds up the merge, because in the Boolean query, and and or operations both need to merge the inverted table, then the same document number needs to be quickly located, so the jump table is used to search for the same document number.
　　This part can refer to a blog of ElasticSearch, which contains some performance tests:
　　ElasticSearch inverted table

3. Forward documents

　　The forward file refers to the original document. Lucene also provides the storage function for the original document. Its storage feature is block + compression. The fdt file is the file that stores the original document. It occupies 90% of the disk space of the index library. The fdx file To index files, quickly get the document location through the document number (auto-incremented number), and their file structure is as follows:
　　 Write picture description here
　　fnm stores various column types, column names, storage methods and other information for meta information.
　　fdt is the value of the document. A chunk in it is a block. When Lucene indexes a document, the document is cached first. When the cache is larger than 16KB, the document will be compressed and stored. A chunk contains the starting document of the chunk, how many documents, and the compressed document content.
　　fdx is the document number index. When the inverted table is stored, the document number can be quickly located through fdx to locate the document position, that is, the chunk position. Its index structure is relatively simple, which is the skip table structure. First, it will group 1024 chunks into one block. , Each block records the starting document value, and the block is equivalent to a level jump table.
　　So to find a document, there are three steps: the
　　first step is to find the block in two ways and locate which block it belongs to.
　　The second step is to find which chunk and chunk position it belongs to according to the starting document number of each chunk in the block.
　　The third step is to load the chunk of fdt and find the document. One more detail here is that storing the starting document value and chunk location of the chunk is not a simple array, but an average compression method. So the starting document value of the Nth chunk is restored from DocBase + AvgChunkDocs * n + DocBaseDeltas[n], and the position of the Nth chunk in fdt is restored from StartPointerBase + AvgChunkSize * n + StartPointerDeltas[n].
　　From the above analysis, it can be seen that Lucene stores the original files as storage, and in order to improve space utilization, multiple documents are compressed together, so additional documents need to be read and decompressed when fetching documents, so the process of fetching documents is very random. Although IO and Lucene provide specific columns for fetching, it can be seen from the storage structure that it will not reduce the time for fetching documents.

4. Columnar storage DocValues

　　We know that the inverted index can solve the rapid mapping from words to documents, but when we need to perform aggregation operations such as classification, sorting, and mathematical calculations on the retrieval results, we need a fast mapping of document numbers to values , and whether it is an inverted index or Documents stored in rows cannot meet the requirements.
　　Before the original version 4.0, Lucene realized this requirement through FieldCache. Its principle is to change the (field value -> doc) mapping to (doc -> field value) mapping by reversing the inverted table by column, but this implementation method There are two significant problems:
　　1. Long build time.
　　2. Large memory footprint, easy to OutOfMemory, and affect garbage collection.
　　Therefore, after version 4.0, Lucene introduced DocValues to solve this problem. It is columnar storage like FieldCache, but it has the following advantages:
　　1. Pre-build and write files.
　　2. Based on the mapping file, out of JVM heap memory, system scheduling page fault.
　　The implementation method of DocValues is only about 10~25% slower than the memory FieldCache, but the stability has been greatly improved.
　　Lucene currently has five types of DocValues: NUMERIC, BINARY, SORTED, SORTED_SET, SORTED_NUMERIC, and there are specific compression methods for each type of Lucene.
　　For example, for the NUMERIC type, that is, the number type, there are many compression methods for the number type, such as: increment, table compression, and greatest common divisor. Different compression methods are selected according to the characteristics of the data.
　　The SORTED type is the string type, and the compression method is table compression: the string dictionary is sorted and assigned a digital ID in advance, and only the string mapping table and the numeric array are stored when storing, and the numeric array can be compressed by NUMERIC again. Compression, as shown below:
　　 Write picture description here
　　In this way, the original string array is turned into a numeric array. First, the space is reduced, and the file mapping is more efficient. Second, the original access method becomes a fixed-length access.
　　For the application of DocValues, the ElasticSearch function is implemented more systematically and completely, that is, ElasticSearch's Aggregations-aggregation function. Its aggregation functions are divided into three categories:
　　1. Metric ->
　　　Typical statistical functions: sum, min, max, avg, cardinality, percent, etc.
　　2. Bucket -> grouping
　　　typical functions: date histogram, grouping, geographical location partition
　　3. Pipline -> based on aggregation and re-aggregation
　　　typical functions: finding the maximum value based on the average of each group.
Based on these aggregation functions, ElasticSearch is no longer limited to search and can answer the following SQL questions

select gender,count(*),avg(age) from employee where dept='sales' group by gender

销售部门男女人数、平均年龄是多少

　　Let's see how ElasticSearch implements the above SQL based on inverted index and DocValues.
　　 Write picture description here
　　1. Find the inverted table of the sales department from the inverted index.
　　2. Take out the corresponding gender of each person from the DocValues of the gender according to the inverted table, and group them into Female and Male.
　　3. Calculate the number of people in each group and the average age according to the grouping situation and age DocValues.
　　4. Because ElasticSearch is partitioned, combining the returned results of each partition is the final result.
　The above is the overall process of ElasticSearch's aggregation. It can also be seen that a bottleneck of ElasticSearch's aggregation is that the last step of aggregation can only be aggregated on a single machine, so some statistics will have errors, such as count(*) group by producet limit 5, the final total Not precise. Because of single-point memory aggregation, it is impossible for each partition to return all the grouping statistics, only part of it, and the final result will be incorrect when summarizing, as follows:
　Original data:

Shard 1	Shard 2	Shard 3
Product A (25)	Product A (30)	Product A (45)
Product B (18)	Product B (25)	Product C (44)
Product C (6)	Product F (17)	Product Z (36)
Product D (3)	Product Z (16)	Product G (30)
Product E (2)	Product G (15)	Product E (29)
Product F (2)	Product H (14)	Product H (28)
Product G (2)	Product I (10)	Product Q (2)
Product H (2)	Product Q (6)	Product D (1)
Product I (1)	Product J (8)
Product J (1)	Product C (4)

　count(*) group by producet limit 5. The data returned by each node is as follows:

Shard 1	Shard 2	Shard 3
Product A (25)	Product A (30)	Product A (45)
Product B (18)	Product B (25)	Product C (44)
Product C (6)	Product F (17)	Product Z (36)
Product D (3)	Product Z (16)	Product G (30)
Product E (2)	Product G (15)	Product E (29)

　After the merger:

Merged
Product A (100)
Product Z (52)
Product C (50)
Product G (45)
Product B (43)

　The total number of product A is correct because every node has returned, but product C is not returned at node 2 because it is not in the top 5, so the total number is wrong.

to sum up

　　The above is the introduction of Lucene and the analysis of the underlying principles, focusing on the implementation strategies and characteristics of Lucene. The next article will introduce how we can optimize our full-text retrieval system from these underlying principles.

Copyright statement: This article is the original article of the blogger and may not be reproduced without the permission of the blogger. https://blog.csdn.net/njpjsoftdev/article/details/54015485