Elasticsearch from beginner to proficient - Elasticsearch core inverted index data structure

Introduction to Elasticsearch

Elasticsearch is a distributed, highly scalable, and highly real-time search and data analysis engine. Elasticsearch is built on the full-text search engine Apache Lucene™. It uses Lucene's inverted index technology to achieve faster filtering than relational databases, making it easy to search, analyze and explore large amounts of data.

There is no doubt that the underlying core of Elasticsearch is the inverted index. Elasticsearch expands the server cluster to store data in the form of documents and FST compression in distributed real-time. At the same time, it adds an inverted index to each field of the file, which is implemented through the inverted index and three data structures: skip list and bitset. Real-time distributed analysis and fast search capabilities.

Elasticsearch data storage

Let’s talk about Elasticsearch’s file storage first. Elasticsearch is a document-oriented database. A piece of data here is a document. JSON is used as the document serialization format, such as the following student data:

{
    "stuName":"Rose",
    "age":18,
    "gender":"Male",
    "resume":"I am gooding at studying",
    "tuition":26800.00,
    "hobbies":["sleep","games"],
    "address":{
        "province":"JiangSu",
        "city":"NanJing",
        "district":"YuHua"
    }
}

If we use a traditional relational database such as mysql to store the above data, what we can think of is to create a student table with stuName, age, tuition, hobbies, address fields, etc. In Elasticsearch, this It is a document recording the student type, and all field types exist in the index of this document. Here is a simple comparison of Elasticsearch and relational data terminology:

Elasticsearch is the same as traditional databases. Indexes, types, documents, and fields are all one-to-many relationships. Elasticsearch data interaction can be requested directly through HTTP Restful request method, or through java API request. It is worth noting that Elasticsearch is mainly used for querying, because each field has an inverted index, which requires a lot of storage space. I just started to add, delete, and modify facts in Elasticsearch. If you don’t use shell scripts, you have to add them one by one. The efficiency of execution, addition, deletion and modification is very low, not as good as traditional database. In fact, the current mainstream in the market is that Elasticsearch is used in conjunction with traditional databases. Elasticsearch is used for queries, and traditional ones are used for additions, deletions, and modifications. Elasticsearch is a bridge between databases and client search engines.

Inverted index

The essence of Elasticsearch inverted index:

The inverted index compresses the storage space and reduces the number of disk reads; it strictly stores the structure and saves search time.

To put it simply, Elasticsearch moves the contents of the disk into the memory as much as possible, reduces the number of random reads on the disk (and also uses the sequential read characteristics of the disk), combines various ingenious compression algorithms, and uses memory with an extremely harsh attitude. .

How can Elasticsearch achieve real-time and efficient search through inverted index? Next, I will talk about the principle of inverted index from the concepts of time and space.

What is the spatial structure of the inverted index?

First, take out the stuName, age, and gender fields according to the above example:

The upper school of the student type document corresponds to an index with an index of 1:

PUT http://192.168.1.1:9200/school

Documents of type student correspond to an index:

| ID | stuName| age| gender |
| -- |--------|----| -------| 
| 1  | Rose   | 24 | Male   |
| 2  | John   | 24 | Female |
| 3  | Bill   | 29 | Female |

stuName:

| stuName| Posting List|
|--------|-------------| 
| Rose   |     1       | 
| John   |     2       |  
| Bill   |     3       | 

age:

| Term | Posting List |
| ---- |------------- |
| 24   |     [1,2]   |
| 29   |       3     |

gender:

|  Term  | Posting List |
| ------ |--------------|
| Female |      1       |
|  Male  |    [2,3]     |

As mentioned above, the index of the school where the student is located, each student document dictionary, and each field of the student's Posting List are indexed. Use a picture to represent it as follows:

Posting List

Posting List is a set of indexes automatically provided for each field in Elasticsearch. For example, 24, 29 are called terms, and [1, 2] is the posting list. Posting list is an array of ints that stores all document IDs that match a certain term.

Term Dictionary

In order to quickly find a term in Elasticsearch, that is, we often use a certain field to quickly query. In order to realize this function, Term Dictionary was created. The bottom layer of the implementation of Term Dictionary is B+Tree. It uses the dichotomy method to query terms. The efficiency of logN disk search is just like dictionary query. What is the first letter, what is the first letter, and then what is the second letter? The second letter,..., until the term is retrieved.

Term Index

Due to the existence of random disk reading, part of the data must be stored in the cache memory. However, the Term Dictionary disk storage space is huge, and the Term Dictionary cannot be completely placed in the memory. Therefore, there is a Term Index, which is like a larger chapter in a dictionary. Each large chapter corresponds to multiple small chapters Term Dictionary, so that a certain term can be found quickly.

Term Index, Term Dictionary and Posting List relationships

As shown in the figure below: How are the words "A", "to", "tea", "ted", "ten", "i", "in", and "inn" stored in Elasticsearch? Term Index is like an upside-down tree, it is the root node of the tree, which is the dictionary; Term Dicitionary is the child node of the root node, storing "t", "A", and "i", also It is the prefix of the stored words; then the Posting List is a collection of words (terms) with the same prefix, which contains the words (terms) we want to retrieve. Therefore, we can quickly and accurately retrieve the term we need through term index.

As shown in the figure below, relational databases such as Mysql only have the term dictionary layer, which is stored on the disk in a b-tree sorting manner. Retrieving a term requires several random access disk operations. Elasticsearch adds a term index to the term dictionary to speed up retrieval. The term index is cached in memory in the form of a tree. After finding the block location of the corresponding term dictionary from the term index, we then go to the disk to find the term, which greatly reduces the number of random accesses to the disk.

How does the concept of inverted index save time?

B-Tree+whole partition quick search

As mentioned above, the term dictionary in Elasticsearch is different from the term dicitionary structure B-Tree of the relational database. The stored data is sorted and stored in the disk according to certain rules, and then a certain term is searched through the dichotomy method, so It can achieve query efficiency of log N; in Elasticsearch, the term dictionary is first divided into blocks of the same size, and then recursively divides each block into blocks of the same size for fast search.

For example, to quickly find a value in the set of data 1-16:

B-Tree dichotomy:

As shown in the figure, the B-Tree binary search for the number 7 requires 4 times to find it. The same search for 1 and 16 also takes 4 times to find them.

B-Tree + overall zoning method :

如图所示B-Tree+整体分区法查找7的时候,只需3次就能找到,相当于“三分法”一样比二分法更加的有效率,但是如果数据每次“三分”时都处于中间,那就无形的增加了判断次数(这种做法,拿要检索的值7和中间块的两头6和11比较),但是这只是极少的数据而已,在海量的数据面前,这数据更是微不足道,所以根据二八定律,它基本上能满足搜索更快的需求。这种是将块或者区域作为一个整体的思想来实现快速搜索,有一点像希尔排序一样,虽然检索效率不稳定,但是能够解决大部分的数据效率问题,就等于实现了整个数据的效率问题。

更为重要的是,Elasticsearch中不仅term dictionary实现了倒排索引,而且term index也采用了这种倒排索引,这就相当于又套了一层B-Tree+整体分区法,效率提高了一个档次。

Guess you like

Origin blog.csdn.net/wangguoqing_it/article/details/128775605