Why is Elasticsearch query so fast? What are you looking at? I'm asking you!

During this period of time, I was maintaining the search function of the product. Every time I saw the efficient query efficiency of Elasticsearch in the management console, I was curious how he did it.

Why is Elasticsearch query so fast?

This is even faster than using MySQL locally to query through the primary key.

Why is Elasticsearch query so fast?

Why is Elasticsearch query so fast?

For this I searched for relevant information:

Why is Elasticsearch query so fast?

There are many answers to this kind of questions on the Internet. The approximate meaning is as follows: ES is a full-text search engine based on Lucene. It will segment the data and save the index. It is good at managing a large amount of index data. Compared with MySQL, it is not good at frequently updating data and association. Inquire.

It is not very thorough, and there is no analysis related principle; but since the index is repeatedly mentioned, then we will compare the differences between the two from the perspective of the index.

MySQL index

Speaking of MySQL first, the term index must be familiar to everyone. It usually exists in some query scenarios, which is a typical case of space for time. The following uses the InnoDB engine as an example.

Common data structure

Assuming that we design the MySQL index ourselves, what are the options?

①Hash table

The first thing we should think of is the hash table, which is a very common and efficient data structure for querying and writing. The corresponding to Java is HashMap.

Why is Elasticsearch query so fast?

This data structure should not need to be introduced too much. Its writing efficiency is very O(1). For example, when we want to query the data with id=3, we need to hash 3, and then find the corresponding in this array Location is fine.

But if we want to query the interval data such as 1≤id≤6, the hash table cannot be satisfied very well. Because it is disordered, it is necessary to traverse all the data to know which data belongs to this interval.

② ordered array

Why is Elasticsearch query so fast?

The query efficiency of the ordered array is also very high. When we want to query the data with id=4, we only need to search through binary search to efficiently locate the data O(logn).

At the same time, because the data is also ordered, it can naturally support interval queries; it seems that an ordered array is suitable for indexing?

Naturally not, it has another major problem; suppose we insert data with id=2.5, we have to move all subsequent data by one bit at the same time, and the writing efficiency will become very low.

③Balanced Binary Tree

Since the writing efficiency of an ordered array is not high, let's take a look at the writing efficiency, and it is easy to think of a binary tree.

Here we take a balanced binary tree as an example:

Why is Elasticsearch query so fast?

Due to the characteristics of the balanced binary tree: the left node is smaller than the parent node, and the right node is larger than the parent node.

So suppose we want to query the data with id=11, and we only need to query 10→12→11 to finally find the data. The time complexity is O(logn), and the same is O(logn) when writing data.

But it still does not support interval range search well. Assuming that we want to query data with 5≤id≤20, we need to query the left subtree of 10 nodes first and then the right subtree of 10 nodes to finally query all the data. The efficiency of such a query is not high.

④ Jump table

Jumping tables may not be as common as the hash tables, ordered arrays, and binary trees mentioned above, but in fact, the sort set in Redis is implemented by jumping tables. Here we briefly introduce the advantages of the data structure implemented by the next jump table.

We all know that even querying an ordered linked list is not efficient. Since it cannot use array subscripts for binary search, the time complexity is o(n).

But we can also ingeniously optimize the linked list to realize binary search in disguise, as shown in the following figure:

Why is Elasticsearch query so fast?

We can extract the primary index and secondary index for the bottom data. According to the amount of data, we can extract the N-level index. When we query, we can use the index here to realize the binary search in disguise.

Assuming that you want to query the data with id=13, you only need to traverse the four nodes 1→7→10→13 to find the data. The more the number, the more obvious the efficiency improvement.

At the same time, interval query is also supported. Similar to the query of a single node just now, you only need to query to the starting node, and then traverse sequentially (the linked list is ordered) to the target node to query the entire range of data.

At the same time, since we don't store real data in the index, but only store a pointer, the space occupied by the linked list at the bottom layer is negligible.

Optimization of balanced binary tree

But in fact, InnoDB in MySQL does not use skip tables, but uses a data structure called B+ tree.

This data structure is not like a binary tree. University teachers often talk about it as a basic data structure, because this type of data structure is evolved from the basic data structure according to the demand scenario in the actual project.

For example, the B+ tree here can be considered to have evolved from a balanced binary tree. Just now we mentioned that the efficiency of the interval query of the binary tree is not high, and it can be optimized for this point:

Why is Elasticsearch query so fast?

After optimization on the basis of the original binary tree: all non-leaves do not store data, but serve as the index of leaf nodes, and all data is stored in leaf nodes.

In this way, the data of all leaf nodes are stored in an orderly manner, which can well support interval query. It only needs to query the position of the starting node first, and then traverse backwards in the leaf nodes.

When the amount of data is huge, it is obvious that the index file cannot be stored in the memory. Although the speed is very fast, the resource consumption is not small; therefore, MySQL will directly store the index file in the disk.

This point is slightly different from the Elasticsearch index mentioned later. Since the index is stored on the disk, we need to reduce the disk IO as much as possible (the efficiency of disk IO and memory are not an order of magnitude).

As can be seen from the above figure, we have to perform at least 4 IOs to query a piece of data. Obviously, the number of IOs is closely related to the height of the tree. The lower the height of the tree, the fewer the number of IOs and the better the performance. it is good.

How can we reduce the height of the tree?

Why is Elasticsearch query so fast?

We can try to change the binary tree into a trinomial tree, so that the height of the tree will drop a lot, so the number of IOs when querying data will naturally be reduced, and the query efficiency will be much improved. This is actually the origin of the B+ tree.

Some suggestions for using indexes

In fact, through the understanding of the B+ tree in the above figure, some small details of daily work can also be optimized; for example, why is it better to increase in order?

Assuming that the primary key data we write is out of order, it is possible that the id of the data written later is smaller than the id written before, so that it may be necessary to move the written data when maintaining the B+ tree index.

If the data is written in increments, there is no such consideration, and each time it is only necessary to write sequentially. That's why we require the database primary key to be trend increasing as much as possible, and the most reasonable thing is to increase the primary key without considering the sub-table situation.

On the whole, the idea is similar to the jump table, but related adjustments are made for the usage scenario (for example, all data is stored in the leaf node).

ES index

MySQL is finished, now let’s see how Elasticsearch uses indexes.

Front Index

What is used in ES is a data structure called an inverted index; let's talk about the opposite of the inverted index before officially talking about it.

Why is Elasticsearch query so fast?

The above figure is an example. The way we can query specific objects through doc_id is called using a positive index, which can also be understood as a hash table.

The essence is to find value by key. For example, through doc_id=4, you can quickly find the data name=jetty wang and age=20.

Inverted index

So if I want to query the data that contains li in the name, how can I query efficiently?

Obviously, only the forward index mentioned above will not have any effect. You can only traverse all the data in order to determine whether the name contains li; this is very inefficient.

But if we rebuild an index structure:

Why is Elasticsearch query so fast?

When you want to query the data containing li in the name, you only need to query the data contained in the Posting List through this index structure, and then query the final data through the mapping method.

This index structure is actually an inverted index.

Term Dictionary

But how to efficiently query li in this index structure? Based on our previous experience, as long as we arrange the Term in an orderly manner, we can use the binary tree search tree data structure to query the data under o(logn).

The process of splitting a text into an independent term is actually what we often call word segmentation.

Combining all terms together is a Term Dictionary, which can also be called a word dictionary.

English word segmentation is relatively simple. You only need to separate the text with spaces and punctuation to split words. Chinese is relatively complicated, but there are many open source tools to support it (because it is not the focus of this article, those interested in word segmentation can search by themselves).

When the amount of our text is huge, there will be a lot of terms after word segmentation. If such an inverted index data structure is stored in memory, it is definitely not enough, but if it is stored on disk like MySQL, the efficiency is not so high.

Term Index

So we can choose a compromise method. Since the entire Term Dictionary cannot be put into memory, we can create an index for Term Dictionary and put it into memory.

In this way, you can query Term Dictionary efficiently, and finally query the Posting List through Term Dictionary.

Compared with the B+ tree in MySQL, it also reduces disk IO several times.

Why is Elasticsearch query so fast?

We can use such a Trie tree for this Term Index, which is what we often call a dictionary tree to store.

Why is Elasticsearch query so fast?

If we search with Term starting with j, the first step is to find out where the Term starting with j is in the Term Dictionary file through the Term Index in the memory (this position can be a file pointer, maybe An interval range).

Immediately after taking out all the terms in this position interval, since the order has been sorted, the specific position can be quickly located through binary search; in this way, the Posting List can be queried.

Finally, the target data can be retrieved from the original file through the location information in the Posting List.

More optimization

Of course, Elasticsearch has also made many targeted optimizations. When we search for two fields, we can use Bitmap to optimize.

For example, now we need to query the data of name=li and age=18. At this time, we need to retrieve the respective results from the Posting List through these two fields.

Why is Elasticsearch query so fast?

The easiest way is to traverse the two sets separately and take out the duplicate data, but this is obviously inefficient.

At this time, we can use Bitmap for storage (and save storage space), and at the same time, we can get the result by using innate bits and calculations.

[1, 3, 5] ⇒ 10101 

[1, 2, 4, 5] ⇒ 11011 

The result can be obtained by adding two binary arrays:

10001 ⇒ [1, 5] 

In the end, the Posting List is solved as [1, 5], which is naturally much higher in efficiency. The same query requirements are not specially optimized in MySQL, but the second field is filtered after the data with a small amount of data is filtered out, and the efficiency is naturally not as high as ES.

Of course, the Posting List will also be compressed in the latest version of ES. The specific compression rules can be found in the official documents, which will not be introduced here.

to sum up

Finally, let's summarize:

Why is Elasticsearch query so fast?

From the above content, it can be seen that no matter how complex products are ultimately composed of basic data structures, they will only be optimized for different application scenarios. Therefore, after laying the foundation of data structure and algorithms, look at a new technology or middleware Only to get started quickly, and even know the direction of optimization.

Finally, draw a pie. I will try to build a stand-alone search engine based on the idea of ​​ES inverted indexing. Only by writing it myself can I deepen my understanding.

Guess you like

Origin blog.csdn.net/qwe123147369/article/details/109095927
Recommended