ElasticSearch index VS MySQL index

ElasticSearch index VS MySQL index
Foreword
During this period of time, I was maintaining the search function of the product. Every time I saw the efficient query efficiency of elasticsearch in the management console, I was curious how he did it.

ElasticSearch index VS MySQL index
This is even faster than using MySQL locally to query through the primary key.

For this reason I found ElasticSearch index VS MySQL indexsearch relevant information:

ElasticSearch index VS MySQL index
There are many answers to this kind of questions on the Internet. The approximate meaning is as follows:

  • ES is a full-text search engine based on Lucene. It will segment the data and save the index. It is good at managing a large amount of index data. Compared with MySQL, it is not good at frequently updating data and associated queries.
    It is not very thorough, and there is no analysis related principle; but since the index is repeatedly mentioned, then we will compare the differences between the two from the perspective of the index.

MySQL index
starts with MySQL. The term index must be familiar to everyone. It usually exists in some query scenarios, which is a typical case of space for time.

The following uses the Innodb engine as an example.
The common data structure
assumes that we design MySQL indexes ourselves. What are the options?

Hash table The
first thing we should think of is the hash table, which is a very common and efficient data structure for querying and writing. The corresponding to Java is HashMap

ElasticSearch index VS MySQL index
This data structure should not need to be introduced too much. Its writing efficiency is very O(1). For example, when we want to query the data with id=3, we need to hash 3, and then find the corresponding in this array Location is fine.

But if we want to query the interval data such as 1≤id≤6, the hash table cannot be satisfied very well. Because it is disordered, it is necessary to traverse all the data to know which data belongs to this interval.

Ordinal
ElasticSearch index VS MySQL index
component search can also locate data O(logn) efficiently.

At the same time, because the data is also ordered, it can naturally support interval queries; it seems that an ordered array is suitable for indexing?

Naturally not, it has another major problem; suppose we insert data with id=2.5, we have to move all subsequent data by one bit at the same time, and the writing efficiency will become very low.


Since the writing efficiency of the ordered array of the balanced binary tree is not high, let's take a look at the efficient writing. It is easy to think of the binary tree; here we take the balanced binary tree as an example:

Due to the characteristics of the balanced binary tree:

The left node is smaller than the parent node, and the right node is larger than the parent node.

So suppose we want to query the data with id=11, we only need to query 10—>12—>11 to finally find the data. The time complexity is O(logn), and similarly, it is O(logn) when writing data.

But it still does not support interval range search well. Assuming that we need to query the data with 5≤id≤20,
the query efficiency of the ordered array is very high. When we want to query the data with id=4, we only need The binary ElasticSearch index VS MySQL index
search can also locate the data O(logn) efficiently.

At the same time, because the data is also ordered, it can naturally support interval queries; it seems that an ordered array is suitable for indexing?

Naturally not, it has another major problem; suppose we insert data with id=2.5, we have to move all subsequent data by one bit at the same time, and the writing efficiency will become very low.


Since the writing efficiency of the ordered array of the balanced binary tree is not high, let's take a look at the efficient writing. It is easy to think of the binary tree; here we take the balanced binary tree as an example:

ElasticSearch index VS MySQL index
Due to the characteristics of the balanced binary tree:

The left node is smaller than the parent node, and the right node is larger than the parent node.

So suppose we want to query the data with id=11, we only need to query 10—>12—>11 to finally find the data. The time complexity is O(logn), and similarly, it is O(logn) when writing data.

But it still does not support interval search well. Assuming that we want to query data with 5≤id≤20, we need to query the 10-node left subtree first and then query the 10-node right subtree to finally query all the data.

The efficiency of such a query is not high.

Jumping tables
may not be as common as the hash tables, ordered arrays, and binary trees mentioned above, but in fact, the sort set in Redis is implemented by jumping tables.

Here we briefly introduce the advantages of the data structure implemented by the next jump table.

We all know that even the query efficiency of an ordered linked list is not high, because it cannot use the array subscript for binary search, so the time complexity is o(n)

But we can also ingeniously optimize the linked list to realize binary search in disguise, as shown in the following figure:

ElasticSearch index VS MySQL index
We can extract the primary index and secondary index for the bottom data. According to the amount of data, we can extract the N-level index.

When we query, we can use the index here to realize the binary search in disguise.

Assuming that you want to query the data with id=13, you only need to traverse the four nodes 1—>7—>10—>13 to query the data. The more the number, the more obvious the efficiency improvement.

At the same time, interval query is also supported. Similar to the query of a single node just now, you only need to query the starting node, and then traverse backwards (the linked list is ordered) to the target node to query the entire range of data.

At the same time, since we don't store real data in the index, but only store a pointer, the space occupied by the linked list at the bottom layer is negligible.

Optimization of balanced binary tree
But in fact, Innodb in MySQL does not use skip tables, but uses a data structure called B+ tree.

This data structure is not like a binary tree, as college teachers often talk about as a basic data structure, because this type of data structure is evolved from the basic data structure in the actual project according to the demand scenario.

For example, the B+ tree here can be considered to have evolved from a balanced binary tree.

Just now we mentioned that the interval query efficiency of the binary tree is not high, and it can be optimized for this point:

ElasticSearch index VS MySQL index
After optimization on the basis of the original binary tree: all non-leaves do not store data, but serve as the index of leaf nodes, and all data is stored in leaf nodes.

In this way, the data of all leaf nodes are stored in an orderly manner, which can well support interval query.

It only needs to query the position of the starting node first, and then traverse backwards in the leaf nodes.

When the amount of data is huge, it is obvious that the index file cannot be stored in the memory. Although the speed is very fast, the resource consumption is not small; therefore, MySQL will store the index file directly in the disk.

This point is slightly different from the elasticsearch index mentioned later.

Since the index is stored on the disk, we need to reduce the disk IO as much as possible (the efficiency of disk IO and memory are not an order of magnitude)

As can be seen from the above figure, we have to perform at least 4 IOs to query a piece of data. Obviously, the number of IOs is closely related to the height of the tree. The lower the height of the tree, the fewer the number of IOs and the better the performance. it is good.

How can we reduce the height of the tree?
ElasticSearch index VS MySQL index
We can try to change the binary tree into a trigeminal tree, so that the height of the tree will drop a lot, so that the number of IOs when querying data will naturally be reduced, and the query efficiency will be improved a lot.

This is actually the origin of the B+ tree.

Some suggestions for using indexes
can actually optimize some small details of daily work through the understanding of the B+ tree in the above figure; for example, why is it better to increase in order?

Assuming that the primary key data we write is out of order, it is possible that the id of the data written later is smaller than the id written before, so that it may be necessary to move the data that has been written when maintaining the B+ tree index.

If the data is written in increments, there is no such consideration, and each time it is only necessary to write sequentially.

That's why we require the database primary key to be as trendy as possible, and the most reasonable thing is to increase the primary key without considering the sub-table situation.

On the whole, the idea is similar to the jump table, but related adjustments are made for the usage scenario (for example, all data is stored in the leaf nodes).

ES index
MySQL is over, now let’s see how Elasticsearch uses indexes.

The forward index
in ES uses a data structure called the inverted index; before officially talking about the inverted index, let’s talk about the opposite of the forward index.
ElasticSearch index VS MySQL index

The above figure is an example. The way we can query specific objects through doc_id is called using a positive index, which can also be understood as a hash table.

The essence is to find value by key.

For example, through doc_id=4, the data name=jetty wang, age=20 can be quickly found.

Inverted index
What if I want to query the data that contains li in name? How to query efficiently in this way?

Obviously, only the forward index mentioned above will not have any effect. You can only traverse all the data in order to determine whether the name contains li; this is very inefficient.

But if we rebuild an index structure:
ElasticSearch index VS MySQL index

When you want to query the data containing li in the name, you only need to query the data contained in the Posting List through this index structure, and then query the final data through the mapping method.

This index structure is actually an inverted index.

Term Dictionary
But how to efficiently query li in this index structure, combined with our previous experience, as long as we arrange the Term in an orderly manner, we can use the binary tree search tree data structure to query the data under o(logn).

The process of splitting a text into an independent term is actually what we often call word segmentation.

Combining all terms together is a Term Dictionary, which can also be called a word dictionary.

English word segmentation is relatively simple. You only need to separate the text with spaces and punctuation to split words. Chinese is relatively complicated, but there are many open source tools to support it (because it is not the focus of this article, those interested in word segmentation can search by themselves).
When the amount of our text is huge, there will be a lot of terms after word segmentation. If such an inverted index data structure is stored in memory, it is definitely not enough, but if it is stored on disk like MySQL, the efficiency is not so high.

Term Index

So we can choose a compromise method. Since the entire Term Dictionary cannot be put in memory, we can create an index for Term Dictionary and put it in memory.

In this way, you can query Term Dictionary efficiently, and finally query the Posting List through Term Dictionary.

Compared with the B+ tree in MySQL, it also reduces disk IO several times.

ElasticSearch index VS MySQL index
We can store this Term Index using such a Trie tree, which is what we often call a dictionary tree.

For more information about the dictionary tree, please check here.

ElasticSearch index VS MySQL index
If we search with Term beginning with j, the first step is to find out where the Term beginning with j is in the Term Dictionary file through the Term Index in the memory (this position can be a file pointer, maybe An interval range).

Immediately after taking out all the terms in this position interval, because the order has been sorted, the specific position can be quickly located through binary search; in this way, the Posting List can be queried.

Finally, the target data can be retrieved from the original file through the location information in the Posting List.

More optimization
Of course, ElasticSearch has also made many targeted optimizations. When we search for two fields, we can use bitmap to optimize.

For example, now we need to query the data of name=li and age=18. At this time, we need to use these two fields to retrieve the respective results from the Posting List.
ElasticSearch index VS MySQL index
The easiest way is to traverse the two sets separately and take out the duplicate data, but this is obviously inefficient.

At this time, we can use the bitmap method for storage (and save storage space), and at the same time use the innate bit and calculation to get the result.


[1, 3, 5]       ⇒ 10101

[1, 2, 4, 5] ⇒ 11011

The result can be obtained by adding two binary arrays:


10001 ⇒ [1, 5]

In the end, the Posting List is solved as [1, 5], which is naturally much higher in efficiency.

The same query requirements are not specially optimized in MySQL, but the second field is filtered after the data with a small amount of data is filtered out, and the efficiency is naturally not as high as ES.

Of course, the Posting List will also be compressed in the latest version of ES. The specific compression rules can be found in the official documents, which will not be introduced here.

Summary
Finally, we come to sum up:

ElasticSearch index VS MySQL index
From the above content, it can be seen that no matter how complex products are ultimately composed of basic data structures, they will only be optimized for different application scenarios. Therefore, after laying the foundation of data structures and algorithms, look at a new technology or middleware. Only to get started quickly, and even know the direction of optimization.

Finally, draw a pie. I will try to build a stand-alone search engine based on the idea of ​​ES inverted indexing. Only by writing it myself can I deepen my understanding.

For a better reading experience, please visit here: https://www.notion.so/ElasticSearch-VS-MySQL-54bddcc092c64c26b2127f1fb9772a23

Your likes and sharing are your greatest support

Guess you like

Origin blog.51cto.com/15049794/2562888