ElasticSearch - underlying implementation of distributed search engine - inverted index

Table of contents

一、ElasticSearch

1.1. What is ElasticSearch?

1.2. What is ElasticStack?

1.3. Forward index and inverted index

1.3.1. Forward index

1.3.2. Inverted index

a) The creation process of inverted index:

b) Query process of inverted index: 

c) Analysis summary:

1.3.3. Applicable scenarios of inverted index


一、ElasticSearch


1.1. What is ElasticSearch?

ElasticSearch is a very powerful search engine that can help us quickly find the content we need from massive amounts of data.

For example, if we all go shopping on Taobao and you enter a product information, it will immediately search for information related to the keyword you entered. For example, if you enter the keyword "IPhone", you can see A variety of information will be searched, such as "IPhone 13 special offer", "IPhone 10 second-hand sale"... You can even see the keyword "IPhone" in red. This is called highlighting. It's to make it more clear and eye-catching.

1.2. What is ElasticStack?

There are actually several components here that are used together with ElasticSearch, namely Kibana, Logstash, and Beats. Their combination is the ElasticStack technology stack.

This set of tools is widely used in microservice log data analysis and real-time monitoring.

Log data analysis: During the operation of our project, a large amount of log information will be generated, which is not uncommon for everyone. These log information are used to facilitate us to locate problems in the system. Suppose your system reports an error and is running online. Sometimes, it is impossible for you to interrupt and debug, so you usually use Logstash to capture the log data, elasticsearch to store, calculate, and search the data, and finally use Kibana to visualize the data to show you the processing. In this way, you will This is very convenient when doing log analysis.

Real-time monitoring: When the project is running, its running status is also data, such as CPU, memory status, frequency of access, etc. This information will also be managed by es, and then displayed and processed for you through visualization, so that you can Clearly know the running status of the project.

But in fact, in ElasticStack, the three components Kibana, Logstash, and Beats are all replaceable. They are officially provided to you. You can use them if you want. It doesn’t matter if you don’t use them. For example, when Taobao displays search results, they all have If you display your own web page yourself, it does not necessarily need to be displayed through the data report generated by Kibana. But what is irreplaceable is the core of elasticsearch (which is also the focus of the explanation later).

1.3. Forward index and inverted index

1.3.1. Forward index

Traditional databases (such as mysql) use forward indexes.

If I have a database table here, I will usually create an index based on the id to form a b+ tree. Then the retrieval speed based on the id will be very fast. Then this method is a forward index. But if the field being searched now is not id is an ordinary title field (generally the content is relatively long), so you will not index it.

Even if you add an index to it, if what I want to search for now is not an exact title value, but only a part of it, what will you do? You are not asking for a select * from table where title like... Once such fuzzy matching is used, even if this field has an index, it will not take effect in the future. Eventually, the database will use a one-by-one scan to judge each row. Whether the data contains the title keyword, if not, discard it, if it does, put it in the result set. In this way, if you have 1 billion pieces of data, it means scanning 1 billion times! The performance can be imagined. This is also a forward index.

1.3.2. Inverted index

The bottom layer of elasticsearch uses inverted index, and two concepts are involved here:

  • Document: Each piece of data is a document (for example, a row of data in the MySQL table, for example, in the product table, a product is a document, and a user in the user table is a piece of data).
  • Entry: The document is divided into words based on semantics. For example, the four words "Huawei mobile phone" can be divided into "Huawei" and "mobile phone".
a) The creation process of inverted index:

Suppose there is a product table:

1. When creating an inverted index for the product name in the product table, the content in the product name will be divided into entries for storage. For example, the product name is "Xiaomi mobile phone", then the title will be segmented to get the two words "Xiaomi" ” and “Mobile” to store it. 

2. At this time, for example, when "Xiaomi" is taken and stored, its ID will also be recorded in its document, and the word "mobile phone" will be saved in a new document, and its ID will be recorded at the same time. 

3. If the product name of the next piece of data is segmented and the term "mobile phone" is obtained, its ID will be saved in the document with the same term.

4. By analogy, when the data organization is finally completed, indexes can be created for all the obtained terms. In this way, the query speed based on the terms will be faster in the future.

b) Query process of inverted index: 

1. For example, if we now search for "Huawei mobile phone", we will first segment it into words and get two entries, "Huawei" and "mobile phone".

2. Next, take the entry and query it in the inverted index. Since the index was just created based on the entry, you can immediately find the document id contained in the entry "Huawei" (assuming the id has 2 , 3), and the document id contained in "mobile phone" (assuming the id is 1, 2), this is equivalent to knowing all the documents (1, 2, 3) contained in "Huawei mobile phone".

3. The product information with id = 2 exists in two documents, which means that the product name with id = 2 is more in line with your search information, and you will be sorted according to this matching degree in the future.

4. At this point, you can use these IDs to query the document in the forward index. Isn't the forward index an index based on the ID? Then you can quickly locate the document with the ID and query the document. The data is put into the result set according to the sorted ID.

c) Analysis summary:

According to the above process, we can see that the search process has gone through two retrievals:

  • The first time is to find the corresponding document ID based on the entry of the content entered by the user.
  • The second time is to find the document with the document ID.

This kind of query efficiency is much faster than searching for data containing mobile phone keywords in the forward index, row by row.

We can also see here why the index is inverted, because in the forward index, you have to search line by line, and put the matching ones into the result set, while the inverted index does the opposite, creating an index based on the terms. When there are many searches, the corresponding document is found based on the word (forward index is to find the word based on the document ).

1.3.3. Applicable scenarios of inverted index

The inverted index is better at querying part of the document. For example, if you search for part of the content in the browser, or search for product information, etc.

This is why elasticsearch is a search engine based on inverted index~~

Guess you like

Origin blog.csdn.net/CYK_byte/article/details/133210064