How Elasticsearch retrieves data

We all know that Elasticsearch is a full-text retrieval engine, so how does it achieve fast retrieval?

Traditional databases store each field as a single value, which is inefficient for full-text retrieval. For example, I have a large text field, which can only be stored in the database as one value. If I want to retrieve any word in this large text field, how does the database implement it? It can only be achieved through like fuzzy query, not to mention the low performance, which is far from enough for a search engine.

In view of the shortcomings of the above databases, Lucene, a full-text retrieval framework, emerged, and its core lies in the use of an inverted index data structure, which is different from the row-based storage of the database. Lucene uses column-based storage here. Therefore, a single field can support the storage of multiple values, which is an inverted index.

Term  | Doc 1 | Doc 2 | Doc 3 | ...
------------------------------------
brown |   X   |       |  X    | ...
fox   |   X   |   X   |  X    | ...
quick |   X   |   X   |       | ...
the   |   X   |       |  X    | ...

As shown in the figure above, a field of the inverted index consists of multiple terms, which are an ordered list and are unique and non-repeating. For each Term, a list of all Document Ids containing the Term will be mapped.

Why talk about Lucene, because Lucene itself is just a full-text retrieval toolkit, it does not have some enterprise-level features, such as distribution, replication, expansion, etc. Elasticsearch and Solr are both enterprise-level frameworks developed and extended based on Lucene, so understand Lucene will be a great help in learning Elasticsearch and Solr.

In Elasticsearch, each piece of data is a json. In fact, each field in json has its own inverted index structure.

Of course, there is a lot of information stored in each Term in the inverted index, such as the number of times this Term appears in several Documents, the number of times it appears in a specific Document, the length of each Document, the average length of all Documents, This information is used to calculate the relevance of the search (Relevance). We all know that after using Google and Baidu search results, the data will be ranked one after the other. The top rankings are basically the most relevant data, so those factors determine the data. rank? This is actually determined by the correlation mentioned above. The calculation method of correlation is also the core function in Lucene. At present, there are mainly two Rank algorithms in Lucene:

(1) Classical TF/IDF algorithm based on VSM vector space

(2) The latest BM25 algorithm based on probability theory

Friends who are just interested can go to Wikipedia to learn, it will not be expanded here.

In the early full-text search, all data will be made into a large inverted index. When the new index is ready, it will replace the old large index and the most recently changed data can be retrieved.

One of the biggest features of this large inverted index is immutability. As long as the index is written to disk, it is immutable:

advantage:

(1) Due to immutability, locks are not required, that is, there is no need for multiple threads to modify data at the same time.

(2) You can directly load the index into the FileSystem Cache and stay in the cache, because it will not be modified and the FileSystem Cache has enough space, so that the search performance is greatly improved by querying directly in memory instead of on the disk.

(3) Other caches such as filter cache are valid throughout the life cycle of the index, they will not be rebuilt because the index is immutable.

(4) The immutable large index can get a higher compression ratio, which can save io and occupied memory resources

shortcoming:

The advantage of the inverted index is also its disadvantage, because it is immutable, so in order to make your new data searchable, you need to rebuild the entire index, which severely limits the number of single index storage and its update frequency.

Therefore, the method of dynamically updating multiple indexes is adopted in Elasticsearch to solve this problem, which will be introduced in the next article.

Reference link:

https://www.elastic.co/guide/en/elasticsearch/guide/master/inverted-index.html

https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-intro.html

https://www.elastic.co/guide/en/elasticsearch/guide/master/making-text-searchable.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324996018&siteId=291194637