How Elasticsearch dynamically maintains an immutable inverted index



The previous article introduced how to search text in Elasticsearch, and also briefly described the immutability of the index data structure in es.


The disadvantage of index immutability limits the maximum amount of data stored in a single index and the frequency of updates, so the problem es faces is how to solve the immutable feature of inverted indexes while still maintaining the benefits of immutability.


The answer is to use multiple indexes

instead of rewriting the entire index each time. The way in es is to add a new index to reflect the latest changes, and then query all the inverted indexes at one time, from the earliest to the latest. , and then the merged result is returned.

In Lucene, an index is composed of multiple segments plus a commit point file, each segment is an inverted index, and the commit point file marks all known segment files. As shown in the following figure:





Note that the index in lucene is called shard in es, and an index in es can contain multiple shards. When querying an index in es, the query request will be sent to all shards at the bottom of es, and finally the result will be sent to all shards. Assemble and return.


Going back to the question at the beginning of the article, how does es use multiple indexes to solve the problem of updating, let's take a look at the process of data being written to es:


(1) When es receives a write or update request, it will first put this The data is collected in the indexing buffer of the memory

(2) After a certain interval or an external command is triggered, a new segment will be generated in the memory buffer.

(3) Then the segment will be written to the filesystem cache first, and at this time, the search can actually be found.

(4) After a period of time, the segment in the filesystem cache will be fsynced to the disk file and the new segment file name will be recorded in the commit point file, and the new segment will be opened to ensure that the search is visible

(5) Finally, the buffer area in the memory will be is emptied and waiting for a new documnet to be collected.


As shown in the figure below:







When a query request is received, all segments including those in memory and disk will be queried in turn, and finally all segments are aggregated and the correlation of each document is accurately calculated. The above implementation method can be handled in a relatively inexpensive way Added document.


The above describes the processing of new data. Next, let's see how es handles deletion and update requests.


First of all, we know that the segment itself is immutable, so the document cannot be removed from the old segment, nor can it be updated, so how does es handle delete and update requests?


At each commit point, es will generate a file with a suffix of .del, which marks all the data that has been deleted. When a piece of data is deleted, es will only make a delete mark in the .del file, which will be deleted by the .del file. The data marked for deletion will still be queried, but before the result is finally returned, the data marked for deletion will be removed. This is the implementation logic of deletion in es.


Similarly, the update logic is similar. When a document is updated, the old version of the document will also be marked for deletion in the .del file, and the new version of the document will be indexed into a new segment. At this time, the query will be Both are queried at the same time, but old versions of data marked for deletion will be removed before the final result is returned.


The above is the content of dynamically updating the index in es. Here we can see that the update and deletion in es are similar to the strategy of pseudo-deletion. At this point, you may have a question, when the data that is marked for deletion will be deleted. It will be really cleared by the file system. After all, a large amount will have a little impact on performance. This will be introduced in the article on segment merge later.

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326258028&siteId=291194637