Elasticsearch inverted index and document adding principle (1)

related articles

1. Inverted index

Although you may be very familiar with inverted indexes, I still want to rethink.

Thinking: When we search for a keyword through a search engine, how does the search engine find the documents it crawls that contain this keyword.

Do you want to go through all the documents? Obviously unrealistic, so a new data structure inverted index appeared.

term data
hello doc1-3-2-0-5,doc2-1-3-1-6
world doc1-3-2-0-5,doc2-1-3-1-6
Elasticsearch doc5-3-2-0-5, doc9-1-3-1-6

As shown above, it can be regarded as a simplified version of the inverted index structure, and its data contains document number, word frequency, position and offset.

For example, the above inverted index shows the word hello:
it appears 3 times in the document doc1, and it appears in the second word. This word starts at the 0th byte in the document and ends at the 5th byte.
It appears once in the document doc2, which appears in the third word. This word starts at the first byte and ends at the sixth byte in the document.

The same applies to world, Elasticsearch, and other keywords.

In this way, when we search for the word hello, we can easily know which documents contain the word hello. Of course, the actual situation will be much more complicated, because there will be multiple nodes containing multiple inverted indexes.

In ES, we can control which attributes (document number, word frequency, position, offset) are included in the inverted index through the index_option parameter of mapping.

Two, segment

Segment is the inverted index of ES. Its special feature is that it will not be modified, but will only be merged and deleted.

As we will introduce later, each refresh will generate a segment, and these segments will eventually be saved to disk called a file.

Each segment occupies a file handle. More importantly, each search request must access each segment. This means that the more segments there are, the more memory and cpu are consumed, and the slower the search request will be.

Therefore, ES will have a task of merging segments in the background. The following are some parameters that control segment merging:

parameter Description
index.merge.policy.floor_segment The default is 2MB, and the segments smaller than this value are merged first
index.merge.policy.max_merge_at_once The default is 10, how many segments can be merged at a time
index.merge.policy.max_merged_segment The default is 5GB, and the segments exceeding this value are not merged
index.merge.policy.max_merge_at_once_explicit Explicitly call at most how many segments to merge at one time

Third, the process of adding documents to ES

Add document flow

When adding a document, it will not be parsed directly, but added to the index-buffer first. By default, it will be processed once a second with refresh, and the document will be parsed into segments and stored in the filesystem cache.

However, the segment is not written to the disk immediately, but is placed in the disk according to the flush configuration strategy. Refresh and flush will be described in detail later.

Another important step when adding a document is to write the translog, because there is a time difference between the document and the segment, and the segment will not be written to the disk immediately, so write the translog first to ensure that the data is not lost.

Of course, if the translog is written asynchronously, some data may be lost.

Translog is a file, because it is written sequentially and no data is output, so it is faster

Fourth, create an inverted index (refresh)

Refresh is the process of parsing the document into segments, and the process of data from index-buffer to filesystem-cache in ES.

Document parsing process

refresh process:

  1. Write the document in the index-buffer to a new segment
  2. Open the segment so that the document can be searched
  3. Clear documents in index-buffer

Five, flush

The flush operation is mainly the segment placement of the filesystem cache in the memory.

The flow of flush operation:

  1. Write the document in the index-buffer to a new segment
  2. Clear documents in index-buffer
  3. Write commit point information to disk
  4. Use fsync to write the segments in the filesystem cache to disk
  5. Delete old translog files

The following are some parameters that control the flush operation:

parameter Description
index.translog.flush_threshold_ops How many operations do flush once, the default is unlimited
index.translog.flush_threshold_size Flush when the translog size reaches this value, the default is 512mb
index.translog.flush_threshold_period There is at least one flush in this time, the default is 30m
index.translog.interval How many time intervals will the translog size be checked once, the default is 5s

Six, documents

index-buffer

translog

Guess you like

Origin blog.csdn.net/trayvontang/article/details/103550434