Why ElasticSearch is so fast

Think about a few questions:

  • Why is search  near real-time  ?
  • Why are CRUD (Create-Read-Update-Delete) operations of documents  real-time  ?

1. Overall structure from top to bottom

Here is an article that explains it very vividly:

This is the cluster cluster.

This is the node Node: just a machine.

One or more nodes and multiple green squares are combined to form an ElasticSearch index.

Under an index, the small green squares distributed in multiple nodes are called shards: Shard.

A slice is a Lucene Index.

There are many small segments in Lucene, which are the smallest management unit of storage.

Let's explain why Elasticsearch is so fast from the Node dimension, Shard dimension, and Segment dimension.

2. Node node dimension

The multi-node cluster solution improves the concurrent processing capability of the entire system.

1. Multi-node cluster solution

Route a document to a shard: When indexing a document, the document is stored in a primary shard. How does Elasticsearch know which shard a document should be stored in? In fact, this process is determined according to the following formula:

shard = hash(routing) % number_of_primary_shards

routing is a variable value, the default is the id of the document, and it can also be set to a custom value. This explains why we have to determine the number of primary shards when creating the index, and never change this number: because if the number changes, all previous routing values ​​​​will be invalid, and the document will never be found again arrive.

After determining which shard it is in, it can be determined which node it is on.

2. Coordinator node

Nodes are divided into Master Node, Data Node, and Client Node (only for the distribution and aggregation of requests). Each node can accept client requests, and each node knows the location of any document in the cluster, so it can directly forward the request to the required node. When a request is accepted, the node becomes a "coordinating node". From this perspective, the entire system can accept higher concurrent requests, and of course the search will be faster.

Take the update document as an example:

  • The client sends an update request to Node 1.
  • It forwards the request to Node 3 where the primary shard resides.
  • Node 3 retrieves the document from the primary shard, modifies _sourcethe JSON in the field, and attempts to reindex the primary shard's document. If the document has been modified by another process, it retries step 3 times, retry_on_conflictgiving up after times.
  • If Node 3 successfully updates the document, it forwards the new version of the document in parallel to the replica shards on Node 1 and Node 2, re-indexing. Once all replica fragments return success, Node 3 also returns success to the coordinator node, and the coordinator node returns success to the client.

3. Optimistic concurrency control

This approach used in Elasticsearch assumes that collisions are impossible and does not block the operation being attempted. Because there is no blocking, the speed of the index is improved, and _versionthe correctness under concurrent conditions can be guaranteed through fields:

PUT /website/blog/1?version=1 
{
  "title": "My first blog entry",
  "text":  "Starting to get the hang of this..."
}

Documents controlled in our index must now _versionbe 1 for this update to succeed.

3. Shard sharding dimension

1. Replica sharding

You can set the number of replicas of shards to improve the search speed in high-concurrency scenarios, but at the same time it will reduce the efficiency of indexing.

2. Invariance of Segment

The segmented storage mode is adopted at the bottom layer, which makes it almost completely avoid the occurrence of locks when reading and writing, and greatly improves the performance of reading and writing.

  • No lock is required. If you never update the index, you don't need to worry about multiple processes modifying data at the same time.
  • Once an index is read into the kernel's filesystem cache, it stays there due to its immutability. As long as there is enough space in the file system cache, most read requests will go directly to memory without hitting disk. This provides a big performance boost.
  • Other caches (like filter caches) are always valid for the lifetime of the index. They don't need to be rebuilt every time the data changes because the data doesn't change.
  • Writing to a single large inverted index allows the data to be compressed, reducing disk I/O and usage of the index that needs to be cached in memory.

How to update the inverted index under the premise of retaining the invariance? _versionThat is , create more index documents as mentioned above . In fact, an UPDATE operation includes a DELETE operation (only the record mark is actually deleted when the Segment Merge is performed) and a CREATE operation.

3. Improve writing speed

In order to improve the speed of writing indexes and ensure the reliability at the same time, Elasticsearch adds a translog, or transaction log, on the basis of segmentation, and logs are recorded every time Elasticsearch is operated.

After a document is indexed, it is added to the memory buffer and appended to the translog


Shards are refreshed every second:

  • These in-memory buffer files are written to a new segment without fsyncing.
  • The segment is opened so that it can be searched.
  • The memory buffer is emptied.

The process continues to work and more documents are added to the memory buffer and appended to the transaction log

Every once in a while—for example, the translog gets bigger and bigger—the index is flushed; a new translog is created, and a full commit is performed:

  • All documents in the memory buffer are written to a new segment.
  • The buffer is emptied.
  • A commit point is written to disk.
  • The filesystem cache is flushed via fsync.
  • Old translogs are deleted.

Before the Segment is refreshed, the data is stored in memory and cannot be searched, which is why Lucene is called to provide near real-time rather than real-time query.

However, the above mechanism avoids random writing, and data writing is done in Batch and Append, which can achieve high throughput. At the same time, in order to improve the efficiency of writing, the file cache system and memory are used to speed up the performance of writing, and logs are used to prevent data loss.

4. Comparing LSM trees

The schematic diagram of LSM-Tree is as follows. It can be seen that the writing idea of ​​Lucene is consistent with that of LSM-Tree:

Four. Segment dimension

1. Inverted index

Finally, when it comes to the inverted index, it is said that the inverted index improves the search speed, so what architecture or data structure is used to achieve this goal?

The above is the actual index structure in Lucene. Use an example to illustrate the above three concepts:

ID is the document id, then the established index is as follows:

Name:

Age:

image

Sex:

image

2. Posting List

It can be seen that an inverted index is established for each field. Posting list is an int array, which stores all document ids matching a certain term. In fact, in addition to this, it also includes: the number of documents, the number of times the term appears in each document, the position of occurrence, the length of each document, the average length of all documents, etc., which are used when calculating the correlation.

3. Term Dictionary

Suppose we have many terms, such as:

Carla,Sara,Elin,Ada,Patty,Kate,Selena

If they are arranged in this order, it must be very slow to find a specific term, because the terms are not sorted, and it is necessary to filter them all to find a specific term. After sorting it becomes:

Ada,Carla,Elin,Kate,Patty,Sara,Selena

In this way, we can use binary search to find the target term faster than full traversal. This is the term dictionary. With the term dictionary, you can use logN disk lookups to get the target.

4. Term Index

But random disk read operations are still very expensive (a random access takes about 10ms). Therefore, it is necessary to cache some data in memory as little as possible to read the disk. But the entire term dictionary itself is too large to be completely stored in memory. So there is a term index. The term index is sort of like a dictionary's large chapter table. for example:

A term beginning with A …………. Xxx page

term starting with C ……………. Page Yyy

term starting with E …………. Zzz page

If all the terms are English characters, maybe the term index is really composed of 26 English characters. But the actual situation is that the term may not be all English characters, and the term can be any byte array. Moreover, the 26 English characters do not necessarily mean that each character has an equal term. For example, there may not be any term starting with the character x, and there are many terms starting with s. The actual term index is a trie tree:

An example is a trie containing "A", "to", "tea", "ted", "ten", "i", "in", and "inn". This tree does not contain all terms, it does contain some prefixes of terms. The term index can be used to quickly locate a certain offset of the term dictionary, and then search sequentially from this position.

Now we can answer "Why Elasticsearch/Lucene retrieval can be faster than mysql. Mysql only has the term dictionary layer, which is stored on the disk in the form of b-tree sorting. Retrieving a term requires several random access disk operations .And Lucene adds a term index on the basis of the term dictionary to speed up the search. The term index is cached in the memory in the form of a tree. After finding the block position of the corresponding term dictionary from the term index, go to the disk to find the term, greatly Reduced the number of random accesses to disk.

5. FST(finite-state transducer)

In fact, the Term Index inside Lucene uses a "variant" trie tree, namely FST. How is FST better than trie tree? The trie tree only shares the prefix, while the FST shares both the prefix and the suffix, saving more space.

A FST is a 6-tuple (Q, I, O, S, E, f):

  • Q is a finite set of states
  • I is a finite set of input symbols
  • O is a finite set of output symbols
  • S is a state in Q called the initial state
  • E is a subset of Q, called the terminal state set
  • f is the conversion function, f ⊆ Q × (I∪{ε}) × (O∪{ε}) × Q, where ε represents the null character.
    That is, starting from a state q1, receiving an input character i, can reach another state q2, and generate an output o.

For example, there is the following set of mapping relationships:

cat -> 5
deep -> 10
do -> 15
dog -> 2
dogs -> 8

It can be represented by FST in the following figure:

image

This article is very good: In-depth analysis of Lucene's dictionary FST

Think about why you don't use HashMap, HashMap can also implement an ordered Map? Memory consumption! Sacrificing a little performance to save memory, the aim is to put all Term Indexes in the memory, and the final effect is to increase the speed. As can be seen above, FST is a graph structure that compresses the suffix of the dictionary tree. She has the efficient search capability of Trie, and it is also very small. In this way, when we search, we can load the entire FST into memory.

To sum up, FST has a higher data compression rate and query efficiency, because the dictionary is resident in memory, and FST has a good compression rate, so FST has a lot of usage scenarios in the latest version of Lucene, and it is also the default Dictionary data structure.

The complete structure of the dictionary

The tip file of Lucene is the Term Index structure, and the tim file is the Term Dictionary structure. It can be seen from the figure that there are multiple FSTs stored in the tip, and
the FST stores <word prefix, the location of the compressed blocks of all Term starting with this prefix in the disk>. That is, after finding the block position of the corresponding term dictionary from the term index mentioned above, then go to the disk to find the term, which greatly reduces the number of random accesses to the disk.

image

It can be understood visually that Term Dictionary means that the main text of the Xinhua Dictionary contains all the words, and Term Index is the index page in front of the Xinhua Dictionary, which is used to indicate which page the word is on.

However, FST cannot know the specific location of a Term in the Dictionary (.tim) file, nor can it know exactly whether the Term actually exists only through FST. It can only tell you that the term of the query may be on these Blocks, and whether there is a FST does not give an exact answer, because the FST is formed by the prefix of each Block in the Dictionary, so the Block can only be found directly through the FST The specific File Pointer on the .tim file cannot directly find the Terms.

How to join index query?

Going back to the above example, given the query filter condition age=24, the process is to first find the approximate position of 18 in the term dictionary from the term index, and then find the term 18 precisely from the term dictionary, and then get a posting list or a A pointer to the location of the posting list. Then the process of querying sex=Female is similar. Finally, age= 24 AND sex=Female is to merge the two posting lists with an "AND".

This theoretical AND operation is not easy. For mysql, if you index both the age and gender fields, only the most selective of them will be selected for use when querying, and another condition is to filter them out after calculating in memory during the process of traversing rows . So how can we use two indexes jointly? There are two ways:

  • Use the skip list data structure. Simultaneously traverse the posting list of gender and age, and skip each other;
  • Using the bitset data structure, calculate the bitset for the two filters of gender and age, and perform AN operation on the two bitsets.

Elasticsearch supports the above two combined indexing methods. If the query filter is cached in memory (in the form of bitset), then the merge is the AND of the two bitsets. If the query filter is not cached, then use the skip list method to traverse the two posting lists on disk.

Merge with Skip List

Use an example to illustrate how to use the idea of ​​skip list to merge (refer to Lucene Learning Summary Seven: Lucene Search Process Analysis (5) ):

  1. The inverted list is initially as follows, it can be seen that each posting list is already sorted:

    image

  2. Arrange each posting list according to the document number of the first article from small to large:

    image

  3. The posting list with the smallest document number is called first, and the document number of the last posting list is called doc (obviously, the previous document can be skipped when doing intersection). That is, doc = 8, first points to item 0, advance to the first document greater than 8, that is, document 10, then set doc = 10, first points to item 1.

    image

  4. doc = 10, first points to item 1, advance to document 11, then set doc = 11, first points to item 2.

    image

  5. doc = 11, first points to item 3, advance to document 11, then set doc = 11, first points to item 4.

    image

  6. By analogy, first points to the last item. That is, doc = 11, first points to item 7, advance to document 11, then set doc = 11, first points to item 0.

    image

  7. doc = 11, first points to item 0, advance to document 11, then set doc = 11, first points to item 1.

    image

  8. doc = 11, first points to item 1. Since 11 < 11 is false, the loop ends and doc = 11 is returned. At this time, we will find that when the loop exits, the first document of all posting lists is 11, so 11 is the common item of all skip lists.

    image

  9. According to this method, the outer loop is repeated to obtain the remaining public items.

    image

What is the Advance operation? It is the fast jump feature provided by the skip list.

On the other hand, for a very long posting list, such as:

[1,3,13,101,105,108,255,256,257]

We can divide this list into three blocks:

[1,3,13] [101,105,108] [255,256,257]

Then the second layer of the skip list can be constructed:

[1,101,255]

1,101,255 respectively point to their corresponding blocks. In this way, the movement across the block can be quickly pointed to the position.

Lucene will naturally compress this block again. The compression method is called Frame Of Reference encoding. Examples are as follows:

image

Consider frequently occurring terms (the so-called low cardinality value), such as male or female in gender. If there are 1 million documents, there will be 500,000 int values ​​in the posting list with gender male. Compression with Frame of Reference encoding can greatly reduce disk usage. This optimization is very important for reducing the index size. Of course, there is also a similar posting list in the mysql b-tree, which has not been compressed in this way.

Because the encoding of this Frame of Reference has decompression costs. Using the skip list, in addition to skipping the cost of traversal, it also skips the process of decompressing these compressed blocks, thereby saving CPU.

It can also be seen that Lucene has really achieved the ultimate in saving memory.

Merge with bitset

Bitset is a very intuitive data structure, corresponding to posting list such as:

[1,3,4,7,10]

The corresponding bitset is:

[1,0,1,1,0,0,1,0,0,1]

Each document is sorted by document id corresponding to one of the bits. Bitset itself has the characteristic of compression, which can represent 8 documents with one byte. So 1 million documents only need 125,000 bytes. But considering that there may be billions of documents, keeping bitsets in memory is still a luxury. And for each filter, a bitset is consumed. For example, if age=18 is cached, it is a bitset, and if 18<=age<25, another filter requires a bitset to be cached.

So the trick is to have a data structure:

Hundreds of millions of bits can be compressed to represent whether the corresponding document matches the filter;
this compressed bitset can still quickly perform logical operations of AND and OR.
The data structure used by Lucene is called Roaring Bitmap.

image

The idea of ​​compression is actually very simple. Instead of storing 100 zeros, use 100 bits. It's better to save 0 once, and then declare this 0 to be repeated 100 times.

Why is 65535 the limit? In the programmer's world, besides 1024, 65535 is also a classic value, because it = 2^16-1, which happens to be the largest number that can be represented by 2 bytes, a short storage unit, and notice the last in the above picture One line "If a block has more than 4096 values, encode as a bit set, and otherwise as a simple array using 2 bytes per value", if it is a large block, save it with bitset, and save it with bitset for small blocks, 2 I don't care about the bytes, it's convenient to use a short[].

After Lucene 7.0, Lucene adopts different storage methods for the denseness of the bitset: when the bitset is relatively sparse, directly store the DocID; when the bitset is dense, directly store the Bits data of the bitset. Depending on the distribution of data, adopting an appropriate structure can not only improve the utilization rate of space, but also improve the efficiency of traversal.

Summarize

In order to improve the efficiency of indexing and searching, Elasticsearch/Lucene uses various ingenious data structures and designs from the upper layer to the lower layer, relying on excellent theory and extreme optimization to achieve the ultimate in query performance.


Reposted from: https://www.jianshu.com/p/b50d7fdbe544

Guess you like

Origin blog.csdn.net/my8688/article/details/100095694