ElasticSearch high performance design

I. Introduction

Second, database search and ES search

2.1 Three problems of database search

Capacity problem: When there are hundreds of millions of items on an e-commerce website, the table must be split if the data in a single table is too large, and the database must be divided into a database (mycat) if the database disk is too large.

Performance problem: MySQL must use the like keyword to implement fuzzy query. Only the post-fuzzy can be indexed, and the front-fuzzy and full-fuzzy will not be indexed. For example, when querying keywords such as "laptop", the product name field of hundreds of millions of data Progressive scan, performance can not keep up.

No word segmentation: You can only search for data that is exactly the same as the keyword. If the amount of data is small, search for "laptop" and "computer" data. Do you want to give the user the data?

Officially, due to these defects, the database can be used for in-site search/vertical search with a relatively small amount of data, but for Internet search with a large amount of PB-level data, database search is definitely not used.

The birth of Lucene solves these three problems of the database. The capacity is very large, and there is no need for sub-database and sub-table to store large amounts of data; the performance of fuzzy query is very fast; and it also needs to use word segmentation query. Lucene is actually a jar package, which encapsulates the full-text search engine and search algorithm code. During development, the jar package of lucen is introduced to search for related services through api development. The bottom layer will build an index library on disk.

But Lucene has the biggest flaw. It is a single-instance search engine, which does not meet the needs of the distributed Internet. Therefore, with elasticsearch, it is a multi-instance distributed cluster search engine. Each instance of elasticsearch is a lucence. , each node is equal/level, and there is no master-slave relationship.

elasticsearch is written in Java language, it is very convenient to integrate with springboot project.

2.2 Elasticsearch

Elasticsearch official website: https://www.elastic.co/cn/products/elasticsearch1.3.2

2.2.1 Two functions of Elasticsearch

Two functions of Elasticsearch: distributed search engine and data analysis engine

Search: Internet search, e-commerce website search, OA system query

Data analysis: E-commerce website inquires which categories of books are sold in the top ten in the past week; news website, the ten most read keywords in the last 3 days, public opinion analysis.

The difference between lucene and elasticsearch
Lucene: the most advanced and powerful search library, developed directly based on lucene, very complex, complex
api Language clients, such as Java's high-level client (Java High Level REST Client) and the low-level client (Java Low Level REST Client)

2.2.2 Two Features of Elasticsearch

Distributed: ES can automatically distribute massive data to multiple servers for storage and retrieval, perform parallel queries, and improve search efficiency. In contrast, Lucene is a stand-alone application.

Near real-time: There are hundreds of millions of data queries in the database, and the search takes several hours at a time, which is batch-processing. And ES can query massive data in seconds, so it is called near real-time. seconds.

2.3 Elasticsearch core concepts vs. database core concepts

Relational database (such as Mysql) Non-relational database (Elasticsearch)
DatabaseDatabase index
table Index Index (formerly Type)
Data row Row DocumentDocument
Data column Column Field
Constraint Schema MappingMapping

In addition, ES also corresponds to MySQL, with operations such as order by, group by, and a series of aggregation functions avg sum max min. We now introduce the ES concepts in the above table.

2.3.1 Index: Index

Contains a bunch of document data with a similar structure.

Index creation rules:
(1) Only lowercase letters
(2) Cannot contain special symbols such as \, /, *, ?, ", <, >, |, # and space characters
(3) No colons are included since version 7.0
(4) cannot start with -, _ or +
(5) cannot exceed 255 bytes (note that it is a byte, so multibyte characters will count towards the 255 limit)

2.3.2 Type: Type

Each index can have one or more types. Type is a logical data classification in the index. Documents under a type have the same fifield. Version 7.x is officially removed.

Question: Why does ES introduce Type?
Answer: Because the concept of relational database was proposed earlier than non-relational database, and it is very mature and widely used. Therefore, many NoSQL (including: MongoDB, Elasticsearch, etc.) later referenced and extended the basic concepts of traditional relational databases. Since there is a need for a concept corresponding to a relational database table, type emerges as the times require.

Question: Types in ES versions?
Answer: In version 5.X, multiple types can be created under one index;
in version 6.X, only one type can exist under one index;
in version 7.X, the concept of type is directly removed, that is to say index no longer has a type.

Question: Why does the 7.X version remove Type?
Answer: Because in the early design of Elasticsearch, it directly checked the design mode of relational database, and there was the concept of type (data table). However, its search engine is based on Lucene, and this "gene" determines that type is redundant. The reason why Lucene's full-text search function is fast is because of the existence of inverted index. The generation of this inverted index is based on index, not type. Multiple types can actually slow down the search. In keeping with Elasticsearch's "everything for search" purpose, some changes (removing type) are understandable and worthwhile.

Question: Why not just remove the type directly at the beginning of version 6.X, but gradually remove the type?
Answer: For historical reasons, in the early days, Elasticsearch supported multiple types under one index, and many projects were using Elasticsearch as a database. If the concept of type is directly removed, not only will many projects applying Elasticsearch face major changes in business, function and code, but also a huge challenge for Elasticsearch officials (this is a major surgery that hurts muscles and bones, and many involve To type source code is to be modified). Therefore, weigh the pros and cons, adopt a gradual transition method, and finally, postpone the revolutionary change of "removing type" until version 7.X.

2.3.3 Document: Document

The smallest unit of data in es. A document is like a record in a database. Usually displayed in json format. Multiple documents are stored in an index (Index).

2.3.4 Field: Field

Just like columns in a database, define the fields that each document should have.

2.3.5 Shard: Sharding

When the index data is too large, the data in the index is divided into multiple shards and stored on each server in a distributed manner. It can support massive data and high concurrency, improve performance and throughput, and make full use of the cpu of multiple machines.

2.3.6 Replica: Replica

In a distributed environment, any machine will go down at any time. If it goes down, one of the shards of the index will not be available, so the index cannot be searched. Therefore, in order to ensure data security, we will back up the shards of each index and store them on another machine. Ensure that a few machines are down and the es cluster can still be searched. The shards that can normally provide queries and inserts are called primary shards, and the rest are called replica shards.

2.4 Elasticsearch document storage

Let's talk about the file storage of Elasticsearch first. Elasticsearch is a document-oriented database. A piece of data is a document here. JSON is used as the document serialization format, such as the following user data:

{
    
    
"name" : "carl", 
"sex" : "Male", 
"age" : 18, 
"birthDate": "1990/05/10", 
"interests": [ "sports", "music" ] 
}

With database storage such as Mysql, it is easy to think of creating a User table with fields such as name, sex, etc. In Elasticsearch, this is a document. Of course, this document will belong to a User type, and various types exist in a in the index. Here's a simple comparison of Elasticsearch and relational data terms:

关系数据库 ⇒ 数据库 ⇒ 表 ⇒ 行 ⇒ 列(Columns)

Elasticsearch ⇒ 索引(Index) ⇒ 类型(type) (7.x版本正式将type剔除) ⇒ 文档 (Docments) ⇒ 字段(Fields)

An Elasticsearch cluster can contain multiple indexes (databases), which means many types (tables). These types contain many documents (rows), and then each document contains many fields (columns). For the interaction of Elasticsearch, you can use Java API, or you can directly use HTTP Restful API.

3. ElasticSearch high-performance design

Based on Lucene, ElasticSearch also solves the problem of single table capacity, query performance, and inability to segment queries in relational databases, that is, to achieve more storage, faster and smarter searches. So, how does ElasticSearch do it? This is all thanks to a series of high-performance designs inside ElasticSearch, let's get acquainted.

3.1 The design of inverted index makes ElasticSearch query faster

Why ElasticSearch is faster than traditional database query, because ElasticSearch is based on inverted index, but traditional database is based on B-tree/B+ tree.

Traditional database: The efficiency of binary tree search is O(n), and it is not necessary to move all nodes to insert new nodes at the same time, so the tree structure is used to store the index, which can take into account the performance of both insertion and query (AVL). Therefore, on this basis, combined with the read characteristics of the disk (sequential read/random read) (multi-way search tree, B tree). Traditional relational databases use data structures such as B-Tree/B+Tree: in order to improve query efficiency and reduce the number of disk seeks, multiple values ​​are stored as an array through continuous intervals, and one seek reads more than one value. data, while also reducing the height of the tree.

ElasticSearch: ES's search data structure model is based on an inverted index. Inverted index refers to the establishment of term index library by word segmentation during data storage. The inverted index originates from the need to find records according to the value of the attribute in practical applications. Each entry in such an index table includes an attribute value and the address of each record with that attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A file with an inverted index is called an inverted index file, or inverted file for short.

Question: How to store ElasticSearch structured data?
answer:

We get three pieces of structured data:

ID Name Age Sex
1 ding dong 18 Female
2 Tom 50 Male
3 carl 18 Male

ID is the document id created by Elasticsearch, then the index created by Elasticsearch is as follows:
Name:

Term Posting List
ding dong 1
TOM 2
carl 3

Age:

Term Posting List
50 2
18 [1,3]

Sex:

Term Posting List
Female 1
Male [2,3]

Elasticsearch establishes a mapping relationship from the filed to the ID for each field, and this mapping relationship is called an inverted index. No matter what type of inverted index it is, the ID in it is stored in the structure of a Posting List, which is called an inverted list.

From the above three tables, the left column Tom, carl, 18, Female are called term (category index), and the right column [1,2] is the Posting List (inverted list). The Posting list is an array of ints that stores all document ids that match a term.

Through the indexing method of posting list, you can search quickly. For example, if you want to find the entry with age=18, it is 1 and 3. But what if there are tens of millions of records here? The answer is Term Dictionary.

Term Dictionary

In order to find a term quickly, Elasticsearch sorts all the terms, searches for a term by binary method, and the search efficiency of log(N) is just like searching through a dictionary. This is the Term Dictionary. Now it seems to be the same as our traditional B-tree. So what progress has our ES made? The answer is to store a smaller dictionary Term Index in memory.

Term Index

B-Tree improves query performance by reducing the number of disk seeks. Elasticsearch also adopts the same idea. It searches for terms directly through memory and does not read the disk. However, if there are too many terms, the term dictionary will be very large, and it is unrealistic to put it in memory, so With the Term Index, just like the index page in the dictionary, what terms are at the beginning of A, and which pages are respectively, it can be understood that the term index is a tree, this tree does not contain all terms, it contains terms some prefixes. Through the term index, you can quickly locate an offset of the term dictionary, and then search sequentially from this position.

Therefore, the term index does not need to store all terms, but only the mapping relationship between some of their prefixes and the blocks of the Term Dictionary, so that the term index can be cached in memory. After finding the block position of the corresponding term dictionary from the term index, go to the disk to find the term, which greatly reduces the number of random reads from the disk.

insert image description here

Block block: The file system does not read data sector by sector, which is too slow, so there is the concept of block (block), which is read block by block, and block is the smallest file access. unit.

3.2 The design of FST allows ElasticSearch to store Term Index with minimal memory

Finite StateTransducers referred to as FST, usually translated into Chinese as finite state converter or finite state sensor, FST is a technology that maps a byte sequence to a block.

Suppose we now want to map mop, moth, pop, star, stop and top (term prefix in term index) to sequence numbers: 0, 1, 2, 3, 4, 5 (block position of term dictionary). The easiest way to do this is to define a Map<string, integer="">, and just find your own position corresponding to the seat, but from the perspective of less memory usage, is there a better way? The answer is: FST.

insert image description here
Looking at the picture, we can see:
mop = 0 + 0 + 0 = 0
moth = 0 + 0 + 1 + 0 = 1
pop = 2 + 0 + 0 = 2
star = 3 + 0 + 0 +0 = 3
stop = 3 +0 + 1 + 0 = 4
top = 5 + 0 + 0 = 5

⭕ represents a state, --> represents the change process of the state, and the letters/numbers above represent the state change and weight. Dividing words into individual letters is represented by ⭕ and –>, and 0 weights are not displayed. If there is a branch after ⭕, mark the weight, and finally the weights on the entire path add up to the sequence number corresponding to the word. When traversing each of the above edges, the output of this edge will be added. For example, when the input is stop, it will go through s/3 and o/1, and the added sorting order is 4; and for mop, The resulting sorted result is 0.

However, this tree does not contain all terms, but the prefixes of many terms. Through these prefixes, you can quickly locate the block of the disk to which the prefix belongs, and then find the document list from this block. In order to compress the space of the dictionary, in fact, each block only saves different parts of the block. For example, if mop and moth are in the same block starting with mo, then only p and th are saved in the corresponding dictionary, so that space utilization rate doubled.

Using finite state transformers is far less memory intensive than SortedMap, but requires more CPU resources when querying. Wikipedia's index uses FST, which only uses 69MB of space. It took about 8 seconds to build an index for nearly 10 million entries, using less than 256MB of heap space.

In ES, there is a kind of query called fuzzy query (fuzzy query), which determines whether there is a match based on the edit distance between the search term and the field. Before ES4.0, the fuzzy query will first let the search term and all terms calculate the edit distance to filter out all fields within the edit distance; after ES4.0, the one developed by Robert is used, and the finite state converter can be used directly. The method of searching for words within a specified edit distance increases the efficiency of fuzzy queries by more than 100 times.

Now that the dictionary has been compressed into a term index, the size is small enough to fit in memory, and the document list can be quickly found through the index. Now there is another question, will it take up too much space to put all the ids of the documents into the disk? If there are 100 million documents, and each document has 10 fields, in order to save this posting list, it needs to consume a billion integers of space, and the consumption of disk space is also huge. ES adopts a more ingenious way to save all the id. This is the Frame Of Reference for compression techniques.

3.3 The design of Frame Of Reference allows ElasticSearch to store the Posting List with the smallest disk

Frame Of Reference can be translated into indexed frames

In addition to the above mentioned compression of term index with FST, Elasticsearch also has compression techniques for posting list. In order to facilitate compression, Elasticsearch requires that the posting list be ordered (in order to improve the performance of the search, no matter how willful the requirements must be met). At the same time, in order to reduce the storage space, all ids will be delta encoded (ie incremental encoding).

For example, there is now a list of ids [73, 300, 302, 332, 343, 372], which is converted into the incremental value of each id relative to the previous id (the previous id of the first id is 0 by default, and the increment is it self) list is [73, 227, 2, 30, 11, 29] . In this new list, all ids are less than 255, so each id only needs one byte of storage.

Explanation: Elasticsearch requires that the posting list is ordered. Except for the first id, all subsequent ids are stored as increments of the previous id, then
when the id list is [73, 300, 302, 332, 343, 372], Elasticsearch will not store this list of ids directly, but will calculate the increment.
The first id is 73, which is stored directly;
the second id is 300, but the value of 300 will not be stored, but 300-73 = 227. Increment;
the third id is 302, but the value of 302 will not be stored, but the increment of 302-(73+227) =2;
and so on, the last Elasticsearch posting list actually stores [73, 227, 2, 30, 11, 29] , so the number stored becomes smaller, and each value can be stored in a smaller data type, thus saving disk space.

In fact, ES will do it more finely. It will divide all documents into many blocks, each block contains exactly 256 documents, and then incrementally encode each document separately, and calculate that it takes at most to store all documents in this block. How many bits to store each id, and put this number of bits as header information (header) in front of each block. This technique is called Frame of Reference, which translates into indexed frames.

For example, to compress the above data (assuming that each block has only 3 files instead of 256), the compression process is as follows

insert image description here
If you directly store six int types, each int type is 4 bytes, you need 24 bytes; but use incremental encoding, then split into different blocks, and finally add a header information to indicate the number of numbers, because each block The number is stored within 256 in the header, so the maximum number of digits in the header is 256, and only 8 bits (one byte) are needed to store the header. Then for the maximum number of bits in the first block, 227, 2^7 = 128, 2^8 = 256, so each number requires 8 bits, and the first block requires four bytes (the header occupies one byte, followed by Each of the three numbers occupies one byte). For the second block, the maximum value is 30, because 2^5 = 32, only 5 bits are needed to store it, so the second block only needs three bytes (one byte for the header, 5 bits for each number, which is 5 *3=15 bits, that is, two bytes (16 bits) are enough), so the two blocks only need 7 bytes. If you do not use incremental encoding, it takes 24 bytes to directly store int, and the disk space is only 1 /3 .

The essence of incremental encoding is that the posting list is arranged in order, and then the incremental number is stored smaller than the original number, and each number can be represented by fewer digits.

8 binary bits make up a byte. The principle of this compression algorithm is to change the original large number into a decimal by incrementing, only store the incremental value, and then arrange the queue according to the bit carefully, and finally store it by bytes, instead of being careless, even though it is 2, it is also used int ( 4 bytes) to store.

When returning the result, in fact, it is not necessary to decompress all the data directly and then return it all at once, you can directly return an iterator iterator, and directly extract the compressed id one by one through the next method of the iterator, which can also save a lot. Computational and memory overhead.

The above method can greatly save the space consumption of the posting list and improve the query performance. However, in order to improve the performance of filter query, ES has done more work, that is, caching.

3.4 The design of Roaring Bitmaps allows ElasticSearch to cache filters with minimal disk storage

ES will cache filter queries with relatively high frequency, and the principle is relatively simple, that is, to generate a mapping between (fitler, segment data space) and id list, but unlike the inverted index, we only cache the commonly used filters and reverse them. The index is all saved, and the filter cache should be fast enough, otherwise direct query will not work. ES directly puts the cached filter into the memory, and the mapped posting list into the disk.

The compression method used by ES in the filter cache is different from the compression method of the inverted index. The filter cache uses the data structure of the roaring bitmap. Compared with the Frame of Reference method above, the CPU consumption is lower when querying, and the query efficiency is higher. , at the cost of requiring more storage space (disk).

Roaring Bitmap is an improved result of the two data structures of int array and bitmap - int array is fast but consumes a lot of space, bitmap is relatively small in space consumption but needs 12.5MB of space no matter how many documents it contains, even if there is only one The file also needs 12.5MB of space, which is really not cost-effective, so after weighing it, there is the following Roaring Bitmap.

3.4.1 Understanding the Roaring Bitmaps data structure

Bitmap is a data structure. Suppose there is a posting list: [1,3,4,7,10]
The corresponding bitmap is: [1,0,1,1,0,0,1,0,0,1 ]
is very intuitive, use 0/1 to indicate whether a certain value exists, for example, the value of 10 corresponds to the 10th bit, and the corresponding bit value is 1, so that one byte can represent 8 document ids, the old version (before 5.0) ) Lucene is compressed in this way, but this compression method is still not efficient enough. If there are 100 million documents, it requires 12.5MB of storage space, which is only corresponding to one index field (we often have many index field). So someone came up with a more efficient data structure like Roaring bitmaps.

Explanation: If there are 100 million documents, then 12.5MB of storage space is required. The
bitmap is stored in bits, and the space becomes 1/8 of the original, 100 million = 1 10 ^8 , divided by 1 10 ^6 (1M) = 100 , again divided by 8 = 12.5, so it takes 12.5M

The disadvantage of Bitmap is that the storage space grows linearly with the number of documents. Roaring bitmaps must use some exponential features to break this curse.
(1) Roaring Bitmap first allocates ids to the corresponding id according to the high 16 bits of each id In the block, for example, the id in the first block should be between 0 and 65535, and the id of the second block should be between 65536 and 131071
(2) For the data in each block, it is divided into two categories according to the number of ids
a. If the number is less than 4096, use a short array to save
b. If the number is greater than or equal to 4096, use a bitmap to save

3.4.2 Using Roaring Bitmaps data structure in ES

In each block, a number actually only needs 2 bytes to save, because the upper 16 bits are the same in this block, and the upper 16 bits are the id of the block. Both the block id and the document id are used. short save.

insert image description here

As for the dividing line of 4096, because when the number is less than 4096, if you use bitmap, you need 8kB of space, and use 2-byte array space consumption will be less. For example, there are only 2048 values, each of which is 2 bytes, and only needs 4kB to save in total, but the bitmap needs 8kB.

Explanation: In each block, a number actually only needs 2 bytes to save
because the value of N%66536 is 0~65535, and the required number of digits is 16, so two bytes are enough, Just use the short type in java directly

Explanation: For the data in each block, if the number is less than 4096, it is stored in a short array, and each array element is a short type; for the data in each block, if the number is greater than or equal to 4096, it is stored in a bitmap.
When the number is equal to 4096 (2 ^12 = 4 * 2 ^10 = 4k), if you use bitmap, you need 8kB space, if you use short array, it is also 8KB
short array: 4096 values ​​​​is 2 ^12 = 4k, then 4k * 2B = 8kB
bitmap: Because the value of N%66536 is 0~65535, the maximum number stored in each block is 2 ^16. For bitmap, only 8KB is needed to satisfy 8KB = 8 * 1K * 1B (8b) = 2^3 * 2^10 * 2^3 = 2^16

Explanation: There are only 2048 values, each of which is 2 bytes, using the short type to save a total of 4kB, but the bitmap requires an 8kB
short array: 2048 values ​​are 2 ^11 = 2k, then 2k * 2B = 4kB
bitmap: Because the value of N%66536 is 0~65535, the maximum number stored in each block is 2 ^16. For bitmap, only 8KB is needed to satisfy 8KB = 8 * 1K * 1B(8b) = 2 ^ 3 * 2^10 * 2^3 = 2^16

Summary: Whether it is to store 1 number or 65536 numbers, using a bitmap requires 8KB. When the number of stored numbers is less than 4096, it is better to use a short array because it occupies less memory.

It can be seen that the inverted index used by Elasticsearch is indeed faster than the B-Tree index of the relational database.

3.5 Jump table to realize joint index of multiple fields as inverted index

As mentioned above, it is a single field index for a long time. If a joint query of multiple field indexes is used, how can the inverted index meet the requirements of fast query? The answer is to use the skip list (Skip list) data structure to quickly do the "AND" operation, or to use the above-mentioned bitset bitwise "AND" to achieve.

3.5.1 Know the data structure of the jump table

First look at the data structure of the jump table:

insert image description here

Take an ordered linked list level0, and pick out several elements to level1 and level2. The higher each level is, the fewer pointer elements are selected. When searching, search from high level to low level, such as 45, first find the level2 25, and finally found 45, the search efficiency is equivalent to the efficiency of the binary tree, but it also uses a certain amount of space redundancy in exchange for it.

3.5.2 Jump table to realize the joint index of multiple fields as an inverted index

Suppose there are the following three posting lists that need a joint index:

insert image description here

If you use a skip list, for each id in the shortest posting list, look up the other two posting lists one by one to see if they exist, and finally get the result of the intersection.

If you use bitset (based on bitMap), it is very intuitive, and the result is the final intersection.

Note that this is how our inverted index implements a joint index, not our ES.

3.6 Summary and reflections

The indexing idea of ​​Elasticsearch is to move the contents of the disk into the memory as much as possible, reduce the number of random reads from the disk (and also use the sequential read feature of the disk), combine various ingenious compression algorithms, and use the memory with a harsh attitude. .

Therefore, you need to pay attention when using Elasticsearch for indexing:
(1) Fields that do not need to be indexed must be clearly defined, because the default is to automatically build the index
(2) For fields of type String, no analysis (word segmentation) is required. It needs to be clearly defined, because the default is also analyzed.
(3) It is very important to choose a regular ID. IDs with too much randomness (such as java's UUID) are not conducive to querying, because the compression algorithm is for a large number of IDs in the Posting list. For compression, if the IDs are sequential, or there are IDs with a certain regularity such as a common prefix, the compression ratio will be higher;

Fourth, the end

ElasticSearch high-performance design, completed.

Code every day, progress every day! !

Guess you like

Origin blog.csdn.net/qq_36963950/article/details/123598217