Elasticsearch index, single-field index and multi-field joint index

  As a distributed and scalable real-time search and analysis engine, a search engine based on the full-text search engine Apache Lucene. Elasticsearch can be used for: distributed real-time file storage, and each field is indexed so that it can be searched; a distributed search engine for real-time analysis; can be extended to hundreds of servers to handle PB-level structured or Unstructured data.
Elasticsearch's file storage
  Elasticsearch is a document-oriented database. A piece of data is a document here, and JSON is used as the document serialization format. For example, the following piece of user data:

{
    "name" :     "XiaoMing",
    "sex" :      "Male",
    "age" :      18,
    "birthDate": "2000/05/01",
    "about" :    "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

In the Mysql database storage, a User table is created with fields such as name and gender. In Elasticsearch, it is a document. Of course, this document will belong to a User type, and various types exist in an index. The following is a comparison table of Elasticsearch and relational data terms:

关系数据库     ⇒ 数据库 ⇒ 表    ⇒ 行    ⇒ (Columns)
Elasticsearch  ⇒ 索引(Index)类型(type)文档(Docments)字段(Fields)  

An Elasticsearch cluster can contain multiple indexes (databases), that is, it contains many types (tables). These types contain many documents (rows), and then each document contains many fields (columns). For Elasticsearch interaction, you can use the Java API, or you can directly use the HTTP Restful API method, such as inserting a record, you can simply send an HTTP request:

PUT /megacorp/employee/1  
{
    "name" :     "XiaoMing",
    "sex" :      "Male",
    "age" :      18,
    "birthDate": "2000/05/01",
    "about" :    "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

Indexing
  Elasticsearch provides powerful indexing capabilities. Insert a record into Elasticsearch, that is, directly PUT a json object, this object has multiple fields, such as name, sex, age, about, interests in the above example, then while inserting these data into Elasticsearch, Elasticsearch is also These fields are indexed-inverted index, because the core function of Elasticsearch is search.

How does Elasticsearch index quickly?
  The inverted index used by Elasticsearch is faster than the B-Tree index of relational databases.
  In order to improve the efficiency of the query and reduce the number of disk seeks, B-tree stores multiple values ​​as an array through consecutive intervals, reads multiple data at a time, and also reduces the height of the tree. For inverted index:
Insert picture description here
Suppose you have the following data:

ID Name Age Sex
1 Kate 24 Female
2 John 24 Male
3 Bill 29 Male

ID is the document id created by Elasticsearch, then the index established by Elasticsearch is as follows:
Name:

Term Posting List
Kate 1
John 2
Bill 3

Age:

Term Posting List
24 [1,2]
29 3

Sex:

Term Posting List
Female 1
Male [2,3]

Elasticsearch has established an inverted index for each field, Kate, John, 24, Female are called term, and [1,2] is Posting List. Posting list is an int array, which stores all document ids that match a certain term.
  It seems to be possible to search quickly through the indexing method of posting list, for example, to find students with age = 24, that is, students with id 1,2. But what if there are tens of millions of records? What if you want to search by name?


  In order to quickly find a term, Term Dictionary Elasticsearch arranges all terms in order, and finds the term by dichotomy. The time complexity is logN, which is Term Dictionary. This seems to be similar to the way traditional databases use B-Tree. Why is the query faster than B-Tree?

Term Index
  B-Tree improves query performance by reducing the number of disk seeks. Elasticsearch also uses the same idea to directly search for term in memory without reading the disk, but if there are too many terms, the term dictionary will be too large and it is not practical to put memory So, with the Term Index, just like the index page in the dictionary, what terms are there at the beginning of A, and which pages are there? It can be understood that the term index is a tree. This tree will not contain all the terms, it contains some prefixes of the terms. You can quickly locate an offset of the term dictionary through the term index, and then search sequentially from this position.
Insert picture description here
Therefore, the term index does not need to store all the terms, but only the mapping between some prefixes and the blocks of the Term Dictionary, combined with the compression technology of FST (Finite State Transducers), the term index can be cached in memory. After finding the block position of the corresponding term dictionary from the term index, then go to the disk to find the term, which greatly reduces the number of random disk reads.FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. FST stores all terms in bytes. This compression method can effectively reduce the storage space, so that the term index is enough to be placed in memory, but this method will also cause more CPU resources to be used when searching.

Compression Tips
  In addition to using FST to compress term indexes in Elasticsearch, there are also compression tips for posting lists. Although the posting list only stores document IDs, if Elasticsearch needs to index the gender of classmates, if there are tens of millions of classmates, and there are only two genders such as male / female in the world, each posting list will have at least one million document ID . Elasticsearch effectively compresses the document id:Frame Of Reference Incremental encoding and compression, changing large numbers to decimals and storing them in bytes . First of all, Elasticsearch requires that the posting list be ordered to facilitate compression, as shown in the following figure:
Insert picture description here
through increments, the original large numbers are converted into decimals to store only incremental values, then sorted by bit, and finally stored by bytes.

Roaring bitmaps
  Bitmap is a data structure, assuming a posting list: [1,3,4,7,10]. The corresponding Bitmap is: [1,0,1,1,0,0,1,0,0,1]. Use 0/1 to indicate whether a value exists, for example, the value 10 corresponds to the 10th bit, and the corresponding bit value is 1, so that one byte can represent 8 document ids, but this compression method is still not efficient If there are 100 million documents, 12.5MB of storage space is needed, which is only corresponding to one index field (there are often many index fields). Therefore, a more efficient data structure such as Roaring bitmaps is adopted.
  The disadvantage of Bitmap is that the storage space increases linearly with the number of documents. Roaring bitmaps divides the posting list into 65535 as the limit. For example, the first block contains the document id range between 065535, the second block id range is 65536131071 And so on. Then use the combination of <quotient, remainder> to represent each group of ids, so that the range of ids in each group is within 0 ~ 65535.

Insert picture description here
In addition to 1024, 65535 is also a classic value, because 65535 = 2 ^ 16-1, which is exactly the maximum number that can be expressed in 2 bytes, a short storage unit. Note the last line in the above figure "If a block has more than 4096 values, encode as a bit set, and otherwise as a simple array using 2 bytes per value ", if it is a large block, use bitset to save it, the small block will be bold, and 2 bytes will not matter It ’s convenient to use a short [].
  Why use 4096 to distinguish between large and small blocks? Due to binary, 4096 * 2bytes = 8192bytes <1KB, the disk can be read in a small block in a single seek, and the larger one is more than 1KB, which requires two reads.


  Compared with single-field index, joint index , if the joint query of multiple field index, inverted index meets the requirements of fast query, use the following two points:

  • Use the data structure of Skip list to do AND operation quickly
  • Use the bitset "AND" mentioned above

The data structure of the jump table is shown in the figure:
Insert picture description here
an ordered list level0 is selected, and several elements are selected to level1 and level2. The higher each level is, the fewer pointer elements are selected. Low search, such as 55, first find 31 of level2, then find 47 of level1, and finally find 55, a total of 3 searches, the search efficiency is equivalent to the efficiency of the 2 fork tree, but it also uses a certain spatial redundancy in exchange.
Assume that the following three posting lists require joint indexing:

Insert picture description here
If you use a jump table, for each id in the shortest posting list, look up the other two posting lists one by one to see if they exist, and finally get the result of the intersection. If you use bitset, it is very intuitive, directly by bit and, the result is the final intersection.

Elasticsearch index

  Move the contents of the disk into the memory as much as possible to reduce the number of random disk reads (at the same time also use the disk sequential read feature), combined with the compression algorithm, strictly use memory. Therefore, you should pay attention to the index when using Elasticsearch:

  • The fields that do not need to be indexed must be clearly defined, because the default is to automatically index
  • For fields of type String, you do n’t need analysis but also need to define it clearly, because analysis is also done by default.
  • It is important to choose regular IDs, IDs with too much randomness (such as java's UUID) are not conducive to query

With regard to the last point, there are multiple factors:

  1. The compression algorithm is to compress a large number of IDs in the Posting list. If the IDs are sequential, or have IDs with a certain regularity such as a common prefix, the compression ratio will be higher.
  2. Another factor: It may be the most influential query performance, it should be the last step to find the Document information on the disk through the ID in the Posting list, because Elasticsearch is stored in segments, and the segment is located according to the ID of this large range of term Efficiency directly affects the performance of the last query. If the ID is regular, you can quickly skip segments that do not contain the ID, thereby reducing unnecessary disk reads.
Published 162 original articles · praised 58 · 90,000 views

Guess you like

Origin blog.csdn.net/ThreeAspects/article/details/104378810