Regarding ES, I have to say something.

What is ES? Why choose ES?

Is a high-performance non-relational document database that can be quickly retrieved

        Naturally distributed: high performance, high availability, easy expansion and easy maintenance

        Cross-platform and cross-language: support mainstream programming languages

        Support structured storage, geographic location information storage

        Full-text search of massive data

        Support log system

What is ES's inverted index? How is the data structure?

The inverted index actually puts the index of the entire document at the back, and the content extracted from the document at the front.

For example: now enter a poem, thinking in a quiet night, bright moonlight in front of the bed

The storage form of the traditional forward index is:

The index is in the front, for example, the index stores moonlight, while the whole poem is stored in the document

At this time, the user can search the moonlight through the index. For example, before the user searches the bed, it will obviously query all the documents.

If there are tens of thousands of poems, the query efficiency will be low.

The storage structure of the inverted index is:

        Enter the whole poem, a document id will be generated when entering, and other poems will also have their own document id. After entry, the keywords of each poem will be extracted according to the index engine to form a keyword dictionary. The keyword dictionary uses It is implemented by mysql's B+ tree.

        In this way, when users search by keyword, they will find the keyword at the first time, and use the keyword to find out which documents have these data.

        But if there are many keywords, it is obviously a performance-consuming thing to perform IO operations on the disk.

        But it cannot be stored in the memory, because the amount of data is too large, so the concept of keyword index is introduced.

        The keyword index is stored in the memory. It stores the condensed part of the keyword dictionary. It mainly stores the prefixes of some keywords. The bottom layer is implemented by a trie tree. The combination of words is stored to form a tree. When users search, they actually match keywords from the tree. After each keyword is matched, a value will be obtained. This value is the position of the keyword in the keyword dictionary tree. Then look for words according to the characteristics of the B+ tree, and after finding them, you can get the corresponding document id, and use the document id to find out which poems store this keyword.

What are the core concepts in ES?

1. Index: It is a collection of documents, which is equivalent to the concept of a library in mysql. It has a unique name to identify it. When you want to operate the index, you need to use the name.

2. Document: A document is equivalent to the concept of a row in a traditional database. It is represented by json data in ES. Inserting a row of data in mysql is equivalent to inserting a document in ES.

3. Field: It is equivalent to a column in mysql, which can be understood as the key of json data in ES.

4. Mapping: Mapping is defined for the type of each field.

5. Sharding: If the index is too large, it will take up space and affect query efficiency. Using sharding means storing the data in the index in each shard.

6. Copy: Assuming that the server where the shard is located fails, resulting in unavailability, in order to solve this situation, the concept of copy is introduced, and each shard has its own copy.

7. Term: Divide a large text into many small words for storage, using an inverted index.

8. Configuration: configure the index, such as the default number of shards and the number of indexes.

9. Analyzer: ES will analyze the document when storing the document. The analyzer includes a character filter, a tokenizer, and a term processor.

10. Node and cluster: It is an ES process. When ES is started, a node is started, and many nodes form a cluster.

The process of indexing documents?

       Step 1: The client sends a request to write data to a node in the cluster.

        Step 2: Node 1 receives the request and uses the file ID to determine that the document belongs to shard 0. At this time, the request will be forwarded to other nodes 2. At this time, the primary shard of shard 0 is assigned to node 3.

        Step 3: Node 3 executes the operation on the primary shard. If successful, it forwards the request to the sub-shards of nodes 1 and 2. After successful execution, it responds to node 1, and node 1 responds with the result.

How to ensure data consistency under concurrency?

        1. Optimistic concurrency control can be used through the version number to ensure that the new version will not be overwritten by the old version, and the application layer will handle specific conflicts;

        2. In addition, for write operations, the consistency level supports quorum/one/all, and the default is quorum, that is, write operations are only allowed when most shards are available. But even if most of them are available, there may be failures to write to the copy due to reasons such as the network, so that the copy is considered to be faulty, and the shard will be rebuilt on a different node.

        3. For read operations, you can set replication to sync (default), which makes the operation return after both the primary and replica fragments are completed; if you set replication to async, you can also set the search request parameter _preference to primary to query the primary shard to ensure that the document is the latest version.

How does ES implement elections?

Step 1: Confirm that the number of candidate master nodes reaches the standard, the value set in elasticsearch.yml

discovery.zen.minimum_master_nodes;

Step 2: Comparison: first determine whether it has the master qualification, and the candidate master node qualification will be returned first. If both nodes are candidate master nodes, the value with the smaller id will be the master node.

What tokenizers are there in ES?

        standard: the default tokenizer, segmented by word, can handle uppercase and lowercase

        keyword: The input is directly output without segmentation.

        pattern: processed according to regular rules

        language: Provides more than 30 common tokenizers

        customer: custom tokenizer

What is mapping in ES?

   Mapping is similar to the table structure definition schema in the database , and it has the following functions:

        Define the name of the field in the index Define the data type of the field , such as string, number, Boolean field, related configuration of the inverted index , such as setting a field to not be indexed, record position , etc. In the early version of ES , under an index Yes, there can be multiple Types . Starting from 7.0 , an index has only one Type . It can also be said that a Type has a Mapping definition.

 

What are the aggregation queries of ES?

        Bucket aggregation: similar to grupby

        Indicator aggregation: generally used for calculation, averaging and summing

        Pipeline aggregation: Secondary aggregation of aggregation results, which is equivalent to subquery on query results

What data types does ES support?

        String: textkeyword

        Numbers: in, float, double

        date

        Boolean

        binary

        interval

        geographic location        

Write performance tuning?

        Reduce resource preemption between reading and writing, and separate reading and writing

        Large batches of data writing should be controlled within the time period of low retrieval requests as much as possible, and large batches of writing should be processed in a centralized manner

        Increase the flush time and reduce the frequency of IO operations

        Increase the value of refresh_interval and reduce the creation of segments to reduce fullGC

Read performance tuning?

       disable swap

        Use filter instead of query

        Avoid deep paging, if the amount of data is large, use scorll search or search after

        Avoid coupling between indexes

Node type in ES?

       master node

        candidate node

        voting node

        data node

        cold node

        hot node

What is deep paging?

 For example: Suppose there are 40 pieces of data distributed on 4 shards, and we request 10 pieces of data on page 1,

If you use the default paging from size to implement paging, assuming that the request hits the coordinating node first, the coordinating node goes to 4 fragments to fetch all 40 pieces of data for sorting, and then takes the first 10 pieces after sorting. If the amount of data is small, there is no The problem, if the amount of data is too large, this is a time-consuming process.

Workaround for deep pagination:

Scroll traversing query: scrolling query, similar to using the mouse wheel to scroll to view the effect of the next page when browsing a website.

Search After query: Every query is backward, and forward query is not supported.

Taobao: Delete the page jump function, and only return to the first 100 pages at a time

ES fragmentation strategy?

Before the 7.x version, the default is 5 shards and 1 copy, and after the 7.x version, 1 shard corresponds to a copy.

Do not allow shards and their own shard replicas to be on the same node

As much as possible, the shards will be divided into multiple nodes, but they will not be evenly distributed

Multiple shards exist on the same node

Shards are allocated when the index is created Replicas can be allocated at any time

 

 

Guess you like

Origin blog.csdn.net/weixin_43195884/article/details/129129497