Some things you should know about ElasticSearch

ES something you should know

1. What is it?

Elasticsearch is a real time of distributed storage, search, analysis engine and a powerful fuzzy / related queries .

2. Data structure?

Insert picture description here

Term Dictionary: We enter a paragraph of text, and Elasticsearch will segment our text according to the tokenizer (that is, Ada/Allen/Sara... as seen on the picture). These sub-words are collectively called Term Dictionary. In Term Dictionarybecause it is very, very much, so we will be in the words of its sort , and so you want to find when you can by half to check, do not need to traverse the entireTerm Dictionary

Term Index: As the Term Dictionaryword too much, can not put Term Dictionaryall the words are placed in memory, so Elasticsearch also smoked layer is called Term Index, this layer is only partially stored word prefix , Term Indexwill be stored in memory (particularly fast retrieve will ). Term IndexIt is stored in the memory in the form of FST (Finite State Transducers), which is characterized by very memory saving .

Posting: The document ID is stored PostingList, and the data inside will be compressed PostingListusing Frame Of Reference ( FOR ) encoding technology to save disk space .

Finite State Transducers: (to be added)

Frame Of Reference: (to be added)

3. Common terms in ES?

Index : The Index of Elasticsearch is equivalent to the Table of the database

Type : This has been abolished in the new Elasticsearch version (in the previous Elasticsearch version, multiple Types are supported under one Index-somewhat similar to the concept of multiple groups under one topic in a message queue)

Document : Document is equivalent to a row of records in the database

Field : the concept equivalent to the Column of the database

Mapping : The concept equivalent to the Schema of the database

DSL : SQL equivalent to the database (the API for us to read Elasticsearch data)

4.ES overall architecture

ES architecture (distributed, highly available)

Insert picture description here

ES write

Insert picture description here

Each node on the cluster is a coordinating node( coordinating node ), and the coordinating node indicates that this node can do routing . For example, node 1 receives the request, but finds that the data of this request should be processed by node 2 (because the main shard is on node 2 ), so it will forward the request to node 2 .

The coodinate ( coordination ) node can calculate which main shard it is on through the hash algorithm, and then route to the corresponding node shard = hash(document_id) % (num_of_primary_shards)

Main shard write process

Insert picture description here

  1. Write data to the memory buffer

  2. Then write the data to the translog buffer

  3. Every 1s data is refreshed from the buffer to the FileSystemCache to generate the segment file. Once the segment file is generated , it can be queried through the index

  4. After refreshing, the memory buffer is emptied.

  5. Every 5s , translog flushes from buffer to disk

  6. Periodically/quantitatively from FileSystemCache, combine translog content flush indexto disk.

  • Elasticsearch will write the data to the memory buffer first, and then refresh it to the file system cache every 1s (the data can be retrieved only after the data is flushed to the file system buffer). Therefore: the data written by Elasticsearch takes 1s to be queried
  • In order to prevent node downtime and data loss in memory, Elasticsearch will write another copy of data to the log file , but the initial data is written to the memory buffer, and the buffer is flushed to the disk every 5s . Therefore: If a node of Elasticsearch goes down, 5s of data may be lost.
  • When the translog file on the disk is large enough or exceeds 30 minutes, the commit operation will be triggered , and the segment file in the memory will be asynchronously flushed to the disk to complete the persistence operation.

After the main shard is written, the data will be sent to the replica set node in parallel, and when all nodes are successfully written, the ack is returned to the coordinating node, and the coordinating node returns ack to the client to complete one write.

ES update/delete

Elasticsearch update and delete operation process: docmark the corresponding record .del, if it is a delete operation delete, docmark the status, if it is an update operation, mark the original as delete, and then write a new piece of data

As mentioned earlier, a segment file will be generated every 1s , and there will be more and more segment files. Elasticsearch there will be a merge task will be more segement files merged into one file segement.

In the process of merging, the deletestate with docthe physical will be deleted .

ES query

The simplest way to query us can be divided into two types:

  • Query doc based on ID
  • According to the query (search term) to query the matching doc

The process of querying specific doc based on ID is:

  • Retrieve Translog files in memory
  • Retrieve Translog files from hard disk
  • Retrieve Segement files from the hard disk

The process of matching doc according to query is:

  • At the same time to query the segment files of the memory and hard disk
  • Insert picture description here

From the writing process mentioned above, we can know: Get (checking Doc by ID is real-time), Query (matching Doc by query is near real-time)

Because the segment file is generated every second

Elasticsearch query can be divided into three stages:

  • QUERY_AND_FETCH (return the entire Doc content after the query is completed)
  • QUERY_THEN_FETCH (first query the corresponding Doc id, and then match the corresponding document according to the Doc id)
  • DFS_QUERY_THEN_FETCH (Calculate points first, then query)
  • "Here refers to the sub- frequency word frequency and document (Term Frequency, Document Frequency) is well known, the higher the frequency of occurrence, it is a stronger correlation"

Insert picture description here

Generally , the one we use most is QUERY_THEN_FETCH . The first type of query returns the entire Doc content (QUERY_AND_FETCH) is only suitable for requests that only need to check one shard .

The overall process flow of QUERY_THEN_FETCH is roughly:

  • The client request is sent to a node in the cluster. Each node on the cluster is a coordinate node (coordinating node)
  • Then the coordinating node forwards the search request to all the shards (both the primary shard and the replica shard)
  • Each shard returns its searched results (doc id)to the coordinating node, and the coordinating node performs operations such as data merging, sorting, and paging to produce the final result.
  • The coordinating node then processed by doc ideach node to pull the actual of documentdata, eventually returned to the client.

What the node does during the Query Phase :

  • The coordinating node sends a query command to the target shard (forwarding the request to the main shard or the replica shard)
  • Data node (filtering, sorting, etc. in each shard), returned doc idto the coordinating node

What the node does during the Fetch Phase phase is:

  • The coordinating node gets the data returned by the data node doc id, doc idaggregates these , and then sends the capture command to the target data fragments (hope to get the entire Doc record)
  • According to the data sent by the coordinating node doc id, the data node pulls the data actually needed and returns it to the coordinating node

Original address: https://mp.weixin.qq.com/s?__biz=MzI4Njg5MDA5NA==&mid=2247486522&idx=1&sn=7b6080756d0711c646fb47d5db49fc97&chksm=ebd74d3bdca0c42d35f7e2097e4fbaca0c42d35f7e2097e4fca0c42d35f7e2097e4fbaca0c42d35e

Guess you like

Origin blog.csdn.net/weixin_41237676/article/details/113374585