ES something you should know
1. What is it?
Elasticsearch is a real time of distributed storage, search, analysis engine and a powerful fuzzy / related queries .
2. Data structure?
Term Dictionary: We enter a paragraph of text, and Elasticsearch will segment our text according to the tokenizer (that is, Ada/Allen/Sara... as seen on the picture). These sub-words are collectively called Term Dictionary
. In Term Dictionary
because it is very, very much, so we will be in the words of its sort , and so you want to find when you can by half to check, do not need to traverse the entireTerm Dictionary
Term Index: As the Term Dictionary
word too much, can not put Term Dictionary
all the words are placed in memory, so Elasticsearch also smoked layer is called Term Index
, this layer is only partially stored word prefix , Term Index
will be stored in memory (particularly fast retrieve will ). Term Index
It is stored in the memory in the form of FST (Finite State Transducers), which is characterized by very memory saving .
Posting: The document ID is stored PostingList
, and the data inside will be compressed PostingList
using Frame Of Reference ( FOR ) encoding technology to save disk space .
Finite State Transducers: (to be added)
Frame Of Reference: (to be added)
3. Common terms in ES?
Index : The Index of Elasticsearch is equivalent to the Table of the database
Type : This has been abolished in the new Elasticsearch version (in the previous Elasticsearch version, multiple Types are supported under one Index-somewhat similar to the concept of multiple groups under one topic in a message queue)
Document : Document is equivalent to a row of records in the database
Field : the concept equivalent to the Column of the database
Mapping : The concept equivalent to the Schema of the database
DSL : SQL equivalent to the database (the API for us to read Elasticsearch data)
4.ES overall architecture
ES architecture (distributed, highly available)
ES write
Each node on the cluster is a coordinating node
( coordinating node ), and the coordinating node indicates that this node can do routing . For example, node 1 receives the request, but finds that the data of this request should be processed by node 2 (because the main shard is on node 2 ), so it will forward the request to node 2 .
The coodinate ( coordination ) node can calculate which main shard it is on through the hash algorithm, and then route to the corresponding node shard = hash(document_id) % (num_of_primary_shards)
Main shard write process
-
Write data to the memory buffer
-
Then write the data to the translog buffer
-
Every 1s data is refreshed from the buffer to the FileSystemCache to generate the segment file. Once the segment file is generated , it can be queried through the index
-
After refreshing, the memory buffer is emptied.
-
Every 5s , translog flushes from buffer to disk
-
Periodically/quantitatively from FileSystemCache, combine translog content
flush index
to disk.
- Elasticsearch will write the data to the memory buffer first, and then refresh it to the file system cache every 1s (the data can be retrieved only after the data is flushed to the file system buffer). Therefore: the data written by Elasticsearch takes 1s to be queried
- In order to prevent node downtime and data loss in memory, Elasticsearch will write another copy of data to the log file , but the initial data is written to the memory buffer, and the buffer is flushed to the disk every 5s . Therefore: If a node of Elasticsearch goes down, 5s of data may be lost.
- When the translog file on the disk is large enough or exceeds 30 minutes, the commit operation will be triggered , and the segment file in the memory will be asynchronously flushed to the disk to complete the persistence operation.
After the main shard is written, the data will be sent to the replica set node in parallel, and when all nodes are successfully written, the ack is returned to the coordinating node, and the coordinating node returns ack to the client to complete one write.
ES update/delete
Elasticsearch update and delete operation process: doc
mark the corresponding record .del
, if it is a delete operation delete
, doc
mark the status, if it is an update operation, mark the original as delete
, and then write a new piece of data
As mentioned earlier, a segment file will be generated every 1s , and there will be more and more segment files. Elasticsearch there will be a merge task will be more segement files merged into one file segement.
In the process of merging, the delete
state with doc
the physical will be deleted .
ES query
The simplest way to query us can be divided into two types:
- Query doc based on ID
- According to the query (search term) to query the matching doc
The process of querying specific doc based on ID is:
- Retrieve Translog files in memory
- Retrieve Translog files from hard disk
- Retrieve Segement files from the hard disk
The process of matching doc according to query is:
- At the same time to query the segment files of the memory and hard disk
From the writing process mentioned above, we can know: Get (checking Doc by ID is real-time), Query (matching Doc by query is near real-time)
Because the segment file is generated every second
Elasticsearch query can be divided into three stages:
- QUERY_AND_FETCH (return the entire Doc content after the query is completed)
- QUERY_THEN_FETCH (first query the corresponding Doc id, and then match the corresponding document according to the Doc id)
- DFS_QUERY_THEN_FETCH (Calculate points first, then query)
- "Here refers to the sub- frequency word frequency and document (Term Frequency, Document Frequency) is well known, the higher the frequency of occurrence, it is a stronger correlation"
Generally , the one we use most is QUERY_THEN_FETCH . The first type of query returns the entire Doc content (QUERY_AND_FETCH) is only suitable for requests that only need to check one shard .
The overall process flow of QUERY_THEN_FETCH is roughly:
- The client request is sent to a node in the cluster. Each node on the cluster is a coordinate node (coordinating node)
- Then the coordinating node forwards the search request to all the shards (both the primary shard and the replica shard)
- Each shard returns its searched results
(doc id)
to the coordinating node, and the coordinating node performs operations such as data merging, sorting, and paging to produce the final result. - The coordinating node then processed by
doc id
each node to pull the actual ofdocument
data, eventually returned to the client.
What the node does during the Query Phase :
- The coordinating node sends a query command to the target shard (forwarding the request to the main shard or the replica shard)
- Data node (filtering, sorting, etc. in each shard), returned
doc id
to the coordinating node
What the node does during the Fetch Phase phase is:
- The coordinating node gets the data returned by the data node
doc id
,doc id
aggregates these , and then sends the capture command to the target data fragments (hope to get the entire Doc record) - According to the data sent by the coordinating node
doc id
, the data node pulls the data actually needed and returns it to the coordinating node
Original address: https://mp.weixin.qq.com/s?__biz=MzI4Njg5MDA5NA==&mid=2247486522&idx=1&sn=7b6080756d0711c646fb47d5db49fc97&chksm=ebd74d3bdca0c42d35f7e2097e4fbaca0c42d35f7e2097e4fca0c42d35f7e2097e4fbaca0c42d35e