[Search Engine] Principle level: Elasticsearch

I. Introduction

ES has two functions:
1. Formed into ELK, used to collect and process logs, logs of nginx, apache, message, secrue, and mysql;
2. ES alone, used as a search engine, especially for larger data volumes, You don’t need to check the database directly. Solr can also search on e-commerce websites.

Second, the introduction of ElasticSearch

2.1 ElasticSearch definition

Definition: Elasticsearch is a real-time distributed storage, search, and analysis engine.
Five keywords: real-time, distributed, storage, search, analysis
Question 1: How does Elasticsearch achieve real-time,
Question 2: Elasticsearch's architecture is distributed.
Question 3: How does Elasticsearch store, search and analyze

2.2 ElasticSearch replaces the actual needs of database like search (three reasons)

Why use Elasticsearch? mysql database can also do real-time (high availability, implementation of reading and writing), storage (stored data), search (like lookup), analysis (data analysis, specific analysis varies from person to person)

The advantage of ES lies in efficient fuzzy query (note the two words: efficient, fuzzy query), mysql can also use the like keyword and% fuzzy query, but this method has three defects:
First, low efficiency: like +% This type of query is not indexed, which means that as long as your database has a large volume (100 million entries), your query will definitely be at the second level;
second, you cannot control the amount of data and return a large number Data: Use like +% even if you find the corresponding record from the database based on fuzzy matching, it will often return a large amount of data to you. Often the amount of data you need is not so much, maybe 50 records are enough;
Third, it is impossible to search for typos: The content entered by users is often not so precise. For example, I typed ElastcSeach from Google (typo), but Google can still estimate that I want to type Elasticsearch.

Elasticsearch is specialized in search and can solve the three problems of mysql search:
First, the search speed is very fast: Elasticsearch is very good at fuzzy search.
Second, it natively supports sorting: the data searched from Elasticsearch can be filtered according to the score. Part of it, just return the high score to the user.
Third, it can match the relevant records: you can search for relevant results without such accurate keywords.

Three, the characteristics of ElasticSearch

3.1 Why can Elasticsearch achieve fast "fuzzy matching" or fast "relevance query"? Standard answer: inverted index + tokenizer

3.1.1 Inverted Index

Question: Why can Elasticsearch achieve fast "fuzzy matching" or fast "relevance query"?
Standard answer: ES tokenizer + inverted index, explanation: ES tokenizer is to perform word segmentation when writing data to Elasticsearch. Inverted index is to find the incomplete condition (corresponding to 0-n positions) of word segmentation Position, and then find the record.

Forward index definition : Find a record according to the "complete condition" is called forward index;
function expression : that is, given x is only for the only one to determine the value of y;
actual : such as a book chapter list is forward Index, find a unique page number through the table of contents.
Inverted index definition : Find the corresponding record according to a certain word (incomplete condition), called inverted index;
function representation : that is, given y value has 0-n corresponding x values, ES is based on inverted index Find;
Actual : If you just open a page on a book, ES will record each word and the corresponding position of each word, and ES will find its corresponding directory according to each word, such as algorithm->2,13,42,56 , Explanation: The word "algorithm" must have appeared on page 2, page 13, page 42 and page 56.
ES uses tokenizer + inverted index
Look at the following picture again to get a taste of it:

3.1.2 tokenizer

Question: How does ES realize the segmentation of human language (Chinese, English, Russian, etc.)?
Standard answer: This question is not clear. It must be noted that ES does not complete the word segmentation function. ES is an index library and a search engine that stores word segmentation and word segmentation positions. It does not achieve word segmentation. ES uses a built-in word segmenter to segment words. The function of is handed over to the tokenizer to complete, the commonly used tokenizers are:
Standard Analyzer: segment by words, lowercase words
Simple Analyzer: filter by non-letters (symbols are filtered out), lowercase words
WhitespaceAnalyzer: segment by spaces, No lowercase

ps: Elasticsearch is written by foreigners. The built-in tokenizers are all in English, and our users often search for Chinese when searching. Now the most used Chinese tokenizer is IK tokenizer.

As for how the tokenizer internally uses algorithms to achieve text segmentation of human language, this is not a discussion.
What you need to know is that the search engine is only an index library, which stores the word segmentation and word segmentation positions, not for specific word segmentation, but the word segmenter, which is the case for ES and Solr.

The Elasticsearch tokenizer is mainly composed of three parts:
Character Filters (text filter to remove HTML)
Tokenizer (segmentation rules, such as: word segmentation, non-letter filter, and space segmentation)
TokenFilter (segmented word Processing, such as converting to lowercase or not converting to lowercase)

3.2 Elasticsearch data structure (Term Index, Term Dictionary, PostingList)

Elasticsearch data structure
different data structures are often spent time not the same as when you want to find to be fast, you need to have support for the underlying data structure:
first, to find the time complexity of a linked list is generally O (n)
first Second, continue with the above, the tree search time complexity is generally O(logn), which is better than the linked list.
Third, continue with the above, the search time complexity of the hash table is generally O(1), which is better than the tree.
As for the fuzziness of Elasticsearch The query speed is very fast, and the underlying data structure of Elasticsearch is: Term Index, Term Dictionary, PostingList

The Elasticsearch data structure (Term Index, Term Dictionary, PostingList) is shown in the
figure:
Insert picture description here
Explain the three concepts in the above figure (Term Index, Term Dictionary, PostingList).
Term Dictionary definition: To store the word segmentation , we enter a paragraph of text, and Elasticsearch will treat us according to the word segmenter Word segmentation of the text (that is, Ada/Allen/Sara... as seen in the picture). These sub-words are collectively called Term Dictionary. Since there are very, very many words in Term Dictionary, we will It is sorted. When you want to find it, you can check it by binary division. It does not need to traverse the entire Term Dictionary
PostingList definition: store the word segmentation position , and we need to find the corresponding record through the word segmentation. These document IDs are stored in the PostingList
Term Index definition: storage The word segmentation prefix is ​​stored in the memory . Because there are too many words in the Term Dictionary, it is impossible to store all the words in the Term Dictionary in the memory, so Elasticsearch also extracted a layer called Term Index, which only stores some words Prefix, Term Index will be stored in memory (retrieving will be very fast)

3.3 Optimization of Term Index + Optimization of Term Dictionary + Optimization of PostingList

3.3.1 Optimization of Term Index

Term Index optimization: using FST storage form
Term Index is stored in the memory in the form of FST (Finite State Transducers), which is characterized by very memory saving. FST has two advantages:
1) Small footprint. By reusing the word prefix and suffix in the dictionary, the storage space is compressed;
2) The query speed is fast. O(len(str)) query time complexity.

Optimization of Term Index (three)
First, Term Index is stored in memory;
second , Term Index is stored in the form of FST (Finite State Transducers) (saving memory space).
Third, Term Dictionary is also sorted for Term Index (convenient for searching)

Gold finger: For Term Index, in order to ensure that it is fast enough, Term Index is stored in memory. Since it is stored in memory, it is easy to reduce the occupied space. Therefore, the optimization of Term Index is mainly to use the FST storage form, through the dictionary The repeated use of word prefix and suffix in the middle, compressed storage space, the second optimization is to use the FST storage form, O(len(str)) query time complexity

3.3.2 Optimization of Term Dictionary

Cheat: The optimization of Term Dictionary is mainly reflected in search performance, which is to use binary search.

3.3.3 Optimization of Term Index

Optimization of PostingList: use FOR encoding technology to compress the data inside + use Roaring Bitmaps to perform intersection operation on the document ID and
use FOR encoding technology to compress the data inside
PostingList will use Frame Of Reference (FOR) encoding technology to compress the data inside Perform compression to save disk space.
Insert picture description here
For the explanation of the above figure:
PostingList uses Roaring Bitmaps to perform intersection operation on document IDs.
PostingList stores document IDs. When we check, we often need to perform intersection and union operations on these document IDs (for example, in multi-condition query Time), PostingList uses Roaring Bitmaps to perform the intersection operation on the document ID. The advantage of using Roaring Bitmaps is that it can save space and quickly get the intersection result.
Insert picture description here
For the explanation of the above figure:
Step1: Remove 65536 from each number, record the result, and take the modulus of 65536 for each number. The record result
1000 becomes (0,1000), which means that the number 1000 divided by 65536 is 0, and the modulus result is 1000
62101 becomes (0,62101), which means that the number 62101 divided by 65526 ​​results in 0, and the modulo result is 65536,
and so on. After Step1, 6 numbers become 6 (key, value) key-value pairs, and There is no saving in storage size, and look at Step2
Step2: For the above 6 key-value pairs, split according to the result of dividing by 65536, so
1000 and 62101 are a group, and the result of dividing them by 65526 ​​is 0, and the storage modulo 65536. So, store 1000 and 62101
131385, 132052, and 191173 are a group. The result of dividing them by 65526 ​​is 2, and the storage is modulo 65536. Therefore, storing 313 980 60101
196658 is a group, and the result of dividing them by 65526 ​​is 3, and the storage modulo 65536 is stored.
From the storage results, the storage becomes smaller.
Step3: Use the smallest possible data type to store

Cheat: Posting List is used to store the location of Term Dictionary, optimization includes two:
First, since it is to store Term Dictionary location, we must know how to make the occupied space as small as possible and use FOR encoding technology to compress the data inside. ;
Second, since it is storing the Term Dictionary location, it is necessary to quickly return to the ID range. PostingList uses Roaring Bitmaps to perform the intersection operation on the document ID

3.3.4 Summary: Understand the optimization of ES data structure (why is the ES index database search fast? It is determined by the ES data structure)

Why is the ES index library search fast? It is determined by the ES data structure. Corresponding to the ES data structure, there are optimization methods
Term Index optimization + Term Dictionary optimization + PostingList optimization

Gold finger: For Term Index, in order to ensure that it is fast enough, Term Index is stored in memory. Since it is stored in memory, it is easy to reduce the occupied space. Therefore, the optimization of Term Index is mainly to use the FST storage form, through the dictionary The repeated use of word prefix and suffix in the middle, compressed storage space, the second optimization is to use the FST storage form, O(len(str)) query time complexity

Cheat: The optimization of Term Dictionary is mainly reflected in search performance, which is to use binary search.

Cheat: Posting List is used to store the location of Term Dictionary, optimization includes two:
First, since it is to store Term Dictionary location, we must know how to make the occupied space as small as possible and use FOR encoding technology to compress the data inside. ;
Second, since it is storing the Term Dictionary location, it is necessary to quickly return to the ID range. PostingList uses Roaring Bitmaps to perform the intersection operation on the document ID

3.4 The concept and architecture of Elasticsearch (terminology and architecture are linked together)

3.4.1 Concepts in ES

The concept of
Elasticsearch Index: Elasticsearch Index is equivalent to the Table
Type of the database : This has been abolished in the new Elasticsearch version (in the previous Elasticsearch version, one Index supports multiple Types-a bit similar to a message queue with multiple groups under one topic The concept of)
Document: Document is equivalent to a row of records
in the database Field: The concept of Column equivalent to the database
Mapping: The concept of Schema
equivalent to the database DSL: SQL equivalent to the database (the API for us to read Elasticsearch data)
Insert picture description here

3.4.2 ES Architecture

ES architecture: Elasticsearch is the number one distributed storage
. Distributed nodes (Master+Slave) : An Elasticsearch cluster will have multiple Elasticsearch nodes. The so-called nodes are actually machines running Elasticsearch processes. Among the many nodes, there will be a Master Node, which is mainly responsible for maintaining index metadata, switching the identity of primary shards and replica shards, etc. (the concept of sharding will be discussed later). If the master node goes down, A new master node will be elected.
Second, fragmentation : The outermost layer of Elasticsearch is Index (equivalent to the concept of a database table); we can distribute the data of an Index to different Nodes for storage. This operation is called fragmentation. For example, there are 4 nodes in the cluster. I now have an Index. If I want to store this Index on 4 nodes, we can set it to 4 shards. The data of these 4 shards is the
third index data . The reason for sharding storage
(1) If the amount of data in an Index is too large and there is only one shard, it will only be stored on one node. With the amount of data With the growth, a node may not be able to store an Index.
(2) Multiple shards can be operated in parallel when writing or querying (read and write data from each node to improve throughput).
Fourth, sharding achieves high availability: if a node fails, that part Is the data lost?
In Elasticsearch, shards will be divided into primary shards and replica shards (in order to achieve high availability). When data is written, it is written to the primary shard. The replica shard will copy the data of the primary shard. Both the primary and replica shards can be read.
How many primary and replica shards the index needs to be divided into can be set by configuration.
If a node goes down, the master node raised above will promote the corresponding replica shard to the primary shard, so that even the node Hang up, the data will not be lost.

Four, ES read and write

4.1 Elasticsearch write process (mainly the internal structure of ES)

Elasticsearch writing process As
we already know above, when we write data to Elasticsearch, it is written to the main shard. We can learn more details.
The client writes a piece of data, and the node handles the request in the Elasticsearch cluster:
Insert picture description here
each node on the cluster is a coordinating node, and the coordinating node indicates that this node can do routing. For example, node 1 receives the request, but finds that the requested data should be processed by node 2 (because the main shard is on node 2), so it will forward the request to node 2.

The coodinate node can calculate which primary
shard it is on through the hash algorithm, and then route to the corresponding node shard = hash(document_id)% (num_of_primary_shards)

When routing to the corresponding node and the corresponding main shard, the following things will be done (the text description corresponds to the picture):
(1) Write the data to the memory buffer
(2) Then write the data to the translog buffer
(3) Every After 1 second, the data is refreshed from the buffer to the FileSystemCache to generate the segment file. Once the segment file is generated, it can be queried through the index
(4) After the refresh is completed, the memory buffer is emptied.
(5) Every 5s, translog flushes from the buffer to the disk
(6) Periodically/quantitatively from the FileSystemCache, combined with the translog content flush index to the disk.
Insert picture description here
For the explanation of the above figure:
(1) Elasticsearch will write the data into the memory buffer first, and then refresh it to the file system cache area every 1s (when the data is flushed to the file system buffer, the data can be retrieved). Therefore: the data written by Elasticsearch needs 1s to be queried
(2) In order to prevent node downtime and data loss in memory, Elasticsearch will write another copy of data to the log file, but initially it is written to the memory buffer. The buffer will be flushed to the disk every 5s. Therefore: If a node of Elasticsearch goes down, 5s of data may be lost.
(3) When the translog file on the disk is large enough or more than 30 minutes, the commit operation will be triggered, and the segment file in the memory will be asynchronously flushed to the disk to complete the persistence operation.
summary: Write to the memory buffer (generate segment and translog regularly) to enable data to be indexed and persisted. Finally, a persistence is completed by commit. After the main shard is written, the data will be sent to the replica set node in parallel, and when all nodes are successfully written, the ack is returned to the coordinating node, and the coordinating node returns ack to the client to complete one write.

4.2 Elasticsearch update and delete (write operation)

Elasticsearch update and delete
Elasticsearch update and delete operation process:

Step 1: Mark the corresponding doc record with .del;
Step 2: If it is a delete operation, mark the delete state;
Step 3: If it is an update operation, mark the original doc as delete, and then write a new one Data (update=delete first and then insert); the
same point: when updating and deleting, the doc status is actually marked as the delete status

For ES, as mentioned earlier, a segment file will be generated every 1s, and there will be more and more segment files. Elasticsearch will have a merge task that will merge multiple segment files into one segment file. During the merge process, the doc with the delete state will be physically deleted, and the physical deletion is completed.

4.3 Two query methods of Elasticsearch (read operations: Get, Query/Search)

Elasticsearch query
query can be divided into two simplest ways:
(1) query doc based on ID; code: public Document doc(int docID);
(2) query matching doc based on query (search term); code: public TopDocs search(Query query, int n); public Document doc(int docID);

The first query Get : The process of querying specific doc based on ID is: the
first step is to retrieve the Translog file in the memory; the
second step is to retrieve the Translog file on the hard disk; the
third step is to retrieve the segment file on the hard disk.
The second type of query Search/Query : The process of matching doc according to query is:
to query the segment files of memory and hard disk at the same time
. One picture summary Two queries
Insert picture description here

Two query methods: Get (checking Doc by ID is real-time), Query/Search (matching Doc by query is near real-time),
question: Why is Get real-time and Query/Search is near real-time?
Reason: Because the segment file is only generated every second.

Elasticsearch query can be divided into three stages:

The first stage: QUERY_AND_FETCH (return the entire Doc content after the query is completed) The
second stage: QUERY_THEN_FETCH (first query the corresponding Doc id, and then match the corresponding document according to the Doc id)
Third stage: DFS_QUERY_THEN_FETCH (first count the score, then Inquire)

"The "point" here refers to the word frequency and document frequency (Term Frequency, Document Frequency). As we all know, the higher the frequency, the stronger the correlation."
Insert picture description here
ps: Generally, we use QUERY_THEN_FETCH the most, the first query Returning the entire Doc content (QUERY_AND_FETCH) is only suitable for requests that only need to check one segment.

The overall process flow of QUERY_THEN_FETCH is roughly :

(1) The client request is sent to a node in the cluster. Each node on the cluster is a coordinate node (
2) Then the coordinating node forwards the search request to all shards (both the primary shard and the replica shard)
(3) Each shard will search for itself The output result (doc id) is returned to the coordinating node, and the coordinating node performs operations such as data merging, sorting, and paging to produce the final result.
(4) Then the coordinating node pulls the actual document data from each node according to the doc id, and finally returns it to the client.

What the node does during the Query Phase :

(1) The coordinating node sends the query command to the target shard (forwarding the request to the main shard or the replica shard)
(2) The data node (does filtering, sorting, etc. in each shard), and returns the doc id to Coordination node

What the node does during the Fetch Phase :

(1) The coordination node gets the doc id returned by the data node, aggregates these doc ids, and then sends the target data fragments to capture commands (hope to get the entire Doc record)
(2) The data node presses the doc id sent by the coordination node , Pull the data actually needed and return it to the coordination node

Summary: ES internal main process: Since Elasticsearch is distributed, it needs to pull the corresponding data from each node, and then finally unified synthesis to the client (ps: just Elasticsearch does all these tasks, when we use it No perception only)

Five, interview cheats

Slightly, the full text is to be remembered.

To sort out the full text, ES has two functions. First, search for logs in the ELK combined collection log; second, use ES to search for specific business data

First, ElasticSearch definition + ElasticSearch replaces the real demand of database like search (three reasons)

Second, why can Elasticsearch achieve fast "fuzzy matching" or fast "relevance query"? Standard answer: inverted index + tokenizer

Third, Elasticsearch data structure (Term Index, Term Dictionary, PostingList) + internal optimization of the data structure (TermIndex optimization + PostingList optimization)
3.1 Why does the ES index library look up quickly? It is determined by the ES data structure and corresponds to the ES data Structure, there are optimization methods
Term Index optimization + Term Dictionary optimization + PostingList optimization
3.2 Golden Finger: For Term Index, in order to ensure that it is fast enough, Term Index is stored in memory. Since it is stored in memory, it is better to reduce the occupied space Therefore, the optimization of Term Index is mainly to use the FST storage format. Through the repeated use of the word prefix and suffix in the dictionary, the storage space is compressed. The second optimization is to use the FST storage format, O(len(str) ) Query time complexity
3.3 Golden Finger: Term Dictionary optimization is mainly reflected in the search performance, which is to use binary search.
3.4 Cheat: Posting List is used to store the location of Term Dictionary, optimization includes two:

3.4.1 Since it is to store the Term Dictionary location, we must know how to make the occupied space as small as possible and use FOR coding technology to compress the data inside;
3.4.2 Since it is to store the Term Dictionary location, we must be able to quickly return to the ID range, PostingList uses Roaring Bitmaps to perform intersection operations on document IDs

Fourth, the concept and architecture of Elasticsearch (terminology and architecture are linked together)

Fifth, ES internal structure + ES read and write process (read is query, including two methods; write includes update and delete)
5.1 Elasticsearch write process (mainly ES internal structure)
5.2 Elasticsearch update and delete (write operation)
5.3 Two query methods of Elasticsearch (read operations: Get, Query/Search)

Six, summary

Principle level: Elasticsearch, done.

Code every day, make progress every day! ! !

Guess you like

Origin blog.csdn.net/qq_36963950/article/details/108952827