ElasticSearch study notes-Chapter 4 ES sharding principle and detailed explanation of the reading and writing process

4. Detailed explanation of ES sharding principle and reading and writing process

Before learning the principles of ES sharding and the reading and writing process, you need to first learn some core concepts of ES and related knowledge of the ES cluster environment.

4.1 ES core concepts

4.1.1 Index

Index is equivalent to the database in MySQL. An index is a collection of documents with somewhat similar characteristics.

4.1.2 Type

Type is equivalent to a table in MySQL. A type is a logical classification/partition of an index, and its semantics are completely determined by the user. Typically, a type is defined for documents that have a common set of fields. Different ES versions use types differently.

Version	Type
5.x	Support multiple types
6.x	There can only be one Type
7.x	Custom Type is no longer supported by default, the default is _doc

4.1.3 Documentation

Document is equivalent to a row in MySQL. A document is a basic information unit that can be indexed, that is, a piece of data.

4.1.4 Fields

Field is equivalent to a column in MySQL, that is, a field is an attribute field.

4.1.5 Mapping

Mapping is used to specify how ES handles data, such as the data type of a field, default value, analyzer, whether it can be indexed, etc.

How to establish appropriate mapping is the focus of using ES

4.1.6 Sharding

When the data stored in an index is too large, the data size may exceed the capacity of the disk space, or the efficiency of processing search requests is greatly reduced. In order to solve this problem, Elasticsearch provides the ability to divide the index into multiple parts, each part is called a shard . When creating an index, you can specify the desired number of shards. Each shard itself is also a fully functional and independent "index" that can be placed on any node in the cluster.

The main functions of sharding :

Allows splitting/expanding content capacity horizontally.
Allows distributed, parallel operations on shards, thereby improving performance/throughput.

As for how a shard is distributed, how its documents are aggregated and search requests are completely managed by Elasticsearch, these are all transparent to users and do not need to be overly concerned.

Note: We mentioned earlier that Elasticsearch is developed based on Lucene, and Lucene also has the concept of indexing. In ES, a Lucene index is called a shard, and an Elasticsearch index is a collection of several shards.

When ES searches in the index, it sends query requests to each shard that belongs to the index (Lucene index), and then merges the results of each shard into a global result set.

4.1.7 Copies

ES allows users to create one or more copies of a shard. These copies are called replicated shards (replicas) , and the copied shards are called primary shards .

The main functions of copies:

Provides high availability in the event of shard/node failure .

In order to achieve high availability, ES will not place replica shards and primary shards on the same node .
Scales search volume/throughput because searches can run in parallel on all replicas.

In summary, each index can be divided into multiple shards. An index can also be replicated 0 times (meaning no replication) or multiple times. Once replicated, each index has a primary shard (the original shard used as the replication source) and a replica shard (a copy of the primary shard). The number of shards and replication can be specified at index creation time. After the index is created, the user can dynamically change the number of replications at any time, but cannot change the number of shards . By default, each index in ES is sharded into 1 primary shard and 1 replica shard , which means that if there are at least two nodes in the ES cluster, the index will have 1 primary shard and another replicated shard (1 full copy). In this case, each index has a total of 2 shards. We need to determine the number of shards based on the index needs.

4.1.8 Allocation

The process of allocating shards to a node, including allocating primary shards or replicas. If it is a replica, it also includes the process of copying data from the primary shard. This process is completed by the master node .

The cluster environment will be explained in the next chapter.

4.2 Cluster environment

4.2.1 System architecture

Insert image description here

node

Each running ElasticSearch instance is called a node
cluster

A cluster consists of one or more nodes with the same cluster.name, which share the pressure of data and load. When a node is added to or removed from the cluster, the cluster will redistribute all data evenly.
master node

When a node is elected as the master node, it will be responsible for managing all cluster-wide changes, such as adding and deleting indexes, or adding and deleting nodes, etc., and usually does not involve the master node in document-level changes and searches. Operation, when configured in this way, even if the cluster has only one master node, the increase in traffic will not make it a bottleneck, and any node can become the master node.

Users can send requests to any node in the cluster, including the master node. Each node knows the location of any document and can forward user requests directly to the node that stores the document the user needs. No matter which node the user sends the request to, it is responsible for collecting data back from each node that contains the document we need and returning the final result to the client. Elasticsearch manages all of this transparently.

4.2.2 Deploy cluster

This example is to deploy a single-machine cluster in a Windows environment.

Specific steps are as follows:

In a suitable disk space, create the ElasticSearch-cluster folder
In the ElasticSearch-cluster folder, copy three copies of the es-7.8.0 decompressed version

Insert image description here

Change the configuration of each es. The configuration file path is: your es installation path/config/elasticsearch.yml

Key configuration of node-9201

#集群名称，同一个集群中的各个节点的这个配置项需要保持一致
cluster.name: my-es
#节点名称，同一个集群中的各个节点的这个配置项需要保证唯一
node.name: node-9201
#是否是主节点
node.master: true
#是否是数据节点，为false则表示不存储数据
node.data: true
#ip地址，由于是本机测试，所以指定为localhost
network.host: localhost
#http端口号
http.port: 9201
#tcp监听端口，同一个集群中的各个节点之间通过tcp协议进行相互通信
transport.tcp.port: 9301
#集群内可以发现的其他节点的tcp路径
discovery.seed_hosts: ["localhost:9302","localhost:9303"]
discovery.zen.fd.ping_timeout: 1m
discovery.zen.fd.ping_retries: 5
#初始化时被指定的主节点
cluster.initial_master_nodes: ["node-9201"]
#跨域配置
http.cors.enabled: true
http.cors.allow-origin: "*"

Key configuration of node-9202

#集群名称，同一个集群中的各个节点的这个配置项需要保持一致
cluster.name: my-es
#节点名称，同一个集群中的各个节点的这个配置项需要保证唯一
node.name: node-9202
#是否是主节点
node.master: true
#是否是数据节点，为false则表示不存储数据
node.data: true
#ip地址，由于是本机测试，所以指定为localhost
network.host: localhost
#http端口号
http.port: 9202
#tcp监听端口，同一个集群中的各个节点之间通过tcp协议进行相互通信
transport.tcp.port: 9302
#集群内可以发现的其他节点的tcp路径
discovery.seed_hosts: ["localhost:9301", "localhost:9303"]
discovery.zen.fd.ping_timeout: 1m
discovery.zen.fd.ping_retries: 5
#初始化时被指定的主节点
cluster.initial_master_nodes: ["node-9201"]
#跨域配置
http.cors.enabled: true
http.cors.allow-origin: "*"

Key configuration of node-9203

#集群名称，同一个集群中的各个节点的这个配置项需要保持一致
cluster.name: my-es
#节点名称，同一个集群中的各个节点的这个配置项需要保证唯一
node.name: node-9203
#是否是主节点
node.master: true
#是否是数据节点，为false则表示不存储数据
node.data: true
#ip地址，由于是本机测试，所以指定为localhost
network.host: localhost
#http端口号
http.port: 9203
#tcp监听端口，同一个集群中的各个节点之间通过tcp协议进行相互通信
transport.tcp.port: 9303
#集群内可以发现的其他节点的tcp路径
discovery.seed_hosts: ["localhost:9301", "localhost:9302"]
discovery.zen.fd.ping_timeout: 1m
discovery.zen.fd.ping_retries: 5
#初始化时被指定的主节点
cluster.initial_master_nodes: ["node-9201"]
#跨域配置
http.cors.enabled: true
http.cors.allow-origin: "*"

4.2.3 Start the cluster

start up

Refer to Section 2.1 and start node-9201, node-9202, and node-9203 in sequence.
test(http)

Insert image description here

{
    
    
    "cluster_name": "my-es", // 集群名称
    "status": "green",       // 当前节点的状态
    "timed_out": false,      // 是否超时
    "number_of_nodes": 3,    // 节点总数  
    "number_of_data_nodes": 3, // 数据节点总数
    "active_primary_shards": 0,
    "active_shards": 0,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

4.2.4 Failover

Create the following index in the cluster to facilitate subsequent learning

Create a users index and allocate three primary shards and one replica (one replica for each primary shard).
```
{
      
      
 "settings" : {
      
      
 "number_of_shards" : 3,
 "number_of_replicas" : 1
 }
}
```
Use the browser plug-in to view the overall status of the cluster

We only start node node-9201 and node node-9202

Use the elasticsearch-head plug-in to view the status of the specified cluster directly through the browser.

As can be seen from the above figure, the health value of the cluster is green at this time, which means that in the current cluster, the three primary shards and three replicas are correctly allocated to different nodes (note that we introduced it earlier , it is unsafe if the primary shard and its replica are on the same node).

This means that if there is a problem with any node in the cluster, our data will remain intact. All recently indexed documents will be stored on the primary shard and then copied to the corresponding replica shards in parallel. This ensures that we can get documents from both the primary shard and the replica shard.

That is, if the above node node-9201 goes down, it will not affect the query function of ES, and we can still obtain document data from the copy.

4.2.5 Horizontal expansion

When the application has been running for a period of time and the amount of data in ES gradually increases, we can achieve horizontal expansion by adding new nodes to the ES cluster. When the third node is started, the cluster will have three nodes and it will reallocate the shards to spread the load.

At this time, start node node-9203 and check the shard allocation of the cluster again.

Insert image description here

As you can see from the figure above, when a new node joins the cluster, the cluster redistributes the shards. At this time, the hardware resources (CPU, RAM, I/O) of each node will be shared by fewer shards, and the performance of each shard will be improved.

Sharding is a fully functional search engine that has the ability to use all resources on a node. Our index with 6 shards (3 primary shards and 3 replica shards) can be expanded to a maximum of 6 nodes . There is one shard on each node, and each shard has all the resources of the node where it is located. .

So, what if we want to expand beyond 6 nodes?

The number of primary shards is determined when the index is created . In effect, this number defines the maximum amount of data that this index can store. (The actual size depends on your data, hardware, and usage scenarios.) However, read operations (searching and returning data can be processed by either the primary or replica shards at the same time, so the more replica shards you have, the better will have higher throughput.

Therefore, ES allows users to dynamically adjust the number of replica shards on a running cluster. Users can adjust the number of replicas to scale the cluster on demand.

Adjust the number of replicas to 2 (two replica shards for each primary shard)

Send a PUT request to the ES cluster, http/localhost:9201/index name/_settings, and add the following content in the request body
```
{
      
      
    "number_of_replicas":2
}
```
View cluster status

The users index now has 9 shards: 3 primary shards and 6 replica shards. This means we can scale the cluster to 9 nodes, with one shard on each node. Compared with the original 3 nodes, the cluster search performance can be improved by 3 times. Of course, just adding more replica shards to a cluster with the same number of nodes will not improve performance, because each shard will get fewer resources from the nodes. At this time, more hardware resources need to be added to improve throughput. But more replica shards increase the amount of data redundancy: according to the above node configuration, we can lose 2 nodes without losing any data.

4.2.6 Dealing with failures

When the master node in the ES cluster goes down (node-9201 is shut down), the ES cluster will elect a new node (these alternative nodes are those nodes with the node.master configuration item in the configuration file being true. Since we When configuring the cluster environment earlier, this configuration item of the three nodes is true, so at this time the ES cluster will select one node from node-9202 and node-9203 as the new master node).

You can see the ES cluster environment status at this time through the plug-in as follows:

Insert image description here

As can be seen from the above figure, the primary node is node-9203 at this time. The original shards 0 and 1 on node-9201 are both primary shards. The new primary node will immediately copy the corresponding copies of these shards on node-9202. The shard is promoted to the main shard, and the status of the cluster will be yellow at this time. **This process of promoting the primary shard occurs instantly, like pressing a switch. **Although we have all three primary shards, we have also set up that each primary shard needs to correspond to 2 replica shards. At this time, there is only one replica shard, so the cluster is in the yellow state, although it is in yellow status, but the functions of the ES cluster can be used as usual.

If we restart node-9201, the cluster can allocate the missing replica shards again, and the cluster's status will be restored to its previous state. If node-9201 still owns the previous shards, it will try to reuse them and only copy modified data files from the primary shard. Compared with the previous cluster, only the Master node has been switched.

Insert image description here

The above figure shows the cluster status after starting node-9201 again.

Note: Readers should try their best to manually operate these chapters on failover, horizontal expansion, and failure response to understand the design concept of the ES cluster.

4.2.7 Route calculation

routing algorithm

When adding a document to an ES cluster, the document will be stored in a primary shard. Since there will be multiple primary shards in a cluster environment, how does Elasticsearch know which primary shard this document should be placed in? ?

In order to solve the above problems, Elasticsearch uses an algorithm to perform routing calculations to determine which shard the document should be placed into, and also uses this algorithm to obtain the shard that stores the document when querying.

The algorithm is as follows:
```
shard = hash(routing) % number_of_primary_shards
```
routing is a variable value, the default is the _id of the document, and can also be defined by the user.

The meaning of this algorithm is: take the hash value of routing and divide it by the number of primary shards to get the remainder. The resulting value is a number between 0 ( _{number of primary shards - 1) (counting starts from 0, such as 3 primary shards) Sharding, then the range is 0} 2), which is the sharding location where the document is stored.

This explains why it is important to determine the number of primary shards when creating the index and never change this number: because if the number changes, all previous routing values will be invalid and the document will no longer be found.
Custom routing

All document APIs (get, index, delete, bulk, update and mget) accept a routing parameter called routing, through which we can customize the mapping of documents to shards. A custom routing parameter can be used to ensure that all related documents (such as all documents belonging to the same user) are stored in the same shard.

This part will be explained in detail in subsequent chapters.

4.3 Sharding control

When describing this section, use the following cluster-test index in a cluster environment, which has 3 primary shards and 1 replica .

Insert image description here

In this cluster, each node has the ability to handle any request. Each node knows the location of any document in the cluster, so requests can be forwarded directly to the required node. In the following examples, I will send all requests to node-9201, which is called the coordinating node.

Note: At work, in order to achieve load balancing, all nodes in the cluster are generally polled so that they can jointly carry requests, instead of sending all requests to the same node.

4.3.1 Writing process (rough process)

In Elasticsearch, requests for adding, deleting, and modifying (full) documents belong to the writing process. The writing process must be completed on the primary shard before being synchronized to the replica shard.

For example request:

PUT/DELETE http://localhost:9200/cluster-test/_doc/1001

Add, fully update or delete a specific document

If the primary key is not specified when adding a document, ES will automatically generate the primary key and perform routing calculations based on the automatically generated primary key.

Next, we use an example to learn how the writing process in Elasticsearch is performed in the ES cluster.

Assume that we send requests to add, delete, and modify (full) documents to node-9201, then ES will do the following processing:

node-9201 uses the _id of the document (if the routing parameter is specified, the corresponding parameter is used) to perform route calculation and obtain the shard to which the document belongs. For example, if it belongs to shard 0, then the node-9201 node will forward the request to node-9202 (because primary shard 0 is on node-9202).
node-9202 handles the request on primary shard 0. If successful, it will forward the request to node-9203 and let replica shard 0 do the same processing. When replica shard 0 is successfully processed, node-9202 will be notified, and then node-9202 will notify the successful result of the processing. node-9201.
node-9201 returns the client request result.

The specific diagram is as follows:

Insert image description here

When the client receives a successful response, the document changes have been executed on the primary shard and all replica shards, and the changes are safe. Of course, Elasticsearch also provides some parameters for users to intervene in this process (which can further improve performance at the expense of data security). Of course, these parameters are rarely used because Elasticsearch is already fast enough.

The following table lists some parameters and their meanings that can intervene in this process.

parameter	meaning
consistency	consistency, that is, consistency . The consistency parameter value can be set to one (as long as the status of the primary shard is normal, the write operation is performed), all (the status of all primary shards and replica shards must be normal before the write operation is performed) and quorum (large Most shards are in a normal state and write operations are allowed), the default value is quorum, which means that under the default settings, even before trying to perform a write operation, the primary shard will require a specified number (quorum) The write operation will be performed only when the shard copy is in an active and available state. This is to avoid writing operations when a network failure occurs, which may lead to data inconsistency. The calculation formula for the specified quantity (quorum) is int((primary + number_of_replicas) / 2) + 1 . Among them, number_of_replicas refers to the number of replicas set in the index settings, not the number of replicas currently active.
timeout	What happens if there are not enough replica shards? Elasticsearch will wait for more shards to appear, by default it will wait up to 1 minute. Users can adjust the waiting time using the timeout parameter.

parameter

meaning

consistency

consistency, that is, consistency . The consistency parameter value can be set to one (as long as the status of the primary shard is normal, the write operation is performed), all (the status of all primary shards and replica shards must be normal before the write operation is performed) and quorum (large Most shards are in a normal state and write operations are allowed), the default value is quorum, which means that under the default settings, even before trying to perform a write operation, the primary shard will require a specified number (quorum) The write operation will be performed only when the shard copy is in an active and available state. This is to avoid writing operations when a network failure occurs, which may lead to data inconsistency. The calculation formula for the specified quantity (quorum) is int((primary + number_of_replicas) / 2) + 1 . Among them, number_of_replicas refers to the number of replicas set in the index settings, not the number of replicas currently active.

timeout

What happens if there are not enough replica shards? Elasticsearch will wait for more shards to appear, by default it will wait up to 1 minute. Users can adjust the waiting time using the timeout parameter.

Note: When creating a new index, the number of index replicas defaults to 1, which means that two active shard replicas should be required to meet the specified number . These default settings will prevent us from doing any write operations on a single node. Therefore, ES stipulates that the specified quantity will only take effect when number_of_replicas is greater than 1 .

4.3.2 Reading process (rough process)

Reading process, what is discussed here is the process of reading specified documents based on routing or ID. (Note to distinguish it from the search data process).

For example, the request is GET http://localhost:9201/cluster-test/_doc/1001

Next, we use an example to learn how the reading process in Elasticsearch is performed in the ES cluster.

Assume that we send a request to read data to node-9201, then ES will do the following processing:

node-9201 performs routing calculation on the document and obtains the shard to which the document belongs. For example, it belongs to shard 0, and will use the round-robin random polling algorithm to randomly select one among the primary shard 0 and all its replica shards. , let the read request load balance, and the node-9201 node will forward the request to a specific node, such as node-9203.

Use the routing parameter or directly use the id to calculate the route and read the data.
Node-9203 processes the request on replica shard 0 and returns the query results to node-9201.
node-9201 returns the query results to the client.

The specific diagram is as follows:

Insert image description here

4.3.3 Update process (rough process)

Partially updating a document combines the reading and writing processes explained earlier.

Next, we use an example to learn how the update process in Elasticsearch is performed in the ES cluster.

Assume that we send a request to update data to node-9201, then ES will do the following processing:

The node-9201 node performs route calculation on the document and obtains the shard to which the document belongs, for example, it belongs to shard 0.
The node-9201 node forwards the request to the node-9202 node (primary shard 0 is on this node).
The node-9202 node processes the update request, reads the document, modifies the JSON data in the _source field, and attempts to re-index the document (if another process is modifying the document at this time, step 3 will be retried after the retry_on_conflict number of times has been exceeded) give up).

Re-indexing the document mentioned above actually means marking the old version of the document for deletion (that is, marking the old version of the document in the .del file), then generating a new version of the document, and writing this new version of the document. document.
If the node-9202 node successfully updates the document, it will forward the new version of the document to the node-9203 node, and the node-9203 node rebuilds the index (inverted index) for the new version of the document.

In this step, the secondary node will do the same thing as the primary node. (Mark the old version of the document as deleted and write the new version of the document)
After the node-9203 node is successfully updated, a response is returned to the node-9202 node.
The node-9202 node will return the successful update result to the node-9201 node.
The node-9201 node returns the results to the client.

The specific diagram is as follows:

Insert image description here

Note :

When the primary shard forwards changes to the replica shard, it does not forward update requests. Instead, it forwards a new version of the complete document. Keep in mind that these changes will be forwarded to the replica shards asynchronously, and there is no guarantee that they will arrive in the same order in which they were sent. If Elasticsearch only forwards change requests, changes may be applied in the wrong order, resulting in corrupted documents.

4.3.4 Multi-document operation process (rough process)

The multi-document operation process here refers to mget and bulk requests.

Processing flow of mget request:

The client sends an mget request to the node-9201 node.
The node-9201 node creates multi-document fetch requests for each node, and then forwards these requests to all nodes in parallel, such as node-9202 and node-9203.
After the node-9202 node and the node-9203 node process the request, the result will be responded to the node-9201 node.
The node-9201 node responds to the client with the request result.

The entire process is equivalent to a batch of get requests. For the processing of requests by each node, refer to the reading process introduced earlier.

Bulk request processing flow:

The client sends a bulk request to the node-9201 node.
The node-9201 node creates batch requests for each node and forwards these requests in parallel to each node containing the primary shard.
After all nodes have processed the request, the results will be responded to the node-9201 node.
The node-9201 node responds to the client with the request result.

The entire process is equivalent to a batch of new, deleted, and updated requests. For the processing of requests by each node, refer to the writing process introduced earlier.

4.4 Sharding principle

4.4.1 Document search (search by paragraph)

Immutable inverted index

Early full-text retrieval would create a large inverted index for the entire document collection and write it to disk. Once an inverted index needs to be created for a new document, the entire inverted index needs to be replaced, that is, the inverted index is replaced. Once written to disk, it cannot be changed and can only be replaced entirely.

The advantages of this are:
1. No lock required. If you never update the index, you don't need to worry about multiple processes modifying data at the same time.
2. Once the index is read into the kernel's file system cache, it remains there due to its immutability . As long as there are enough in the file system cache
  
  space, then most read requests will directly request memory without hitting the disk. This provides a big performance boost.
3. Writing a single large inverted index allows the data to be compressed, reducing the amount of disk I/O and index usage that needs to be cached in memory.
The disadvantages of this are also very obvious: if you need to make a new document searchable, you need to rebuild the entire inverted index, which places a great limit on the amount of data that an inverted index can contain, or the index There are significant limitations on how often it can be updated.
Dynamically update the inverted index

In order to update the inverted index while retaining the immutability of the inverted index, Elasticsearch adopts the method of supplementing the inverted index to reflect recent modifications by adding new supplementary indexes instead of directly rewriting the entire inverted index. Rank index . During retrieval, each inverted index will be queried in turn, and the results will be merged after the earliest query is completed. This can avoid the performance loss caused by frequent rebuilding of the inverted index.
Search by segment

Elasticsaerch is developed based on Lucene. It has the concept of searching by segment . Each segment itself is an inverted index. In addition to segments, there is also the concept of a commit point , which records all currently available segments . The new supplementary inverted index mentioned above is actually a new segment.

The relationship between segments and submission points is shown in the figure below

Insert image description here

Writing data under segmentation thinking

When a new document is added to the index, it will go through the following process ( here we only focus on the usage of segments and submission points )
1. New documents are added to the memory cache.
  
  The schematic diagram of the submission point, segment and memory cache at this time is as follows
2. From time to time [by default, after a certain period of time, or when the amount of data in the memory reaches a certain stage, the data will be submitted to the disk in batches. ( The specific details will be explained later in the underlying principle of the writing process )], the cache is submitted
  - A new segment (supplementary inverted index) is written to disk
  - A new commit point containing the new segment is generated and written to disk
    
    Once a segment has a commit point, it means that the segment only has read permission and has lost the write permission; on the contrary, when the segment is in memory, it only has the permission to write data but not the permission to read data, so it is Cannot be retrieved.
  - Disk is synchronized - all writes waiting in the file system cache are flushed to disk to ensure they are written to physical files
3. A new segment is opened so that the documents it contains can be retrieved
4. The memory cache is cleared, waiting to receive new documents
  
  The schematic diagram of the submission point, segment and memory cache at this time is as follows
Deleting/updating data under segmentation thinking

When data needs to be deleted, since the segment where the data is located is immutable, the document cannot be removed directly from the old segment. At this time, each submission point will contain a .del file, which stores the deleted data id. ( logical delete )

When data needs to be updated, the deletion operation will be performed first, and then the new operation will be performed. That is, the old data will be recorded in the .del file first, and then an updated piece of data will be added to the new segment.
Query data based on segmentation thinking

Query the data that meets the query conditions in all segments, then merge the query result sets in each segment to obtain a large result set, and then remove the deleted data recorded in the .del file and return it to the client.

4.4.2 Near real-time search

It can be seen from the new data process under the segmentation idea introduced in the previous section that when a new document is written, the data is still in the memory cache. At this time, this data cannot be queried, so the query of Elasticsearch It is searched in near real time .

When committing a new segment to disk, you need to use the fsync system call to ensure that the data is physically written to the disk so that the data will not be lost after a power outage. However, fsync is very expensive. If fsync is used to physically write data to disk every time new/modified data is added, it will cause a huge performance loss.

A more lightweight approach is used in Elasticsearch to make a document retrievable, that is, fsync is removed from the process from when the document is written to when it can be retrieved, to improve performance. To achieve this goal, between Elasticsearch and the disk is the operating system's file system cache (OS Cache) .

Between Elasticsearch's memory cache (Memory) and hard disk (Disk) is the operating system's file system cache (OS Cache)

As described in the previous section, the document in the memory index buffer will be written to a new segment, but here the new segment will be written to the file system cache first** (this step will be less expensive) , and will be flushed to disk in batches later (this step is relatively expensive)**. As long as the file is already in the file system cache, it can be opened and read like any other file, that is, it can be retrieved.

As introduced above, the process of writing data in the memory buffer to the file system cache is called refresh . This operation is performed by default every second or when the data in the memory buffer reaches a certain amount of data (section 4.4.1 As described in), this is why we say that Elasticsearch is a near-real-time search (changes in the document will be visible after one second, so it is not real-time, but near-real-time).

Of course, Elasticsearch provides a refresh API for users to manually perform refresh operations, such as sending request/index name/_refresh.

Although flushing is a much lighter operation than committing, it still has a performance overhead. Manual refreshes are useful when writing tests, but don't do it every time you index a document in a production environment. Instead, our applications need to be aware of the near-real-time nature of Elasticsearch and accept its shortcomings.

Some scenarios do not require a refresh every second (such as adding a large number of log files to ES). How to meet the needs of the above scenarios?

We can adjust the time interval for performing refresh operations by setting the refresh_interval of the index.

{
    
    
 "settings": {
    
    
 "refresh_interval": "30s" 
 }
}

refresh_interval can dynamically update existing indexes. In a production environment, when you are building a large new index, you can turn off automatic refreshes and then turn them back when you start using the index.

# 关闭自动刷新
PUT /索引名/_settings
{
    
     "refresh_interval": -1 } 
# 每一秒刷新
PUT /索引名/_settings
{
    
     "refresh_interval": "1s" }

At this time, the writing process of Elasticsearch is as shown in the figure below. ( The purpose of this flow chart is to present the Elasticsearch writing process design ideas to you step by step. The flow chart here is not complete )

Insert image description here

4.4.3 Persistent changes

In the previous section, we introduced Elasticsearch's lightweight query mechanism that writes document data from the memory cache to the file system cache through the refresh operation. In this process, the fsync system call is removed. If fsync is not used to write the data from When the file system cache is written to the hard disk (we call the operation of writing the data in the file system cache to the hard disk flush ), there is no guarantee that it will still exist after a power outage or even the program exits normally (that is, there is no persistence) .

In order to ensure reliability, Elasticsearch needs to ensure that data changes are persisted to disk. In order to achieve this requirement, Elasticsearch adds a translog (transaction log) as a compensation mechanism to prevent data loss. All changes that have not been made are recorded in the translog. Data persisted to disk.

Regarding translog , you need to understand the following three questions.

When is data written to translog?

When data is written to the memory cache, a copy of the data will be appended to the translog. ( This part will be introduced in detail later )
When to use data from translog?

When Elasticsearch starts, it will not only load the persisted segments based on the latest submission point, but also re-persist the unpersisted data to the disk based on the data in the translog.
When should the data in translog be cleared?

When the data in the file system cache is flushed to disk, the old translog is deleted and a new blank translog is generated.

**By default, a flush operation will be performed every 30 minutes or when the translog is too large (default is 512MB). **Typically, auto-refresh is sufficient. When Elasticsearch tries to restore or reopen an index, it needs to replay all operations in the translog, so the recovery is faster if the log is shorter.

The maximum capacity of translog can be specified through the index.translog.flush_threshold_size configuration parameter.

After adding translog, the writing process of Elasticsearch is shown in the figure below (the detailed process will be explained in detail in the section on the underlying principles of the writing process)

Insert image description here

Although translog is used to prevent data loss, there is also the risk of data loss .

Detailed explanation of writing translog

As can be seen from the write flow chart above, the translog has a copy in the memory cache and on the disk. It is reliable only when the translog in the memory is flushed to the disk through the fsync system call.

There are two modes for executing translog flush operations - asynchronous and synchronous. The default is synchronous mode . This mode can be adjusted through the parameter index.translog.durability , and the time for automatic flush execution can be controlled through the parameter index.translog.sync_interval . interval.
```
#异步模式
index.translog.durability=async
#同步模式
index.translog.durability=request
```
When in synchronous mode , by default an fsync operation will be performed after each write request. This process occurs on both the primary shard and the replica shard. This means that the entire request is fsynced to the primary shard and the replica shard. The client won't get a 200 response until the translog is in the slice's disk. (That is, in this mode, if the write request is successful, it means that this data has been written to the translog on the disk, which ensures the reliability of the data).

When in asynchronous mode , the fsync operation will be performed once every 5 seconds by default, and this action is asynchronous. This means that even if your write request gets a 200 response, it does not mean that the data of this request has been dropped. disk to the translog on the disk, that is, this operation is not reliable. (If the power is cut off within five seconds, this part of the data will be lost).

Notice

Elasticsearch's translog flush operation defaults to synchronous mode. Although a translog flush operation is performed every time the data is modified, the cost is much lower than performing a segment flush operation every time the data is modified. Therefore, the translog compensation mechanism is used. It is a solution that balances performance and data security. Unless there are special requirements, synchronous mode is used by default .

4.4.4 Segment merging

Introduction and process

In Section 4.4.2, we introduced that the refresh operation executed every second will create a new segment. After a long period of accumulation, there will be a large number of segments in the index. When the number of segments is too large, it will not only occupy too much server resources, and also affects retrieval performance.

As mentioned earlier, every time you search, the data that meets the query conditions in all segments will be queried, and then the result sets queried in each segment will be merged, the slower the retrieval.

Elasticsearch uses segment merging to solve the problem of too many segments. There is a background process in Elasticsearch that is specifically responsible for segment merging, and it will perform segment merging operations regularly.

The operation process of segment merging is as follows:
1. Merge multiple small segments into a new large segment. During the merge, deleted documents (documents corresponding to the document ids stored in the .del file) or old versions of updated documents will not be written to the new one. in paragraph.
2. Flush the new segment file to disk
3. Identify all new segment files in this commit point and exclude old and merged segment files
4. Open a new segment file for search use
5. After all retrieval requests have been transferred from the small segment files to the large segment files, delete the old segment files.
The above process is transparent to users, and Elasticsearch will automatically execute it when indexing documents and searching documents. The segments to be merged can be indexes that have been submitted on disk, or segments that have not yet been submitted in memory. During the merging process, the current indexing and search functions will not be interrupted .
Performance impact of segment merging

From the above introduction to the segment merging process, we can see that the segment merging process not only involves the reading of segments and the generation of new segments, but also involves the flush operation of segments. Therefore, if segment merging is not controlled, It will consume a lot of I/O and CPU resources, and will also affect search performance.

In Elasticsearch, only ten segments can be merged at a time by default. If the segment capacity is greater than 5GB, it will not participate in segment merging, and the default speed of the merge thread is 20MB/S.

We can adjust the segment merging rules through the following parameters.
```
#更改配速为100MB/s
{
      
      
    "persistent" : {
      
      
        "indices.store.throttle.max_bytes_per_sec" : "100mb"
    }
}
#设置优先被合并的段的大小，默认为2MB
index.merge.policy.floor_segment
#设置一次最多合并的段数量，默认为10个
index.merge.policy.max_merge_at_once
#设置可被合并的段的最大容量，默认为5GB
index.merge.policy.max_merged_segment
```

4.4.5 Detailed explanation of writing process

In Section 4.2.8.1, we learned the general process of the write process. We learned how Elasticsearch processes write requests initiated by the client in a cluster environment as a whole. In this section, the blogger will summarize the writing process based on the specific processing of each node after receiving the write request.

The writing process is summarized as follows:

The client sends a write request to the coordinating node
The coordinating node performs routing calculation (see section 4.2.7 for details) based on the routing parameter (if not specified, the default is the document's ID) , and calculates the location of the primary shard to which the document belongs.
The coordinating node forwards the write request to the node where the primary shard is located.
After the node where the primary shard is located receives the write request, it enters the write process of a single node.
After the node where the primary shard is located has processed the write request, it will forward the write request in parallel to all the nodes where its replica shards are located. After these nodes receive the request, they will do the same processing.
After the nodes where all replica shards are located process the write request, the processing results will be returned to the node where the primary shard is located, and the node where the primary shard is located will return the processing results to the coordinating node.
The coordinating node returns the results to the client.

The detailed write process of a single node in Elasticsearch is shown in the figure below.

Insert image description here

The text description is as follows:

Write data to the memory cache (**index buffer)**
Append data to the transaction log ( translog )
By default , the refresh operation is performed once per second to refresh the data in the memory cache to the **File System Cache (OS Cache)**, generate a segment ( segment ), and open the segment for user search, and at the same time the memory cache ( index buffer ).
By default, the translog in the memory is written ( flush ) to the disk through the fsync system call every time data is written .

Synchronous mode is to fsync to disk every time data is written.

Asynchronous mode is fsync to disk every 5 seconds
By default, every 30 minutes or after the translog size exceeds 512M , a flush will be executed to write the data in the file system to the disk.
1. Generate new segments and write to disk
2. Generate a new commit point containing the new segment and write it to disk
3. Delete the old translog and generate a new translog
Elasticsearch will start the merging process , merge small and medium segments in the background , and reduce the number of index segments. This process will be performed in the file system cache and disk.

4.4.6 Detailed explanation of reading process

The reading process is summarized as follows:

The client sends a read request to the coordination node
The coordinating node performs routing calculation (see section 4.2.7 for details) based on the routing parameter (if not specified, the default is the document's ID) , and calculates the shard location to which the document belongs.
Use the round-robin random polling algorithm to randomly select one of the shards to which the document belongs (primary shard or replica shard), and forward the request to the node where the shard is located.
After the node receives the request, it enters the reading process of a single node.
This node returns the query results to the coordinating node.
The coordination node returns the query results to the client.

The reading process of a single node of Elasticsearch is shown in the figure below

Insert image description here

The text description is as follows:

The node receives the read data request
Query the data from the translog cache according to the doc id field in the request. If the data is queried, the result will be returned directly.
If no result is found in step 2, the data is queried from the translog on the disk. If the data is queried, the result is returned directly.
If no result is found in step 3, the results are queried from each segment in the disk. If the data is found, the results are returned directly.
After the previous steps, if no result is found, null will be returned.

Note that when Elasticsearch reads data, it will first try to obtain it from the translog, and then obtain it from the segment . This is because as we mentioned earlier, the writing/modification/deletion operations of all documents will be recorded in the translog first. Then it is written to the segment through refresh and flush operations. Therefore, the latest document data will be recorded in the translog. So if the target data is found from the translog, just return it directly. If not, try to get it from the segment.

4.4.7 Detailed explanation of search process

The search process here refers to search . Be careful to distinguish it from the reading process introduced above. The reading process refers to taking the doc ID to search for data through the forward index . The search process is related to the search process and searchType .

The default value of searchType is Query then Fetch

It can be briefly understood as: first get the doc id through the inverted index, and then find the data through the forward index based on the doc id.

There are four searchTypes, as follows:

Query And Fetch

Query requests are issued to all shards of the index, and when each shard returns, the element document (document) and the calculated ranking information are returned together.

This search method is the fastest. Because compared with the following search methods, this query method only needs to query the shard once. However, the sum of the number of results returned by each shard may be n times the size required by the user.
Query Then Fetch(default)

This search mode is a two-step process.
1. Make a request to all shards, and each shard returns only "sufficient" (estimated to be related to sorting, ranking, and score) information (note, document document is not included), and then reorders and sums it up according to the score returned by each shard. Ranking, take the first size documents.
2. Go to the relevant shard to get the document. The document returned in this way is equal to the size requested by the user.
DFS Query And Fetch

This method has one more initial scatter phrase step than the first method. With this step, the scoring accuracy can be higher.
DFS Query Then Fetch

This method has one more initial scatter phrase step than the second method.

Generally, the default mode can be used.

Next, we will learn the search process using Query Then Fetch mode.

The search process is divided into two stages, Query (query stage) and Fetch (get stage) .

Query
1. After the coordination node receives the search request, it broadcasts the request to all shards (including primary shards and replica shards).
2. Each shard performs the search independently, uses the inverted index for matching, and builds a prioritized result queue (including the document's id and all The value of the field involved in sorting, such as _score).
  
  At this stage, the segment cache in the OS Cache will be queried . At this time, some data may still be in Memory, so Elasticsearch is a near-real-time search. (You can refer to the previous introduction to understand)
3. Each shard returns its queue of prioritized results to the coordinating node.
4. The coordination node creates a new priority sorting result queue, sorts the global results, and obtains a sorting result list (including all sorted field values and document IDs).
5. Enter the Fetch stage.
Fetch
1. The coordinating node submits multiple GET requests to the relevant shards based on the sorted result list.
2. After each shard receives the GET request, it executes the reading process described above, obtains detailed document information based on the document ID, and returns it to the coordinating node.
3. The coordinating node returns the results to the client.

reference

[Silicon Valley] ElasticSearch tutorial from getting started to mastering (based on the new features of ELK technology stack elasticsearch 7.x+8.x)