Elasticsearch distributed document storage

shard = hash(routing) % number_of_primary_shards determines which shard the document is on. It routing is a variable value, the default is the document  _id , or it can be set to a custom value. routing Generate a number through the hash function, and then divide this number by  number_of_primary_shards (the number of primary shards) to get the  remainder .  The remainder of this distribution between  0 to  is the location of the shard where the document we are looking for is located. number_of_primary_shards-1We need to determine the number of primary shards when creating the index And never change the quantity: because if the quantity changes, all previous routes' values ​​will be invalid, and the documentation will never be found again.

Replicas of the same shard will not be placed on the same node, we can send requests to any node in the cluster. Each node has the ability to handle arbitrary requests. Each node is aware of the location of any document in the cluster, so it can forward requests directly to the desired node. The node that handles the request is called the coordinator node. When sending requests, in order to scale the load, it is better to poll all nodes in the cluster.

When creating, indexing and deleting documents, the node will find the primary shard according to the ID after receiving the request. After the primary shard is successfully executed, the request will be sent to the shard where the replica is located. Partial update document-based replication When the primary shard forwards the changes to the replica shards, It does not forward update requests. Instead, it forwards a new version of the full document. Keep in mind that these changes will be forwarded to replica shards asynchronously, and there is no guarantee that they will arrive in the same order in which they were sent. If Elasticsearch only forwards change requests, it is possible to apply changes in the wrong order, resulting in corrupted documents. New, index and delete requests are all  write  operations, which must be completed on the primary shard before they can be copied to the relevant replica shards. consistency The value of the parameter can be set to  one (as long as the primary shard status is ok, _write_ operations are allowed) , all`(必须要主分片和所有副本分片的状态没问题才允许执行_写_操作), 或 `quorum . The default value is  quorum , that is, most shard replicas are allowed to perform _write_ operations. The number is int( (primary + number_of_replicas) / 2 ) + 1. If there are not enough replica shards, Elasticsearch will wait, hoping to make more Multiple shards appear. By default it waits up to 1 minute. If you need you can use  timeout parameters Make it terminate earlier:  100 100ms, 30s is 30s. New indexes have  1 one replica shard by default, which means that   two active shard replicas 规定数量 should be required to satisfy them. However, these default settings prevent us from doing anything on a single node. To avoid this problem, it is required  number_of_replicas that the specified number will only be executed if it is greater than 1.

When obtaining a document, the node will find the shard where the document is located according to the ID after receiving the request, find the document and then return it to the coordinator node. When processing a read request, the coordinator node will poll all the replicas for each request. Sharding to achieve load balancing.

bulk时Elasticsearch can directly read raw data received by network buffers. It uses the newline character to identify and parse small  action/metadata lines to decide which shard should handle each request. These original requests are forwarded directly to the correct shard. No redundant data replication, no wasted data structures. The entire request is handled in the smallest amount of memory possible.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325312715&siteId=291194637