Elasticsearch: routing - routing

Have you considered how Elasticsearch knows where to store documents? How does it know where to look for them, and whether to retrieve, update or delete them? It's an exciting process, and it all boils down to the concept of routing.

Routing Introduction

Routing is the process of determining which shard a document belongs to in order to retrieve it or store it where it belongs. When Elasticsearch indexes a document, it does various calculations to determine which shard to place it on. This is done by using the following formula:

shard_num = hash(_routing) % num_primary_shards

By default, "_routing" is equal to the ID of the document. This tells Elasticsearch to look up the document's ID to determine which shard it belongs to. The same is true when we update or delete documents.

So when we ask Elasticsearch to retrieve a document by its ID, Elasticsearch uses that ID to locate the shard where the document is stored. If the document exists, it is almost certainly on the shard corresponding to the routing formula.

The same is true when we update or delete certain documents. However, the situation is different if you want to find documents based on characteristics other than ID. But we'll get to that eventually.

Default Routing Policy

The beauty of routing is that it is completely invisible to Elasticsearch users. Elasticsearch makes our lives a lot easier by providing a default routing strategy, which saves us from having to deal with all these routing issues ourselves.

You might be wondering if it is possible to change the default routing policy. The answer is yes; you can modify it if you want. However, this is a complex topic, which we will discuss later.

In addition to ensuring that documents are assigned to a shard and we can fetch them by ID, the default routing strategy also ensures that documents are evenly distributed across all shards in the index. This helps ensure that no one shard has more documents than another.

If we decide to modify the way documents are routed, we must either ensure that they are still evenly distributed, or accept that one shard may end up having more documents than another.

Elasticsearch meta fields and custom routes

Elasticsearch keeps some additional information in the documents it indexes. Elasticsearch includes metafields like _id and _source in addition to the data we provide (such as the JSON we use to add documents). The _id field has the document's unique identifier, while the _source field contains the JSON payload used to index it.

There is also a metafield named _routing. This option is used to customize the route plan for our document.

By default, Elasticsearch uses a hash-based routing method to determine in which shards a document should be placed. However, if we provide a custom routing value when indexing documents, Elasticsearch will use that value to identify the appropriate shard number. For example, we can use the following method for routing:

PUT my_index/_doc/3?routing=1?refresh  (1)
{
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer",   (2)
    "parent": "1"       (3)
  }
}
 
PUT my_index/_doc/4?routing=1?refresh
{
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

As shown above, if we set routing to 1, then all documents with the same routing value will be written to the same shard. This situation is sometimes necessary, such as for the documentation of the join data type. For specific operations, please refer to the document " Elasticsearch: Join data types ​​​​​​​​".

It is critical to understand that the number of shards in an index is fixed and cannot be modified after the index is created. This is because Elasticsearch's routing formula depends on the number of shards in the index. Specifically, the formula it uses to determine the number of shards is shard_num = hash(_routing) % num_primary_shards.

The routing formula will provide different results if the number of shards in the index changes. This is not a big deal for newly indexed documents, but it is for old ones. Suppose we have an index with two shards and we index a document. Documents are stored in the second shard according to the routing formula. However, if the number of shards in the index is later increased to four, the documents will need to be moved into separate shards. This may take some time my friend. So, when creating an index in Elasticsearch, we have to think carefully about the number of shards we want.

Managing Elasticsearch Shards: Considerations and Best Practices

Ok, let's pretend we add more shards to the index, say about 5. This way we can add more documents without causing any noticeable issues. However, when we try to find certain documents by ID, Elasticsearch sometimes cannot find them. Basically, the ID is run through the routing algorithm again, and because one of the factors has changed, the results may vary.

This means that Elasticsearch is looking for the document in the wrong shard and returns nothing, although the document is in the index. This problem is known to create significant problems, especially when dealing with time-sensitive information. One solution is to develop a better routing mechanism that can handle changes in the number of shards more efficiently.

It's also worth noting that other factors, such as how documents are distributed across different shards, also play a role in this problem. Therefore, it is critical to carefully evaluate your indexing strategy and ensure it is optimized for maximum efficiency and accuracy.

Keep this very important principle in mind when considering adding more shards to your index. You have to make sure that the documents in the index are not evenly distributed, otherwise you might run into performance issues. So if you can balance all shards, it's much better in search time and other aspects.

However, if you want to change the number of shards, you must create a new index and reindex all documents. You can use reindex to reindex documents. We can refer to the article " Elasticsearch: Reindex interface ".

This sounds bad, but there are APIs that can help. If you're interested, check out the shrink and split functions to shrink or expand the number of shards. Using these can help you create a new index with a different number of shards and reindex all documents without serious trouble.

In general, you should exercise caution when building your index and consider the number of shards you will need as well as any potential future revisions. However, with proper preparation and the right tools, managing shards can become a breeze and dramatically increase your search speed.

Supongo que te gusta

Origin blog.csdn.net/UbuntuTouch/article/details/131031191
Recomendado
Clasificación