A review of an urgent Elasticsearch online problem in the early morning of the 27th of the twelfth lunar month

The following questions and solutions are discussed from: WeChat communication and Tencent meeting.

1. Background of online problems

1.1 Cluster environment of a foreign enterprise

  • 1. About 10 TB (the details are not particularly specific) cluster data;

a6af07a7cc08ff5901570067b4dcb55a.png
  • 2. For a 2-node cluster, the resource utilization rate is as shown in the figure below;

76204d827bc88e0472817c70cd025795.png
  • 3. The maximum index is 600GB

  • 4. Elasticsearch version: 7.17.4

  • 5. The cluster has a total of 200 shards.

1.2 Core Issues

  • Symptoms: The cluster cannot be started after restarting. It has been started for more than 20 hours, and the cluster still cannot fully return to normal state.

Feedback through communication: Before, the maximum time to start the cluster was 8 hours. Now that the Chinese New Year holiday is approaching, it cannot be started directly.

1.3 Core error reporting

Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-kibana-7-2023.01.17][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-7-2023.01.17][0]] containing [2] requests]
 ... 11 more
[2023-01-17T06:19:46,326][WARN ][o.e.x.m.e.l.LocalExporter] [datanodeprod5.synnex.org] unexpected error while indexing monitoring document
c9ac611f493938af4f00764c3f7755d9.png

2. The proposal proposed by the technical responsible team of the enterprise

  • (1) Delete (cancel) the shards? - delete the shards

  • (2) Move the shards to another node?——Move the shards to other nodes

  • (3) Allocate the shards to the node? - Redistribute the shards

  • (4) Update 'number_of_replicas' to 2?——Number of update replicas

Something else entirely?

3. Communication and troubleshooting found problems

Major premise: The solutions proposed by the enterprise technical team are all based on the modification of Settings, etc., but the current cluster has been in the state of primary shard allocation, the cluster has been red, and many operations cannot be performed.

We took the communication again and again to troubleshoot the problem, and the communication found the following problems:

  • First: The cluster planning is unreasonable, the data volume is 10TB+, and there are only 2 nodes, and the node roles are all default values.

  • Second: There are several extremely large-scale single indexes and extremely large-scale single shards. For example: there is a very large 600GB index.

  • Third: Both development and operation and maintenance do not know how many shards and replicas are set up (very weird, no attention).

  • Fourth: It took 8 hours to restart, and the cause was not checked, and no attention was paid to it. It was not paid enough attention until a serious problem occurred during the Spring Festival holiday.

In one sentence, describe the problem in a nutshell:

  • Due to the unreasonable pre-planning of the cluster (unreasonable setting of sharding, number of copies, unreasonable sharding size planning, etc.), the restart of the cluster takes a long time (it cannot be started normally or the startup always shows: recovery is about 30%, which is very slow! ), during the restart period (recovery), due to continuous unallocated primary shards or successful recovery, the cluster is consistently in the red state.

  • Neither Kibana nor the Head plug-in can connect successfully (Note: the cluster is in red state, and Kibana cannot connect to Elasticsearch), and only a few limited commands can be executed through the postman tool, and the cluster response is extremely slow or even unresponsive in many cases.

4. Discussion on solutions

Elasticsearch automatically performs recovery based on the following five situations:

  • 1. Node startup (this type of recovery is called local storage recovery);

  • 2. Copy the primary shard to the replica shard;

  • 3. Migrate shards to different nodes in the same cluster;

  • 4. Restore the snapshot (also known as: snapshot restore restore operation);

  • 5、Clone, shrink, or split operation .

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html

When the cluster is still available, the following 4.0-4.3 can be tried.

4.0 Make good use of the recovery API

GET _cat/recovery?v=true&h=i,s,t,ty,st,shost,thost,f,fp,b,bp&s=index:desc
809210400fe0aa2610b9d56fa42924e4.png
  • Purpose: Returns information about shard recovery, both in progress and completed.

  • Note: After shard recovery is complete, the recovered shards are available for search and indexing.

4.1 Determine the number of concurrent shards that will be restored for each node.

If the following settings take effect, it is essentially: node_concurrent_incoming_recoveries and node_concurrent_outgoing_recoveries take effect at the same time.

incoming_recoverie can be simply understood as the recovery of replica fragments, and outgoing_recoveries can be simply understood as the recovery of primary fragments.

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.node_concurrent_recoveries": 3
}
}

The default value is 2. In theory, the general assembly will increase the concurrency.

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html

4.2 Delayed shard allocation strategy

When a node leaves the cluster for any reason (human reasons or system abnormalities), the master node will react as follows (the following is called step X to facilitate subsequent interpretation):

  • Step 1: Promote the replica shard to primary to replace any primary shard on the node.

  • Step 2: Allocate replica shards to replace lost replicas (provided there are enough nodes).

  • Step 3: Rebalance the shards evenly across the remaining nodes.

The benefits of the above operations are: avoiding cluster data loss and ensuring high availability of the cluster.

However, the possible side effects are also very obvious: first, it will bring additional load to the cluster (shard allocation consumes system resources); second, if the nodes that leave the cluster return quickly, the necessity of the above mechanism will remain Discuss.

At this time, it is very necessary to delay the allocation of shards. The settings are as follows:

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "6m"
  }
}

The essence of the delayed shard allocation strategy (in vernacular): When a node leaves the cluster and confirms that it can go online quickly in a few minutes (set by yourself), only the corresponding copy on the leaving node in step 1 is triggered during the leaving process The shard is promoted to the primary shard. At this point the cluster is at least not in the red state, but in the yellow state. Steps 2 and 3 will not happen. At this time, the cluster is available. After the cluster is offline within a few minutes to ensure that it is back online, the shards will be converted to replica shards again, and the cluster will return to the green state .

This process effectively avoids the shard allocation in steps 2 and 3, and ensures the high availability of the cluster in the shortest time overall.

https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html

4.3 Limit recovery speed to avoid cluster overload

Elasticsearch limits the speed allocated to recovery to avoid overloading the cluster.

This setting can be updated to make recovery faster or slower, depending on business requirements. When resources allow, increase the size as soon as you want; otherwise, do the opposite.

However, if you blindly pursue fast recovery, if you set the following settings too high, the ongoing recovery operation will consume too much bandwidth and other resources, which may destroy the stability of the cluster.

PUT _cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

Precautions:

  • This is a dynamic setting at the cluster level. Once set, it takes effect on every node in the cluster.

  • If you only want to limit a certain node, you can achieve this by updating the static configuration of the elasticsearch.yml configuration file.

https://www.elastic.co/guide/en/elasticsearch/reference/current/recovery.html

However, none of the above operations can be implemented for the current problematic cluster . The main reason is that the cluster cannot respond.

what to do? At this time, we need to find another way.

Can unnecessary indexes be physically deleted to improve the cluster startup speed? That's how this bold idea came about. try it!

5. Discussion on speed-up scheme of index recovery

The core of speed-up: Delete the historical "burden" (large indexes that will no longer be needed in the future) to indirectly speed up cluster recovery.

Disclaimer:

  • For the following verification, only the single-node cluster verification is ok, and the multi-node principle is consistent.

  • Operations involving files are unavoidable, and as a last resort, do not directly operate files .

5.1 Step 1: Find the uuid of the large index to be deleted.

Some commands are executable, you can cat/index

GET _cat/indices?v&s=docs.count:desc

See below:

72f862bdf6f68502b4dc36e04866beb2.png

corresponduuid:"znUfwfE3Rt22GMMqANMbQQ"

The purpose of this step is to find the indexes that need to be deleted and completely abandoned at the business level, and directly delete the files physically, so as to reduce the recovery pressure for restarting the cluster.

5.2 Step 2: Find the corresponding storage under the Elasticsearch storage path.

[root@VM-0-14-centos data]# find ./ -name "znUfwfE3Rt22GMMqANMbQQ"

Find the location:

./indices/znUfwfE3Rt22GMMqANMbQQ
1ddc1316f0377f173aa393defb521f67.png

Suggestion: backup first, then perform physical deletion.

5.3 Step 3: Restart the cluster again.

In this way, it can be started normally in theory, and my own small-scale cluster verification has no problem.

Screenshot of verification in the middle link:9d4ec91f976a36b0110743a480674bdc.png

Again: this is impossible! As a last resort, do not manipulate files directly! !

6. Summary

Due to the online environment and 10TB+ business data involved, the development team has not immediately adopted a plan to delete files, and the decision will be made after re-discussion and further verification by the team.

a231b99ef3fac76d4ef573e5ca7b79ad.png

This is a warning to us latecomers or golfers as follows:

  • First: "Mend the prison after a sheep is lost", when a sheep is lost, it must be mended! !

  • Second: There must be someone in the team who is relatively familiar with ES, otherwise it will be very troublesome to encounter problems (especially during holidays).

  • Third: The timing snapshot function is very important.

  • Fourth: The necessity of deploying a node in a 128GB cluster is yet to be verified.

  • Fifth: Disorderly adjustment of relevant tuning parameters is to go to the doctor in a hurry. It can only be boldly used after one by one verification is feasible.

  • Sixth: There must be a small-scale testable cluster, and the online environment may be a big problem.

    Seventh: Large shards (hundreds of gigabytes) are harmful without exception, especially cluster restart and fault recovery are cumbersome. Don't be afraid of getting too big, ILM must be used. Historical data should be cleared and cold-treated is the kingly way!

——2023-01-18 0:29 first draft, 2023-01-30 23:20 reorganization

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List (2022 National Day Update)

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

d7ce3f6c8c99ca15f0635fd9968d4f64.jpeg

Acquire more dry goods faster in a shorter time!

Improve with 1800+ Elastic enthusiasts around the world!

bfb1d4dcbc82edb4bb2893033b792e4b.gif

Learn advanced dry goods one step ahead of your colleagues!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/128825602