1.概述

2.恢复流程

2.1 重启后何触发恢复？

重启后的恢复是在GatewayService中触发的，需要满足如下条件： gateway.expected_nodes gateway.expected_data_nodes gateway.expected_master_nodes gateway.recover_after_time

在满足上述条件后，执行allocationService.reroute来进行分片分配，并且触发恢复

分片分配是指决策当前分片在那个实例上进行恢复的过程。

在这一阶段在主master中决策哪些分片是主，哪些分片是副，本次初始化哪些分片,本次初始化的分片就由unassigned标记为initialize状态

初始化哪些分片是由索引的优先级、十几个决策器共同决定的，其中每次初始化一部分是使用被限流决策器所限制的。

public void allocateUnassigned(RoutingAllocation allocation) {
    
     
    if (allocateUnassignedDecision.getAllocationDecision() == AllocationDecision.YES) {
    
     
        unassignedIterator.initialize();
    } 
}

在将分片标记为initialize后，master需要让所有的node知道这些分片已经被初始化。即使用routingChangesObserver.shardInitialized来更新RoutingNodes 其内容为某个分片是主分片还是副本，该分片被分配到了哪个实例上。master将该RoutingNodes随着新的cluster state下发到所有的实例上，各实例根据RoutingNodes进行具体的恢复操作

routingChangesObserver.shardInitialized(unassignedShard, initializedShard);

reroute执行结果返回新的clusterState,clusterState在每个实例的各个模块中进行应用，若没有在cluster.follower_lag.timeout的时间内成功应用集群状态，则master。会认为该实例故障，并且会将该实例移除。cluster.follower_lag.timeout的默认值为90s,在版本中已经调整为300s。

 [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@64f16fcf]] took [111457ms], [running applier [org.elasticsearch.script.ScriptService@6577341]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@2db1ce91]] took [135ms], [running applier [org.elasticsearch.snapshots.RestoreService@384aedc9]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@572a6218]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@6a5adec7]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$2018/653557958@49bc405b]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@2677d800]] took [0ms],
 [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@354c2f04]] took [0ms], [notifying listener [com.huawei.es.security.cluster.ClusterStateManager@2d634675]] took [7ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@4561dd49]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@133ea613]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@475e9f7f]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@70b9290f]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@3bf929c4]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@5d111e96]] took [62890ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@6e07c72c]] took [224666ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@57cf6ab2]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@258ade86]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@703a810f]] took [0ms]

上述的applier和listener都执行完成后clusterState才算应用成功，其中在IndicesClusterStateService模块中应用clusterState的时候进行分片的恢复。所以若并发恢复的分片过多，则IndicesClusterStateService的耗时会较长，存在实例应用集群状态超时被master移除的情况。cluster.routing.allocation.nodeconcurrentrecoveries：20

每个分片在完成恢复后，会向master发送一个shard-stared的task,在_cat/pending_tasks中可以看到，该task会在应用集群状态后，触发下一批次的reroute 来触发下一批次的恢复。

@Override public void clusterStatePublished(ClusterChangedEvent clusterChangedEvent) {
    
     
    rerouteService.reroute("reroute after starting shards", prioritySupplier.get(), ActionListener.wrap( r -> logger.trace("reroute after starting shards succeeded"), 
        e -> logger.debug("reroute after starting shards failed", e))); 
}

每个shard-started task都会触发一次reroute操作，并且reroute是以最高优先级URGENT执行的，在集群规模较大的场景中，reroute耗时较长，这意味着一小部分的shard-started事件就会使其他低优先级的任务会堆积(pendingtasks) 在分片启动时，并不需要立即触发reroute，只要最终触发到reroute即可,7.x中进行了优化，不再每次shard-started`都去触发reroute，在一次cluster-state时触发。涉及开源问题单#44433

 274 21ms URGENT shard-started StartedShardEntry{
    
    shardId [[myindex11][0]], allocationId [YsP575y9S7aWu9kKOGeAug], primary term [7], message [after peer recovery]} 273 28ms URGENT shard-started StartedShardEntry{
    
    shardId [[myindex33][0]], allocationId [KrSJ-2NNSdeViQEpQ8VjvQ], primary term [6], message [after peer recovery]} 272 32ms NORMAL cluster_reroute(reroute after starting shards)

单分片恢复过程是在IndicesClusterStateService中的createOrUpdateShards(state);中执行
在IndexShard#startRecovery中进行恢复

阶段	简介	分析
INIT	尚未启动恢复
INDEX	恢复Lucene文件，以及在节点间复制索引数据，主分片恢复不会在节点间拷贝数据\|若副本恢复是基于文件恢复，则耗时较长，若基于操作恢复或translog恢复较快\| \|VERIFY_INDEX\|验证分片，默认是关闭的	不耗时
TRANSLOG	启动Engine，重放translog，创建Lucence索引
FINALIZE	清理工作，执行refresh	不耗时
DONE	完毕	不耗时

Soft-delete

硬删除：elasticsearch在7.4之前的删除是标记删除，执行DELETE API时会将文档标记为删除，并且在段合并的时候，将标记为删除的文档移除掉。软删除：为了通过Lucene段来追踪到主分片上的所有操作历史，需要是被删除的文档在段合并的时候不被影响。段合并不受影响是指在一定的期限内，而非无限期。

租约

ES使用历史租约的机制来跟踪预期将来需要重放的操作，每个副本分片跟踪尚未收到响应的操作的序列号，当副本接收到新的操作时，会将它租约中的序列号加1，用来表示将来无需重播这些操作。一旦软删除的操作未被任何一个租约持有，则Elasticsearch将会将它移除。租约的有效期默认为12h,若在租约到期后执行恢复，则需要拷贝全量段文件来恢复。

主分片恢复

主分片恢复均是在存在分片副本的实例上进行恢复，主分片的恢复,除加载文件外，仍然需要使用translog进行恢复，因为在恢复期间仍然可能存在一些新的操作，若不恢复tranlog，会导致数据丢失。

副本恢复

副本恢复的瓶颈有: - 分片分配阶段就已经分配到了没有该副本的段文件的实例上 - 基于translog/租约恢复出现异常，需要通过基于文件的恢复来执行

【Elasticsearch】Elasticsearch启动索引恢复流程

文章目录