Exploring ES High Availability: Detailed Explanation of Didi Self-developed Cross-Data Center Replication Technology

Elasticsearch is an open source, distributed, full-text search engine with RESTful interface built on Lucene, each of its fields can be indexed, and can be scaled horizontally to hundreds of servers to store and process terabytes of data, which can be in Store, search and analyze large volumes of data in a fraction of the time.

Since the development of Didi ES, it has undertaken most of the company's on-device retrieval and log scenarios, including map POI retrieval, order retrieval, customer service, internal search, and pulse-checking ELK scenarios.

In recent years, we have continued to explore the directions of stability, cost, efficiency and data security:

  • Didi ES has many online P0-level retrieval scenarios. In order to improve the stability of the cluster, we have self-developed cross-data center replication capabilities to achieve strong consistency in data writing in multiple computer rooms, and cooperate with the management and control platform to enable ES to support multi-activity capabilities;

  • In order to improve query performance and solve query glitches, we upgraded to support JDK 17 in place on version 7.6;

  • In the ES log scenario, the daily writing volume is on the order of 5PB-10PB, and the writing pressure and business cost pressure are high. In order to improve the writing performance of ES, we let ES support the ZSTD compression algorithm;

  • Since the ES index contains a lot of sensitive data, we have improved the security authentication capabilities of ES.

Based on the above explorations, we have summed up some experience, which is now divided into 4 articles to introduce in detail. This article introduces how Didi ES implements cross-data center replication of indexes to ensure high availability of indexes.

Didi Cross Datacenter Replication - Didi Cross Datacenter Replication, self-developed by Didi, referred to as DCDR, can natively replicate data from one Elasticsearch cluster to another Elasticsearch cluster. As shown in the figure, DCDR works at the index template or index level, adopts the master-slave index design model, and the Leader index actively pushes the data to the Follower index, thus ensuring the strong consistency of the master-slave index data.

8da089a311ba863c4f3e832e64e465cc.png

DCDR cross-data center replication capability diagram

The main production and application of DCDR in Didi are as follows:

  • Disaster Recovery (DR)/High Availability (HA) : If the primary cluster fails, it can be quickly recovered by switching the primary and secondary clusters, so as to achieve multi-active in different places

  • Index Migration : Indexes can be migrated between different clusters to ensure data balance between clusters, and at the same time realize the hierarchical guarantee of indexes at the cluster level

  • Master-slave query isolation : Due to the strong consistency guarantee of the master-slave index and the self-developed ES Admin management and control platform, different business parties can query different clusters to avoid mutual query impact         

background and goals

The native Elasticsearch provides high availability within the cluster and can ensure data reliability within the cluster. However, this high availability cannot satisfy users who have further requirements for reliability. Native Elasticsearch mainly has the following pain points:

  • No fast recovery from datacenter-level failures

  • The cost of data relocation between clusters is high, and external tools are required to complete multiple complex operations

Initially, Didi internally dealt with high availability across data centers, using an external synchronization platform to double-write data to different clusters. This method is heavily dependent, does not support historical data synchronization, and cannot guarantee the strong consistency of master-slave index data. With the convergence of external platforms, the double-write method is no longer available. ES officially provides a cross-cluster data replication function in version 6.7.0. This function requires payment and can only guarantee the final consistency of the master-slave index data. Didi's internal core businesses, such as POI retrieval (Didi APP pick-up and drop-off location retrieval service) and order retrieval business, all require strong consistency of master-slave index data.

In order to solve the above problems and meet the needs of the business side, the Didi ES team decided to self-develop the cross-data center replication capability, which is the above-mentioned DCDR.

DCDR has the following main goals in design:

  • Ensure strong consistency of master-slave data

  • Guaranteed high availability and rapid disaster recovery

  • Realize non-stop cross-cluster index migration

  • Reliable version upgrade (Elasticsearch's Rolling upgrades and Full cluster restart upgrade solutions cannot be rolled back after upgrade)

technical foundation

The DCDR function supports the replication of indexes in remote clusters to local clusters. During the replication process, two key points need to be considered: synchronization of real-time data and synchronization of historical data. Real-time data synchronization relies on the ES writing mechanism, and data synchronization relies on the ES copy recovery mechanism. Therefore, before introducing the scheme design and implementation details of DCDR, a brief overview of these two processes:

basic write mechanism

ES writing is to write the main shard first, and then forward the request to the copy in parallel after the main shard is written. After the copy is processed, the main shard returns the writing result. The specific process is as follows: (Note: Si in this article represents ES specific Fragmentation, P stands for primary shard, R stands for replica)       

708b6187ae72aea147657b5b0ffb738f.png

Replica recovery process

In order to ensure the consistency of the data copy, the data of the copy needs to be restored to be consistent with the primary shard in order to provide external services normally. ES replica recovery is at the shard level, which is divided into the primary shard recovery process and the secondary shard recovery process. Since the replica recovery process of ES is extremely complicated, and the data recovery process of DCDR is only related to the recovery process of the sub-shard, here only briefly introduces the recovery process of the sub-shard.

The goal of replica recovery is to restore the local data to be consistent with the primary shard. The main process is divided into two stages:

  • The first stage is that the primary shard sends the segment file to the copy (the specific data that has been placed on the disk and parsed is stored)

  • The second stage is that the primary shard sends the translog log (data that has not been placed on the disk, similar to mysql's WAL Log) to the replica. After the two stages are over, the recovery process of the replica is over.

The specific process is as follows:     

77f9e737727dec837fd75d4411a34513.png

Design

design thinking

The core idea of ​​DCDR is to treat the shard corresponding to the secondary index as a remote copy of the shard corresponding to the primary index. As shown in the figure below, the primary shard of the slave index shard0 will be regarded as a remote copy of the primary shard of the primary index shard0.            

8ca24d513773ac806377202ac697774a.png

In order to let everyone better understand this idea, let’s briefly introduce the remote copy: the remote copy is an extension of the ES data copy model, and the metadata related to the remote copy is saved by the main shard of the main index. The implementation is based on Microsoft’s PacificA algorithm. The design idea conforms to the ES data copy model, which can greatly reuse the ES copy logic, reduce the difficulty of development, and reduce the intrusion of the open source ES kernel.

The following is the corresponding relationship between some core terms of the algorithm and the ES data copy model:

30cd7e3142744fb39c14ef17a9a7473e.png

Specific program design

DCDR is a cross-cluster data replication capability. The first step to realize this function is to specify which index templates or indexes need to replicate data across clusters, that is, to establish a DCDR link. Secondly, as a remote copy, DCDR's secondary index needs to be restored to the same data as the primary index to provide services normally, that is, historical data recovery. The data of the slave index is restored to be consistent with the master index. When new data is added to the master index, how should the data be written to the slave index, that is, real-time data synchronization. After the above steps, services can be provided normally from the index, so how to ensure the reliability of the data? This involves master-slave index data quality verification.

Based on the above considerations, the design of the entire DCDR scheme is divided into four main processes:

c1d9eda23935eec9d70bcc48127a544e.png

1. DCDR link construction

The ES cluster is driven based on the cluster state, so the essence of DCDR link construction is to change the cluster state and apply the new cluster state on the corresponding machine. Didi's internal ES usage is in the form of index templates (a set of index sets with the same prefix), so link design needs to support template links and index links. DCDR link cluster meta information is implemented through ES cluster state custom metaData. Links have a unified naming rule and distinguish between templates and indexes. The main information is shown as follows:

模板链路:
{
   "templates": {
       "templateA_to_ClusterA": {
           "name": "IndexA_to_ClusterA", // dcdr模板链路名
           "template": "templateA",               // 索引模板名
           "replica_cluster": "ClusterA"    // 从集群名称
       }
   }
}
索引链路:
{
   "Index_202206/Index_202206(ClusterA)": {
       "primary_index": "Index_202206", // 主索引名称
       "replica_index": "Index_202206", // 从索引名称
       "replica_cluster": "ClusterA", // 从集群名称
       "replication_state": true            // 链路状态
   }
}

The ES cluster provides the DCDR link creation API externally, through which the link meta information is updated to the cluster state, and the DCDR related modules enter the data synchronization process by subscribing to the cluster state change event. As shown below:         

dc8e8a265c09ed6c27bff93dae579249.png

There is a design detail to note:

Q: The name of the master-slave index is the same, so how to deal with the unique identifier UUID of the master-slave index (a random string automatically generated after the cluster index is built)?

  • Considering the difficulty of development and the problem of source code intrusion, the index name and UUID of the master-slave index are consistent

  • When the slave index is created, the UUID of the master index is transparently transmitted to the slave cluster. The UUID is no longer automatically generated when the slave index is created, and the problem of UUID inconsistency in creating the slave index is solved.

  • Since the ES cemetery will temporarily save the deleted index, scan the ES cemetery and delete the index with the same UUID when creating from the index, so as to solve the problem that the index cannot be rebuilt after being deleted from the index

2. Historical data recovery

The design of the historical data recovery scheme draws on the ES copy recovery strategy. DCDR recovery from the copy of the index is also at the fragmentation level, and also requires the replication of segment and translog. Conditions for historical data recovery to occur:

  • To create a new DCDR link, the slave index needs to restore historical data based on the master index

  • Failed to write data from the index shard, and the scheduled task of the primary index rebuilds the DCDR link       

a12c1179890c744dce1e1aa3a0fdc43b.png

The historical data recovery process from the index as a remote copy is basically the same as the ES copy recovery process. The main difference (marked in green in the figure) lies in the triggering conditions for data recovery in step 1 and the copy group added in step 6. Also pay attention to the following design details:

Q: How to trigger the restoration of historical data?

  • The replica recovery of ES is driven by cluster state change events, and the recovery of the secondary index is cross-cluster, so the RPC call of the primary cluster can only be used to trigger the DCDR historical data recovery of the secondary cluster.

Q: ES shard recovery is a very time-consuming stage. How to improve the efficiency of shard recovery from the index so that the service can be quickly provided from the index?

  • The slave index only needs to restore its own primary shard data, and after the recovery of the historical data of the DCDR slave index is completed, the slave index can normally receive the write request of the master index. Recovery from the copy of the index itself depends on the ES copy mechanism from the cluster. This can greatly reduce the recovery time of DCDR link historical data.

Q: When can the slave index normally receive write requests from the master index?

  • The ES copy will end at phase 1 of the main shard. After the engine is started, the copy joins the copy group of the main shard and begins to receive write requests from the main shard. The recovery of the slave index is also similar. The master shard of the slave index is used as the remote copy of the master shard corresponding to the master index. After the end of phase 1 of the master index master shard and its own Engine starts, the corresponding master shard of the master index Join the remote copy group and start receiving write requests.

  • The realization of the remote copy group is to add a remote prepared list to the ReplicationGroup class of ES.

Q: During the recovery process of DCDR historical data, can the primary shard of the primary index be migrated?

  • Shard relocation is a means of cluster balancing. Since DCDR recovery is cross-cluster, it is impossible to quickly perceive and process shard migration through cluster state changes. Therefore, primary shards cannot be migrated. During the DCDR data recovery process, the primary shard migration will be prevented by locking.

3. Real-time data synchronization

Real-time data synchronization refers to how incremental data is synchronized to the secondary index after historical data synchronization is completed. According to the ES writing process above, ES writing is to write the primary shard first, and then forward the write request to the replica synchronously. Based on Didi's internal business scenarios, the amount of business data that needs to be multi-active in different places is generally not large, far from reaching the bottleneck of ES writing, and some core businesses have a strong dependence on data consistency. Therefore, in real-time data synchronization, DCDR adopts the scheme that the primary shard is successfully written, and the data is synchronously forwarded to the replica and the remote replica. This scheme sacrifices certain data writing performance to ensure strong data consistency.

1751af456f0c5760a3af7832289956d0.png

The real-time data synchronization strategy is implemented by forwarding write requests to remote copies. There are still many details to consider:

Q: What should I do if the remote copy fails to be written?

  • The processing strategy for ES replica write failure is to remove the replica from the synchronous replica group and execute Recovery again. The processing strategy for the write failure of the remote copy is similar to the write failure processing strategy of the ES copy. The remote copy is removed from the remote copy group of the primary shard of the master index. The master index will no longer forward the write request to the slave index. The timing check mechanism of the index re-executes the data recovery process.

Q: How does the seq_num of the slave index (the unique ID incremented by each request, used to speed up the replica recovery process) ensure the consistency of the master and slave?

  • The sharding of the slave index uses a custom Engine, which can directly receive the seq_num from the master index and no longer generate the seq_num value.

Q: How does master-slave mapping ensure consistency? How to deal with updating the mapping?

  • When creating a new DCDR link, the mapping of the master index is copied to the slave cluster, and a new slave index is created to ensure that the mapping of the master-slave index is consistent when the link is created.

  • The design idea of ​​DCDR is a remote copy strategy, which directly forwards write requests to the slave index. Therefore, if there is a field that needs to be updated in the later stage, the respective masters of the master and slave clusters will execute the master task to update the mapping (the master and slave master mapping update processing strategies are the same).

4. Master-slave index data quality verification

The data quality verification link is to guarantee the reliability of the index data. It regularly checks whether the DCDR meta-information in the cluster status is consistent with the current link operating status, and performs corresponding operations on the link according to the result. When the master-slave index data gap is too large or the link is abnormal, the master cluster will actively disconnect the link and notify the slave index to recover the differential data. In the ES cluster, the MasterNode is responsible for managing and controlling the cluster metadata, so when designing the verification task, it is mainly used to create link metadata and check whether the slave index exists; the DataNode is responsible for data storage, so it is used to determine whether the master-slave fragmentation needs to be performed. Data Recovery.   

3da2636d091ba6071cd673a612d8c978.png

5. Others

After the above four steps, data can be natively copied from one Elasticsearch cluster to another Elasticsearch cluster. With the master-slave switching strategy, cross-cluster high availability can be achieved under the premise of ensuring strong data consistency. For the goal of non-stop cross-cluster index migration, we synchronize the data to the destination cluster through DCDR, wait for the recovery of the stock data to complete, and then perform a master-slave switch. For the goal of reliable version upgrade, we copy the data of the version to be upgraded to the standby cluster through DCDR, and can quickly switch the cluster when the version upgrade is abnormal.

Summarize

At present, Didi ES has a total of 6 DCDR slave clusters. It has established 400+ DCDR template links and 2000+ DCDR index links, covering Didi core businesses such as POI, dos_order, and soda. At present, ES still has problems in query glitches, query interaction, shard recovery, and write performance. In the future, we will focus on these aspects to better help business development.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132157866