Meituan Database Disaster Recovery System Construction Practice

This article focuses on the construction practice of Meituan's database disaster recovery system. The main content includes business architecture, database disaster recovery platform capacity building, drill system construction and some achievements of these constructions. Finally, it will share the future thinking of disaster recovery construction. I hope it can be helpful or inspiring to everyone.

  • 1 Introduction to Disaster Recovery

  • 2 Business Disaster Recovery Architecture

    • 2.1 Evolution of disaster recovery architecture

    • 2.2 Meituan Disaster Recovery Architecture

  • 3 Database disaster recovery construction

    • 3.1 Challenges

    • 3.2 Basic High Availability

    • 3.3 Disaster recovery construction path

    • 3.4 Platform capacity building

    • 3.5 Construction of drill system

  • 4 Future thinking

    • 4.1 Make up for shortcomings

    • 4.2 Iterative architecture

 1 Introduction to Disaster Recovery 

We usually divide faults into three categories, one is host faults, the other is computer room faults, and the third is regional faults. Each type of failure has its own triggering factors, and from the mainframe to the computer room to the region, the probability of failure is getting smaller and smaller, but the impact of the failure is getting bigger and bigger.

The goal of building disaster recovery capability is very clear, which is to be able to cope with and handle such large-scale failures at the computer room level and regional level, so as to ensure business continuity. In recent years, there have been many data center-level failures in the industry, which have had a very large negative impact on the business and brand of related companies. The current disaster recovery capability has become a must for many IT companies to build information systems.

b750b3273963c1c7aa67141498d070ab.png

2 Business Disaster Recovery Architecture 

2.1 Evolution of Disaster Recovery Architecture

The disaster recovery architecture has evolved from the earliest single-active mode (same-city active backup) to the same-city multi-active mode, and then evolved to remote multi-active. According to this path, disaster recovery can be divided into three types: disaster recovery 1.0, disaster recovery 2.0, and disaster recovery 3.0. stage.

  • Disaster recovery 1.0 : The disaster recovery system revolves around data construction and is mostly deployed in an active-standby manner, but the backup computer room does not bear traffic, and is basically a single-active structure.

  • Disaster recovery 2.0 : The perspective of disaster recovery is transformed from data to application system. The business has the ability of active-active in the same city or multi-active in the same city. It adopts the deployment architecture of active-active in the same city or active-active in the same city plus remote cold backup (two places and three centers). Every computer room other than has traffic processing capability.

  • Disaster Recovery 3.0 : Centered on business, it mostly adopts a unitized architecture. Disaster recovery is realized based on pairwise backup between units. According to the deployment location of the unit, it can realize multi-active in the same city and multi-active in different places. Applications using a unitized architecture have good disaster recovery and expansion capabilities.

ea427c478158c3d062c6854cf5c1e80c.png

Due to the different development stages of each company, the solutions adopted will also be different. Most of Meituan’s business is in the 2.0 stage (that is, the same city dual-active or multi-active structure), but for large-scale, regional disaster recovery and regional expansion Businesses with high security requirements are in the disaster recovery 3.0 stage. The following will introduce the disaster recovery architecture of Meituan.

2.2 Meituan Disaster Recovery Architecture

Meituan's disaster recovery architecture mainly includes two types, one is the N+1 disaster recovery architecture, and the other is the SET architecture.

N+1 architecture : In the industry, it is also called a scattered or multi-AZ deployment solution. A system with a capacity of C is deployed in N+1 computer rooms. Each computer room can provide at least C/N capacity. When any computer room is suspended , the remaining system can still support the capacity of C. The core of this solution is to sink the disaster recovery capability to the PaaS components to complete. When a failure occurs at the equipment room level or the regional level, each PaaS component independently completes the disaster recovery switchover to realize business recovery. The overall architecture is shown in the figure below. The business performance is multi-computer rooms and multi-active. The database adopts this master-slave architecture. A single computer room handles write traffic, and multi-computer rooms load-balanced read traffic. The following "Database Disaster Recovery System Construction Practice" is oriented to the N+1 architecture.

b9e956629a6b55fa22e5dd13b8d1ba0f.png

Unitized architecture : Also called SET-based architecture, this is a disaster recovery architecture that is partial to the application layer. It divides applications, data, and basic components into multiple units according to a unified dimension, and each unit handles a part of closed-loop traffic. The business uses the unit as the deployment unit, and realizes disaster recovery in the same city or in a different place through mutual backup of units. General financial business or ultra-large-scale business will choose this kind of architecture. Its advantage is that traffic can be closed-loop and resources are isolated, and it has strong disaster recovery and cross-domain expansion capabilities. However, the implementation of SET-based architecture requires a lot of business systems. Transformation, operation and maintenance management is also more complicated. The simplified diagram is as follows:

9d9eed5236f0e8eb84e244a47ae8ff53.png

Most of Meituan’s internal businesses have an N+1 structure, and businesses such as food delivery and finance have adopted a unitized structure. In general, Meituan has both intra-city multi-activity and remote multi-activity, and the two disaster recovery solutions coexist.

 3 Database disaster recovery construction 

| 3.1 Challenges

Challenges brought by ultra-large-scale clusters : the company's business is developing rapidly, the scale of servers is increasing exponentially, and the scale of data centers is getting larger and larger. There are thousands of database clusters and tens of thousands of instances in the large computer room.

  • Performance problem : There is an obvious bottleneck in the concurrent failure processing capability of the high availability system.

  • Disaster recovery failure risk : The management and control link becomes more and more complex with the increase of the number of clusters, and a problem in one link will lead to the failure of the overall disaster recovery capability.

  • Frequent failures : The number and scale of clusters have increased, making large-scale failures with low probability become common failures, and the frequency and probability of occurrence are getting higher and higher.

The cost of drills is high and the frequency is low : the verification of core capabilities is insufficient, the ability to respond to large-scale failures is in an unknown state, and it is difficult to "keep fresh" in known disaster recovery capabilities. Taking the relevant capabilities to deal with large-scale failures at the computer room level, a large part is in an unknown state or only exists in "paper" analysis, and for the verified capabilities as the architecture evolves and iterates, "freshness" is also very important. difficulty.

As one of the stateful services, the database itself is relatively more difficult and challenging to build the ability to deal with large-scale failures.

| 3.2 Basic High Availability

There are two main database architectures in Meituan, one is the master-slave architecture and the other is the MGR architecture.

aeed087ad3d062bd2d547e2c57f31270.png

  • Master-slave architecture : applications access the database through database middleware. When a fault occurs, high availability is used for fault detection, topology adjustment, configuration delivery, and application recovery.

  • MGR architecture : The application also accesses the database through middleware, but the middleware is adapted to MGR. The internal name is Zebra for MGR. The middleware automatically performs topology detection and perception. Once MGR is switched, the new topology will be detected and the data source Adjustments will be made and business will resume.

  • Meituan's high-availability architecture : The high availability of Meituan's master-slave cluster is based on the secondary development of Orchestrator. It is essentially a centralized management and control architecture. As shown in the figure below, there are multiple high-availability groups, and each group hosts a part Database clusters are deployed in Beijing and Shanghai in two regions. The underlying core components are only deployed in Beijing, such as our core CMBD, WorkflowDB, etc. Once there is a problem with the northbound dedicated line, the high availability on the Shanghai side will fail and become unavailable.

6b2242abf9aba3b796d99e27e3f6dfdf.png

3.3 Disaster recovery construction path

  • Disaster recovery construction path : determine disaster recovery goals, formulate disaster recovery standards, build disaster recovery platforms, consolidate basic capabilities, drill verification and risk operations.

  • Flywheel for disaster recovery construction : The inner ring is platform capacity building, from the proposal of disaster recovery requirements to R&D and launch, experience improvement, user use, and new requirements when problems are discovered, and continuous iterative improvement. Another aspect is to improve the construction of the drill platform, carry out high-frequency drills (or real fault-driven), find problems, propose improvements, and promote continuous iterative improvement of platform capabilities.

9acc53004a182c7f9c506d7371f64fe9.png

3.4 Platform capacity building

In order to build and improve the disaster recovery capability of database services, a disaster recovery management and control project DDTP (Database Disaster Tolerance Platform) was established internally to focus on improving the ability of databases to deal with large-scale failures. Two platform products: one is a disaster recovery management and control platform, and the other is a database drill platform.

The disaster recovery management and control platform mainly focuses on defense. Its core functions mainly include escape before the event, observation during the event, stop loss, and recovery after the event. The database exercise platform focuses on attack, supports multiple fault types and multiple fault injection methods, and has Core capabilities such as fault orchestration and fault recovery. The second article of this series, "Database Attack and Defense Drill Construction Practice", is a detailed introduction to the drill platform. Next, we will focus on the main content of the disaster recovery management and control platform, first look at the panorama:

d64d6ffae015bf0cf2f3c972249d08fb.png

  • Database service : including MySQL, Blade, MGR and other basic database services.

  • Basic capability layer : mainly backup and recovery, resource management, elastic scaling, master-slave high availability, and indicator monitoring capabilities. These capabilities are the basic part of stability assurance, but they need to be further strengthened in disaster recovery scenarios to handle large-scale failure scenarios .

  • Management and control orchestration layer : The core is OOS (Operation Orchestration Service), which will orchestrate basic capabilities on demand to generate corresponding processing procedures, also called service-oriented plans. Each plan corresponds to one or more specific operation and maintenance scenarios. Disaster recovery plans are also in this category.

  • Platform service layer : It is the capability layer of the disaster recovery management and control platform, including: 1) disaster recovery management and control , disaster recovery calculation evaluation and hidden danger management, as well as disaster recovery and escape before failure, bottom-up switching during failure, fault extraction, etc. 2) Disaster recovery observation , clarifying the scope of faults, and supporting disaster recovery decisions during faults. 3) Disaster tolerance recovery , after a failure, the disaster tolerance capability of the cluster can be quickly restored through functions such as instance repair and cluster expansion. 4) Plan service , including the management and execution of emergency plans for common faults, etc.

3.4.1 Capacity up to standard

The database has established a set of N+1 disaster recovery calculation standards, which are divided into 6 levels. If the cluster disaster recovery level is ≥ 4, the disaster recovery standard is met, otherwise the disaster recovery standard is not met.

cd73aaac93eaf4672d1c6c5115a5c9ec.png

It can be seen from the standard that starting from level 3, it is deployed in multiple computer rooms. The difference between Level 3 and Level 4 and Level 5 is that Level 3 does not meet the N+1 requirement, that is, if all nodes in a computer room fail, the remaining nodes cannot bear the peak traffic. Levels 4 and 5 both meet the N+1 requirement, and level 5 meets capacity parity between regions. In addition to basic standards, SET-based clusters have special rules, such as closed-loop routing policies, unified computer rooms bound to SET clusters, equal SET capacity for mutual backup, and unified models within the cluster. These rules will be included in the disaster recovery calculation to determine the final disaster recovery level of the cluster.

In the construction of basic disaster recovery data, the above rules will be coded and the calculation process will be streamlined, and the basic data will be kept fresh in a near real-time manner. Disaster recovery data is the basic data used on the disaster recovery management and control platform for escape switching and stop losses during incidents. At the same time, hidden risks (that is, hidden dangers of failure to meet the standards of disaster recovery) will be constructed based on disaster recovery data, and these risks will be eliminated through certain operational governance. hidden danger.

3.4.2 Escape before failure

The ability to escape before a failure is batch master database switching and slave database extraction. It is mainly used to receive early warnings before failures, perceive disasters in advance, and quickly cut off all database services in a computer room or offline slave database traffic to reduce real The impact of the failure.

We know that for a cluster with a master-slave architecture, if a failover occurs due to a power outage or a network outage, data loss is likely to occur. Once the data is lost, the business needs to confirm and do aftermath work, which is sometimes very cumbersome. If you can escape in advance, you will avoid these risks. At the same time, in addition to the master database escaping, the slave database can also "remove" the traffic in advance, so that the business side is "insensitive" to failures.

b67c58d264dccf14a83f8ce4c1d44cc1.png992260f7c1fd3942db780784dde47cf3.png

3.4.3 Observation during faults

When a large-scale failure occurs, there will generally be alarm bombing, telephone consultation bombing, etc. If there is no overall fault awareness capability, the fault handling will be confusing, the processing time will be relatively long, and the impact of the fault will be magnified. Therefore, we have built Disaster recovery observation board, which can observe faults in real time, accurately and reliably, so as to ensure that students on duty can grasp the real-time situation of faults.

As shown in the figure below, if a fault occurs, you can quickly get the list of faulty clusters or instances, and initiate a switchover action on the corresponding page, thereby realizing quick stop loss. The core requirement for observing the market is real-time, accurate and reliable. You can improve your own availability by reducing service dependencies.

433293e8ba4463e84fafc3ad38aa82ae.png

3.4.4 Stop loss during failure

Before introducing the stop loss in the fault, let's understand the plan service first. The core function of the contingency plan service is to manage common faults and corresponding various processing contingencies, and provide execution control capabilities, so that the contingency plans can be run in a convenient and controllable manner.

c3e56e063d5ab9772b74665b9432e4d7.png

Failure stop loss : After we have a plan, we can carry out a bottom-up stop loss. As shown in the figure below, when a large-scale failure occurs, HA will automatically handle the failure. If the cluster switching fails or is missed, it will enter the bottom-up stage. First of all, start with the DDTP platform. If the platform is unavailable due to a fault, you can provide the bottom line at the operation and maintenance orchestration layer. If the operation and maintenance orchestration service also fails, you need to manually check the details through the CLI tool. CLI is the bottom-level tool of DBA, and it and high availability are two independent links. The CLI will perform logics such as cluster topology detection, master election, topology adjustment, configuration modification, and configuration distribution, which are equivalent to manual cluster switching steps.

f1178b5d9cf444ea5fd1d9d84635d0d8.png55a35fd31bbe136eb58c4f9a1bc6b26c.png

The overall principle is to give priority to improving the success rate of high-availability automatic switching, and to reduce the number of clusters in the transparent transmission to the bottom-up stage. Secondly, improve the reliability of the plan, give priority to the white screen, descend step by step, reduce the ease of use, and improve the reliability.

3.4.5 Recovery after failure

Although the cluster has N+1 capability, when a computer room fails, the remaining nodes of the cluster can support the peak traffic, but it does not have the disaster recovery capability for another AZ failure, so after the failure, according to the resource conditions of each computer room, through the capacity The disaster decision-making center rapidly expands the cluster to supplement the disaster recovery capacity of the core cluster, making it capable of AZ disaster recovery again.

One of the big disadvantages of the above solution is that it needs to have enough resources to expand capacity, which is very difficult. Currently, we are building faster recovery capabilities, such as in-place repair of instances, in-place expansion of clusters, etc. After AZ recovery , can quickly reuse the machine resources in the failed computer room, and realize disaster recovery and rapid recovery.

45bb21528d8fe2cddcb5ebc557fe8a43.png

| 3.5 Construction of drill system

Various basic disaster recovery capabilities cannot only exist at the level of architecture design and theoretical evaluation, but must be practically usable, which requires verification through drills. At the beginning of the disaster recovery management and control project, a strategy based on drills was formulated to verify and drive the improvement of various basic capabilities. So far, a multi-environment, high-frequency, large-scale, and long-link exercise system has been preliminarily established.

7ea52a19ecea316ec856ba8e941125b0.png

  • Multiple environments : We have built a variety of drill environments to meet the various disaster recovery drill needs of each PaaS component. The first is the long-term stable environment of the disaster recovery management and control platform, the second is the offline isolation environment dedicated to drills, and the third is the production environment, which has a drill area and a normal production environment.

  • High frequency : At present, it can reach the daily and weekly levels. The daily level is a normalized drill, which is mainly initiated automatically in a long-term stable environment, and the scale of the drill is hundreds of clusters; the weekly level is mainly for regular network and power outage real drills in isolated environments and drill areas.

  • Large-scale : It is a drill carried out in the production environment, which is used to verify the large-scale and high-concurrency processing capabilities of basic high availability, emergency plans, escape plans, disaster recovery and recovery functions, and determine the service capacity of the management and control system.

  • Long link : The entire disaster recovery link involves many components, including CMDB database, process database, high-availability components, configuration center, plan service, etc. We will gradually incorporate these components into the drill, allowing one or more components to serve Simultaneous failures, discover potential problems, and verify the impact of simultaneous failures of multiple nodes of multiple services on the overall failure handling capability.

3.5.1 Isolation environment drill

As the name suggests, the isolated environment drill is a drill environment completely isolated from the production computer room. It has its own independent TOR and cabinets. The risks can be completely isolated, and independent network or power failure operations can be carried out. The PaaS components and business services participating in the drill should be independently deployed in this environment. In the isolated environment, in addition to regularly carrying out various disaster recovery drills to discover disaster recovery problems, it can also verify the independent deployment capabilities of each PaaS, providing a basis for international business support.

3.5.2 Production environment drill

  • Normalized and large-scale fault drills : This type of drill is carried out continuously on a daily basis. Faults are injected into the database cluster through the drill platform, and high availability is used for failover. Verify the high-availability concurrent switching capability through different drill scales. In addition, on the disaster recovery management and control platform, it is possible to verify the escape ability, stop loss plan, and observation of large-scale failures. All in all, it uses the combination of "attack" and "defense" to realize the verification, acceptance and optimization of capabilities.

The main features of this type of drill are: first, the participating clusters are hosted by the high-availability group of the production environment, which means that the drill verifies the high-availability capabilities of the production environment; second, the large-scale clusters participating in the drill are non-business clusters. It is a newly created cluster specially used for drills before each drill. The scale can be very large. Currently, 1500+ clusters can be drilled at the same time. The third is to have a certain simulation effect. In order to make the drill more realistic and Accurately evaluate the RTO and increase the carrying capacity of the exercise cluster.

8195f2c23397df353a0bc177de647ee7.png

  • Real zone drills : The isolated environment drills and large-scale drills described above are all simulated in nature, and are quite different from real fault scenarios. In order to make up for the gap with the real faulty primary key, we built a dedicated drill AZ based on the public cloud, which can be understood as an independent computer room. Participating business and component PaaS components deploy some service nodes that carry business traffic to the drill AZ. During the actual drill, a real network disconnection will be performed. Businesses and components can observe and evaluate their own disaster recovery status when the network is disconnected. It will be more realistic to verify the actual disaster recovery of components and services through real computer rooms, real component clusters, and real business traffic.

b6b2039e2ff76a00e88138945091cc85.png

  • Game Day : In addition, we are still evaluating and demonstrating the feasibility of carrying out drills in real computer rooms. With the normalization of isolated environment drills and zone drills, the basic disaster recovery capabilities of each component will become stronger and stronger, and normalization in real computer rooms The ultimate goal of the computer room drill will also be achieved.

 4 Future thinking 

After more than two years of construction, although it has achieved certain results in high-availability automatic switching, disaster recovery capacity operation management, large-scale fault observation, fault stop loss plan, disaster recovery and other aspects. However, there are still many shortcomings in capabilities that need to be filled, and new business developments have also brought new needs and challenges. In the future, we will continue to improve in two directions: making up for shortcomings and iterating the technical architecture.

| 4.1 Make up for shortcomings

  • Insufficient large-scale escape ability and stop loss ability : With the implementation of our self-built data center, the scale of our self-built AZ will be larger, which will have higher requirements for capabilities. We will gradually improve capabilities mainly through platform iteration and drill verification .

  • Cross-domain leased line failures lead to failure of Region-level high availability : Next, we will explore unitized solutions or independent deployment solutions to achieve Region-level or finer-grained closed-loop management.

  • New challenges for business going overseas : Going overseas will bring some new requirements and challenges to the disaster recovery architecture. Whether to adopt "long-arm jurisdiction" or independent deployment, whether to reuse the existing technology system or create a new architecture, these issues still need to be further improved exploration and demonstration.

  • Disaster recovery efficiency : The basic functions of the platform are relatively complete, but disaster recovery decision-making and processing coordination still need to be done manually, and the efficiency is relatively low. In the future, disaster recovery management and control, emergency stop loss and other capabilities will be gradually automated; multi-environment drills The cost is relatively high, and automated drills must be gradually carried out to gradually incorporate the core drill scenarios into the long-term stable environment, and let it automatically run fault scenarios through timing or certain strategies. We only need to focus on the operation of core indicators.

| 4.2 Iterative architecture

Database-related technologies are developing rapidly. For example, new technologies such as Database Mesh and Serverless will gradually be implemented. At that time, middleware, high availability, and kernel will undergo relatively large changes. The introduction of computing separation products will cause relatively large changes in disaster recovery capabilities. Disaster recovery capacity building will be iterated along with these determined product evolutions.

Disaster recovery construction is a very challenging thing, and it is also something that all companies must face after their business grows. Welcome everyone to leave a message at the end of the article and communicate with us.

 5 Authors 

Ruichao, from Meituan's basic research and development platform - basic technology department.

Guess you like

Origin blog.csdn.net/g6U8W7p06dCO99fQ3/article/details/131160156