One of the ultra-large-scale database cluster stability series: high-availability system

Based on years of practical experience in maintaining the stability of large-scale data clusters, we hope to have some technical exchanges with the industry. The Meituan technical team held the 75th technical salon. We invited Zhao Yinggang, a researcher at Meituan, as the producer. At the same time, we invited 4 technical experts in the field of database, including Zhang Hong, Wang Zhanquan, Lin Ruichao, and Shen Yufeng, to share about attack, defense, self-healing, and drills.

  • 00 Producer says

  • 01 Introduction to High Availability

    • 1.1 Challenges

    • 1.2 Development History

  • 02 High availability deployment

    • 2.1 High availability architecture (data flow, control flow)

    • 2.2 High availability deployment (HA Core, microservices, data layer)

  • 03 Key module design

    • 3.1 Fault discovery (reduce missed judgment, reduce misjudgment)

    • 3.2 Failure election (election factor, election strategy)

    • 3.3 Data Consistency (Four Risks and Solutions)

    • 3.4 High availability of multiple computer rooms

    • 3.5 Configuration Delivery (Dual Region Delivery)

  • 04 Future thinking

This article is compiled from the theme sharing "High Availability System of Meituan Database", which is the first article in the series of ultra-large-scale database cluster stability. For the database, the very core is how to ensure its high availability. This article revolves around four aspects, including high-availability introduction, high-availability deployment, design thinking of key modules, and future thinking. I hope it can be helpful or inspiring to everyone.

| Video at Station B : Meituan Database High Availability System

 00 Producer says 

In the context of the rapid expansion of database clusters, in the event of a failure, how to quickly restore the data and services of hundreds or even thousands of clusters is an important challenge faced by many large Internet companies. Hundreds of thousands of microservices are deployed online, and the database structure and topology are changing at any time. System reconstruction, kernel upgrade, hardware equipment replacement, computer room relocation, etc. will also have a certain impact on the stable work of the database. As the most important and lowest-level service in the entire IT system, even if it encounters the impact of an extremely small probability event, it will have a very large impact. For the Meituan database team, "the low-hanging fruits have been picked", and we began to focus on the impact of these small-probability events on the business.

The way to break the database stability guarantee: on the one hand, it is to increase the mean time between failures (MTTF), and on the other hand, it is to improve the emergency response capability, that is, to shorten the mean time to repair (MTTR). Under the guidance of these two goals, the Meituan database team constructed a closed-loop system for stability assurance from the two dimensions of capability-driven and fault-driven.

From a capability-driven perspective, we borrowed from Google's stability assurance system. In the bottom three layers, think about how to shorten the fault handling time through the dimensions of fault drill/contingency plan construction, review, and observability; the middle four layers focus more on R&D requirements, design, launch, and change control to reduce faults probability of occurrence; the top layer is product operation, that is, through internal user-oriented operation, guides the business to select and use the database reasonably, continuously improves the ease of use of products and platforms, and provides corresponding solutions according to business characteristics.

From a fault-driven perspective, it includes pre-prevention and discovery, mid-event fault location, post-event recovery, review and improvement, etc. Engage in the whole life cycle before, during and after the event and all stages of software development to comprehensively improve the management, control and emergency response capabilities.

183492cd9b6870585a37fbb878e2299f.png

Figure 1 The way to break the situation of database stability guarantee

Based on the practice of maintaining stability in the past years, this salon will focus on how to improve offensive and defensive capabilities, how to improve rapid recovery capabilities, and how to better coordinate and coordinate people, systems, and processes after offense, defense, and recovery form a closed loop. In several aspects of dealing with large-scale failures, four topics are introduced, including high-availability systems of databases, construction practices of database attack and defense drills, construction of database disaster recovery systems, and construction of database autonomous service platforms. Hope to bring inspiration and help to the vast number of database practitioners and business R&D personnel.

 01 Introduction to High Availability 

| 1.1 Challenges

First, share the problems and challenges faced by the high availability of the Meituan database, mainly from three levels:

The first challenge is that instances grow faster and faster. Figure 1 below intercepts the data from January 2019 to January 2022. It can be clearly seen that the instance scale has grown very rapidly. In large-scale scenarios, how to ensure the high availability of each instance is a very big challenge. Everyone knows that the complexity of ensuring the stable operation of several machines is not at the same level as ensuring the stable operation of tens of thousands or even hundreds of thousands of machines.

5f7eb07534de613b560417c2468c5061.png

Figure 2 Instance Growth Curve

The second challenge is that availability (RTO) requirements are becoming increasingly stringent. Meituan's business type is online and real-time transactions, which have very high requirements for system availability, especially for instant delivery. In the early stage of business development, the volume and concurrency are not high, and the requirement for system availability may only be 99.9%. However, with the rapid growth of business volume, the requirements for system availability will continue to increase, especially for relatively low-level database systems, from 99.9% to 99.99% or even higher.

1484b78fa4102d1c1fee8c9bb1a83a27.png

Figure 3 Downtime corresponding to system availability

The third challenge is the complexity of disaster recovery scenarios. Disaster recovery scenarios are mainly divided into three levels. The first is conventional disaster recovery, such as daily software, hardware or network failures; the second is AZ disaster recovery, that is, the computer room level, such as computer room disconnection, computer room downtime, etc.; The first is Region disaster recovery, that is, larger space disaster recovery, typically city-level disaster recovery, and currently AZ-level disaster recovery is still being resolved, which is divided into the following five stages:

70fbc1204846683ff49957872ba58b3b.png

Figure 4 AZ disaster recovery capability level

As can be seen from Figure 4, we divide AZ disaster recovery into five stages from 0 to 4, referred to as L0-L4. As the level increases, the scene becomes more and more complex, and the corresponding scale becomes larger. From the perspective of disaster recovery scale, single point->single cluster->a cluster that a business depends on->clusters in AZ, the capabilities required by different scales are completely different. In addition to scale, there are also disaster recovery scenarios. changing.

  • The two levels "L0-L1" focus on conventional disaster recovery and are instance-level disaster recovery.

  • The two levels of "L2-L3" focus on AZ disaster recovery, which is a great leap forward compared with L1, because not only the conventional disaster recovery problems faced by "L0-L1" must be solved, but also a very core problem, that is, Whether the overall high availability itself can be quickly restored, and whether the downstream services that the high availability depends on have disaster recovery switching capabilities. Since high availability itself is a system, it has a data plane and a control plane, as well as upstream and downstream dependencies, so you must first ensure that you are available before you can guarantee the RTO and RPO of the database.

  • L4, there is another big leap from L3 to L4, because the scale of L3 is relatively controllable, while L4 directly breaks the network of AZ. The size of AZ is different, and its scale is larger, which is closer to the real AZ fault. .

1.2 Development history

Next, share the development history of Meituan's high-availability system. In general, Meituan's high-availability development process is based on the contradictions and challenges at different stages, and the corresponding solutions and solutions are made. So far, there have been three relatively large system architecture iterations:

  • The first-generation architecture was called MMM (Multi-Master Replication Manager for MySQL) before 2015. This architecture includes access layer VIP, Agent, and Manager. There are many problems in this architecture itself. For example, VIP access cannot support cross-computer rooms, Cross-network segment; the Agent and the instance are tied to each other, and there is no high availability itself, so it is difficult to maintain; the Manager is still a single point.

  • The second-generation architecture is from 2015 to 2019, called MMHA (Meituan Master High Availability), which is based on MHA and combined with the Meituan database ecological custom architecture, which solves the problem of VIP access and Agent misjudgment in the first-generation architecture There are some problems, but after 2019, due to the large scale of instances and rapid growth, the management of this architecture is extremely complicated, and the Manager is a single point that does not solve its own high availability. At the same time, in 2019, the entire PAAS failed in the exercise room, so the architecture gradually exposed various stability problems at that time.

  • The third-generation architecture is in 2019, and a high-availability system customized based on Orchestrator is newly opened. The actual start time is 2018, and grayscale will start in 2019. At that time, MGR was already used by some companies in the industry. MGR used distributed protocols to solve high-availability problems. We thought deeply about this: whether we can introduce a similar distributed protocol to solve the problems faced by high-availability systems, essentially Implement the distributed protocol built into MGR into a highly available system.

Later, the architecture implemented based on this idea is the current high-availability system, which is a Raft Group with multi-node and multi-room deployment. In the past, it mainly focused on master-slave switching, but the new system is similar to Octopus. All nodes in the cluster are managed uniformly, such as master library, slave library, Ripple (explained later), etc.; on the other hand, it is a highly available , high-performance, and large-scale concurrent processing architecture, we have deployed multiple Raft Groups to host database clusters in different regions and service levels. In addition, we are also thinking about a new generation of decentralized architecture, which will be introduced in the last chapter of the article.

1636e4c6dd4c0fb33b43b851e5fed38a.png

Figure 5 The development history of Meituan's high-availability architecture

02 High availability deployment 

| 2.1 High availability architecture (data flow, control flow)

This part is mainly divided into two lines: control flow and data flow.

  • Data flow : If the business application wants to access the database, how does it get the data, and how does a SQL return the data back? The business application sees the topology of MySQL through accessing the middleware. The data flow is relatively simple and common in the industry. way of doing.

  • Control flow : In order for business applications to access the normal database topology through access to middleware, high-availability components are required to solve the problem of failover.

As can be seen from Figure 6 below, high-availability components are divided into four parts: HA Core is a classic 3-node Raft Group deployment (or multi-node deployment); HA Platform is a management and control system, such as active switching, state machine viewing, and bottom-up etc.; HAservice is the API service of HA, which is responsible for delivery with peripheral systems; the scheduling system includes (Scheduler, Worker), which is a process scheduling service and handles the rotation of the state machine. Metadata storage is a core service that depends on peripherals. It is responsible for the management of basic configuration. When the configuration is changed, it will be sent to all middleware nodes through it to ensure that the business sees a normal database topology.

68df1121a6106312d5559398ff54af52.png

Figure 6 High availability architecture diagram

| 2.2 High availability deployment (HA Core, microservices, data layer)

Next, let's introduce the four high-availability components, and the strategies for high-availability deployment to deal with AZ-level disaster recovery or Region-level disaster recovery.

03042036c9b9a88583f728014e7e93d4.png

Figure 7 High availability deployment architecture

One is HA Core deployment , which has three characteristics: one is multiple Regions, as shown in Figure 6 below, with the red line as the boundary, Region 1 on the left and Region 2 on the right; the other is multiple AZs, each HA Core is deployed in 3AZs ; The third is multi-cluster, each 3-node cluster of HA Core will host MySQL thousand-level instances.

The second is microservices , including synchronization services, scheduling services, and configuration centers.

  • Synchronization service : To put it simply, all the information after our engineers apply for a cluster or DB in RDS will be registered in the HA Core service synchronously, which means that HA Core is an "octopus" that can discover this information.

  • Task scheduling : It also includes API Service, Scheduler and Worker, mainly for state machine task execution. These two services are deployed in multiple regions and multiple computer rooms, and they have no state themselves.

  • Configuration Center : The data synchronization link between business application access middleware and high availability. It is the core component of MySQL node discovery and processing. It is a dual Region deployment and has its own components, such as API, Config, and Consistency.

The third is the data layer . Except for the configuration center, microservices have no state, because the state is in the data layer, and the data layer services are all MGR clusters, and these clusters have different functions. MGR is single-region write and multi-region read, so Here we go back to the Region disaster recovery mentioned at the beginning. These data need to be unitized or isolated. Now these clusters are still written in a single Region.

The fourth is the management and control layer , which is a normal platform service and routine deployment.

 03 Key module design 

3.1 Fault discovery (reduce missed judgment, reduce misjudgment)

Fault discovery has two core metrics:

  • The first indicator is not to miss the judgment . If the fault is not judged, the RTO will not take effect at all.

  • The second indicator is to reduce misjudgments , because we know that RTO cannot be equal to 0, that is, RTO must have an impact on business. If misjudgments are made several times a day, it is unacceptable and harmful to business.

Let’s talk about missed judgment first. For the fault detection channels shown in Figure 8 below, each detection channel is completely independent, such as the common detection channel, the heartbeat detection channel, and the slave library detection channel. In this stage, these channels are actively and concurrently initiated on the server side to detect. On the other hand, there is the business side. The access middleware has error reports from the business side, which are also included in fault detection and decision-making. This is a fault from the business perspective. No matter what the problem is, as long as the business reaches the damage threshold, decisions can be made. This part is still under construction. Combining multiple channels, as well as client and server judgments, which channel finds a problem and directly starts to make decisions without being affected by other channels. This is a strategy to solve the problem of missed judgments.

d36d5234362ba646ddef1c80ea6a6aa4.png

Figure 8 Fault discovery

Besides, to reduce misjudgment, we have introduced a multi-node negotiation mechanism. For example, when we encounter certain problems, we will introduce "majority decision-making" decision-making to allow everyone to give full play to their opinions, and finally summarize and draw conclusions. The HA Core is multi-node. When each node detects independently, it will make independent judgment and analysis after detection. If only one Follower node thinks that the fault is useless, the Leader needs to analyze and initiate a negotiation with other Follower nodes for joint decision-making. , only when most nodes consider the fault will it be determined as a fault, and the fault will be registered and the fault handling process will be carried out. The Backend DB in the figure is the independent storage of each node of the Raft Group, and the topology and decision-making information detected by each node are stored locally.

| 3.2 Failure election (election factor, election strategy)

The so-called failure election must be multiple slave libraries, and there is no election for one master and one slave. The current situation of Meituan’s MySQL cluster is one master and multiple slaves, so the election is extremely complicated, because we need to ensure N+1 disaster recovery, multi-AZ and even multi-region deployment. It is very important to choose who is the master database during the election. There are two main factors : Election factor and election strategy. Election factor + election strategy = decide who is the new master.

(1) The election factor is the core element that affects the election, and more than 20 election factors in one failure will jointly affect how to sort.

ba78a1902b18d5f7ca519b86779a56c8.png

Figure 9 Selection factor

To give two examples, the first is the version. A cluster with one master and five slaves may have three versions. If the latest version is used as the new master, there will be problems. Try not to make it the new master, because if the version of the new master It is the latest, but the slave library is a lower version than it, and there will be some compatibility problems; the second is the weight, assuming that other factors are the same, but in the case of operation and routine maintenance, it will be artificially If the weight is lowered, or an instance is marked as not being able to be the new master, then the one with the weight of 100 has a greater chance of becoming the new master than the one with the weight of 50. Each selection factor has a weight, and finally a comprehensive ranking is performed, such as version, binlog format, weight, server configuration, etc. The election result of the comprehensive selection factor is 1M (first master).

But 1M is not necessarily the best. Some businesses are deployed across regions in the north. For example, the main database was in Beijing before, and the 1M selected according to the selection factors went to Shanghai. Obviously, the business is unacceptable because the old main database is in Beijing. It shows that most of the business is deployed in Beijing. Once it is written to Shanghai across Regions, the RT will increase a lot, so the election strategy is introduced.

(2) The election strategy is priority in the same computer room > priority in the same center > priority in the same region, which is a flexible fallback strategy. According to the election strategy, a new master (called 2M) is re-elected. If 2M and 1M overlap, this 1M is considered to be the Master that meets the business requirements, and 1M will be used as the final master library. But sometimes the rankings of 1M and 2M are very different. At this time, we try to make the factors that cannot be changed based on it, and align other factors that can be changed. For example, the location is difficult to change, but the location is easy to change. After the final balance, a new master library is elected.

cbaba9054dace53b6a716e0de6e896d3.png

Figure 10 Election strategy

Example :

  • As shown in Figure 11 (Master library M, four instances of slave libraries S1, S2, S3, and S4), the promotion rule of S2 is Must Not, so it must not be the master library, even if no other can be selected , it also cannot be the main library.

  • S1 and the old master database are in the same AZ, that is, both are AZ1. S1 has a higher priority than AZ2's S3 and S4, that is, it has a greater chance of becoming the new master.

  • S4 has a weight of 100, and S3 has a weight of 90. S4 has a higher weight than S3. Even if S4 is an exclusive container, it has a higher voting right.

ae413f4d1d49e92d03642c7bf9921c32.png

Figure 11 Election example

This example shows that the final selected main library is sorted after comprehensive consideration and is affected by various factors.

| 3.3 Data Consistency (Four Risks and Solutions)

Why should the data be consistent? In our current scale of business scenarios, the availability priority strategy can no longer cover all business scenarios, but under the master-slave architecture, data loss is ubiquitous. Refer to Figure 12 as an example, there are many risk points of data loss :

  • Question 1, the binlog may not be placed in real time.

  • Question 2, the IO thread did not get the latest data.

  • Question 3, the SQL thread is not aligned with the IO thread, including the binlog positions between slave libraries are also different.

  • Question 4, the incomplete transaction problem of the slave library, it is complete when the master library transaction is submitted, but when it is synchronized to the slave library through the IO thread, it is not synchronized at the transaction granularity but at the event granularity. If the transaction is not completely received Data loss inconsistencies may also occur.

23a7a0213342f20032ef496ed44c249c.png

Figure 12 Four risks
  • Regarding questions 3 and 4, in fact, as long as the data is obtained from the database, no matter whether it is aligned or not, we have a strategy to ensure consistency. This strategy is called S1. As long as the data is synchronized, the consistency can be guaranteed through S1 sex.

  • For problem 2, it is the event of the main library, and none of the slave libraries has obtained the full data. In this case, the binlog of the old main library must be analyzed and the binlog location must be calculated to obtain the data, and the data must be completed during the switchover, that is, when it is open to the business Before writing traffic, the data will be completed to the new main database to ensure consistency. This strategy is called S2.

  • However, S1+S2 cannot guarantee 0RPO. When the server is down, the binlog of the old main library cannot be obtained, and there is no way to calculate, parse and process it. 0RPO cannot be guaranteed, which is about 20% to 30%.

Next, let’s talk about the solution for when the old main library is down and cannot get data (this is called strategy S3). Here, we also refer to the MySQL master-slave library architecture to solve consistency through IO threads and SQL threads. The first difficulty is that the IO thread should try to get as much data as possible. This is a very critical and difficult thing. The second difficulty is that the SQL thread can quickly apply the data. Only when these two aspects are achieved can 0RPO be guaranteed.

The solution to the first difficulty is shown in Figure 13. The core is Ripple, which is a high-performance binlog subscriber server that acts like an IO thread to ensure fast binlog acquisition. The first feature is its extremely high performance, because in addition to processing binlog receiving requests, ordinary slave library IO threads also handle many tasks such as writing Relay logs, some lock logic processing, and coordinating with SQL thread consistency. Ripple's tasks It is to quickly fetch all the data without processing other logic. The binlog synchronization speed is more than three times that of ordinary MySQL IO threads; the second feature is that it can be configured with strong consistency and supports a semi-synchronization mechanism. If you think you can sacrifice RTO, for example, you can Sacrifice 3 minutes but cannot lose data, then configure a multi-copy semi-synchronous timeout strategy (such as 3 minutes), due to Ripple's high performance, the actual timeout time is much smaller than this value, and the degradation probability is very low; the third feature, itself It is a storage-computing separation architecture, that is, Ripple is a lightweight container + EBS cloud disk, using such an architecture to ensure data integrity.

fb7f910ab59277a7bc5a0689250344bd.png

Figure 13 Ripple-based consistency scheme

The second difficulty is to supplement data from Ripple. After all, Ripple is not an ordinary MySQL, so HA is required to be compatible, but HA is much more flexible in processing data than SQL threads, because SQL threads need to process data in real time to ensure consistency, while HA Processing data only during failures can lengthen the RTO and do a lot of customization. There is also the RPO≈0 mentioned here, why is it approximate, because although RPO=0 can be achieved, the loss of RTO must be accepted. If you want both RTO and RPO, you can configure it flexibly in the strategy.

Under the current business scale, there is no way to have both RTO and RPO. If you want RTO, you have to sacrifice part of RPO or vice versa. There are some uncontrollable factors that will affect the core indicators, such as transaction size, delay, integrity, etc. We open these configurations to the business through policies, and let the business make decisions by themselves. For example, if your business has many large transactions, but you want 0 RPO, then you need to configure a longer RTO loss.

  • Availability priority : can be customized according to business characteristics, promise RTO does not guarantee RPO.

  • Consistency priority : can be customized according to business characteristics, guarantee RPO but not unlimited, RTO will control the upper limit.

  • Uncontrollable factors : transaction size, latency, transaction integrity, etc.

5a40a502f31009251aef266546557615.png

Figure 14 Availability Priority/Consistency Priority

3.4 High Availability for Multiple Computer Rooms

In normal scenarios, after the leader fails, a new leader will be quickly elected within 4 seconds to continue working. However, there are also some scenarios, as shown in Figure 15 below, where the MySQL Master and the HA leader are both in AZ1. What if AZ1 goes down? In fact, the picture on the right appears. While the old leader is dealing with the failure of the MySQL node in AZ1, because its own leader is also in AZ1, it will interrupt the processing state machine of the old leader and elect a leader, but the new leader does not know how the old leader state machine handles it, which directly leads to a switchover s failure.

4144e91c22e4882032c4df26b3924505.png

Figure 15 High availability of multiple computer rooms

The core solution to the high availability of multiple computer rooms is to ensure the continuity of the state machine between each node. The so-called continuity of the state machine means that the leader synchronizes the state machine to all follower nodes in real time. Once the follower node becomes the leader point again, it will continue the state machine of the old leader, and there will be a critical state judgment of the action. As shown in Figure 15 below, it judges whether to roll back or continue to execute based on the execution cost. If it rolls back, it will roll back all the old Leader state machines and start processing from scratch; on the other hand, if due to the topology and status of the cluster If it is destroyed, it will be troublesome or impossible to roll back the state machine, then it will continue to execute the unfinished actions of the old Leader state machine.

Ensure multi-node state machine continuity:

f4781d3213d08c2f63eb1f96aaea98df.png

Figure 16 Guarantee the continuity of the state machine between each node
  • State synchronization : the Leader synchronizes the state machine to all Follower nodes through Raft in real time

  • Critical state : The critical point of the state machine determined according to the execution cost

  • Rollback : The new Leader performs state machine rollback, including corresponding operations

  • Continue : The new leader continues to execute the state machine of the old leader

| 3.5 Configuration Delivery (Dual Region Delivery)

Since the configuration service is deployed in two regions, it is divided into delivery in the same region and delivery across regions. Sending to the same Region will be written to the storage layer, and config-server will update the latest configuration and push the configuration to the client. Cross-Region distribution, for example, Beijing, Shanghai and Shenzhen all have business service nodes, and when it is finally pushed, it will also use Consistency-Server, that is, once a Region is updated, we will push the data In another Region, the data between the Regions is completely consistent. If the other Region has a business service node, it will continue to follow the same Region delivery process.

d05878908d714830baf91bd78ce6ad0b.png

Figure 17 Dual Region delivery

04 Future thinking 

Finally, share some thoughts on the future of high availability, mainly including the following three aspects:

  • Improve disaster recovery capabilities, mainly AZ-level disaster recovery and Region-level disaster recovery . We are still under construction in these two aspects. AZ-level disaster recovery needs to reduce dependencies. What cannot be reduced requires AZ-level closed-loop independent deployment, and the ability to improve large-scale concurrent processing. In terms of Region-level disaster recovery, we are doing some unitized Thinking and planning, try to make all services, including the data layer, closed-loop, and do not access across regions.

  • Decentralized architecture . There are also databases in the industry that have high availability built into MySQL, such as MGR. Built into MySQL has many advantages, but it is the current premise for Meituan's one master and multiple slave architecture. In addition, the MGR architecture has low tolerance to network jitter and some increase in request delay, which makes it unacceptable for most business scenarios. Therefore, we are working on another way of thinking, that is, building high availability into the Proxy process, so that Proxy has its own database high availability capability, which is similar to the idea of ​​building it into MySQL.

  • De-dependence and clustering . Remove downstream dependencies such as HA's Service, Scheduler, Worker, and configuration center. It is hoped that after being built into the Proxy process, the internal data will be synchronized through the Raft/Gossip protocol and no longer rely on centralized services, so that it can be completely clustered. This is our 2023 years thinking about strategies.

----------  END  ----------

 recommended reading 

  |  Intelligent analysis and diagnosis of database exceptions

  |  Database full SQL analysis and audit system performance optimization journey

  |  Design and implementation of database anomaly monitoring system based on AI algorithm

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/130633472