RocketMQ's multi-scenario disaster recovery practice case in Xiaomi

Author: Deng Zhiwen, Wang Fan

01 Why disaster recovery?

Inside Xiaomi, we use RocketMQ to provide message queuing services for various online businesses, such as shopping mall orders, SMS notifications, and even to collect reported data from IoT devices. It can be said that the availability of RocketMQ is the lifeline of these online services. As software developers, we usually hope that services can run in an ideal state: on the premise of no bugs, the system can provide normal service capabilities.

However, actual operation and maintenance experience tells us that this is impossible. Hardware failures are very common problems, such as memory failures, disk failures, etc., and even computer room-related failures (dedicated line failures, computer room shutdowns, etc.). Therefore, we need to back up the data and use multiple copies to ensure the high availability of the service. Apache RocketMQ is designed to support multi-copy and multi-node disaster recovery, such as Master-Slave architecture and DLedger deployment mode.

Inside Xiaomi, because it is oriented to online business, service recovery speed is very important, and the DLedger mode based on the Raft protocol can achieve second-level RTO, so we chose the DLedger architecture as the basic deployment mode in early 2020 (in 5.0, Master-slave mode can also achieve automatic failover). Supporting computer room disaster recovery requires additional costs. Next, I will use three practical cases of disaster recovery deployment to explain how Xiaomi supports disaster recovery in terms of cost and availability.

02 How to do disaster recovery?

Single computer room high availability

In actual use, there are many businesses that do not require disaster recovery at the computer room level, as long as a single computer room can be highly available. Apache RocketMQ itself is a distributed message queuing service, which can achieve high availability of multiple nodes in the same computer room. The following mainly shares how Xiaomi upgrades and optimizes the deployment architecture under the premise of weighing cost and availability.

We know that in the Raft protocol, three nodes are generally configured, and the goal of high availability is achieved by using machine redundancy + automatic master election switching. Therefore, when Xiaomi introduced RocketMQ, a single Broker group deployed three Broker nodes. At the same time, in order to ensure that there are always Master nodes in the cluster, we generally deploy at least two Broker groups. A simple deployment architecture diagram is as follows:

image

It can be said that it is a very basic deployment architecture. In a single computer room, disaster recovery for a single computer room is achieved through multiple copies and multiple Broker groups. But it is not difficult to find that there is a serious problem in doing so: waste of resources. The slave node of RocketMQ will only play the role of slave reader when the client reads older data. At other times, it will simply run as a copy. The machine utilization rate is only 33%, which is unbearable.

Due to cost considerations, we need to rethink the existing deployment architecture. How can we use slave nodes? A very simple idea is to mix nodes: deploy Broker processes on slave nodes so that they can serve as Masters. Coincidentally, the community also proposed the concept of Broker Container at that time. The principle of the solution is to abstract a Container role on RocketMQ Broker. Container is used to manage the addition, deletion, modification and query of Broker, so as to achieve multiple The purpose of a Broker, the specific architecture diagram is as follows:

image

It can be seen that the Container runs as a process, and the original Broker is abstracted as a part of the Container. We can run 9 Broker nodes on the same 3 machines to form three Broker groups. There is a Master node on each service host. Use After Container deploys Broker peer-to-peer, each service host is utilized, and the same number of machines can theoretically provide three times the performance.

Container is a good deployment idea: master-slave nodes are deployed peer-to-peer to make full use of all machines. We tried to use the scheme directly, but encountered some problems:

  1. Container is essentially a process. Regardless of how many Brokers are running in it, as long as we restart it, it will affect all Broker groups related to the Broker inside the Container, and the upgrade will have a more serious impact;

  2. Container maintains Broker's online and offline, and cannot be used in conjunction with Xiaomi's internal deployment tools.

Therefore, Container is not suitable for Xiaomi internally, but inspired by Broker Container, we propose another similar deployment solution - single machine with multiple instances. The so-called single-machine multi-instance means that multiple Broker instances are deployed on a single host. The service host is our Container, and the Broker runs as a process. In this way, each Broker will not affect each other before, and it can also be perfectly combined with internal deployment tools. A simple deployment architecture looks like this:

image

So far, Xiaomi has completed the first upgrade of the RocketMQ deployment architecture internally, and the number of nodes in the cluster has been directly reduced by 2/3. On the premise of cost optimization, it still provides 99.95% availability guarantee.

Multi-room Disaster Recovery-Ⅰ

With the continuous access of services, some services have raised the demand for computer room disaster recovery. Although the probability of failure in the computer room is extremely low, once it occurs, its impact will be very large. For example, RocketMQ is unavailable due to a failure in the computer room. As a traffic entry, it will affect all dependent services.

In terms of disaster recovery in multiple computer rooms, we first proposed a multi-cluster and multi-active approach based on the deployment experience of other internal services, that is, deploying a cluster in each availability zone and providing multiple clusters for business disaster recovery. The solution deployment architecture is as follows:

image

What the user sees are three independent clusters, and clients need to be deployed in the same availability zone to read and write to the RocketMQ cluster in the same computer room. For example: the client in Availability Zone 1 normally accesses the RocketMQ cluster Cluster-1 in Availability Zone 1. When Cluster-1 fails, the user needs to manually change the connection address of the client to switch the cluster, and then transfer the traffic to other computer rooms in the cluster. The user can send the hot update connection address through the configuration, or modify the configuration and restart the client to switch, but the premise of all these operations is that the business needs to be aware of the failure of the RocketMQ cluster, and it can be triggered manually .

▷Advantages

  • No need to synchronize data across regions, low latency (P99 writes 10ms) and high throughput (single Broker group writes TPS up to 100K)

  • Simple deployment architecture and high stability

▷ Disadvantages

  • The cluster needs to reserve a disaster recovery buffer to ensure that when a failure occurs, the surviving cluster can carry all the traffic of the failed cluster

  • The business needs to manually switch the cluster by itself, which is not flexible enough

  • If there is an accumulation of consumption, the messages of the faulty cluster may not be consumed, and can be consumed after recovery

▷ Time-consuming production

image

Multi-room Disaster Recovery-Ⅱ

It can be seen that if the service is accessed through the above methods, certain adaptation work needs to be done. This solution is suitable for service access with large traffic. However, there are some businesses that want to be able to access at low cost: no adaptation, directly use the SDK to access , we combined the automatic switching feature of DLedger, and experimentally deployed the automatic failover mode of the computer room fault service. The deployment architecture is as follows:

image

What the user sees is an independent RocketMQ cluster, which can be accessed normally using the SDK without any adaptation. When the computer room fails, rely on DLedger to automatically switch to the master for traffic switching.

▷Advantages

  • Easy to deploy, make full use of the native capabilities of RocketMQ

  • Automatic master selection, convenient service access, no need to manually switch traffic

▷ Disadvantages

  • Deployed across computer rooms, it is vulnerable to network fluctuations, and the probability of cluster jitter is high

  • Deploying across computer rooms will increase write latency, thereby reducing cluster throughput

▷ Time-consuming production

image

Disaster recovery for multiple computer rooms - PLUS

At present, it seems that the RocketMQ service has been well implemented in Xiaomi, and the daily message volume has reached a scale of 100 billion . However, after careful observation of the above two solutions, it is not difficult to find that although the failover of the computer room can be achieved, they both have certain shortcomings. , a brief overview is as follows:

  • Disaster recovery for multiple computer rooms-I: Requests from the same computer room have low latency, but need to manually switch between clusters

  • Disaster recovery for multiple computer rooms-II: automatic flow switching and consumption of historical data, but with high load on dedicated lines, three Regions are required for deployment

There are always imperfections in the solution, but whether you are a service developer or a business user, you all hope to achieve disaster recovery under the premise of achieving the following goals:

1) Low cost: Dual Regions can be deployed;

2) Low time consumption: Try to request from the same computer room as much as possible to reduce network time consumption;

3) Automatic flow cut: When the computer room fails, the flow can be automatically switched to the normal computer room.

In order to achieve the above requirements, we start from RocketMQ's own architecture, hoping to support disaster recovery with the lowest transformation cost. We found that the clients produce and consume based on the metadata returned by Namesrv. As long as the client can fail in the computer room, the traffic can be automatically cut off according to the metadata. The disaster recovery function is supported on the terminal.

All RocketMQ Brokers will register themselves to Namesrv. Once a Broker group fails, its information will be removed from Namesrv, and the client will no longer be able to send or pull messages to such Broker groups. Based on the above logic, as long as we deploy Broker groups in different computer rooms, we can achieve disaster recovery effects at the computer room level. The deployment architecture is as follows:

image

Let's use a practical example to explain the feasibility of the above solution: Topic-A has partitions in two availability zones, and the SDK needs to configure its own region when using it.

For producers, clients will only send messages to partitions in the same availability zone. For example: the client in Availability Zone 1 will only send messages to Availability Zone 1. When Availability Zone 1 fails, since there is no writable partition in Availability Zone 1, it will start sending messages to Availability Zone 2, thereby realizing production Side automatic cut-off. Consumers also need to configure region. All consumption instances will be rebalanced according to the availability zone first: partitions will be allocated and consumed by consumers in the same availability zone first. When Availability Zone 1 fails, because the producer has already cut off the traffic, the consumer does not need to make any special changes to automatically cut the consumption flow.

image

This solution is an option for the business. The business can decide whether to enable the disaster recovery mode, so it is more flexible. It can be said that it combines the advantages of the previous two computer room disaster recovery solutions, but there are still shortcomings, such as faults. During the failure period of the cluster, historical messages cannot be consumed, etc., and the follow-up will continue to optimize the solution.

03 Let's make a summary!

This article introduces four deployment modes, providing different deployment modes for different business needs, summarized as follows:

10.png

At present, the above solutions have specific business scenarios within Xiaomi, and the message volume accounts for about 90% of the total. In the future, all clusters related to the remaining traffic will be gradually upgraded to computer room disaster recovery clusters, so as to provide 99.99% availability service capabilities.

image

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/8885848