Disaster recovery construction and practice of vivo push system

Author: vivo Internet Server Team - Yu Quan

This article introduces the push system disaster recovery construction and key technical solutions, as well as the thinking and challenges in the practice process.

1. Introduction of push system

The vivo push platform is a message push service provided by vivo to developers. By establishing a stable and reliable long-term connection between the cloud and the client, it provides developers with a real-time push message service to the client application, supporting tens of billions of Notification/message push can reach mobile users in seconds.

The push system is mainly composed of an access gateway, a logical push node, and a persistent connection. The persistent connection is responsible for establishing a connection with the user's mobile phone terminal and sending the message to the mobile terminal in time.

The push system is characterized by high concurrency, large message volume, and high delivery timeliness .

The current status of the vivo push system has a maximum push speed of 140w/s, a maximum message volume of 20 billion in a single day, and an end-to-end second-level online delivery rate of 99.9%. At the same time, the push system has the characteristics of sudden large traffic that cannot be predicted in advance. Considering the high concurrency, high timeliness, and bursty traffic characteristics of the push system, how to ensure system availability? This article will describe how the push system does disaster recovery from three aspects: system architecture, storage disaster recovery, and traffic disaster recovery.

2. System Architecture Disaster Recovery Scheme

2.1 Long connection layer disaster recovery

The persistent connection is the most important part of the push system. The stability of the persistent connection directly determines the push quality and performance of the push system. Therefore, it is necessary to provide disaster recovery and real-time scheduling capabilities for the persistent connection layer.

The original push system architecture is that the long-term connection layer is deployed in East China. All vivo IDC logical nodes establish connections with the Broker in East China through VPC, and the mobile terminal communicates with the broker in East China through long-term connections. This deployment method has the following problems .

  • Problem 1: Mobile phones in North China and South China need to connect to the Broker in East China. The geographical span is large, and the network stability and timeliness of long-term connections are relatively poor.

  • Question 2: The logical layer is connected to the Broker in East China by a VPC. With the development of the business, the push traffic will increase, and the bandwidth will become a bottleneck, and there is a risk of excessive packet loss. In addition, when the VPC fails, messages on the entire network will not be delivered.

Note: The long connection layer node is named Broker.

Original long connection architecture diagram:

Based on the problems in the above architecture, it was optimized and the Broker was deployed in three places, respectively in North China, East China, and South China.

Users in North China, East China, and South China use the nearest access method.

The optimized architecture can not only guarantee the stability and timeliness of the long-connection network. At the same time, it has strong disaster recovery capabilities. East China and South China Brokers are connected to North China Brokers through the cloud network, and North China Brokers are connected to vivo IDC through VPC. When the Broker cluster or the public network fails in a region in North China, East China, or South China, it will not affect the sending and receiving of messages by devices on the entire network. However, there is still a problem with this method, that is, if the Broker cluster or the public network fails in a certain area, some devices in this area will not be able to receive push messages.

Architecture diagram after deployment in three places:

In response to the problem that some devices in this area cannot receive push messages due to the above-mentioned abnormalities in a single area, we have designed a traffic scheduling system that can achieve real-time traffic scheduling and switching. The global scheduler node is responsible for policy scheduling and management.

When the vivo phone registers, the dispatcher will deliver the ip addresses of multiple regions. By default, the nearest connection will be made. After multiple connection failures, try to connect to other ip. When a Broker in a region has a bottleneck in the number of long connections or a VPC fails, the global scheduler node can issue a policy to allow the devices in the faulty region to obtain the IP of the new IP set from the dispatcher, and establish a long connection with Brokers in other regions. The node sends a message to the reconnected Broker. After the area recovers, you can re-issue the strategy and make a callback.

Flow scheduling system diagram:

2.2 Logical layer disaster recovery

After the long connection layer has done disaster recovery, the logic layer also needs to do corresponding disaster recovery. Previously, our logic layer was deployed in one computer room, which did not have the disaster recovery capability of the computer room. When there was a risk of power outage in one computer room, the overall service would be unavailable. Therefore, we made a transformation of the "active-active in the same city" deployment plan.

Logic layer single-active architecture:

The logic layer is deployed in vivo IDC1 and vivo IDC2 respectively, and the gateway layer distributes the traffic to the two IDCs according to the routing rules according to a certain ratio, so as to realize the dual-active in the same city of the logic layer. We found that there is still only one data center, which is deployed in vivo IDC1. Considering the cost, income, and data synchronization delay of multiple data centers, the data center is still mainly a single data center for the time being.

Logic layer active-active architecture:

3. Traffic Disaster Recovery Solution

After the disaster recovery capability of the system architecture is done well, the gateway layer of the push system needs to take corresponding measures to deal with sudden traffic, do a good job of traffic control, and ensure system stability. Historically, due to hot spots and breaking news events, the concurrent push traffic was huge, resulting in abnormal service and reduced availability.

How to deal with sudden large traffic, ensure that the system availability remains unchanged under the condition of sudden traffic, and at the same time take into account performance and cost. To this end, we compared and designed the following two schemes respectively.

The conventional solution is to deploy a large number of redundant machines based on historical estimates to deal with burst traffic. The cost of this method alone is high, and the burst traffic may only last for 5 minutes or less. In order to meet the 5-minute burst traffic, the system needs to deploy a large number of redundant machines. Once the traffic exceeds the upper limit that the deployment machine can bear, the capacity cannot be expanded in time, which may lead to decreased availability or even an avalanche effect.

Push architecture under the traditional solution:

So how to design a solution that can control costs, elastically expand capacity in the face of sudden large traffic, ensure that messages are not leaked, and take into account push performance?

Optimization solution: On the basis of the original architecture, a buffer channel is added at the access layer. When the traffic peak arrives, the traffic beyond the upper limit capacity that the system can handle will be put into the buffer queue. In the form of message queues, the bypass access layer is added to limit the rate of consumption of message queues. After the traffic peak has passed, increase the bypass consumption speed and process the cached queue messages. The bypass access layer is deployed through docker, supports dynamic expansion and contraction, and minimizes the cluster by default. When there is a backlog of message queues and the downstream has the ability to process them, the consumption speed is increased. Bypass dynamically expands capacity according to the CPU load and consumes message queues quickly. Dynamic shrinkage after processing.

Message queue: KAFKA middleware with a large throughput is selected, and it is shared with the offline computing KAFKA cluster, which can make full use of resources.

Bypass access layer: deployed with docker, supports dynamic expansion and contraction according to CPU load and time. Default minimal cluster deployment. For known traffic peak hours, the service can be expanded in advance to ensure fast traffic processing. In case of traffic peaks during unknown periods, the access layer can be bypassed, and the capacity can be dynamically expanded and contracted according to the CPU load.

The push architecture after adding the cache queue:

After the above transformation, there is still a problem, that is, how to implement global speed control at the access layer. The method we adopt is to collect the push traffic of downstream push nodes. For example, when the traffic reaches 80% of the upper limit that the system can withstand, a speed limit command is issued to adjust the push speed of the access layer. Let the messages be backlogged in the message queue first, and after the downstream traffic decreases, issue the command to release the rate limit, and let the bypass access layer accelerate the consumption of the message queue and push it.

The push architecture after adding speed control:

Comparison between the optimized scheme and the traditional scheme:

4. Storage Disaster Recovery Solution

After doing a good job of concurrent traffic control, it can be a good way to pre-release hot spots. Inside the push system, due to the use of the Redis cluster to cache messages, there has been a problem that messages cannot be delivered in time due to a Redis cluster failure. Therefore, we consider designing related disaster recovery solutions for Redis clusters, so that the system can push messages in time and ensure that messages are not lost during the failure of the Redis cluster.

The push message body is cached in the Redis cluster, and the message body is obtained from Redis when pushing. If the Redis cluster crashes or the memory fails, the offline message body will be lost.

Original message flow:

Solution 1: Introduce another peer-to-peer Redis cluster, adopt the push double-write method, and double-write two Redis clusters. This solution requires redundant deployment of standby Redis clusters of equal scale. The push system requires double-write Redis operations.

Solution 2: The original Redis cluster is synchronized to another standby Redis cluster using RDB+AOF. This solution no longer requires the double-write Redis transformation of the push system, and directly uses the original Redis cluster data to be synchronized to another standby Redis cluster. It is also necessary to deploy redundant Redis clusters of equal scale. There may be some data synchronization delays that cause push failures.

Solution 3: Apply another distributed storage system, disk KV, which is compatible with the Redis protocol and has persistence capabilities. It is guaranteed that the message body will not be lost. But in order to save costs, the Redis cluster peer resources are no longer used directly. But according to the push characteristics, the push is divided into single push and group push. Single push is one-to-one push, one message body per user. Group push is one-to-many push, one message body corresponds to multiple users. Group tweets are often task-level pushes. Therefore, we use a relatively small disk KV cluster, which is mainly used for redundant storage and group push message bodies, that is, task-level messages. For a single push, it is only saved to Redis without redundant storage.

If the Redis cluster fails, for a single push message, the push system can carry the message body and push it downstream to ensure that the message can continue to be delivered. For group push messages, because the message body is redundantly stored in the disk KV, when the Redis cluster fails, it can be downgraded to read the disk KV.

Solution 3 still has a problem, that is, the write performance of the disk KV is not of the same order of magnitude as that of the Redis cluster, especially the latency. The average disk KV is about 5ms. The Redis cluster is 0.5ms. If you double-write the group push message body in the push system. This delay is unacceptable. Therefore, only the method of asynchronously writing to disk KV can be used. Here, the group push message body will be backed up, first written into the message middleware KAFKA, and the bypass node consumes KAKFA to asynchronously write to the disk KV. In this way, under the premise of using less disaster recovery disk KV resources, the high concurrency capability of the push system is guaranteed, and at the same time, the group push message body is not lost. When Redis is abnormal, the single push message carries the message body to push, and the group push message body is read. Take the disk KV.

Comparison of storage disaster recovery solutions:

V. Summary

This article describes the disaster recovery construction process of the push system from three aspects: system architecture disaster recovery, traffic disaster recovery, and storage disaster recovery. System disaster recovery needs to be considered based on business development, cost-benefit, and implementation difficulty.

At present, our long-term connection layer has been deployed in three places, the logical layer has active-active in the same city, and the data center is a single data center. In the future, we will continue to study and plan dual data centers, and deploy three centers in two places to gradually strengthen the disaster recovery capability of the push system.

Guess you like

Origin blog.csdn.net/vivo_tech/article/details/130402972