Vivo's high-availability architecture practice based on native RabbitMQ

1. Background description

Vivo introduced RabbitMQ in 2016, based on the open source RabbitMQ expansion, to provide messaging middleware services to the business.

From 2016 to 2018, all businesses used a cluster. As the business scale grew, the cluster load became heavier and cluster failures occurred frequently.

In 2019, RabbitMQ entered the high-availability construction stage, completing the high-availability component MQ name service and the intra-city dual-active construction of the RabbitMQ cluster.

At the same time, the physical split of the business use cluster is carried out, and the allocation and dynamic adjustment of the business use cluster are strictly based on the cluster load and business traffic.

Since the construction of high availability in 2019, business traffic has increased tenfold, and the cluster has not experienced serious failures.

RabbitMQ is an open source message broker software that implements the AMQP protocol, which originated in the financial system.

Has a wealth of characteristics:

  1. Message reliability guarantee: RabbitMQ guarantees the reliability of message sending by sending confirmation, guarantees the reliability of messages in the cluster by means of clustering, message persistence, and mirroring queues, and guarantees the reliability of message consumption by consumption confirmation.

  2. RabbitMQ provides clients in multiple languages.

  3. Provides various types of exchanges. After messages are sent to the cluster, they are routed to specific queues through the exchange.

  4. RabbitMQ provides a complete management background and management API, which can be quickly integrated with the self-built monitoring system through the management API.

Problems discovered by RabbitMQ in specific practice:

  1. In order to ensure high business availability, multiple clusters are used for physical isolation, and multiple clusters do not have a unified platform for management.

  2. The native RabbitMQ client uses the cluster address to connect. When using multiple clusters, the business needs to care about the cluster address, and the usage is confusing.

  3. Native RabbitMQ only has simple user name/password verification, and does not authenticate the business application party used. It is easy to mix exchange/queue information for different businesses, causing abnormal business applications.

  4. Many business applications are used, and there is no platform to maintain the associated information of the message sender and consumer. After multiple version iterations, the counterparty cannot be determined.

  5. The client has unlimited flow, and the sudden abnormal traffic impacts or even defeats the cluster.

  6. The client has no exception message retransmission strategy, which needs to be implemented by the user.

  7. When the cluster is blocked by memory overflow, etc., it cannot be quickly and automatically transferred to other available clusters.

  8. With mirrored queues, the master node of the queue will fall on a specific node. When the number of cluster queues is large, the node load is prone to imbalance.

  9. RabbitMQ has no queue automatic balancing capability, and it is prone to uneven load of cluster nodes when there are many queues.

Second, the overall structure

Vivo's high-availability architecture practice based on native RabbitMQ

1. MQ-Portal--support application use application

In the past, when the business team applied RabbitMQ, the application application traffic and the docked application information were all recorded in the offline form, which was fragmented and not updated in time. It was impossible to accurately understand the current real use of the business. Therefore, through a visualization of the application process, Platformization establishes metadata information used by applications.

Vivo's high-availability architecture practice based on native RabbitMQ

Through the application process of MQ-Portal (as shown in the figure above), it is determined that the message sending application, consumer application, use of exchange/queue, sending traffic and other information use applications will enter the vivo internal work order process for approval after submission.

Vivo's high-availability architecture practice based on native RabbitMQ

After the work order process is approved, call back through the work order interface, assign the specific cluster used by the application, and create an exchange/queue binding relationship on the cluster.

Since the physical isolation of multiple clusters is adopted to ensure the high availability of the business in the formal environment, it is impossible to locate the cluster in use simply by the name of an exchange/queue.

Each exchange/queue is associated with the cluster through a unique pair of rmq.topic.key and rmq.secret.key, so that the specific cluster used can be located during the SDK startup process.

rmq.topic.key and rmq.secret.key will be allocated in the callback interface of the ticket.

Vivo's high-availability architecture practice based on native RabbitMQ

2. Overview of client SDK capabilities

The client SDK is encapsulated based on spring-message and spring-rabbit, and on this basis provides capabilities such as application authentication, cluster addressing, client current limiting, production and consumption reset, and blocking transfer.

2.1, application use authentication

The open source RabbitMQ only uses the username and password to determine whether to allow connection to the cluster, but whether the application allows the use of exchange/queue is not verified.

In order to avoid mixing exchange/queue for different services, the application needs to be authenticated.

Application authentication is completed by the cooperation of SDK and MQ-NameServer.

When the application starts, it will first report the rmq.topic.key information of the application configuration to the MQ-NameServer, and the MQ-NameServer will determine whether the used application is consistent with the applied application, and a secondary verification will be performed during the SDK sending of the message.

/**
  * 发送前校验,并且获取真正的发送factory,这样业务可以声明多个,
  * 但是用其中一个bean就可以发送所有的消息,并且不会导致任何异常
  * @param exchange 校验参数
  * @return 发送工厂
*/
public AbstractMessageProducerFactory beforeSend(String exchange) {
    if(closed || stopped){
        //上下文已经关闭抛出异常,阻止继续发送,减少发送临界状态数据
        throw new RmqRuntimeException(String.format("producer sending message to exchange %s has closed, can't send message", this.getExchange()));
    }
    if (exchange.equals(this.exchange)){
        return this;
    }
    if (!VIVO_RMQ_AUTH.isAuth(exchange)){
        throw new VivoRmqUnAuthException(String.format("发送topic校验异常,请勿向无权限exchange %s 发送数据,发送失败", exchange));
    }
    //获取真正的发送的bean,避免发送错误
    return PRODUCERS.get(exchange);
}

2.2, cluster addressing

As mentioned earlier, applications use RabbitMQ to allocate clusters strictly according to the load and business traffic of the cluster. Therefore, different exchange/queues used by a specific application may be allocated on different clusters.

In order to improve the efficiency of business development, it is necessary to shield the impact of multiple clusters on the business, so the clusters are automatically addressed according to the rmq.topic.key information configured by the application.

2.3, client current limit

The native SDK client does not limit the sending traffic. When some applications have abnormalities and continue to send messages to MQ, the MQ cluster may be overwhelmed. And a cluster is used by multiple applications, and the cluster impact caused by a single application will affect all applications using the abnormal cluster.

Therefore, it is necessary to provide the client-side current limiting capability in the SDK, and when necessary, the application can be restricted from sending messages to the cluster to ensure the stability of the cluster.

2.4. Production and consumption reset

(1) With the growth of the business scale, the cluster load continues to increase. At this time, the cluster business needs to be split. In order to avoid business restart during the splitting process, a production and consumption reset function is required.

(2) Abnormalities in the cluster may cause consumers to go offline. At this time, the production and consumption reset can quickly increase business consumption.

In order to realize the production and consumption reset, the following process needs to be realized:

  • Reset connection factory connection parameters

  • Reset connection

  • Establish a new connection

  • Restart production and consumption
CachingConnectionFactory connectionFactory = new CachingConnectionFactory();
connectionFactory.setAddresses(address);
connectionFactory.resetConnection();
rabbitAdmin = new RabbitAdmin(connectionFactory);
rabbitTemplate = new RabbitTemplate(connectionFactory);

At the same time, MQ-SDK has an abnormal message retransmission strategy, which can avoid the abnormal message transmission caused by the production reset process.

2.5, blocking transfer

RabbitMQ blocks message sending when memory usage exceeds 40% or disk usage exceeds the limit.

Since the vivo middleware team has completed the construction of RabbitMQ intra-city active-active, it can be reset to the active-active cluster through production and consumption to complete the fast transfer of the blocking when a cluster is blocked.

2.6, multi-cluster scheduling

With the development of applications, a single cluster will not be able to meet the traffic demand of the application, and the cluster queues are all mirrored queues, and the horizontal expansion of a single cluster of business support traffic cannot be achieved simply by adding cluster nodes.

Therefore, the SDK is required to support multi-cluster scheduling capabilities, and to meet the needs of large business traffic by distributing traffic to multiple clusters.

3. MQ-NameServer--Support MQ-SDK to achieve fast failover

MQ-NameServer is a stateless service, which can ensure its high availability through cluster deployment. It is mainly used to solve the following problems:

  • MQ-SDK starts authentication and applications use cluster positioning.

  • Process MQ-SDK timing indicator reporting (number of messages sent, number of messages consumed), and return the current available cluster address to ensure that the SDK reconnects according to the correct address when the cluster is abnormal.

  • Control MQ-SDK to reset production and consumption.

4. MQ-Server high availability deployment practice

Vivo's high-availability architecture practice based on native RabbitMQ

RabbitMQ clusters all adopt the same-city active-active deployment architecture, relying on the cluster addressing and failover capabilities provided by MQ-SDK and MQ-NameServer to ensure cluster availability.

4.1. Handling cluster split brain problems

RabbitMQ officially provides three cluster split brain recovery strategies.

(1)ignore

Ignore the split brain problem and do not deal with it. Human intervention is needed to recover when a split brain occurs. Due to the need for human intervention, some messages may be lost, which can be used when the network is very reliable.

(2)pause_minority

When a node loses connection with more than half of the cluster nodes, it will automatically suspend until it detects that the communication with more than half of the cluster nodes resumes. In extreme cases, all nodes in the cluster are suspended, causing the cluster to be unavailable.

(3) autoheal

Minority nodes will restart automatically. This strategy is mainly used to prioritize service availability, not data reliability, because messages on restarting nodes will be lost.

Because RabbitMQ clusters are all deployed in the same city, even if the abnormal business traffic of a single cluster can be automatically migrated to the dual-active computer room cluster, the pause_minority strategy is chosen to avoid the split-brain problem.

In 2018, split-brain clusters were caused by network jitter many times. After the cluster split-brain recovery strategy was modified, the split-brain problem no longer appeared.

4.2, cluster high availability solution

RabbitMQ adopts cluster deployment, and because the cluster split-brain recovery strategy adopts the pause_minority mode, each cluster requires at least 3 nodes.

It is recommended to use 5 or 7 nodes to deploy a highly available cluster and control the number of cluster queues.

The cluster queues are mirrored queues to ensure that messages are backed up, and to avoid message loss caused by node abnormalities.

Exchange, queue, and messages are all set to be persistent to avoid loss of messages when the node restarts abnormally.

The queues are all set to lazy queues to reduce the fluctuation of node memory usage.

4.3. Live-active construction in the same city

Equivalent clusters are deployed in dual computer rooms, and the dual clusters are formed into alliance clusters through the Federation plug-in.

The application machines in the computer room are preferentially connected to the MQ cluster in the computer room to avoid abnormal application usage due to the jitter of the dedicated line.

Obtain the latest available cluster information through the MQ-NameServer heartbeat, and reconnect to the active-active cluster in case of an exception to achieve rapid recovery of application functions.

3. Future challenges and prospects

At present, the use of RabbitMQ is mainly enhanced on the MQ-SDK and MQ-NameServer side. The SDK implementation is more complicated. Later, it is hoped that the proxy layer of the message middleware can be built, which can simplify the SDK and do more detailed management of business traffic.

Author: derek

Guess you like

Origin blog.51cto.com/14291117/2544083