RabbitMQ in action: usability analysis and implementation

This series is a summary note of the book "RabbitMQ in Action: Efficient Deployment of Distributed Message Queues".

The previous article introduced the best practices in various scenarios. Most scenarios can use the "send and forget" mode, which does not require a response. If a response is required, the RabbitMQ RPC model can be used.

RabbitMQ decouples the relationship between systems in an asynchronous way. The caller sends a business request to the Rabbit server, and it can return. Rabbit will ensure that the request is processed correctly, even if it encounters network abnormalities, Rabbit server crashes, and the entire computer room is powered off, etc. For special scenarios, Rabbit provides various mechanisms to ensure its availability.

This article analyzes the availability guarantee provided by Rabbit by summarizing the special scenarios that may occur, and learning its implementation, you will know:

Summarize abnormal scenarios
cluster and handle failure
Connection loss and failover
main/standby mode
Replication across computer rooms

Promote my personal public account "Qingqingshuo", and share my work, study and life as soon as possible. If it is helpful to you, I hope you can pay attention.

abnormal scene

In actual work, a large part of the time is spent on solving various abnormal situations, such as verification of user input, various exception classes provided in JDK, network exceptions, etc. These are relatively easy to solve.

As a bridge between the caller and the processor, the Rabbit service is very important. If the Rabbit service is unavailable due to network abnormalities, single server crash, computer room paralysis, etc., it will affect all dependent business systems.

network anomaly

The processor and the server interact through a long connection, so that messages can be pushed in real time. Network abnormalities may cause the long connection to be disconnected. If the client cannot perceive it, the processor will not receive any messages. This situation is called "" Connection lost".

This problem can be solved by catching connection exceptions and reconnecting. In addition, the Rabbit client is encapsulated, so it is easy to deal with this kind of problem.

Server crashes

If there is only one server serving, the server crash will cause the service to be unavailable. Generally, a cluster is used to treat multiple servers as a whole to provide services to the outside world. In this way, the crash of a single server will not affect the overall service.

After using a cluster, there are a few issues to consider:

Which server the client connects to is random, and a queue is only in a certain server, so each server must save queue metadata (similar to an index), and can obtain actual queue data from other servers;
The crash of the server will cause the loss of non-persistent queues and exchanges. After the client reconnects, it needs to be created again, but the unconsumed messages will not be recovered;
If queues, exchanges, messages, etc. are persistent, how to restore them? Rabbit provides several ways to deal with them, which will be described in detail later;
Subscribers also need to re-establish the connection to monitor;

Computer room paralysis

If you consider the paralysis of the computer room, it is necessary to build multiple data centers. RabbitMQ provides a mechanism to easily copy messages between Rabbits in different data centers.

cluster and handle failure

One of the best features of RabbitMQ is its built-in cluster, which is mainly used to accomplish 2 goals:

Allow consumers and producers to continue running in the event of a Rabbit node crash;
Linearly scale the throughput of message communication by adding more nodes;

Cluster Architecture

RabbitMQ always records four types of internal metadata (like indexes):

Queue metadata: the queue name and its properties;
Exchange metadata: Exchange name, type and properties;
Binding metadata: a simple table showing how to route messages to queues;
vhost metadata: provides namespace and security attributes for queues, exchanges and bindings within a vhost;

When a cluster is introduced, new types of metadata need to be tracked: cluster node locations, and how nodes relate to other types of metadata that have been recorded.

Not every node has a full copy of all queues, if you create a queue in a cluster, only complete queue information (metadata, status, content) will be created on a single node, all other nodes only know the queue's metadata and pointers Node pointer for this queue.

If the node crashes, the consumers attached to the queue will not be able to receive new messages. It is possible to let consumers reconnect to the cluster and recreate the queue. This approach is only feasible when the queue is not set to persist. This is to ensure that when the failed node joins the cluster after recovery, the queue messages on the node will not be lost.

Why not copy the queue content and state to all nodes: first, storage space, if each cluster node has a full copy of all queues, adding new nodes will not bring more storage space; second, performance, message Publishers need to replicate messages to every cluster node, and for persistent messages, network and disk replication increases.

The switch is just a lookup table, not an actual message router, so it is simpler to replicate the switch across the cluster

Think of each queue as a process running on a node, each process has its own process ID, and an exchange is just a list of routing patterns and a list of queue process IDs to which matching messages should be sent.

Exchanges and Queues in a Cluster Architecture

Each Rabbit node is either a memory node or a disk node. A single-node system only runs disk-type nodes. In a cluster, you can choose to configure some nodes as memory nodes.

When declaring queues, exchanges, or bindings in a cluster, these operations do not return until all cluster nodes have successfully committed metadata changes.

RabbitMQ only requires at least one disk node in the cluster. If there is only one disk node and it happens to crash again, the cluster can continue to route messages, but cannot create queues, exchanges, bindings, add users, change permissions and other operations. Therefore, it is recommended to set up two disk nodes. When the memory node restarts, it will connect to the pre-configured disk node and download the current cluster metadata copy, so all disk nodes must be told to the memory node.

mirror queue

As mentioned earlier, the queue will only be on one node in the cluster. After the node crashes, the queue messages will be lost. After RabbitMQ 2.6, a mirror queue is provided. Once the master queue is unavailable, the slave queue will be elected as the new master team. List.

For the mirror queue, in addition to delivering the message to the appropriate queue according to the routing binding rules, the message will also be delivered to the slave copy of the mirror queue.

For sender acknowledgment messages, Rabbit will not notify the sender until all queues and slave copies of queues have safely received the message.

In addition, when using mirror queues, there is a problem: if the master copy node fails to send, the slave queue will elect the Wie master queue, and all consumers of this queue need to re-attach and listen to the new master copy of the queue. Consumers connecting through a failed node can be detected by a lost TCP connection to the node, but will not be detected for those consumers attached to a mirrored queue through a node and functioning normally.

Rabbit sends a consumer cancellation notification to the consumer, informing that it is no longer attached to the main copy of the queue and needs to be reconnected.

Connection loss and failover

This section mainly discusses how consumers detect connection loss and reconnect.

There are multiple strategies for handling reconnection to the cluster. A better way is to use load balancing, which not only reduces the complexity of the application's handling of node failure codes, but also ensures an even distribution of connections in the cluster.

Regarding load balancing, there are many introductions on the Internet, so I won't introduce too much here, mainly to see how to sense faults and perform reconnection operations.

It is relatively simple to perceive faults. When a long connection is disconnected, an exception will be thrown, and the corresponding exception can be caught.

When a cluster node fails, the application needs to think about: where to connect next? This job has been handed over to the load balancer.

Regarding reconnection handling, consider:

If you reconnect to a new server, the channel and all consumption loops on it will be invalidated and they will need to be rebuilt;
When reconnecting, all queues and bindings may no longer exist, and the queues and bindings need to be reconstructed.

main/standby mode

When the availability requirements are particularly high, message loss is not allowed, and queues, exchanges, and messages need to be set to persistent. If a node crashes, it will not be able to forward messages until it recovers, because the default cluster architecture does not allow clustering. Other nodes create queues to prevent historical messages from being lost after the failed node recovers.

This problem can be solved by building an independent RabbitMQ of the primary/standby machine, that is, the warren mode. A warren refers to a pair of primary/standby independent servers with a set of load balancers in front to handle failover.

There is no cooperation between the primary server and the standby server, and only when the primary server crashes, the standby server processes messages. It can be guaranteed that after the primary node fails, the queue is re-created through the standby node and the switch continues to serve. After the faulty node recovers, the messages not consumed by the primary node can continue to be consumed.

Replication across computer rooms

When there is only one data center, RabbitMQ cluster is a great solution for improving the performance of message communication, but when it is necessary to route messages from one program to another city, it is more troublesome, which can be solved by Shovel.

Shovel is a plugin for RabbitMQ that enables you to define a replication relationship between a queue on RabbitMQ and an exchange on another RabbitMQ. To put it bluntly, producers and consumers are far apart.

By creating a new queue in computer room 1 to receive messages published by the website, and then let shovel consume these messages and re-publish the messages to the exchange on computer room 2 through the WAN connection.

In this way, users can return as long as they are published to the queue of computer room 1, which reduces the response time. Computer room 1 can continue to publish messages to computer room 2.

Shovel process

As can be seen from the above introduction, a lot of work needs to be done to ensure high availability, and different architectural methods can be selected according to the requirements of the business for availability.

The next article focuses on the RabbitMQ management interface and monitoring.

Welcome to scan the QR code below and follow my personal WeChat public account~

love story