Rabbitmq cluster failure queue recovery

It comes from the Internet, and I don't know who created it. I will make a record here for learning, and add my own understanding and supplement.


                                Rabbitmq cluster fault queue recovery

RabbitMQ's mirror queue (mirror queue) mechanism is the simplest queue HA solution. It can mirror the queues in the cluster according to requirements by adding policy options such as ha-mode and ha-param on the basis of the cluster. to multiple nodes to achieve high availability and eliminate the risk brought by a single point of queue content in cluster mode.

Before using mirrored queues, there are several precautions that must be kept in mind (the master node and master, slave nodes and slaves will be mixed in the following):

1. Mirrored queues cannot be used as load balancing, because each operation is performed on all nodes. to do it again.

2. The ha-mode parameter and durable declare do not take effect on the exclusive queue, because the exclusive queue is exclusive to the connection. When the connection is disconnected, the queue is automatically deleted. So in fact these two parameters have no meaning for the exclusive queue.

3. When adding a new node to an existing mirror queue, by default ha-sync-mode=manual, the messages in the mirror queue will not be actively synchronized to the new node unless the synchronization command is explicitly called. When a sync command (via rabbitmqctl or web-based ui) is called, the queue starts to block and cannot operate on it until the sync is complete. When ha-sync-mode=automatic, the known mirror queue will be synchronized by default when a new node is added. Due to the limitations of the synchronization process, it is not recommended to operate in the active queue of the production environment (with production and consumption messages).

4. Whenever a node joins or rejoins (for example, recovers from a network partition) the mirror queue, the previously saved queue contents will be cleared.

5. The mirror queue is divided into master and slave, one master node (master), 0 or more slave nodes (slave). When the master goes down, a new master will be elected in the slave. The election algorithm is the first node to start.

6. When all slaves are in an unsynchronized state (with the master) and the ha-promote-on-shutdown policy is set to when-syned (default), if the master is stopped for active reasons, such as through rabbitmqctl stop Command to stop or gracefully shut down the OS, then the slave will not take over the master, that is to say, the mirror queue is unavailable at this time; but if the master is stopped for passive reasons, such as VM or OS crash, then the slave will take over the master. The implicit value orientation of this configuration item is to give priority to ensuring that messages are reliable and not lost, and giving up availability. If the ha-promote-on-shutdown policy is set to always, then no matter the master stops for any reason, the slave will take over the master, giving priority to ensuring availability.

7. The last stopped node in the mirror queue will be the master. The startup sequence must be the master first. If the slave starts first, it will have a waiting time of 30 seconds to wait for the master to start and then join the cluster. When all nodes go offline at the same time for some reason (power outage, etc.), each node considers itself not the last node to stop. To restore the mirrored queue, try to start all nodes simultaneously within 30 seconds.

8. For the mirror queue, the client basic.publish operation will be synchronized to all nodes; other operations are relayed through the master, and then the master will act on the salve. For example, in a basic.get operation, if the client establishes a TCP connection with the slave, the slave first sends the basic.get request to the master, and the master prepares the data, returns it to the slave, and delivers it to the consumer.

9. It can be seen from 8 that when the slave is down, there is no other impact except that the client connections connected to the slave are all disconnected. When the master goes down, there will be the following chain reactions: 1) All client connections connected to the master are disconnected. 2) Elect the oldest slave as the master. If all slaves are in an unsynchronized state at this time, some unsynchronized messages will be lost. 3) The new master node requeues all unack messages, because the new node cannot distinguish whether these unack messages have reached the client, or whether the ack messages are lost on the path to the old master, or lost in the old master multicast ack message to all slave paths. Therefore, for the sake of message reliability, requeue all unacked messages. At this point the client may be hit with duplicate messages. 4) If the client is connected to the slave and the x-cancel-on-ha-failover parameter is specified in the basic.consume message, the client will receive a Consumer Cancellation Notification, and the Java SDK will call back handleCancel( ) method, so this method needs to be overridden. If the x-cancel-on-ha-failover parameter is not specified, the consumer will not be aware that the master is down and will wait forever.

The considerations listed above are compiled from the official HA documentation.

The following mirror queue recovery is the focus of this article:

* Premise: Two nodes (A and B) form a mirror queue.

* Scenario 1: A stops first, then B stops.

In this scenario, B is the master, as long as you start B first, then start A. Or start A first, and then start B within 30 seconds to restore the mirror queue.

* Scenario 2: A, B stop at the same time.

This scenario may be caused by a power failure or other reasons. Just start A and B continuously within 30 seconds to restore the mirror queue.

* Scenario 3: A stops first, then B stops, and A cannot resume.

This scenario is an enhanced version of scenario 1, because B is the master, so after B gets up, call rabbitmqctl forget_cluster_node A on node B to cancel the cluster relationship with A, and then add the new slave node to B to restore the mirror queue. .

* Scenario 4: A stops first, then B stops, and B cannot resume.

This scene is an enhanced version of scene 3, which is more difficult to handle. It seems that there is no good solution as early as the 3.1.x era. Maybe I don't know, but now there is a solution, and it is effective in version 3.4.2. . Because B is the master, it is not possible to start A directly. When A fails to start, there is no way to call rabbitmqctl forget_cluster_node B on node A. In the new version, forget_cluster_node supports the --offline parameter. The offline parameter allows rabbitmqctl to execute the forget_cluster_node command on the offline node, forcing RabbitMQ to choose one of the slave nodes that are not started as the master. When rabbitmqctl forget_cluster_node –offline B is executed on node A, RabbitMQ will mock a node on behalf of A, execute the forget_cluster_node command to remove B from the cluster, and then A can start normally. Finally, add the new slave node to A to restore the mirror queue.

* Scenario 5: A stops first, then B stops, and neither A nor B can be recovered, but the disk files of A or B can be obtained.

This scene is an enhanced version of Scene 4, which is more difficult to handle. Copy the database file of A or B (in the $RABBIT_HOME/var/lib directory by default) to the directory of the new node C, and then change the hostname of C to the hostname of A or B. If the copied file is the disk file of node A, it is processed according to scenario 4; if the copied file is the disk file of node B, it is processed according to scenario 3. Finally, add the new slave node to C to restore the mirror queue.

* Scenario 6: A stops first, then B stops, and neither A nor B can be recovered, and the disk files of A or B cannot be obtained.

In this scenario, the contents of the A and B queues cannot be restored.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326589350&siteId=291194637