Message queue-reliability (high availability)

这边以RabbitMQ与Kafka为例:

1. The reliability of RabbitMQ (not lost)

Sender confirmation mode

The producer puts the data in the message queue, the message queue has data, and actively asks the consumer to get it (commonly known as push)

​ Set the channel to confirm mode (sender confirmation mode), all messages published on the channel will be assigned a unique ID.

​ Once the message is delivered to the destination queue, or after the message is written to disk (a persistent message), the channel will send an acknowledgment to the producer (including the unique ID of the message).

​ If RabbitMQ has an internal error that causes the message to be lost, it will send a nack (notacknowledged) message.

​ The sender confirmation mode is asynchronous, and the producer application can continue to send messages while waiting for the confirmation. When the confirmation message reaches the producer application, the callback method of the producer application will be triggered to process the confirmation message.

RabbitMQ persistence

​ It means that RabbitMQ loses the data. You must turn on RabbitMQ's persistence. After the message is written, it will be persisted to disk. Even if RabbitMQ hangs, it will automatically read the previously stored data after recovery. Will lose. Unless it is extremely rare that RabbitMQ has not persisted, and it hangs by itself, which may cause a small amount of data to be lost, but this probability is small.

There are two steps to set up persistence,

​ The first is to set it to be persistent when creating the queue, so that it can ensure that RabbitMQ persists the metadata of the queue, but does not persist the data in the queue;

​ The second is to set the deliveryMode of the message to 2 when sending a message, which is to set the message to be persistent, and RabbitMQ will persist the message to the disk at this time.

​ These two persistences must be set at the same time. Even if RabbitMQ is hung up and restarted, it will restart and restore the queue from the disk and restore the data in this queue.

And the persistence can be combined with the confirm mechanism on the producer side. Only after the message is persisted to the disk, the producer will be notified of the ack, so even before the persistence to the disk, RabbitMQ hangs and the data is lost. If the producer cannot receive the ack, you can also resend it yourself.

Even if you turn on the persistence mechanism for RabbitMQ, there is a possibility that the message is written to RabbitMQ, but has not had time to persist to the disk, the result is unfortunate, at this time RabbitMQ hangs, it will cause the memory A little bit of data will be lost.

Receiver confirmation mechanism

Consumers continue to train the message queue to see if there is new data, and consume if there is (commonly known as pull)

​ Consumers must confirm after receiving each message (message reception and message confirmation are two different operations). Only the consumer confirms the message, RabbitMQ can safely delete the message from the queue.

​ There is no timeout mechanism used here, RabbitMQ only confirms whether the message needs to be resent through the Consumer's connection interruption. In other words, as long as the connection is not interrupted, RabbitMQ gives the Consumer long enough to process the message. Ensure the ultimate consistency of data;

Several special cases are listed below

(1) If the consumer receives the message and disconnects or cancels the subscription before the confirmation, RabbitMQ will think that the message has not been distributed, and then redistribute it to the next subscribed consumer. (There may be a hidden danger of repeated consumption of messages, which needs to be removed)

(1) If the consumer receives the message but does not confirm the message, and the connection is not disconnected, RabbitMQ considers the consumer to be busy and will not distribute more messages to the consumer.

2. Kafka's reliability

The consumer loses data

The only situation that may cause consumers to lose data is that you consume this message, and then the consumer automatically submits the offset, making Kafka think that you have already consumed the message. In fact, you are just about to process this message. If you haven't dealt with it, you hang up yourself, and the message will be lost at this time.

Everyone knows that Kafka will automatically submit the offset, so as long as you turn off the automatic submission of the offset and submit the offset manually after processing, you can ensure that the data will not be lost. But at this time there will indeed be repeated consumption. For example, you have just processed it and haven't submitted the offset, and you hang up by yourself. At this time, you will definitely consume it again, just guarantee the idempotence.

A problem encountered in the production environment is that after our Kafka consumers consume the data, they write to a memory queue and buffer it first. As a result, sometimes, you just write the message to the memory queue, and the consumer will automatically Submit the offset.

Then we restarted the system at this time, which will cause the data in the memory queue to be lost before processing.

Kafka lost data

A common scenario in this area is when a broker in Kafka goes down and then re-elects the leader of Partiton. Think about it, if other followers happen to have some data not synchronized at this time, and the leader hangs up at this time, and then after a certain follower is elected as the leader, will he lose some data? Some data is lost.

We have also encountered it in the production environment, and so did we. The leader machine of Kafka was down before. After switching the follower to the leader, we will find that this data is lost.

So at this time, it is generally required to set at least the following 4 parameters:

  • Set the replication.factor parameter for this topic: this value must be greater than 1, and each partition must have at least 2 copies.
  • Set the min.insync.replicas parameter on the kafka server: this value must be greater than 1. This is to require a leader to at least sense that there is at least one follower and keep in touch with itself, so as to ensure that the leader is down and there is a follower. .
  • Set acks=all on the producer side: This requires each piece of data to be written to all replicas before it can be considered as successful.
  • Set retries=MAX on the producer side (a large, large, large value, meaning unlimited retry): This requires unlimited retry once the write fails, and it is stuck here.

Will the producer lose data

If you set ack=all according to the above ideas, you will not lose it. The requirement is that your leader receives the message and all the followers have synchronized to the message before it is considered successful. If this condition is not met, the producer will automatically try again and again for an unlimited number of times.

Guess you like

Origin blog.csdn.net/weixin_42272869/article/details/112147914