Detailed explanation of MQ message queue

Detailed explanation of MQ message queue

WWW principles

1. What is MQ?

MQ stands for Message Queue. Message Queuing (MQ) is an application-to-application communication method. Applications communicate by reading and writing messages (application-specific data) to and from queues without requiring a dedicated connection to link them. Message passing refers to programs communicating with each other by sending data in messages, rather than by making direct calls to each other, which is typically used for techniques such as remote procedure calls. Queuing refers to applications communicating through queues. The use of queues removes the requirement that receiving and sending applications execute simultaneously. --Baidu Encyclopedia

1. Decoupling

The higher the degree of coupling between systems, the more troublesome it will be to maintain. If you want to reduce the degree of coupling, you need to achieve the purpose of decoupling in a way, and MQ can achieve this purpose. Through MQ, multiple systems The dependency relationship has changed from the previous strong coupling to weak coupling, shifting the focus to MQ.
Summary: Through a model of MQ and Pub/Sub publishing and subscribing messages, system A is completely decoupled from other systems.

2. Asynchronous

As the name suggests, the system response time can be shortened through asynchronous method. The opposite of asynchronous is synchronous. Synchronization means that the request processing process is serial. Only after the previous processing is successful, the next one will continue to be processed. All intermediate processes are processed successfully, and finally before returning success. Asynchronous means returning the response to the foreground first, and then continuing processing in the background until success.

3. Peak clipping

Generally, MySQL can handle 2k requests per second. If the number of requests per second reaches 5k, MySQL may be killed directly, causing the system to crash, and users will no longer be able to use the system.
If MQ is used, 5k requests are written to MQ per second, and system A can process up to 2k requests per second, because MySQL can process up to 2k requests per second. System A slowly pulls requests from MQ, pulling 2k requests per second. It is ok if it does not exceed the maximum number of requests it can handle per second. In this way, even during peak periods, System A will never Will hang up. MQ receives 5k requests per second and only 2k requests go out. As a result, during the peak period at noon (1 hour), there may be hundreds of thousands or even millions of requests backlogged in MQ.

3. When to use MQ?

If the coupling between systems is high and maintenance is troublesome, it is recommended to use MQ.

What are the message queues?

1. ActiveMQ

2. RabbitMQ

3. RocketMQ

4. Kafka

Message queue comparison

ActiveMQ

characteristic ActiveMQ
Single machine throughput Level 10,000, one order of magnitude lower than RocketMQ and Kafka
The impact of topic number on throughput
Timeliness Microsecond level, this is a major feature of RabbitMQ, with the lowest latency
Availability High, based on master-slave architecture to achieve high availability
message reliability Basically not lost
Function support Developed based on Erlang, it has strong concurrency capabilities, excellent performance, and low latency.

RabbitMQ

characteristic ActiveMQ
Single machine throughput Level 10,000, one order of magnitude lower than RocketMQ and Kafka
The impact of topic number on throughput
Timeliness ms level
Availability High, based on master-slave architecture to achieve high availability
message reliability There is a lower probability of losing data
Function support The functionality of the MQ domain is extremely complete

RocketMQ

characteristic ActiveMQ
Single machine throughput Level 100,000, supporting high throughput
The impact of topic number on throughput Topics can reach hundreds/thousands of levels, and the throughput will decrease slightly. This is a major advantage of RocketMQ. On the same machine, it can support a large number of topics.
Timeliness ms level
Availability Very high, distributed architecture
message reliability After parameter optimization and configuration, 0 loss can be achieved
Function support MQ has relatively complete functions, is distributed, and has good scalability.

kafka

characteristic ActiveMQ
Single machine throughput Level 100,000, high throughput, generally used with big data systems for real-time data calculation, log collection and other scenarios
The impact of topic number on throughput When the number of topics increases from dozens to hundreds, the throughput will drop significantly. On the same machine, Kafka tries to ensure that the number of topics is not too large. If it wants to support large-scale topics, more machine resources need to be added.
Timeliness Delay is within ms level
Availability Very high, distributed, with multiple copies of one data. If a few machines go down, there will be no data loss or unavailability.
message reliability Same as RocketMQ
Function support The function is relatively simple, mainly supporting simple MQ functions, and is used on a large scale for real-time computing and log collection in the field of big data.

In summary, after various comparisons, we have the following suggestions:

  1. General business systems need to introduce MQ. At first, everyone used ActiveMQ, but now it is true that not many people use it. It has not been verified in large-scale throughput scenarios, and the community is not very active.

  2. Later, I started using RabbitMQ and the erlang language prevented a large number of Java engineers from studying and controlling it in depth, making it almost uncontrollable. However, it is indeed open source, has relatively stable support, and is highly active.

  3. Nowadays, more and more companies are indeed using RocketMQ, which is indeed very good. After all, it is produced by Alibaba, but the community may be at risk of stopping maintenance (currently RocketMQ has been donated to Apache, but the activity on GitHub is actually not high).

  4. For small and medium-sized companies with average technical strength and not particularly high technical challenges, RabbitMQ is a good choice; for large companies with strong infrastructure research and development capabilities, RocketMQ is a good choice.

  5. If it is real-time computing, log collection and other scenarios in the field of big data, using Kafka is the industry standard, absolutely no problem, the community is very active, and it is almost a de facto standard in this field around the world.

What are the advantages and disadvantages of message queues?

The disadvantages are as follows:

1. Reduced system availability

The more external dependencies the system introduces, the easier it is to fail. The introduction of MQ components will add uncertainties to the original system. The MQ component itself also has the risk of sudden failure, which will have an impact on the systems that rely on it.

2. Increased system complexity

The introduction of MQ will increase the complexity of the system by an order of magnitude, and there will also be many problems that need to be avoided at the architectural level.
Common problems are as follows:

1. Repeat consumption problem?

First of all, for example, RabbitMQ, RocketMQ, and Kafka may have the problem of repeated message consumption, which is normal. Because this problem is usually not guaranteed by MQ itself, but by our development. Let’s take Kafka as an example and talk about how to consume it repeatedly.

Kafka actually has the concept of offset, that is, each message written has an offset, which represents the sequence number of the message. Then after the consumer consumes the data, it will update the consumed message at regular intervals (regular intervals). Submitting the offset means "I have already consumed it. If I restart or something next time, please let me continue consuming from the offset I consumed last time."
But there are always surprises in everything. For example, what we often encountered in production before is that you sometimes restart the system and see how you restart. If you are in a hurry, just kill the process and restart again. This will cause the consumer to process some messages, but not have time to submit the offset, which is embarrassing. After restarting, a small number of messages will be consumed again.
Give me a chestnut.
There is such a scene. Data 1/2/3 enter Kafka in sequence. Kafka will assign an offset to each of these three pieces of data, representing the sequence number of this data. We assume that the assigned offsets are 152/153/154 in sequence. When consumers consume from Kafka, they also consume in this order. If the consumer consumes the data with offset=153 and is just about to submit the offset to Zookeeper, the consumer process is restarted. Then the offset of 1/2 of the data consumed at this time has not been submitted, and Kafka does not know that you have consumed the data with offset=153. Then after restarting, the consumer will ask Kafka and say, hey, buddy, continue to pass me the data from the place where I consumed it last time. Since the previous offset was not submitted successfully, 1/2 of the data will be transmitted again. If the consumer does not remove duplicates at this time, it will lead to repeated consumption.

Note: The new version of Kafka has moved the storage of offsets from Zookeeper to Kafka brokers, and uses the internal offset topic __consumer_offsets for storage.

How to ensure the idempotence of message queue consumption?

For example, if you want to write data to a database, you first check it based on the primary key. If the data is already there, don't insert it, just update it.
For example, if you are writing Redis, then there is no problem. Anyway, it is set every time, which is naturally idempotent.
For example, if you are not in the above two scenarios, then it is a little more complicated. You need to ask the producer to add a globally unique ID to each piece of data, such as an order ID, and then after you consume it, First, check it in Redis based on this ID. Have you consumed it before? If it has not been consumed, you process it, and then write this ID to Redis. If it has been consumed, then don't process it. Just make sure not to process the same message repeatedly.
For example, it is based on the unique key of the database to ensure that duplicate data will not be inserted multiple times. Because there are unique key constraints, repeated data insertion will only report an error and will not cause dirty data to appear in the database.

2. Message loss problem?

There is a basic principle when using MQ, that is, there cannot be one more, one less, or more data, which is the issue of repeated consumption and idempotence mentioned earlier. It cannot be less, which means don’t lose this data. Then you must consider this issue.

Data loss problems may occur in producers, MQ, and consumers. Let's analyze them from RabbitMQ and Kafka respectively.

2.1. RabbitMQ
2.1.1. The producer lost the data

When the producer sends data to RabbitMQ, the data may be lost halfway due to network problems or other problems.
At this time, you can choose to use the transaction function provided by RabbitMQ, which is to open the RabbitMQ transaction channel.txSelect() before the producer sends data, and then send the message. If the message is not successfully received by RabbitMQ, the producer will receive an exception error. At this time You can roll back the transaction channel.txRollback() and retry sending the message; if the message is received, you can commit the transaction channel.txCommit().

try {
    
    
    // 通过工厂创建连接
    connection = factory.newConnection();
    // 获取通道
    channel = connection.createChannel();
    // 开启事务
    channel.txSelect();

    // 这里发送消息
    channel.basicPublish(exchange, routingKey, MessageProperties.PERSISTENT_TEXT_PLAIN, msg.getBytes());

    // 模拟出现异常
    int result = 1 / 0;

    // 提交事务
    channel.txCommit();
} catch (IOException | TimeoutException e) {
    
    
    // 捕捉异常,回滚事务
    channel.txRollback();
}

But the problem is that once the RabbitMQ transaction mechanism (synchronization) is implemented, the throughput will basically decrease because it consumes too much performance.

So generally speaking, if you want to ensure that messages written to RabbitMQ are not lost, you can turn on the confirm mode. After setting the confirm mode on the producer, each message you write will be assigned a unique id, and then if it is written In RabbitMQ, RabbitMQ will send you an ack message back to tell you that the message is OK. If RabbitMQ fails to process the message, it will call back your nack interface to tell you that the message reception failed and you can try again. And you can combine this mechanism to maintain the status of each message ID in memory. If you have not received a callback for this message after a certain period of time, you can resend it.
The biggest difference between the transaction mechanism and the confirm mechanism is that the transaction mechanism is synchronous. After you submit a transaction, it will be blocked there, but the confirm mechanism is asynchronous. After you send a message, you can send the next message, and then RabbitMQ receives that message. After that, one of your interfaces will be called back asynchronously to notify you that the message has been received.

Therefore, the confirm mechanism is generally used to avoid data loss on the producer side.

A channel that is already in transaction mode cannot be set to confirm mode, that is, the two modes cannot coexist.

There are three ways for the client to implement producer confirm:

  1. Ordinary confirm mode: After each message is sent, the waitForConfirms() method is called to wait for the server to confirm. If the server returns false or does not return within a period of time, the client can resend the message.
channel.basicPublish(ConfirmConfig.exchangeName, ConfirmConfig.routingKey, MessageProperties.PERSISTENT_TEXT_PLAIN, ConfirmConfig.msg_10B.getBytes());
if (!channel.waitForConfirms()) {
    
    
    // 消息发送失败
    // ...
}
  1. Batch confirm mode: After each batch of messages is sent, the waitForConfirms() method is called and waits for server confirmation.
channel.confirmSelect();
for (int i = 0; i < batchCount; ++i) {
    
    
    channel.basicPublish(ConfirmConfig.exchangeName, ConfirmConfig.routingKey, MessageProperties.PERSISTENT_TEXT_PLAIN, ConfirmConfig.msg_10B.getBytes());
}
if (!channel.waitForConfirms()) {
    
    
    // 消息发送失败
    // ...
}
  1. Asynchronous confirm mode: Provide a callback method. The client will call back this method after the server confirms one or more messages.
SortedSet<Long> confirmSet = Collections.synchronizedSortedSet(new TreeSet<Long>());
channel.confirmSelect();
channel.addConfirmListener(new ConfirmListener() {
    
    
    public void handleAck(long deliveryTag, boolean multiple) throws IOException {
    
    
        if (multiple) {
    
    
            confirmSet.headSet(deliveryTag + 1).clear();
        } else {
    
    
            confirmSet.remove(deliveryTag);
        }
    }

    public void handleNack(long deliveryTag, boolean multiple) throws IOException {
    
    
        System.out.println("Nack, SeqNo: " + deliveryTag + ", multiple: " + multiple);
        if (multiple) {
    
    
            confirmSet.headSet(deliveryTag + 1).clear();
        } else {
    
    
            confirmSet.remove(deliveryTag);
        }
    }
});

while (true) {
    
    
    long nextSeqNo = channel.getNextPublishSeqNo();
    channel.basicPublish(ConfirmConfig.exchangeName, ConfirmConfig.routingKey, MessageProperties.PERSISTENT_TEXT_PLAIN, ConfirmConfig.msg_10B.getBytes());
    confirmSet.add(nextSeqNo);
}
2.1.2. RabbitMQ lost data

It means that RabbitMQ itself has lost the data. For this, you must enable the persistence of RabbitMQ. That is, after the message is written, it will be persisted to the disk. Even if RabbitMQ itself hangs up, it will automatically read the previously stored data after recovery. Generally, the data will not leave. Unless it is extremely rare that RabbitMQ hangs before it is persisted, it may cause a small amount of data loss, but this probability is small.

There are two steps to setting up persistence:

  1. When creating a queue, set it to be persistent. This ensures that RabbitMQ persists the metadata of the queue, but it will not persist the data in the queue.

  2. The second is to set the deliveryMode of the message to 2 when sending the message. Just set the message to be persistent, and RabbitMQ will persist the message to disk.

These two persistences must be set up at the same time. Even if RabbitMQ hangs and is restarted, it will restart and restore the queue from the disk and restore the data in the queue.

Note that even if you enable the persistence mechanism for RabbitMQ, there is a possibility that the message is written to RabbitMQ, but has not had time to be persisted to the disk. Unfortunately, RabbitMQ hangs at this time, which will cause memory failure. A little bit of data is lost.

Therefore, persistence can be combined with the confirm mechanism on the producer's side. Only after the message is persisted to the disk, the producer will be notified of the ack. Therefore, even before the message is persisted to the disk, RabbitMQ hangs and the data is lost. If the producer cannot receive the ack, you can resend it yourself.

2.1.3. The consumer lost data

If RabbitMQ loses data, it is mainly because when you consume it, it has just been consumed and has not been processed yet. As a result, the process hangs, such as restarting, which is embarrassing. RabbitMQ thinks that you have consumed it, and the data is lost.

At this time, you must use the ack mechanism provided by RabbitMQ. To put it simply, you must turn off RabbitMQ's automatic ack. You can call it through an API, and then every time you ensure that the processing is completed in your own code, ack again in the program. Bundle. In this case, if you haven't finished processing it, won't there be no ack? Then RabbitMQ will think that you have not finished processing it. At this time, RabbitMQ will allocate the consumption to other consumers for processing, and the message will not be lost.

In order to ensure that messages from the queue reach consumers reliably, RabbitMQ provides a message confirmation mechanism. The consumer can specify the noAck parameter when declaring the queue. When noAck=false, RabbitMQ will wait for the consumer to explicitly send back the ack signal before removing the message from the memory (and disk, if it is a persistent message). Otherwise, RabbitMQ will delete the message from the queue as soon as it is consumed by the consumer.

2.2. Kafka

2.2.1. The consumer lost data

The only situation that may cause the consumer to lose data is that you consume the message, and then the consumer automatically submits the offset, making Kafka think that you have consumed the message, but in fact you have just prepared to process the message. You hang up before you can handle it, and this message will be lost.

Isn't this similar to RabbitMQ? Everyone knows that Kafka will automatically submit offsets. So as long as you turn off automatic offset submission and manually submit offsets after processing, you can ensure that the data will not be lost. But it is true that there may still be repeated consumption at this time. For example, if you have just finished processing and haven't submitted the offset yet, and you hang up, you will definitely consume it again at this time. Just ensure idempotence.

A problem encountered in the production environment is that after our Kafka consumer consumes the data, it writes it to a memory queue and buffers it first. As a result, sometimes, you just write the message to the memory queue, and then the consumer automatically Submit offset. Then when we restart the system at this time, the data in the memory queue that has not had time to be processed will be lost.

2.2.2. Kafka lost data

A common scenario in this area is that a certain Kafka broker goes down, and then the leader of the partition is re-elected. Think about it, if other followers happen to have some data that is not synchronized at this time, and the leader fails at this time, and then elect a follower as the leader, wouldn't some data be missing? This will lose some data.

We have also encountered it in the production environment, and we have also encountered it. The leader machine of Kafka was down before. After switching the follower to the leader, we will find that the data is lost.

Therefore, at this time, it is generally required to set at least the following four parameters:

  1. Set the replication.factor parameter to the topic: this value must be greater than 1, and each partition must have at least 2 copies.
  2. Set the min.insync.replicas parameter on the Kafka server: This value must be greater than 1. This requires a leader to at least perceive that at least one follower is still in contact with itself and does not fall behind. This can ensure that there is still a follower after the leader dies. .
  3. Set acks=all on the producer side: This requires that each piece of data must be written to all replicas before it can be considered successful.
  4. Set retries=MAX on the producer side (a very, very, very large value, meaning unlimited retries): This requires that once writing fails, retries will be infinite, and it is stuck here.
    Our production environment is configured according to the above requirements. After this configuration, at least on the Kafka broker side, it can be guaranteed that if the broker where the leader is located fails and the leader is switched, the data will not be lost.
2.2.3. Will the producer lose data?

If you set acks=all according to the above idea, it will definitely not be lost. The requirement is that your leader receives the message and all followers have synchronized the message before the write is considered successful. If this condition is not met, the producer will automatically and continuously retry, unlimited times.

3. Is there a message delivery sequence problem?

3. Consistency Issues

It may be that system A returns successfully after processing the data, but for systems B, C, and D, only BC system processes successfully, and system D fails to process, resulting in data inconsistency in the end.

So the message queue is actually a very complex architecture. There are many benefits to introducing it, but you also have to make various additional technical solutions and architectures to avoid the disadvantages it brings. After doing it well, you will find that the system The complexity has increased by an order of magnitude, maybe 10 times more complex.

Guess you like

Origin blog.csdn.net/weixin_38717886/article/details/125710328