How to troubleshoot message loss?
When we use mq, we often encounter the problem of abnormal message consumption. There are many reasons, such as
- producer send failed
- Consumer consumption exception
- the consumer did not receive the message at all
So how do we investigate?
In fact, with the help of RocketMQ -Dashboard, you can efficiently check, there are many functions you can't imagine
Message not found?
It means that the proder sends abnormally, it is also possible that the message has expired, because the rocketmq message is saved for 72h by default, and the log on the producer side can be further confirmed at this time.
Message found!
Then look at the consumption status of the message, as shown in the figure below, the consumption status of the message is NOT_ONLINE
What does NOT_ONLINE mean?
Don't worry, let's analyze it step by step, let's see how many states TrackType has.
public enum TrackType {
CONSUMED,
CONSUMED_BUT_FILTERED,
PULL,
NOT_CONSUME_YET,
NOT_ONLINE,
UNKNOWN
}
Each type is explained below
Types of | explain |
---|---|
CONSUMED | message has been consumed |
CONSUMED_BUT_FILTERED | The message was delivered but filtered |
PULL | The way of message consumption is pull mode |
NOT_CONSUME_YET | not currently consumed |
NOT_ONLINE | CONSUMER is offline |
UNKNOWN | unknown mistake |
How to determine that the message has been consumed?
As we mentioned in the previous section, the broker will use a map to save the consumption progress of each queue. If the offset of the queue is greater than the offset of the queried message, the message will be consumed, otherwise it will not be consumed (NOT_CONSUME_YET)
CONSUMED_BUT_FILTERED indicates that the message has been delivered, but has been filtered out . For example, the producer sends topicA, tagA, but the consumer subscribes to topicA, tagB
On RocketMQ-Dashboard, we can actually see the offset (broker site) of each queue broker and the offset (consumer site) of message consumption. The difference is the messages that are not consumed.
When all messages are consumed, The difference is 0, as shown in the figure below
How does CONSUMED_BUT_FILTERED (message delivered but filtered) happen?
This has to mention a concept in RocketMQ. Message consumption must meet the consistency of subscription relationship, that is, the topics and tags subscribed by all consumers in a consumerGroup must be consistent, otherwise messages will be lost.
As shown in the following scenario, 4 messages are sent, consumer1 subscribes to topica-taga, and consumer2 subscribes to topica-tab. consumer1 consumes data in q0, consumer2 consumes data in q1
For msg-1 and msg-3 delivered to q0, only msg-1 can be consumed normally, while msg-3 is CONSUMED_BUT_FILTERED. Because msg-3 is delivered to q0, but consumer1 does not consume the message of tagb, the message is filtered and the message is lost
Similarly msg-2 this message will also be lost
Note that there is another very important point
Although the message consumption fails, the offset of the message will be submitted normally, that is, the message consumption fails, but the status will also be CONSUMED
So where did the news of consumption failure go?
When the message consumption fails, it will be placed in the retry queue, and the topic name is %RETRY% + consumerGroup
Consumer does not subscribe to this topic, how can it consume retry messages?
In fact, when the Consumer starts, the framework subscribes to this topic for you, so the retry message can be consumed to
In addition, the message is not retried all the time, but every 1 period of time.
number of times to retry | time interval from last retry | number of times to retry | time interval from last retry |
---|---|---|---|
1 | 10 seconds | 9 | 7 minutes |
2 | 30 seconds | 10 | 8 minutes |
3 | 1 minute | 11 | 9 minutes |
4 | 2 minutes | 12 | 10 minutes |
5 | 3 minutes | 13 | 20 minutes |
6 | 4 minutes | 14 | 30 minutes |
7 | 5 minutes | 15 | 1 hour |
8 | 6 minutes | 16 | 2 hours |
When the message exceeds the maximum consumption times of 16 times, the message will be delivered to the dead letter queue. The topic name of the dead letter queue is %DLQ% + consumerGroup.
Therefore, when you find that the message status is CONSUMED, but the consumption fails, just go to the retry queue and dead letter queue to find it
Message consumption exception troubleshooting
The background of this problem is this, that is, we have 2 systems, and the data consistency is guaranteed by mq in the middle. As a result, one day the data is inconsistent, it must be a problem with the consumer consuming messages, or the producer sending messages.
First find the message according to the time period to ensure that there is no problem in sending, and then see that the status of the message is NOT_CONSUME_YET, indicating that the consumer is online but there is no message
NOT_CONSUME_YET indicates that the message has not been consumed , but it has been a long time since the message was sent. The consumer should not have consumed it. Check the log that the consumer did not consume it.
Use RocketMQ-Dashboard to check the agent site and the consumer site. Queue 0 consumes normally, and other queues are not consumed
. I feel that this load balancing strategy is a bit problematic, why there are so many messages in queue 0, and why there are no messages in other queues, Ask a wave of middleware students, have they changed the load balancing strategy again?
It has indeed changed! In the test environment, the queue latitude is used to distinguish multiple environments. 0 is the benchmark environment. Our team has not yet used multiple environments, so the messages sent and received will be on queue 0, and other queues will not be used ( you can simply think that the test environment sends and consumes The message will only use the 0 queue )
Then here comes the problem!
First of all, the status of the message is NOT_CONSUME_YET, which means that the message must have been delivered to queue 0, but the middleware partner said that the message will not be delivered to queue 0.
To verify my idea, we first need to prove that the messages that have not been consumed are indeed delivered to queues other than the 0 queue.
I won’t talk about the detour in the middle, until I looked at the source code of RocketMQ-Dashboard and found that Dashboard actually returned a lot of information of the message, but it was not displayed on the page. I just looked at the interface and returned to
darling, and found a new world, news All the properties of . are here, and seeing that the queueId is 14, it really verified my idea.
Let's see that bornHost is actually the network segment of our office
Is the load balancing strategy started locally and the load balancing strategy of the test environment different?
A wave of local debug code, it turns out that the local producer will send messages to all queues, and the consumer will also consume messages from all queues
So far the problem has been found!
The producer starts a service locally, registers with the zk of the test environment, some requests of the test environment are sent to the local, and messages are sent to the queues other than the 0 queue, but the consumer of the test environment will only consume the messages in the 0 queue, resulting in The message has not been consumed for a long time
parameter
[1]http://www.broadview.com.cn/article/419768
[2]https://mp.weixin.qq.com/s/wWgAbFLuesdb3BhY3GjxPg