RocketMQ source code analysis: how to troubleshoot message loss?

Please add image description

How to troubleshoot message loss?

When we use mq, we often encounter the problem of abnormal message consumption. There are many reasons, such as

  1. producer send failed
  2. Consumer consumption exception
  3. the consumer did not receive the message at all

So how do we investigate?


insert image description here
In fact, with the help of RocketMQ -Dashboard, you can efficiently check, there are many functions you can't imagine

Message not found?

It means that the proder sends abnormally, it is also possible that the message has expired, because the rocketmq message is saved for 72h by default, and the log on the producer side can be further confirmed at this time.

Message found!

Then look at the consumption status of the message, as shown in the figure below, the consumption status of the message is NOT_ONLINE

insert image description here
What does NOT_ONLINE mean?

Don't worry, let's analyze it step by step, let's see how many states TrackType has.

public enum TrackType {
    
    
    CONSUMED,
    CONSUMED_BUT_FILTERED,
    PULL,
    NOT_CONSUME_YET,
    NOT_ONLINE,
    UNKNOWN
}

Each type is explained below

Types of explain
CONSUMED message has been consumed
CONSUMED_BUT_FILTERED The message was delivered but filtered
PULL The way of message consumption is pull mode
NOT_CONSUME_YET not currently consumed
NOT_ONLINE CONSUMER is offline
UNKNOWN unknown mistake

How to determine that the message has been consumed?

As we mentioned in the previous section, the broker will use a map to save the consumption progress of each queue. If the offset of the queue is greater than the offset of the queried message, the message will be consumed, otherwise it will not be consumed (NOT_CONSUME_YET)

Is it possible to immediately know where the problem is?

CONSUMED_BUT_FILTERED indicates that the message has been delivered, but has been filtered out . For example, the producer sends topicA, tagA, but the consumer subscribes to topicA, tagB

On RocketMQ-Dashboard, we can actually see the offset (broker site) of each queue broker and the offset (consumer site) of message consumption. The difference is the messages that are not consumed.
insert image description here
When all messages are consumed, The difference is 0, as shown in the figure below

insert image description here
How does CONSUMED_BUT_FILTERED (message delivered but filtered) happen?

This has to mention a concept in RocketMQ. Message consumption must meet the consistency of subscription relationship, that is, the topics and tags subscribed by all consumers in a consumerGroup must be consistent, otherwise messages will be lost.

As shown in the following scenario, 4 messages are sent, consumer1 subscribes to topica-taga, and consumer2 subscribes to topica-tab. consumer1 consumes data in q0, consumer2 consumes data in q1

For msg-1 and msg-3 delivered to q0, only msg-1 can be consumed normally, while msg-3 is CONSUMED_BUT_FILTERED. Because msg-3 is delivered to q0, but consumer1 does not consume the message of tagb, the message is filtered and the message is lost

Similarly msg-2 this message will also be lost
insert image description here

Note that there is another very important point

Although the message consumption fails, the offset of the message will be submitted normally, that is, the message consumption fails, but the status will also be CONSUMED

So where did the news of consumption failure go?

When the message consumption fails, it will be placed in the retry queue, and the topic name is %RETRY% + consumerGroup

Consumer does not subscribe to this topic, how can it consume retry messages?
insert image description here
In fact, when the Consumer starts, the framework subscribes to this topic for you, so the retry message can be consumed to

In addition, the message is not retried all the time, but every 1 period of time.

number of times to retry time interval from last retry number of times to retry time interval from last retry
1 10 seconds 9 7 minutes
2 30 seconds 10 8 minutes
3 1 minute 11 9 minutes
4 2 minutes 12 10 minutes
5 3 minutes 13 20 minutes
6 4 minutes 14 30 minutes
7 5 minutes 15 1 hour
8 6 minutes 16 2 hours

When the message exceeds the maximum consumption times of 16 times, the message will be delivered to the dead letter queue. The topic name of the dead letter queue is %DLQ% + consumerGroup.

Therefore, when you find that the message status is CONSUMED, but the consumption fails, just go to the retry queue and dead letter queue to find it

Message consumption exception troubleshooting

The background of this problem is this, that is, we have 2 systems, and the data consistency is guaranteed by mq in the middle. As a result, one day the data is inconsistent, it must be a problem with the consumer consuming messages, or the producer sending messages.

First find the message according to the time period to ensure that there is no problem in sending, and then see that the status of the message is NOT_CONSUME_YET, indicating that the consumer is online but there is no message

Please add image description
NOT_CONSUME_YET indicates that the message has not been consumed , but it has been a long time since the message was sent. The consumer should not have consumed it. Check the log that the consumer did not consume it.

Use RocketMQ-Dashboard to check the agent site and the consumer site. Queue 0 consumes normally, and other queues are not consumed
insert image description here
. I feel that this load balancing strategy is a bit problematic, why there are so many messages in queue 0, and why there are no messages in other queues, Ask a wave of middleware students, have they changed the load balancing strategy again?

It has indeed changed! In the test environment, the queue latitude is used to distinguish multiple environments. 0 is the benchmark environment. Our team has not yet used multiple environments, so the messages sent and received will be on queue 0, and other queues will not be used ( you can simply think that the test environment sends and consumes The message will only use the 0 queue )

Then here comes the problem!

First of all, the status of the message is NOT_CONSUME_YET, which means that the message must have been delivered to queue 0, but the middleware partner said that the message will not be delivered to queue 0.

To verify my idea, we first need to prove that the messages that have not been consumed are indeed delivered to queues other than the 0 queue.

I won’t talk about the detour in the middle, until I looked at the source code of RocketMQ-Dashboard and found that Dashboard actually returned a lot of information of the message, but it was not displayed on the page. I just looked at the interface and returned to
insert image description here
darling, and found a new world, news All the properties of . are here, and seeing that the queueId is 14, it really verified my idea.

Let's see that bornHost is actually the network segment of our office

Is the load balancing strategy started locally and the load balancing strategy of the test environment different?

A wave of local debug code, it turns out that the local producer will send messages to all queues, and the consumer will also consume messages from all queues

So far the problem has been found!

The producer starts a service locally, registers with the zk of the test environment, some requests of the test environment are sent to the local, and messages are sent to the queues other than the 0 queue, but the consumer of the test environment will only consume the messages in the 0 queue, resulting in The message has not been consumed for a long time

parameter

[1]http://www.broadview.com.cn/article/419768
[2]https://mp.weixin.qq.com/s/wWgAbFLuesdb3BhY3GjxPg

Guess you like

Origin blog.csdn.net/zzti_erlie/article/details/123558837