RocketMQ message sending common errors and solutions

Click "Middleware Interest Circle" above and select "Set as Star"

Be a positive person, the harder you work, the luckier you get!

This article will share the common problems of message sending based on my own experience in using RocketMQ, basically follow the problems , analyze the problems, and solve them.

1、No route info of this topic


The routing information could not be found, and its complete error stack information is as follows:

Moreover , many readers and friends will say that the above problems will also occur when the automatic theme creation is enabled on the Broker side.

The routing finding process of RocketMQ is shown in the following figure:

The key points above are as follows:
  • If the Broker enables automatic topic creation, the topic will be created by default: TBW102 at startup, and will be reported to the Nameserver with the heartbeat packet sent by the Broker to the Nameserver, and then the routing information can be returned when querying routing information from the Nameserver.

  • The sender of the message will first check the local cache when sending the message, and if it exists in the local cache, it will directly return the routing information.

  • If the cache does not exist, query the Nameserver for routing information, and return directly if the Nameserver has the routing information.

  • If the nameserver does not have routing information for this topic, and if automatic topic creation is not enabled, No route info of this topic will be thrown.

  • If automatic topic creation is enabled, use the default topic to query the Nameserver for routing information, and use the routing information of the default topic as your own routing information, and no route info of this topic will be thrown.

Under normal circumstances, the error No route info of this topic is usually encountered when RocketMQ is just built, and there are many encounters when starting RocketMQ. The usual troubleshooting ideas are as follows:

  • 可以通过rocketmq-console查询路由信息是否存在,或使用如下命令查询路由信息:

    cd ${ROCKETMQ_HOME}/bin
    sh ./mqadmin topicRoute -n 127.0.0.1:9876 -t dw_test_0003

    其输出结果如下所示:

  • 如果通过命令无法查询到路由信息,则查看Broker是否开启了自动创建topic,参数为:autoCreateTopicEnable,该参数默认为true。但在生产环境不建议开启。

  • 如果开启了自动创建路由信息,但还是抛出这个错误,这个时候请检查客户端(Producer)连接的Nameserver地址是否与Broker中配置的nameserver地址是否一致。

经过上面的步骤,基本就能解决该错误。

2、消息发送超时


消息发送超时,通常客户端的日志如下:

客户端报消息发送超时,通常第一怀疑的对象是RocketMQ服务器,是不是Broker性能出现了抖动,无法抗住当前的量。

那我们如何来排查RocketMQ当前是否有性能瓶颈呢?

首先我们执行如下命令查看RocketMQ 消息写入的耗时分布情况:

cd /${USER.HOME}/logs/rocketmqlogs/
grep -n 'PAGECACHERT' store.log | more

输出结果如下所示:

RocketMQ会每一分钟打印前一分钟内消息发送的耗时情况分布,我们从这里就能窥探RocketMQ消息写入是否存在明细的性能瓶颈,其区间如下:
  • [<=0ms] 小于0ms,即微妙级别的。

  • [0~10ms] 小于10ms的个数。

  • [10~50ms] 大于10ms小

  • 于50ms的个数

其他区间显示,绝大多数会落在微妙级别完成,按照笔者的经验如果100-200ms及以上的区间超过20个后,说明Broker确实存在一定的瓶颈,如果只是少数几个,说明这个是内存或pagecache的抖动,问题不大。

通常情况下超时通常与Broker端的处理能力关系不大,还有另外一个佐证,在RocketMQ broker中还存在快速失败机制,即当Broker收到客户端的请求后会将消息先放入队列,然后顺序执行,如果一条消息队列中等待超过200ms就会启动快速失败,向客户端返回[TIMEOUT_CLEAN_QUEUE]broker busy,这个在本文的第3部分会详细介绍。

在RocketMQ客户端遇到网络超时,通常可以考虑一些应用本身的垃圾回收,是否由于GC的停顿时间导致的消息发送超时,这个我在测试环境进行压力测试时遇到过,但生产环境暂时没有遇到过,大家稍微留意一下。

在RocketMQ中通常遇到网络超时,通常与网络的抖动有关系,但由于我对网络不是特别擅长,故暂时无法找到直接证据,但能找到一些间接证据,例如在一个应用中同时连接了kafka、RocketMQ集群,发现在出现超时的同一时间发现连接到RocketMQ集群内所有Broker,连接到kafka集群都出现了超时。

但出现网络超时,我们总得解决,那有什么解决方案吗?

我们对消息中间件的最低期望就是高并发低延迟,从上面的消息发送耗时分布情况也可以看出RocketMQ确实符合我们的期望,绝大部分请求都是在微妙级别内,故我给出的方案时,减少消息发送的超时时间,增加重试次数,并增加快速失败的最大等待时长。具体措施如下:

  • 增加Broker端快速失败的时长,建议为1000,在broker的配置文件中增加如下配置:

    maxWaitTimeMillsInQueue=1000

    主要原因是在当前的RocketMQ版本中,快速失败导致的错误为SYSTEM_BUSY,并不会触发重试,适当增大该值,尽可能避免触发该机制,详情可以参考本文第3部分内容,会重点介绍system_busy、broker_busy。

  • 如果RocketMQ的客户端版本为4.3.0以下版本(不含4.3.0)
    将超时时间设置消息发送的超时时间为500ms,并将重试次数设置为6次(这个可以适当进行调整,尽量大于3),其背后的哲学是尽快超时,并进行重试,因为发现局域网内的网络抖动是瞬时的,下次重试的是就能恢复,并且RocketMQ有故障规避机制,重试的时候会尽量选择不同的Broker,相关的代码如下:

    DefaultMQProducer producer = new DefaultMQProducer("dw_test_producer_group");
    producer.setNamesrvAddr("127.0.0.1:9876");
    producer.setRetryTimesWhenSendFailed(5);// 同步发送模式:重试次数
    producer.setRetryTimesWhenSendAsyncFailed(5);// 异步发送模式:重试次数
    producer.start();
    producer.send(msg,500);//消息发送超时时间
  • 如果RocketMQ的客户端版本为4.3.0及以上版本

    If the client version is 4.3.0 and above, since the message sending timeout set by it is the total timeout time of all retries, it cannot directly set the timeout time of RocketMQ's sending API, but needs to set its API For packaging, retry needs to be received in the outer layer, for example, the sample code is as follows:

    public static SendResult send(DefaultMQProducer producer, Message msg, int 
                                retryCount)
     
    {
      Throwable e = null;
      for(int i =0; i < retryCount; i ++ ) {
          try {
              return producer.send(msg,500); //设置超时时间,为500ms,内部有重试机制
          } catch (Throwable e2) {
              e = e2;
          }
      }
      throw new RuntimeException("消息发送异常",e);
    }

3、System busy、Broker busy


In the use of RocketMQ, if the RocketMQ cluster reaches the pressure load level of 1W/tps, System busy and Broker busy will be problems that everyone often encounters. For example, the exception stack shown in the following figure.

Throughout the RocketMQ error keywords related to system busy and broker busy, there are a total of 5 as follows:
  • [REJECTREQUEST]system busy

  • too many requests and system thread pool busy

  • [PC_SYNCHRONIZED]broker busy

  • [PCBUSY_CLEAN_QUEUE]broker busy

  • [TIMEOUT_CLEAN_QUEUE]broker busy

3.1 Principle analysis

Let's first use a picture to illustrate when the above error will be thrown in the whole life cycle of message sending.

According to the above five types of error logs, the original triggers can be classified into the following three types.
  • pagecache pressure is high

    Among them, the following three types of errors belong to this case

  • [REJECTREQUEST]system busy

  • [PC_SYNCHRONIZED]broker busy

  • [PCBUSY_CLEAN_QUEUE]broker busy

    The basis for judging whether the pagecache is busy is the time of locking when adding messages to the memory when writing messages. The default judgment criterion is that the locking time exceeds 1s, it is considered that the pagecache is under high pressure, and related errors are thrown to the client. log.

  • The rejection strategy for sending thread pool squeezes
    In RocketMQ, a thread pool with only one thread is used to process message sending, and a bounded queue is maintained internally. The default length is 1W. If the number of squeezes in the current queue exceeds 1W, the execution thread The rejection policy of the pool, thus throwing the [too many requests and system thread pool busy] error.

  • Broker side fails fast

    默认情况下Broker端开启了快速失败机制,就是在Broker端还未发生pagecache繁忙(加锁超过1s)的情况,但存在一些请求在消息发送队列中等待200ms的情况,RocketMQ会不再继续排队,直接向客户端返回system busy,但由于rocketmq客户端目前对该错误没有进行重试处理,所以在解决这类问题的时候需要额外处理。

3.2 PageCache繁忙解决方案

一旦消息服务器出现大量pagecache繁忙(在向内存追加数据加锁超过1s)的情况,这个是比较严重的问题,需要人为进行干预解决,解决的问题思路如下:

  • transientStorePoolEnable

    开启transientStorePoolEnable机制,即在broker中配置文件中增加如下配置:

    transientStorePoolEnable=true

    transientStorePoolEnable的原理如下图所示:

 引入transientStorePoolEnable能缓解pagecache的压力背后关键如下:
  • 消息先写入到堆外内存中,该内存由于启用了内存锁定机制,故消息的写入是接近直接操作内存,性能能得到保证。

  • 消息进入到堆外内存后,后台会启动一个线程,一批一批将消息提交到pagecache,即写消息时对pagecache的写操作由单条写入变成了批量写入,降低了对pagecache的压力。

    引入transientStorePoolEnable会增加数据丢失的可能性,如果Broker JVM进程异常退出,提交到PageCache中的消息是不会丢失的,但存在堆外内存(DirectByteBuffer)中但还未提交到PageCache中的这部分消息,将会丢失。但通常情况下,RocketMQ进程退出的可能性不大,通常情况下,如果启用了transientStorePoolEnable,消息发送端需要有重新推送机制(补偿思想)。

  • 扩容

    If the pagecache level is still busy after transientStorePoolEnable is enabled, the cluster needs to be expanded, or the topics in the cluster should be split, that is, some topics will be migrated to other clusters to reduce the load of the cluster.


Reminder: When the broker is busy due to pagecache busy in RocketMQ, RocketMQ Client will have a retry mechanism.

3.3 TIMEOUT_CLEAN_QUEUE solution

Because if there is an error of TIMEOUT_CLEAN_QUEUE, the client will not retry it temporarily, so the suggestion at this stage is to appropriately increase the judgment standard of fast failure, that is, add the following configuration to the configuration file of the broker:

#该值默认为200,表示200ms
waitTimeMillsInSendQueue=1000

This article is from the author's other masterpiece "RocketMQ Actual Combat and Advancement". The column starts with the usage scenario to introduce how to use RocketMQ, what problems are encountered during the use, how to solve these problems, and why they can be solved in this way, that is, the principle explanation (Figure) interspersed in actual combat . The design idea of ​​the column emphasizes the word actual combat, which aims to make a RocketMQ beginner quickly "fight monsters and upgrade" through the study of this column, combining theory with actual combat, and become a leader in this field.

This article is shared from WeChat public account - middleware interest circle (dingwpmz_zjj).
If there is any infringement, please contact [email protected] to delete it.
This article participates in the " OSC Yuanchuang Project ", you are welcome to join and share with us.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324109178&siteId=291194637