Failure retry mechanism of Rocketmq concurrent and sequential consumption

question

  1. When consuming, there is a batch of messages. If one of the messages fails to be consumed, will all the messages be retried?
  2. Can users control the number of retries and the retry interval?
  3. When consuming messages in batches, can I control the starting offset of retry? For example, if there are 10 messages and the 5th message fails, then only the 5th message and all subsequent messages will be retried.
  4. How are retried messages re-consumed?
  5. If the broker's write permission is turned off, will it have any impact on the retry of message consumption?
  6. What happens if a Topic is consumed sequentially or concurrently by different consumers of the same ConsumerGroup?

For more details, please see: Rocketmq concurrent consumption failure retry mechanism

Concurrent consumption

Trigger time

After the consumption is completed, the consumer needs to process the result of the consumption, whether it is success or failure.

ConsumeMessageConcurrentlyService#processConsumeResult

    /**
    * 石臻臻的杂货铺
    * vx: shiyanzu001
    **/
    public void processConsumeResult(
        final ConsumeConcurrentlyStatus status,
        final ConsumeConcurrentlyContext context,
        final ConsumeRequest consumeRequest
    ){
    
    
      int ackIndex = context.getAckIndex();

        if (consumeRequest.getMsgs().isEmpty())
            return;

        switch (status) {
    
    
            case CONSUME_SUCCESS:
                if (ackIndex >= consumeRequest.getMsgs().size()) {
    
    
                    ackIndex = consumeRequest.getMsgs().size() - 1;
                }
   // 这个意思是,就算你返回了消费成功,但是你还是可以通过设置ackIndex 来标记从哪个索引开始时消费失败了的;从而记录到 消费失败TPS的监控指标中;
                int ok = ackIndex + 1;
                int failed = consumeRequest.getMsgs().size() - ok;
                this.getConsumerStatsManager().incConsumeOKTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(), ok);
                this.getConsumerStatsManager().incConsumeFailedTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(), failed);
                break;
            case RECONSUME_LATER:
                ackIndex = -1;
                this.getConsumerStatsManager().incConsumeFailedTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(),
                    consumeRequest.getMsgs().size());
                break;
            default:
                break;
        }
      
      List<MessageExt> msgBackFailed = new ArrayList<>(consumeRequest.getMsgs().size());
                for (int i = ackIndex + 1; i < consumeRequest.getMsgs().size(); i++) {
    
    
                    MessageExt msg = consumeRequest.getMsgs().get(i);
                    // Maybe message is expired and cleaned, just ignore it.
                    if (!consumeRequest.getProcessQueue().containsMessage(msg)) {
    
    
                        log.info("Message is not found in its process queue; skip send-back-procedure, topic={}, "
                                + "brokerName={}, queueId={}, queueOffset={}", msg.getTopic(), msg.getBrokerName(),
                            msg.getQueueId(), msg.getQueueOffset());
                        continue;
                    }
                    boolean result = this.sendMessageBack(msg, context);
                    if (!result) {
    
    
                        msg.setReconsumeTimes(msg.getReconsumeTimes() + 1);
                        msgBackFailed.add(msg);
                    }
                }

                if (!msgBackFailed.isEmpty()) {
    
    
                    consumeRequest.getMsgs().removeAll(msgBackFailed);

                    this.submitConsumeRequestLater(msgBackFailed, consumeRequest.getProcessQueue(), consumeRequest.getMessageQueue());
                }
                
      //..... 部分代码省略....
    }

Part of the code is omitted above. The above code is mainly for sending failed messages back to the Broker;

Just looking at the code means the following:

  1. If the processing result is CONSUME_SUCCESS and there is no need to retry, record the monitoring indicators, TPS of successful consumption and TPS of failed consumption; here the user can context.setAckIndex()set the index value of ACK through settings; for example, if your batch size is 10 messages, If you set it to 4 here; it means that the first 5 items are successful and the next 5 items are failed; of course, there will be no retry here for the failed ones;
  2. If the processing result is RECONSUME_LATER , it means that it needs to be retried, and all the messages of the batch will be traversed and sent back to the Broker synchronously; if a synchronization request fails, it will be recorded; it will be consumed again on the local client in a while;
  3. Remove these messages from the TreeMap of messages to be consumed (except for the failure of the synchronization request back to the Broker), and obtain the smallest value in the current TreeMap;
  4. Update the value of the consumed offset in the local cache; so that the consumed offset can be submitted

Insert image description here

Look at the picture and talk about a few key points.

  1. Messages that need to be retried will be sent back to the retry queue first. After being sent successfully, it will be regarded as successfully consumed. The purpose of this is to prevent the failure of a certain message from hindering the submission of the entire consumption offset; for example
    , Of the four messages 1, 2, 3, and 4, the consumption of the first one failed, and the others were successful. Then because the smallest Offset 1 failed, the subsequent messages could not be marked as successful for submission.
    So if 1 is also set to success, it will not become a blocking point. Of course, it must be sent to the retry queue to wait for retry.

  2. The value of the submittable consumption Offset is always the minimum value in the TreeMap. This TreeMap stores all the Msg to be consumed obtained by pullMessage. Delete it after successful consumption.
    For example, four messages 1, 2, 3, and 4. 1 and 2 are successfully consumed and deleted, then the smallest offset is 3, then all the offsets before it can be submitted; if 2, 3, and 4 are all consumed successfully and deleted, but 1 is still there, then the offset that can be submitted is The shift amount is still the current minimum value 1;

Users can decide which message to start retrying from

As mentioned above, users can ConsumeConcurrentlyContextset the starting index of ackIndex to control retries by entering parameters;

        /**
        * 石臻臻的杂货铺
        * vx: shiyanzu001
        **/
        consumer.registerMessageListener((MessageListenerConcurrently) (msg, context) -> {
    
    
            System.out.printf(" ----- %s 消费消息: %s  本批次大小: %s   ------ ", Thread.currentThread().getName(), msg, msg.size());

            for (int i = 0; i < msg.size(); i++) {
    
    
                System.out.println("第 " + i + " 条消息, MSG: " + msg.get(i));
                try{
    
    
                 // 消费逻辑
                }catch(Exception e){
    
    
                  // 这条消息失败, 从这条消息以及其后的消息都需要重试
                  context.setAckIndex(i-1);
                  return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
                }
                
            } 
            return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
        });

PS: The current version I am looking at is (5.1.3). The author always feels that there is something wrong with the ackIndex setting;

  1. Setting the ackIndex will only take effect when the consumption is successful. Since the user returns success, it means that it does not need to be retried; setting this value always feels awkward.
  2. When consumption fails, ackIndex is forcibly set to -1, indicating that all messages must be retried. Normally, when batch consumption occurs and one of the messages fails, subsequent messages should start from the index of this message. All need to be retried. Those that have been consumed previously and are successful do not need to be retried;

Regarding this, I'm more inclined to think this is a bug; or a design flaw

Optimization suggestions:

  1. In order to be compatible with the previous logic, the logic of the successful state will not be modified.
  2. In the case of failure, there is no need to force it to be set to -1, causing all to be retried. This allows users to set partial messages for retry through ackIndex instead of retrying all.

The client initiates a request CONSUMER_SEND_MSG_BACK

If the message consumption of this batch fails, it will try again.
The retry will try to send the messages back one by one.

DefaultMQPushConsumerImpl#sendMessageBack

Request header ConsumerSendMsgBackRequestHeader

Attributes illustrate
group GroupName
originTopic Topic
offset The offset of the message in the Log
delayLevel Delayed retry level; also the retry policy level; [-1: No retry, directly placed in the dead letter queue, 0: Broker controls retry frequency, >0: Client controls retry frequency]; if it is greater than 0 situation, the corresponding delay level (delay message) will be delayed when retrying; if it is 0, the delay level is the number of retries + 3, which means that the delay increases by one level for each retry; the delay level mentioned here It is 18 levels of delayed messages
originMsgId Message ID
maxReconsumeTimes The maximum number of retries, in concurrent mode, defaults to 16; in ordered mode, the default is Integer.MAX_VALUE.
bname BrokerName

target address

The address of the Broker where the Message is located

msg.getStoreHost()

Request method

synchronous request

Request process

    /**
    * 石臻臻的杂货铺
    * wx: szzdzhp001
    **/
    private void sendMessageBack(MessageExt msg, int delayLevel, final String brokerName, final MessageQueue mq)
        throws RemotingException, MQBrokerException, InterruptedException, MQClientException {
    
    
        boolean needRetry = true;
        try {
    
    
              // 部分代码忽略....
                String brokerAddr = (null != brokerName) ? this.mQClientFactory.findBrokerAddressInPublish(brokerName)
                    : RemotingHelper.parseSocketAddressAddr(msg.getStoreHost());
                this.mQClientFactory.getMQClientAPIImpl().consumerSendMessageBack(brokerAddr, brokerName, msg,
                    this.defaultMQPushConsumer.getConsumerGroup(), delayLevel, 5000, getMaxReconsumeTimes());
            
        } catch (Throwable t) {
    
    
            log.error("Failed to send message back, consumerGroup={}, brokerName={}, mq={}, message={}",
                this.defaultMQPushConsumer.getConsumerGroup(), brokerName, mq, msg, t);
            if (needRetry) {
    
    
            //以发送普通消息的形式发送重试消息
                sendMessageBackAsNormalMessage(msg);
            }
        } finally {
    
    
            msg.setTopic(NamespaceUtil.withoutNamespace(msg.getTopic(), this.defaultMQPushConsumer.getNamespace()));
        }
    }
  1. The timeout for the first send is 5000ms; the request isRequestCode.CONSUMER_SEND_MSG_BACK
  2. If the above request fails to be sent, the cover-up strategy is to send ordinary messages directly; but the Topic is %RETRY%{consumerGroup}; the delay level is; the 3 + msg.getReconsumeTimes()Producer client that sends the message here is the built-in Producer created by the Consumer when building the instance. Client, the client instance name is: CLIENT_INNER_PRODUCER ; this sending is also synchronous; the timeout is3000
  3. If the above fails and an exception is thrown, the local client will retry consumption (with a delay of 5 seconds);

Does the local client always retry or is there a limit to the number of times it can be retried?

If it keeps failing and the client retries, there is no limit to the number of times, and consumption is delayed for 5 seconds each time; it will become a blocking point for consuming Offset; subsequent messages have the possibility of being re-consumed (for example, the client restart)

Insert image description here

Broker handles CONSUMER_SEND_MSG_BACK request

AbstractSendMessageProcessor#consumerSendMsgBack


  1. If the current Broker is not the Master, a system exception error code will be returned.
  2. If the consuming Group subscription relationship does not exist, an error code will be returned.
  3. If brokerPermissionthe permission is not writable, a no permission error code will be returned.
  4. If the number of retry queues of the current Group retryQueueNums<=0, a no-permission error code will be returned.
  5. If the Group's new Topic does not exist, create one, TopicName: %RETRY%GroupName; read and write permissions
  6. Search the Message according to the input parameters offset; if not found, the system exception error code will be returned.
  7. If the number of retries for the message has exceeded the maximum number, or the retry policy is no retry, the message will be sent to the dead letter queue; dead letter queue Topic: %DLQ%GroupName
  8. If the number of retries has not been exceeded, the message is sent to the retry topic: %RETRY%GroupName
  9. If there is a ConsumeMessageHook list, execute consumeMessageAfterthe method
  10. Return Response.

Insert image description here

Note: No matter how many times a message is retried, the Message IDs of these retried messages will not change. Therefore, we need to do idempotent consumption operations on the consumer side.

Sequential consumption

The process of executing the processing results after sequential consumption

ConsumeMessageOrderlyService#processConsumeResult

Insert image description here

Several important points

  1. Sequential consumption will only have one consumption task, ConsumeRequest, executed for the same ProcessQueue.
  2. The user will try again if he returns SUSPEND_CURRENT_QUEUE_A_MOMENT . The retry process will decide whether to send the message back to the retry queue based on whether the maximum number of retries has been exceeded.
  3. This retry queue is sent directly to the retry Topic %RETRY%{consumerGroup} using the built-in Producer instance of the Consumer;
  4. This maximum number of retries is generally INTEGER.MAXVALUE; so it generally does not exceed it, then it will always be retried locally, with a delay of 1s every time it is retried; this process does not write the message back to the Broker.
  5. If a certain message continues to fail to be consumed, then the entire queue consumption will be blocked.

Q&A

When consuming, there is a batch of messages. If one of the messages fails to be consumed, will all the messages be retried?

If you return ConsumeConcurrentlyStatus#RECONSUME_LATER when consuming, it means that the consumption failed and needs to be retried. Then all the Msgs allocated this time will be retried;
the number of Msgs allocated this time is consumer.setConsumeMessageBatchMaxSize(1)determined by; the default is 1; means consuming one message at a time;

Can users control the number of retries and the retry interval?

Can.
Control the number of retries: Before
3.4.9 , the subscriptionGroupConfig consumer group configuration was used. After retryMaxTimes
3.4.9 , it was specified by the client (requestHeader.getMaxReconsumeTimes()). The value
can be Consumer#setMaxReconsumeTimes(最大次数)set here
. The default concurrency mode is 16 times.

Retry interval:
By default, the retry interval is controlled by the Broker. The interval is implemented using delay messages. For example, the delay level of the Broker is; by default, the 3+重试次数level corresponding to the first retry. The time interval of 3 is: 10s;

If you want to customize the retry interval , you need to handle it yourself when consuming, for example

        /**
        * 石臻臻的杂货铺
        * vx: shiyanzu001
        **/
        consumer.registerMessageListener((MessageListenerConcurrently) (msg, context) -> {
    
    
            System.out.printf(" ----- %s 消费消息: %s  本批次大小: %s   ------ ", Thread.currentThread().getName(), msg, msg.size());

            for (int i = 0; i < msg.size(); i++) {
    
    
                System.out.println("第 " + i + " 条消息, MSG: " + msg.get(i));
                if(消费失败){
    
    
                   // 延迟等级5 = 延迟1分钟;  
                  context.setDelayLevelWhenNextConsume(5);

                  // 或者你也可以根据重试的次数来递增延迟级别
                  context.setDelayLevelWhenNextConsume(3 + msg.get(i).getReconsumeTimes());
                }
                // 需要重试
                return ConsumeConcurrentlyStatus.RECONSUME_LATER;

               
            }
            return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
        });

When consuming messages in batches, can I control the starting offset of retry? For example, if there are 10 messages and the 5th message fails, then only the 5th message and all subsequent messages will be retried.

Yes
, but currently only when ConsumeConcurrentlyStatus.CONSUME_SUCCESS is returned. If ConsumeConcurrentlyStatus.RECONSUME_LATER is returned , the entire batch of messages will be retried.
For details, please see the code below

        consumer.registerMessageListener((MessageListenerConcurrently) (msg, context) -> {
    
    
            System.out.printf(" ----- %s 消费消息: %s  本批次大小: %s   ------ ", Thread.currentThread().getName(), msg, msg.size());

            for (int i = 0; i < msg.size(); i++) {
    
    
                System.out.println("第 " + i + " 条消息, MSG: " + msg.get(i));
                try{
    
    
                 // 消费逻辑
                }catch(Exception e){
    
    
                  // 这条消息失败, 从这条消息以及其后的消息都需要重试
                  context.setAckIndex(i-1);
                  return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
                }
                
            } 
            return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
        });

The author believes that consumption failure ( RECONSUME_LATER ) should also be supported here to allow users to control which message needs to be retried.

How are retried messages re-consumed?

Messages that need to be retried will be written to the %RETRY%{consumerGroup} retry queue. When the delay time is up, the client will consume these messages again.
If the number of retries is exceeded, it will be placed in the dead letter queue %DLQ%{consumerGroup}. Won't try again

Insert image description here

If the broker's write permission is turned off, will it have any impact on the retry of message consumption?

Answer: It has an impact.

The mechanism of consumption retry is to first send a retry message back to the Broker. If you turn off the write permission, then the process will be blocked and it will keep retrying on the local client, with an unlimited number of delays of 5 seconds for consumption.
Of course, if you keep retrying locally, this Msg will be a blocking point for successful consumption, and all the Offsets behind it cannot be submitted even if they are consumed.
Therefore, you still need to be careful when closing the Broker write permission.

For more details, please see: Rocketmq concurrent consumption failure retry mechanism

Guess you like

Origin blog.csdn.net/u010634066/article/details/132969177