How does rocketMQ use MQFaultStrategy to avoid delay faults?

background

In a RocketMq cluster, when queues are distributed among different broker servers, when trying to send a message to one of the queues, if it takes too long or fails to send, RocketMQ will try to retry sending. May wish to think about it, the same message fails to be sent for the first time or takes too long, which may be caused by network fluctuations or the stop of the relevant broker. If you try again in a short time, it is very likely that the same situation will still occur .

RocketMQ provides us with the function of automatically switching queues after delayed failures, and will predict the failure time and automatically recover according to the number of failures and failure levels. This function is optional and disabled by default. It can be enabled through the following configuration.

DefaultMQProducer producer = new DefaultMQProducer("producerGroup");
producer.setSendLatencyFaultEnable(true);

Note: This function only takes effect when no queue is specified

Source code interpretation

We locate the source code related to queue decision and retryorg.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl#sendDefaultImpl

private SendResult sendDefaultImpl(
    Message msg,
    final CommunicationMode communicationMode,
    final SendCallback sendCallback,
    final long timeout
) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
    
    
    this.makeSureStateOK();
    // 参数校验
    Validators.checkMessage(msg, this.defaultMQProducer);
    final long invokeID = random.nextLong();
    long beginTimestampFirst = System.currentTimeMillis();
    long beginTimestampPrev = beginTimestampFirst;
    long endTimestamp = beginTimestampFirst;
    // 获得topic发布的信息
    TopicPublishInfo topicPublishInfo = this.tryToFindTopicPublishInfo(msg.getTopic());
    // 这里.ok()是判断是否有可用的queue,只有当queue不为空时才能将消息投递出去
    if (topicPublishInfo != null && topicPublishInfo.ok()) {
    
    
        boolean callTimeout = false;
        MessageQueue mq = null;
        Exception exception = null;
        SendResult sendResult = null;
        // 同步执行需要设置一个最大重试次数
        int timesTotal = communicationMode == CommunicationMode.SYNC ? 1 + this.defaultMQProducer.getRetryTimesWhenSendFailed() : 1;
        int times = 0;
        String[] brokersSent = new String[timesTotal];
        for (; times < timesTotal; times++) {
    
    
            String lastBrokerName = null == mq ? null : mq.getBrokerName();
            // 选择投递的queue,会自动规避最近故障的queue
            MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);
            if (mqSelected != null) {
    
    
                mq = mqSelected;
                brokersSent[times] = mq.getBrokerName();
                try {
    
    
                    beginTimestampPrev = System.currentTimeMillis();
                    if (times > 0) {
    
    
                        // 为了防止namespace状态发生变更,重试期间利用namespace重新解析topic名称
                        msg.setTopic(this.defaultMQProducer.withNamespace(msg.getTopic()));
                    }
                    long costTime = beginTimestampPrev - beginTimestampFirst;
                    if (timeout < costTime) {
    
    
                        // 如果超时则break停止投递
                        callTimeout = true;
                        break;
                    }

                    // 开始投递消息
                    sendResult = this.sendKernelImpl(msg, mq, communicationMode, sendCallback, topicPublishInfo, timeout - costTime);
                    endTimestamp = System.currentTimeMillis();
                    // 更新发送超时记录,用于规避再次故障
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);
                    switch (communicationMode) {
    
    
                        case ASYNC:
                            return null;
                        case ONEWAY:
                            return null;
                        case SYNC:
                            if (sendResult.getSendStatus() != SendStatus.SEND_OK) {
    
    
                                if (this.defaultMQProducer.isRetryAnotherBrokerWhenNotStoreOK()) {
    
    
                                    // 失败则尝试投递其他broker
                                    continue;
                                }
                            }
                            return sendResult;
                        default:
                            break;
                    }
                } catch (RemotingException e) {
    
    
                    endTimestamp = System.currentTimeMillis();
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, true);
                    log.warn(String.format("sendKernelImpl exception, resend at once, InvokeID: %s, RT: %sms, Broker: %s", invokeID, endTimestamp - beginTimestampPrev, mq), e);
                    log.warn(msg.toString());
                    exception = e;
                    continue;
                } catch (MQClientException e) {
    
    
                    endTimestamp = System.currentTimeMillis();
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, true);
                    log.warn(String.format("sendKernelImpl exception, resend at once, InvokeID: %s, RT: %sms, Broker: %s", invokeID, endTimestamp - beginTimestampPrev, mq), e);
                    log.warn(msg.toString());
                    exception = e;
                    continue;
                } catch (MQBrokerException e) {
    
    
                    endTimestamp = System.currentTimeMillis();
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, true);
                    log.warn(String.format("sendKernelImpl exception, resend at once, InvokeID: %s, RT: %sms, Broker: %s", invokeID, endTimestamp - beginTimestampPrev, mq), e);
                    log.warn(msg.toString());
                    exception = e;
                    if (this.defaultMQProducer.getRetryResponseCodes().contains(e.getResponseCode())) {
    
    
                        continue;
                    } else {
    
    
                        if (sendResult != null) {
    
    
                            return sendResult;
                        }

                        throw e;
                    }
                } catch (InterruptedException e) {
    
    
                    endTimestamp = System.currentTimeMillis();
                    this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);
                    log.warn(String.format("sendKernelImpl exception, throw exception, InvokeID: %s, RT: %sms, Broker: %s", invokeID, endTimestamp - beginTimestampPrev, mq), e);
                    log.warn(msg.toString());

                    log.warn("sendKernelImpl exception", e);
                    log.warn(msg.toString());
                    throw e;
                }
            } else {
    
    
                break;
            }
        }

        // 是否有响应数据,有则直接响应结果
        if (sendResult != null) {
    
    
            return sendResult;
        }
        // 下面就是异常类的包装和抛出操作

        String info = String.format("Send [%d] times, still failed, cost [%d]ms, Topic: %s, BrokersSent: %s",
            times,
            System.currentTimeMillis() - beginTimestampFirst,
            msg.getTopic(),
            Arrays.toString(brokersSent));

        info += FAQUrl.suggestTodo(FAQUrl.SEND_MSG_FAILED);

        MQClientException mqClientException = new MQClientException(info, exception);
        if (callTimeout) {
    
    
            throw new RemotingTooMuchRequestException("sendDefaultImpl call timeout");
        }

        if (exception instanceof MQBrokerException) {
    
    
            mqClientException.setResponseCode(((MQBrokerException) exception).getResponseCode());
        } else if (exception instanceof RemotingConnectException) {
    
    
            mqClientException.setResponseCode(ClientErrorCode.CONNECT_BROKER_EXCEPTION);
        } else if (exception instanceof RemotingTimeoutException) {
    
    
            mqClientException.setResponseCode(ClientErrorCode.ACCESS_BROKER_TIMEOUT);
        } else if (exception instanceof MQClientException) {
    
    
            mqClientException.setResponseCode(ClientErrorCode.BROKER_NOT_EXIST_EXCEPTION);
        }

        // 将包装好的异常结果抛出
        throw mqClientException;
    }

    // 校验NameServer服务器是否正常
    validateNameServerSetting();

    // 抛出topic异常信息
    throw new MQClientException("No route info of this topic: " + msg.getTopic() + FAQUrl.suggestTodo(FAQUrl.NO_TOPIC_ROUTE_INFO),
        null).setResponseCode(ClientErrorCode.NOT_FOUND_TOPIC_EXCEPTION);
}

Pay attention to this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);this line of code. This code will be called immediately after the message is pushed to the broker, and this method will be called if the exception is caught. We said earlier that the broker will be temporarily marked as unavailable if it takes too long or fails to send. Let's see See how it is implemented at the bottom.

Follow the code to locateorg.apache.rocketmq.client.latency.MQFaultStrategy#updateFaultItem

    public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) {
    
    
        // 配置如果开启则生效
        if (this.sendLatencyFaultEnable) {
    
    
            // 如果是个隔离异常则标记执行持续时长为30秒,并根据执行时长计算broker不可用时长
            long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency);
            // 记录broker不可用的时长信息
            this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration);
        }
    }

    private long computeNotAvailableDuration(final long currentLatency) {
    
    
        for (int i = latencyMax.length - 1; i >= 0; i--) {
    
    
            if (currentLatency >= latencyMax[i])
                return this.notAvailableDuration[i];
        }

        return 0;
    }

The above method has three input parameters

  • brokerName The name of the broker
  • currentLatency as of the current latency
  • Whether isolation is isolated or not, the place will directly mark the delay time of the isolation state as 30 seconds, that is to say, when it is true, the execution time is considered to be 30 seconds

computeNotAvailableDurationTwo arrays are used in the method, let's look at these two arrays

private long[] latencyMax = {
    
    50L, 100L, 550L, 1000L, 2000L, 3000L, 15000L};
private long[] notAvailableDuration = {
    
    0L, 0L, 30000L, 60000L, 120000L, 180000L, 600000L};

The above latencyMax indicates the execution time, and notAvailableDuration indicates the unavailable period of the broker, and their index bits correspond to each other. This method is to traverse the index position in reverse. Assuming that the current message push time is 600ms, and the corresponding latencyMax subscript is 2, then the notAvailableDuration subscript Also 2, the unavailable duration of this broker is 30000ms.

Let's take a look at how unavailable brokers are maintained, and locate them logicallyorg.apache.rocketmq.client.latency.LatencyFaultToleranceImpl#updateFaultItem

@Override
public void updateFaultItem(final String name, final long currentLatency, final long notAvailableDuration) {
    
    
    // 查看broker是否被标记过
    FaultItem old = this.faultItemTable.get(name);
    if (null == old) {
    
    
        // 没有则进行一次标记
        final FaultItem faultItem = new FaultItem(name);
        // 记录本次的耗时
        faultItem.setCurrentLatency(currentLatency);
        // 当前时间+不可用时间=截至时间
        faultItem.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);

        // 再次尝试放入map中,为了防止并发情况下key已存在,则使用putIfAbsent
        old = this.faultItemTable.putIfAbsent(name, faultItem);
        if (old != null) {
    
    
            // 放入时已存在则更新存在的对象
            old.setCurrentLatency(currentLatency);
            old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);
        }
    } else {
    
    
        // broker被标记过则直接更新
        old.setCurrentLatency(currentLatency);
        old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration);
    }
}

The marked broker maintains it using ConcurrentHashMap 延迟对象, which contains the elapsed time and the time until unavailable

At this point, we know the principle of abnormal unavailability. Next, let's look at the code related to the queue's automatic decision-making. We locate againorg.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl#sendDefaultImpl

Pay attention to MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);this line of code, because of the retry logic, there is lastBrokerName, which means the broker used in the last call, and topicPublishInfo, which means the topic-related information to be delivered.

Follow the logic to enter the core queue decision-making methodorg.apache.rocketmq.client.latency.MQFaultStrategy#selectOneMessageQueue

public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) {
    
    
    // 是否开启延迟故障功能
    if (this.sendLatencyFaultEnable) {
    
    
        try {
    
    
            // 使用threadlocal维护索引位置,做到线程隔离
            int index = tpInfo.getSendWhichQueue().incrementAndGet();
            // 遍历所有可用queue
            for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
    
    
                // 索引位置对queue数量进行取模,保证分布尽量均匀
                int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
                if (pos < 0)
                    pos = 0;
                MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
                // 检查broker是否可用
                if (latencyFaultTolerance.isAvailable(mq.getBrokerName()))
                    return mq;
            }
            // 没有可用的broker走下面逻辑
            // 从疑似故障的broker中强行取一个broker出来
            final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
            // 从broker中取一个queue
            int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
            if (writeQueueNums > 0) {
    
    
                final MessageQueue mq = tpInfo.selectOneMessageQueue();
                if (notBestBroker != null) {
    
    
                    mq.setBrokerName(notBestBroker);
                    mq.setQueueId(tpInfo.getSendWhichQueue().incrementAndGet() % writeQueueNums);
                }
                return mq;
            } else {
    
    
                latencyFaultTolerance.remove(notBestBroker);
            }
        } catch (Exception e) {
    
    
            log.error("Error occurred when selecting message queue", e);
        }

        return tpInfo.selectOneMessageQueue();
    }
    // 重最后一个broker中取出一个queue
    return tpInfo.selectOneMessageQueue(lastBrokerName);
}

Here latencyFaultTolerance.isAvailable(mq.getBrokerName())is the use of the delayed object stored in the previous ConcurrentHashMap to determine whether it is available by comparing with the current time

public boolean isAvailable() {
    
    
    // startTimestamp是 上次调度故障的时间+故障恢复时间
    return (System.currentTimeMillis() - startTimestamp) >= 0;
}

What if all the brokers here have been marked as faulty, and the recovery time has not yet been reached?

The above code final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();will forcibly take one out of the faulty broker, and we locate itorg.apache.rocketmq.client.latency.LatencyFaultToleranceImpl#pickOneAtLeast

public String pickOneAtLeast() {
    
    
    final Enumeration<FaultItem> elements = this.faultItemTable.elements();
    List<FaultItem> tmpList = new LinkedList<FaultItem>();
    while (elements.hasMoreElements()) {
    
    
        final FaultItem faultItem = elements.nextElement();
        tmpList.add(faultItem);
    }

    if (!tmpList.isEmpty()) {
    
    
        // 这里属于无效操作,忽略就好,官方已在最新版本修复
        Collections.shuffle(tmpList);

        // 进行排序
        Collections.sort(tmpList);

        // 这段逻辑表示只从延迟最低的一半broker中选择一个
        final int half = tmpList.size() / 2;
        if (half <= 0) {
    
    
            return tmpList.get(0).getName();
        } else {
    
    
            final int i = this.whichItemWorst.incrementAndGet() % half;
            return tmpList.get(i).getName();
        }
    }

    return null;
}

Pay attention to the above logic, it is not to take out one at random. First, it will be sorted according to the delay of the latest scheduling, and then it will be halved, and only one broker will be taken modulo from the fastest half of the brokers.

Just ignore the above Collections.shuffle(tmpList);, the official has stated that this will be fixed https://github.com/apache/rocketmq/pull/3945

It can be seen that RocketMQ has made a lot of considerations at the message sending end just for fault fusing and failover. Of course, the code described in this chapter is only applicable when the queue is not manually specified and the sending delay fault function is enabled.

Guess you like

Origin blog.csdn.net/qq_21046665/article/details/125892156