RocketMQ is based on Dledger mode, smooth upgrade operation

Precondition

This solution is not common to all cases, and other deployment models are for reference only.

  • Cluster mode: multiple masters and multiple slaves (for example, 2 masters and 4 slaves)
  • Support master-slave switching: Dledger, you can also refer to it if you don’t support it, and meet the previous requirement
  • The sending end and the consumer end are in the process of broker upgrade (broker restart), and the client is required to not allow errror logs to appear

Scenes

MQ cluster nodes need to be upgraded in rotation, or the configuration of the broker must be modified to restart. When the broker restarts, the client does not allow ERROR level error logs to appear.

The restart of the broker may cause the client to report an error, probably in the following situations:

1. The producer fails to send the message (the routing information is not updated in time, and the stopped broker will fail to connect)

2. The consumer failed to update the offset (timing scheduling, persisting the consumption site to the broker master every 5s, the connection fails at this time)

Article 1, the error log is printed directly in the business system.

The second item belongs to the consumer log. By default, the consumer has a log/rocketmq_client.log in the home directory to record the log separately. However, this can be set to use the slf4j of the business system to record the log, the log can be collected by the business system, and then our business system has made relevant alarms, if there are ERROR level logs (some ERROR logs will not actually cause any impact on the business ), you will be alerted. In order to prevent alarms from being detected by the business system during the restart of the broker master (the alarms that do not affect the business are not needed), it is necessary to ensure the smoothness and stability of the operation. The heartbeat failure is the info level log.

Suppose, I want to upgrade the online MQ cluster from 4.7.1 to 4.8.0, the operation is as follows, mainly explain the broker, ignore the name server, that is a good operation.

Steps

The cluster model is as follows:

Upgrade the master and slave nodes of broker1 and the master and slave nodes of broker2 in turn.

1. Turn off the write permission of broker master1 first, and prohibit the producer from sending messages to broker1. At this time, you need to ensure that broker2 can bear all the pressure. Because after closing the write permission, all the producer's traffic will be cut to broker2 (including the original broker1), which reflects the importance of resource redundancy.

sh mqadmin updatebrokerconfig -n 'nameserver:9876' -k brokerPermission -v 4 -b broker1master:10911

2. Through the console/command line or monitoring platform, depending on what tools you have, use the command line clusterList command if you don’t have anything. Observe that the inTps and outTps of the broker1 master are both 0, and then remove the read permission (make sure the node All news consumption is over, no backlog).

sh mqadmin updatebrokerconfig -n 'nameserver:9876' -k brokerPermission -v 1 -b broker1master:10911

At this time, the consumer will report some warn-level logs, which are forbidden to pull, but it will not affect because all messages have been consumed.

3. Check that the inTps of slave1_1 and slave1_2 are both 0 to ensure that there is no message synchronization, and then start and stop slave1_1 and slave1_2 respectively to upgrade.

4. Stop the master1 node and ensure that the interval between this step and step 2 is at least 2 minutes (why the interval is so long, as explained later). After stopping, one of the slave nodes will automatically be elected as the master node, and the producer and consumer will connect to the new one normally. The master node sends and consumes messages. Then, start the last new broker (just stopped this) as a slave node (if it is not a version upgrade, just restart it, first change the value of the configuration item brokerPermission of this node to 6 and then start it (because the previous steps Turn off the read and write permissions of this node, so this configuration item is now 1).

During the upgrade, when the slave node is switched to the new master node, if the consumer reports this warn log, don’t care. This is the load balancing and queue allocation on the consumer side. There is a lack of implementation. It has not been fixed yet. normal phenomenon. Detailed reasons, I will explain this issue separately in the future when I have time:

In this case, the entire restart/upgrade process of broker1 has no impact on the business-side client, but a small number of warn-level logs will be reported, and ERROR-level error logs will not appear.

5. Confirm that there is no problem with broker1, repeat steps 1-4 to operate the related nodes of broker2 until the upgrade of the entire cluster is completed.

Close the read permission of the master, it is recommended to stop at least 2 minutes

As you saw earlier, after closing the read permission in step 2, it will take at least 2 minutes to stop the broker. This is to avoid the ERROR error of the connection exception on the consumer side: the persistent consumption offset fails. To stop the scenario of broker1, analyze one by one below:

Timed sending of heartbeat? Take a look at the code: In the startScheduledTask() method of the MQClientInstance class, pay attention to the Chinese comment I added:

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {

            @Override
            public void run() {
                try {
                    // 默认是30秒
                    // 遍历缓存的broker地址,然后判断topic路由信息(但是topicRouteInfo一定会有)里是否有该broker,不存在就剔除
                    // 没有read perm是pub info表里没有该topic的队列数据,没有write perm是sub info里没有该topic的队列数据,但是topicRouteInfo一定会有
                    MQClientInstance.this.cleanOfflineBroker();
                    // 尽管 发送心跳的时候,broker地址表里保存的是所有broker的地址,如果没有消费端实例,只会向master节点发送心跳,否则向所有broker发送心跳
                    // 每次发送心跳的时候broker端都会创建retry topic。所以retry topic即使删除了,只要消费端运行,一个心跳后便又创建了,
                    // 同时也意味着创建一个消费组,只有消费端启动之后才会创建重试topic
                    MQClientInstance.this.sendHeartbeatToAllBrokerWithLock();
                } catch (Exception e) {
                    log.error("ScheduledTask sendHeartbeatToAllBroker exception", e);
                }
            }
        }, 1000, this.clientConfig.getHeartbeatBrokerInterval(), TimeUnit.MILLISECONDS);

This heartbeat will not affect the update of the consumption offset. It is mainly to pay attention to my comments above. Without the write permission, the subscription information will not have the topic queue data.

 

There is a scheduled task that updates topic routing information every 30s (the code is not posted, too much). When updating routing information, update the cached subscription information (this place is related to the update offset, first explain below):

                            // Update sub info,没有读权限的话,就没有队列信息
                            {
                                Set<MessageQueue> subscribeInfo = topicRouteData2TopicSubscribeInfo(topic, topicRouteData);
                                Iterator<Entry<String, MQConsumerInner>> it = this.consumerTable.entrySet().iterator();
                                while (it.hasNext()) {
                                    Entry<String, MQConsumerInner> entry = it.next();
                                    MQConsumerInner impl = entry.getValue();
                                    if (impl != null) {
                                        impl.updateTopicSubscribeInfo(topic, subscribeInfo);
                                    }
                                }
                            }

By default , the consumption offset is persisted regularly every 5s . As long as it is guaranteed that there is no address of broker1 when the offset is persisted, there is no need to worry about stopping broker1.

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {

            @Override
            public void run() {
                try {
                    // 默认每隔5秒钟上报偏移
                    // 如果上报偏移的 时候,已停止 broker的地址还缓存呢,就该报错了
                    MQClientInstance.this.persistAllConsumerOffset();
                } catch (Exception e) {
                    log.error("ScheduledTask persistAllConsumerOffset exception", e);
                }
            }
        }, 1000 * 10, this.clientConfig.getPersistConsumerOffsetInterval(), TimeUnit.MILLISECONDS);

(Too much code, not pasted) To put it simply, updating the offset will update all the offset information of the local message queue, and the subscribed message queue information comes from when the topic routing information is updated. Without read permission, there will be no queue information for broker1, and updating the offset will not update the consumption offset on this broker. The maximum time for this process is: 30+5=35s, which is the client-side time interval (maximum). Look at the broker side, because after the broker side closes the read permission, how long does it take to update the routing information of this topic: 30s,

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {

            @Override
            public void run() {
                try {//brokerConfig.isForceRegister()默认值是true
                    BrokerController.this.registerBrokerAll(true, false, brokerConfig.isForceRegister());
                } catch (Throwable e) {
                    log.error("registerBrokerAll Exception", e);
                }
            }// brokerConfig.getRegisterNameServerPeriod() 默认30s,在10-60s,之间,30s注册一次broker->name server,即每30s上报一次topic信息
        }, 1000 * 10, Math.max(10000, Math.min(brokerConfig.getRegisterNameServerPeriod(), 60000)), TimeUnit.MILLISECONDS);

The topic routing information is reported to the name server once in 30s, and then the consumer can begin to perceive it.

So the total time is 65s. To ensure that the consumer side will not be affected, it takes at least 65s to stop the broker after the read permission is closed. Of course, in practice, these schedules overlap (some information updates have other scenarios besides schedules), and it is also possible in ten to twenty seconds. The reason I recommend 2 minutes is that it is easy to remember, no need to care about the details, 2 minutes is absolutely safe.

Guess you like

Origin blog.csdn.net/x763795151/article/details/112385106
Recommended