Online environment large-scale RocketMQ cluster elegant upgrade practice without downtime (architecture scheme)

In response to the administrative requirements of the security department, hundreds of RocketMQ machines in the production environment must be upgraded within half a month and must support ACL to avoid security risks.

The upgrade plan and implementation of the RocketMQ cluster naturally fell to me. This article not only introduces how the author upgrades, but also wants to show the methodology of dealing with these problems as an architect, and show the architects of the big factory. daily work .

>Reminder: Regarding ACL-related content, follow-up articles will share the twists and turns of upgrading from version 4.1.0 to 4.8 and opening ACL.

1. The urgency of version upgrade

Ashamed to say, as an excellent evangelist in the RocketMQ community, the RocketMQ server version of the author's company is still 4.1.0. RocketMQ does not support ACL (access control) before version 4.4.0, and any machine in the corresponding production environment You can subscribe to any topic, and you can install a rocketmq-console on any production application server, so as to control the entire cluster, have the permission to delete topics, delete consumer groups, and think about whether you have a cold back .

2. Upgrade plan

2.1 Determine the version to upgrade to

Open the RocketMQ upgrade log, RocketMQ officially introduced the ACL mechanism in version 4.4.0, so the version should be upgraded to at least 4.4.0. There is an unwritten rule for using the open source version in the industry : usually do not use the latest version, do not act as a small White rat.

But RocketMQ can be regarded as a special one .

By carefully browsing the RocketMQ version change record, it is not difficult to find that there are very few changes related to the RocketMQ Client, that is, the code for message sending and message consumption that is closely related to users is very stable, and there is basically no compatibility problem in theory. And each version has fixed some major bugs, and the performance improvement is also more obvious, so the author decided to "risk the world" this time, and decided to upgrade the help to the latest version 4.8.0 .

I'm going to talk a little bit here, and briefly introduce several versions of RocketMQ with mileage cup significance.

  • RocketMQ4.3.0 officially introduced transaction messages. If you want to use transaction messages, the minimum recommended version is 4.6.1.
  • RocketMQ4.4.0 introduces ACL and message trace. If you need to use these functions, the minimum recommended version is 4.7.0.
  • RocketMQ4.5.0 introduced multiple copies (master-slave switching), and its version is recommended to use 4.7.0.
  • RocketMQ4.6.0 introduced the request-response model.

2.2 Upgrade ideas

Basic requirements for version upgrade : The business cannot be shut down, that is, the upgrade must be done without awareness of the business.

If the machine has enough spare machines, the best version migration solution should be to expand and then shrink. The example diagram is as follows: insert image description hereThe main idea is to expand the Broker first, add two high-version Broker servers, and add them to the cluster. , and then close the write permission of the lower version Broker, remove the lower version after the message expires, and finally upgrade the NameServer to complete the non-stop online migration.

Since this upgrade requires all the nodes of the RocketMQ cluster to be upgraded in about half a month, so many cold standby nodes cannot be provided. Therefore, the expansion and reduction of capacity cannot meet this demand. This time, it can only be based on the existing machine to upgrade.

Whether the Broker code can be directly upgraded, but the higher version of the Broker directly uses the lower version of the Broker storage directory, that is, the software is directly upgraded. The example diagram is as follows: The insert image description here core idea is to stop the old version of the Broker first, and then use the new version to start the Broker, but Use the old configuration file .

With the idea, the next step is to verify the feasibility of the plan.

2.3 Scheme verification

Theory belongs to theory. Before making any changes in the production environment, there must be sufficient testing and verification. The focus of version upgrades is to verify compatibility issues.

2.2.1 Server version compatibility verification

insert image description hereTo build an above-mentioned MQ cluster, the core points are as follows:

  • Can the Broker of the higher version register routes with the NameServer of the lower version?
  • Can the Broker of the lower version register the route with the NameServer of the higher version?

Through rocketmq-console, create multiple topics to see if their routing information is correct, verified and in line with expectations .

2.2.2 Client and server compatibility verification

The client API of RocketMQ is actually relatively simple. It is nothing more than message sending, batch sending, and message consumption. Since version 4.1 does not support transaction messages, this upgrade does not even need to verify transaction messages. The main points of verification:

  • Can the low-version client send messages to the high-version broker normally and consume messages?
  • Whether the client of the higher version can send messages to the broker of the lower version and consume the messages

Where does the test case come from, we don’t actually need to write it ourselves, we can just use the official Demo directly. The screenshot of the code is as follows: insert image description hereIn the actual implementation process, client-side verification is actually much more complicated than server-side verification. The client versions used by the project team are different, and some project teams even use other non-Java clients such as C++ and Python. How to accurately find the connection information (client version, language type) of all clients in the cluster is very important .

The officially provided version is friendly to the connection information of consumer groups. We can write a script to first query all consumer groups in the system, and then traverse each consumer group to query the IP addresses and clients of these consumer groups. Version, language used and other information, but the open source version is not friendly to the producer, and there is no interface that can get all the senders.

The connection method to obtain the consumer end of the consumer group is shown in the figure below: insert image description hereTherefore, the method we adopted is mainly based on the type of the failed client of the consumer group. During this upgrade process, I also made some customized development for RocketMQ, which can easily obtain all the sending The link information of the party will be contributed to the official by submitting PR in the future .

2.2.3 Broker-side storage format verification

Since there are no free resources, the upgrade method to be used this time is to upgrade the software directly, but the new and old versions share the storage directory. The RocketMQ-based message storage protocol has not changed since version 4.0.0. The key points of its verification are as follows:

  • Can version 4.8.0 directly use the storage files (commitlog and other files) generated by 4.1.0?
  • Can version 4.1.0 directly use the storage file generated by 4.8.0?

Why do you need to verify that version 4.1.0 is compatible with 4.8.0? Because if the upgrade fails, it needs to be rolled back. If the 4.1.0 version is not compatible with 4.8.0, you will have no way out, which is absolutely not allowed in the architecture design.

After verification, it is found that the storage files are compatible with each other.

2.2.4 Test environment verification

After the verification of the above three steps, the upgrade can already be carried out, but before the upgrade, the test environment needs to run stably for a day. The test environment can be upgraded to the following architecture: insert image description herethat is, the mashup mode of different versions, and the verification of all application servers in the test environment is accepted. , if the test environment runs without problems, you can upgrade it in the production environment.

2.4 Implementation plan

With the above upgrade plan, and it has been fully verified, it can be executed in the production environment. Before execution, it is necessary to implement a landing implementation plan for the theoretical design output. The implementation plan must include a rollback operation, and this The rollback operation must be relatively easy to perform, otherwise your solution must be less reliable .

Next, we will focus on some key steps in the implementation process. The entire upgrade step has a rolling upgrade, that is, one-by-one upgrade.

1. Close the write permission of a Broker

Turn off the Broker write permission to allow the application to smoothly migrate traffic to other nodes, which can effectively avoid the impact on the business when the machine is restarted.

sh ./mqadmin updateBrokerConfig -b 192.168.x.x:10911 -n 192.168.xx.xx:9876 -k brokerPermission -v 4

2. When writing and consuming tps with Broker is close to 0, close the broker

ps -ef | grep java
kill pid

3. Start Broker with the new version

Note that the configuration file used in this process is the configuration of the old version, so the write permission is not enabled at this time, and the startup will not affect the writing of client messages.

4. Enable write permission

After the new version is successfully launched, you can enable write permission.

sh ./mqadmin updateBrokerConfig -b 192.168.xx.xx:10911 -n 192.168.xx.xx:9876 -k brokerPermission -v 6

Watch the flow.

Repeat the above steps to complete the Broker upgrade.

It is easier to upgrade the nameserver. Use rolling upgrade, kill the old version of the nameserver, and start the new version of the nameserver on the original machine.

3, tidbits

Finally, I would like to share with you a small episode. Although the above scheme is very detailed and has been tested repeatedly, the importance of MQ in our company is too important. The operation and maintenance partner dare not start the operation, and he wants me to watch it. At this time, As architects, we must dare to take responsibility and clearly inform me that as long as you operate correctly, the output failure will be borne by me. This is also a very important soft skill I personally think as an architect: having control over the technology you are responsible for, and Dare to take responsibility.


Well, this article is introduced here. Your one-click three-connection is the biggest encouragement to me . Of course, you can add the author's WeChat: dingwpmz , note CSDN, and discuss together.

Finally, share the author's hard-core RocketMQ e-book, and you will gain the operation and maintenance experience of 100 billion-level message flow. insert image description hereHow to get it: Search [Middleware Interest Circle] on WeChat, and reply to RMQPDF to get it.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324107745&siteId=291194637