RocketMQ 5.0 Architecture Analysis: How to Support Diversified Scenarios Based on Cloud Native Architecture

This article will understand RocketMQ's cloud-native architecture from a technical perspective, and understand how RocketMQ supports diversified scenarios based on a unified architecture.

The article mainly includes three parts. First introduce the core concepts and architecture overview of RocketMQ 5.0; then from the perspective of the cluster, learn how RocketMQ's control link, data link, client and server interact from a macro perspective; finally introduce the most important module storage system of the message queue, Learn how RocketMQ achieves data storage and high availability of data, and how to use cloud native storage to further enhance competitiveness.

01 Overview

Before introducing the architecture of RocketMQ, let's look at the key concepts and domain model of RocketMQ from the user's perspective. As shown in the figure below, here is an introduction according to the flow sequence of the messages.

In RocketMQ, the message producer generally corresponds to the upstream application of the business system, and sends a message to the Broker after a certain business action is triggered. Broker is the core of the message system data link, responsible for receiving messages, storing messages, maintaining message status, and consumer status. Multiple brokers form a message service cluster to serve one or more topics together.

Producers produce messages and send them to Broker. Messages are the carrier of business communication. Each message includes message ID, message topic, message body content, message attributes, message business key, etc. Each message belongs to a topic, representing the semantics of the same business.

Inside Ali, the topic of the transaction message is called Trade, and the shopping cart message is called Cart. The producer application will send the message to the corresponding topic. There is also MessageQueue in Topic, which is used for load balancing and data storage sharding of message service. Each Topic contains one or more MessageQueue, which are distributed in different message brokers.

Producers send messages, Brokers store messages, and consumers are responsible for consuming messages. Consumers generally correspond to downstream applications of the business system, and the same consumer application cluster shares a Consumer Group. A consumer will have a subscription relationship with a topic. The subscription relationship is a triplet of Consumer Group + Topic + filter expression. Messages that meet the subscription relationship will be consumed by the corresponding consumer cluster.

Next, we will further understand RocketMQ from the perspective of technical implementation.

02 Architecture overview

The following figure is a RocketMQ 5.0 architecture diagram. The RocketMQ 5.0 architecture can be divided into SDK, NameServer, Proxy and Store layers from top to bottom.

The SDK layer includes RocketMQ's SDK, and users use the SDK based on RocketMQ's own domain model. In addition to RocketMQ's own SDK, it also includes industry-standard SDKs for subdivided domain scenarios. For example, for event-driven scenarios, RocketMQ 5.0 supports the CloudEvents SDK; for IoT scenarios, RocketMQ supports the IoT MQTT protocol SDK; Many traditional applications are migrated to RocketMQ, which also supports the AMQP protocol, and will be open sourced into the community version in the future.

Nameserver is responsible for service discovery and load balancing. Through the NameServer, the client can obtain the data fragmentation and service address of the Topic, and link to the message server to send and receive messages.

The message service includes the computing layer Proxy and the storage layer RocketMQ Store. RocketMQ 5.0 is a storage-computing separation architecture. The storage-computing separation here mainly emphasizes the separation of modules and responsibilities. Proxy and RocketMQ Store can be deployed together or separately for different business scenarios.

The computing layer Proxy mainly carries the upper-layer business logic of the message, especially for multi-scenario and multi-protocol support, such as carrying the implementation logic and protocol conversion of the domain model of CloudEvents, MQTT, and AMQP. For different business loads, Proxy can also be deployed separately and independently elastically. For example, in the IoT scenario, independent deployment of the Proxy layer can perform elastic scaling for massive IoT device connections, decoupling from storage traffic expansion and contraction.

The RocketMQ Store layer is responsible for the core message storage, including Commitlog-based storage engine, multi-index, multi-copy technology and cloud storage integration extension. The state of the message system is all sunk to the RocketMQ Store, and all its components are stateless.

03 Service Discovery

Let's take a closer look at RocketMQ's service discovery, as shown in the figure below. The core of RocketMQ's service discovery is NameServer. The following figure shows the combined deployment mode of Proxy and Broker, which is also the most common mode of RocketMQ.

Each Broker cluster will be responsible for certain Topic services, and each broker will register its service topic information to the NameServer (hereinafter referred to as NS) cluster, communicate with each NameServer, and regularly maintain the lease with NS through the heartbeat mechanism. The data structure of service registration includes topic and topic fragment. In the example, broker1 and broker2 host a fragment of topicA respectively. The global view is maintained on the NS machine, and topicA has two fragments in broker1 and broker2.

RocketMQ SDK will randomly access the NameServer machine before sending and receiving formal messages to TopicA, so as to obtain which fragments are in TopicA, which broker each data fragment is on, establish a long connection with the broker, and then send messages send and receive.

The service discovery mechanism of most projects will play the role of the registration center through strongly consistent distributed coordination components such as zookeeper or etcd, while RocketMQ has its own characteristics. From the perspective of CAP, the registration center adopts the AP mode, and the NameServer node Stateless is a shared-nothing architecture with higher availability.

As shown in the figure below, the separation of storage and calculation of RocketMQ can be separated or combined. Using the separated deployment mode, the RocketMQ SDK directly accesses the stateless Proxy cluster. This mode can cope with more complex network environments, supports multiple network types of access such as public network access, and achieves better security control.

In the entire service discovery mechanism, NameServer and Proxy are stateless, and nodes can be added or removed at any time. The increase and decrease of stateful node Broker is based on the NS registration mechanism, and the client can perceive and dynamically discover it in real time. During the shrinking process, RocketMQ Broker can also control the read and write permissions of service discovery, prohibit writing and reading on the shrinking nodes, and realize lossless and smooth offline after all unread messages are consumed.

04 Load balancing

Through the above introduction, we understand how the SDK discovers the MessageQueue of Topic fragmentation information and the Broker address through the NameServer. Based on the metadata discovered by these services, the following will introduce in detail how the message traffic flows between the producer, RocketMQ Broker and The consumer cluster is load balanced.

The load balancing of the production link is shown in the figure below: the producer obtains the data fragmentation of the Topic and the corresponding Broker address through the service discovery mechanism. The service discovery mechanism is relatively simple. By default, RoundRobin is used to poll and send to each Topic queue to ensure the traffic balance of the Broker cluster. It is slightly different in the scenario of sequential messages. The Hash of the business primary key based on the message is sent to a certain queue. If there is a hot business primary key, hot spots may also appear in the Broker cluster. In addition, based on metadata, more load balancing algorithms can be expanded according to business needs, such as the same computer room priority algorithm, which can reduce the delay in multi-computer room deployment scenarios and improve performance.

Consumer load balancing: There are two types of load balancing methods, including queue-level load balancing and message-granularity load balancing.

The most classic mode is queue-level load balancing. Consumers know the total number of topic queues and the number of instances under the same Consumer Group, and can bind each consumer instance according to a unified allocation algorithm, similar to a consistent hash method. Corresponding to the queue, only the messages bound to the queue are consumed, and the messages of each queue will only be consumed by the consumer instance. The biggest disadvantage of this mode is that the load is unbalanced, and the consumer instance needs to be bound to a queue and has a temporary state. If there are three queues and two consumer instances, there must be consumers who need to consume 2/3 of the data. If there are four consumers, the fourth consumer will run empty. Therefore, RocketMQ 5.0 introduces a message-granularity load balancing mechanism, without binding queues, and messages are randomly distributed in the consumer cluster to ensure load balancing of the consumer cluster. More importantly, this mode is more in line with the trend of serverless in the future. The number of Broker machines, the number of topic queues and the number of consumer instances are completely decoupled, and can be independently expanded and contracted.

05 storage system

Through the architecture overview and service discovery mechanism, we have a relatively global understanding of RocketMQ, and then we will go deep into the RocketMQ storage system. The storage system plays a decisive role in the performance, cost and availability of RocketMQ. The storage core of RocketMQ consists of commitlog, ConsumeQueue and index files.

The message storage is first written to the commitlog, flashed and copied to the slave node for persistence. The commitlog is the source of true for RocketMQ storage, and a complete message index can be built through it.

Compared with Kafka, RocketMQ writes all topic data to the commitlog file, maximizing sequential IO, so that a single RocketMQ machine can support tens of thousands of topics.

After writing the commitlog, RocketMQ will distribute multiple indexes asynchronously. The first is the ConsumeQueue index, which corresponds to the MessageQueue. Based on the index, the precise positioning of the message can be realized. The message can be located according to the topic, queue ID and location. The message backtracking function is also based on This ability is realized.

Another very important index is the hash index, which is the basis of message observability. The query capability of the primary key of the message business is realized through the persistent hash table, and the message track is mainly realized based on this capability.

In addition to the storage of the message itself, the broker also carries the storage of message metadata and topic files, including which topics the broker will provide services to, and also maintains the number of queues, read and write permissions, and order of each topic. The subscription and consumer offset files maintain the subscription relationship of the topic and the consumption progress of each consumer, and the abort and checkpoint files are used to complete the file recovery after restart to ensure data integrity.

06 Topic high availability

From the perspective of a stand-alone machine, learn the RocketMQ storage engine from the functional level, including commitlog and indexing. Now jump out again and look at the high availability of RocketMQ from the cluster perspective.

The high availability of RocketMQ means that when the NameServer and Broker are partially unavailable in the RocketMQ cluster, the specified topic is still readable and writable.

RocketMQ can deal with three types of failure scenarios.

Scenario 1: A single broker pair is unavailable

For example, when the primary node of Broker 2 is down and the standby node is available, TopicA is still readable and writable, where Shard 1 is readable and writable, Shard 2 is readable but not writable, and TopicA's unread messages in Shard 2 can still be consumed. In summary, as long as one node exists in any group of Brokers in the Broker cluster, the read-write availability of the Topic will not be affected. If all the active and standby Brokers in a certain group are down, the reading and writing of new topic data will not be affected, and the unread messages will be delayed, and the consumption can only continue after any of the active and standby brokers are started.

Scenario 2: Part of the NameServer cluster is unavailable

Since NameServer is a shared-nothing architecture, each node is stateless and in AP mode, without relying on the majority algorithm, so as long as one NameServer survives, the entire service discovery mechanism is normal, and the read-write availability of Topic is not affected .

Scenario 3: All NameServers are unavailable

Since RocketMQ's SDK has a cache for service discovery metadata, as long as the SDK does not restart, messages can still be sent and received according to the current topic metadata.

07 Basic concept of high availability of MessageQueue

The previous section talked about the high availability principle of Topic. From its implementation, it can be found that although Topic is continuously readable and writable, the number of read and write queues of Topic changes. Changes in the number of queues will affect certain data integration services. For example, binlog synchronization of heterogeneous databases. Change binlogs of the same record will be written to different queues. Replay binlogs may be out of order, resulting in dirty data. Therefore, it is necessary to further enhance the existing high availability. It is necessary to ensure that when local nodes are unavailable, not only the topic can be read and written, but also the number of topic readable and writable queues remains unchanged, and the specified queues are also readable and writable.

As shown in the figure below, if any single point of NameServer or Broker is unavailable, Topic A still maintains two queues, and each queue has read and write capabilities.

5.0 HA Features

In order to solve the above scenarios, RocketMQ 5.0 introduces a new high-availability mechanism. The core concepts are as follows:

DLedger Controller: A strongly consistent metadata component based on the raft protocol, which executes master election commands and maintains state machine information.
SynStateSet: Maintain a set of replica groups in a synchronized state. The nodes in the set have complete data. After the master node goes down, a new master node is selected from the set.
Replication: used for data replication between different replicas, data verification, truncation alignment, etc.

The following is a panorama of the 5.0 HA architecture. The new high-availability architecture has multiple advantages.

The data of the dynasty and the start point are introduced into the message storage, and the data verification and truncated alignment are completed based on these two data, and the data consistency logic is simplified in the process of constructing the copy group.
Based on DledgerController, there is no need to introduce external distributed consistency systems such as zk and etcd, and DledgerController can also be merged with NameServer to simplify operation and maintenance and save machine resources.
RocketMQ is weakly dependent on DledgerController. Even if Dledger is unavailable as a whole, it will only affect the selection of the master, and will not affect the normal message sending and receiving process.
It can be customized, and users can comprehensively choose data reliability, performance, and cost according to the business. For example, the number of copies can be 2, 3, or 4, and the copies can directly be synchronous or asynchronous. For example, the 2-2 mode means 2 copies and the data of the two copies is replicated synchronously; the 2-3 mode means 3 copies, as long as 2 copies are successfully written, the message persistence is considered successful. Users can also deploy a copy of it in a remote computer room for asynchronous replication to achieve disaster recovery. As shown below:

08 Cloud Native Storage - Object Storage

The storage systems mentioned above are all RMQ implementations for local file systems. In the cloud-native era, deploying RocketMQ to the cloud environment can further utilize cloud-native infrastructure, such as cloud storage, to further enhance the storage capabilities of RocketMQ. RocketMQ 5.0 provides the feature of multi-level storage, which is a kernel-level storage extension, and object-oriented storage extends the corresponding Commitlog, ConsumeQueue and IndexFile. And with the plug-in design, multi-level storage can be implemented in multiple ways. On Alibaba Cloud, it can be realized based on OSS object service, and on AWS, it can be realized through an S3-oriented interface.

By introducing cloud-native storage, RocketMQ has released a lot of dividends.

The first is unlimited storage capacity. The message storage space is not limited by the local disk space. It used to be stored for a few days, but now it can be stored for several months or even a year. In addition, object storage is also the lowest-cost storage system in the industry, especially suitable for cold data storage.

The second is the TTL of the topic. It turns out that the life cycle of multiple topics is bound to the Commitlog and has a unified retention time. Now each Topic will use an independent object to store the Commitlog file, which can have an independent TTL.

The third is the further separation of storage and computing in the storage system, which can separate the elasticity of storage throughput from the elasticity of storage space.

The fourth is hot and cold data isolation, which separates the read links of hot and cold data, which can greatly improve cold read performance without affecting online services.

09 summary

RocketMQ overall architecture:

RocketMQ load balancing: AP priority, split mode, horizontal expansion, load granularity;
RocketMQ storage design: storage engine, high availability, cloud storage.

Author: Longji

Click to try cloud products for free now to start the practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud, and shall not be reproduced without permission