Basic principles and selection comparison of message queues

Message queue usage scenarios

Message queue middleware is an important component in distributed systems. It mainly solves problems such as application coupling, asynchronous messaging, peak shaving, and valley filling. Achieve high performance, high availability, scalability and eventually consistent architecture.

Decoupling: Multiple services monitor and process the same message to avoid multiple RPC calls.

Asynchronous messages: The message publisher does not have to wait for the result of message processing.

Peak shaving and valley filling: large traffic and writing scenarios, to resist traffic for downstream I/O services. Of course, under heavy traffic, other solutions need to be used.

Message-driven framework: In the event bus, services drive services by listening to event messages to complete corresponding actions.

Message Queuing Pattern

Point-to-point mode, no repeated consumption

Multiple producers can send messages to the same message queue. After a message is successfully consumed by one message producer, the message will be removed and other consumers cannot process the message. If the consumer fails to process a message, the message will be consumed again.

publish/subscribe pattern

The publish-subscribe model requires registration and subscription, and the corresponding messages are consumed according to the registration. Multiple producers can write messages to the same Topic, and multiple messages can be consumed by the same consumer. Messages produced by one producer can also be consumed by multiple consumers, as long as they have subscribed to the message.

Selection reference

Message order: Whether the order of consumption of messages sent to the queue can be guaranteed when consumed;
Scaling: When there is a problem with the performance of the message queue, such as too slow consumption, whether it can quickly support expansion; when there are too many consumption queues, which wastes system resources, whether it can support scaling down.
Message retention: After the message is successfully consumed, whether it will continue to be retained in the message queue;
Fault tolerance: When a message fails to be consumed, is there some mechanism to ensure that the message will be successful? For example, an asynchronous third-party refund message needs to ensure that the message is consumed before the refund to the user can be determined to be successful, so it must be guaranteed The accuracy of successful consumption of this message;
Message reliability: Will there be any message loss? For example, if there are two messages A/B, only the B message can be consumed, and the A message is lost;
Message timing: mainly includes "message survival time" and "delayed message";
Throughput: the maximum number of concurrencies supported;
Message routing: According to routing rules, only subscribe to messages that match routing rules. For example, if there are messages with both A/B rules, consumers can only subscribe to A messages and B messages will not be consumed.

Kafka

Kafka is an open source stream processing platform developed by the Apache Software Foundation and written in Scala and Java. The goal of this project is to provide a unified, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a "large-scale publish/subscribe message queue based on a distributed transaction log architecture," making it valuable as an enterprise-grade infrastructure for processing streaming data. (Wikipedia)

basic terminology

Producer : message producer. Typically, a message is sent to a specific topic. Typically, written messages are written to each partition in a polling manner. The producer can also write messages to the specified partition by setting the message key value. The more uniformly the data written to the partitions, the better Kafka's performance will be.

Topic : Topic is an abstract virtual concept. A cluster can have multiple topics, which serve as the identifier of a type of message. A producer sends messages to a topic, and consumers obtain partitioned messages by subscribing to the topic.

Partition : Partition is a physical concept, and a Topic corresponds to one or more Partitions. New messages will be written to the partition in an appending manner, and the messages in the same Partition are ordered. Kafka achieves message redundancy and scalability through partitioning, and supports physical concurrent reading and writing, which greatly improves throughput.

Replicas : A Partition has multiple Replicas. These copies are stored in the broker. Each broker stores hundreds or thousands of copies of different topics and partitions. The stored content is divided into two types: master copy. Each Partition has a master copy, and all content is written and consumed. All will go through the master copy; the follower copy does not process any client requests and only synchronizes the content of the master for replication. If an exception occurs on the master, a follower will soon become the new master.

Consumer : message reader. Consumers subscribe to topics and read messages in a certain order. Kafka guarantees that each partition can only be used by one consumer.

Offset : Offset is a kind of metadata, which is an increasing integer. Kafka adds it to the message as it is written. Offsets are unique within a partition. During the consumption process, the last read offset will be stored in Kafka. The offset will not be lost when the consumer closes. Restarting will continue consumption from the last position.

Broker : independent Kafka server. A Topic has N Partitions, and a cluster has N Brokers. Then each Broker will store a Partition of this Topic. If a topic has N partitions and the cluster has (N+M) brokers, then N brokers store a partition of the topic, and the remaining M brokers do not store partition data of the topic. If a topic has N partitions and the number of brokers in the cluster is less than N, then one broker stores one or more partitions of the topic. In actual production environments, try to avoid this situation, which can easily lead to Kafka cluster data imbalance.

system framework

The first topic has two productions. New messages are written to partition 1 or partition 2. Both partitions have backups in broker1 and broker2. After new messages are written, the two follower partitions will synchronize the changes from the two master partitions. The corresponding consumer will obtain messages from the two master partitions based on the current offset and update the offset. The second topic has only one producer, which also corresponds to two partitions and is distributed on the two brokers of the Kafka cluster. When new messages are written, the two follower partitions will synchronize the master changes. The two Consumers obtain messages from different master partitions.

advantage

High throughput, low latency : Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds;

Scalability : Kafka cluster supports hot expansion;

Durability and reliability : Messages are persisted to the local disk, and data backup is supported to prevent data loss;

Fault tolerance : Allows node failure in the cluster, multiple copies of one data, and a few machines to go down without losing data;

High concurrency : supports thousands of clients reading and writing at the same time.

shortcoming

Partition ordering : ordering is only guaranteed within the same partition, and global ordering cannot be achieved;

No delayed messages : the consumption order is in the order of writing, and delayed messages are not supported

Repeated consumption : The consumption system is down or restarted, resulting in the offset not being submitted;

Rebalance : During the Rebalance process, all consumer instances under the consumer group will stop working and wait for the Rebalance process to complete.

scenes to be used

Log collection : A large number of log messages are first written to Kafka, and the data service consumes Kafka messages to store the data;

Message system : decoupling producers and consumers, caching messages, etc.;

User activity tracking : Kafka is often used to record various activities of web users or app users, such as browsing the web, searching, clicking and other activities. These activity information are published by various servers to Kafka topics, and then consumers subscribe to these topics. It can be used for real-time monitoring and analysis, and can also be saved to the database;

Operation indicators : record operation and monitoring data, including collecting data from various distributed applications and producing centralized feedback for various operations, such as alarms and reports;

Streaming processing : such as spark streaming.

RabbitMQ

RabbitMQ is an open source message broker software (also known as message-oriented middleware) that implements the Advanced Message Queuing Protocol (AMQP). The RabbitMQ server is written in the Erlang language, and clustering and failover are built on On top of the Open Telecommunications Platform framework. All major programming languages have client libraries for communicating with the agent interface. (Wikipedia)

basic terminology

Broker : Receives client link entities and implements AMQP message queue and routing functions;

Virtual Host : It is a virtual concept and the smallest unit of permission control. A Virtual Host contains multiple Exchanges and Queues;

Exchange : Receives messages from message producers and forwards messages to queues. When sending messages, routing rules are determined according to different ExchangeTypes. Commonly used ExchangeTypes are: direct, fanout and topic;

Message Queue : Message queue, stored as consumed messages;

Message : It consists of Header and Body. Header is various attributes added by the producer, including whether the Message is persisted, which MessageQueue receives it, and the priority. Body is the specific message content;

Binding : Binding connects Exchange and Message Queue. When the server is running, a routing table will be generated, which records the conditions of the MessageQueue and the BindingKey value. When Exchange receives the message, it will parse the Header in the message to obtain the BindingKey, and send the message to the corresponding MessageQueue based on the routing table and ExchangeType. The final matching mode is determined by ExchangeType;

Connection : TCP connection between Broker and client;

Channel : channel. The Broker and the client cannot send messages if they only have a tcp connection, and a channel must be created. The AMQP protocol stipulates that AMQP commands can only be executed through Channel. A Connection can contain multiple Channels. The reason why a Channel needs to be established is because each TCP connection is precious. If each client and each thread needs to interact with the Broker and maintain a TCP connection, the machine will consume resources. It is generally recommended to share the Connection. RabbitMQ does not recommend that client threads share Channels before. At least ensure that small messages sent on the same Channel are traversed;

Command : AMQP command. The client uses Command to complete the interaction with the AMQP server.

Information Direct: Linux kernel source code technology learning route + video tutorial kernel source code

Learning Express: Linux Kernel Source Code Memory Tuning File System Process Management Device Driver/Network Protocol Stack

system framework

A Message reaches the corresponding Exchange through the channel. After receiving the message, Exchange parses out the message Header content, obtains the message BindingKey and forwards the message to the corresponding MessageQueue based on Binding and ExchangeType, and finally transmits the message to the client through Connection.

ExchangeType

Direct: exact match

Only when the RoutingKey and BindingKey completely match, the message queue can obtain the message;
Broker provides an Exchange by default. The type is Direct and the name is an empty string. It is bound to all Queues (distinguished here by Queue names).

Fanout: Subscription, broadcast

This mode will forward the message to all routing Queues

Topic: wildcard pattern

RoutingKey is a string separated by periods "." (each independent string separated by periods "." is called a word), such as "quick.orange.rabbit". BindingKey is the same as RoutingKey;
The two special characters "#" and "_" in Bindingkey are used for fuzzy matching, "#" is used to match multiple single words, and "_" is used to match a single word (including zero).

advantage

Based on AMQP protocol: In addition to Qpid, RabbitMQ is the only message server that implements the AMQP standard;
Robust, stable and easy to use;
Active community and complete documentation;
Support scheduled messages;
Plugable authentication, authorization, support for TLS and LDAP;
It supports querying messages based on message identifiers and querying messages based on message content.

shortcoming

Erlang development source code is difficult to understand, which is not conducive to secondary development and maintenance;
The interfaces and protocols are complex, and the learning and maintenance costs are high.

Summarize

Erlang has concurrency advantages and better performance. Although the source code is complex, the community is highly active and can solve problems encountered during development;
If the business traffic is not large, you can choose RabbitMQ, which has relatively complete functions.

Pulsar

Apache Pulsar is the top project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. It adopts a computing and storage separation architecture design and supports multi-tenancy, persistent storage, Multi-machine room cross-regional data replication has streaming data storage characteristics such as strong consistency, high throughput, low latency and high scalability. It is regarded as the best solution for real-time message streaming transmission, storage and computing in the cloud native era. Pulsar is a pub-sub (publish-subscribe) model message queuing system. (encyclopedia)

basic terminology

Property : represents the tenant. Each property can represent a team, a function, and a product line. A property can contain multiple namesapce. Multi-tenancy is a resource isolation method that can improve resource utilization;

Namespace : The basic management unit of Pulsar. Permissions, message TTL, retention policies, etc. can be set at the namaspace level. All topics in a namaspace inherit the same settings. There are two types of namespaces: local namespace, which is only visible within the cluster, and global namespace, which is visible to multiple clusters. Cluster namespace;

Producer : Data producer, responsible for creating messages and delivering them to Pulsar;

Consumer : Data consumer, connected to Pulsar to receive messages and process them accordingly;

Broker : Stateless Proxy service, responsible for operations such as receiving messages, delivering messages, and cluster load balancing. It shields the client from the complexity of the server-side read and write process, and plays an important role in ensuring data consistency and data load balancing. Broker does not persist metadata. It can be expanded but cannot be reduced;

BookKeeper : stateful, responsible for persistent storage of messages. When the cluster is expanded, Pulsar will add BookKeeper and Segment (that is, Bookeeper's Ledger). There is no need to perform Rebalance during expansion like kafka. The result of the expansion is that Fragments are distributed in strips across multiple Bookies. Fragments of the same Ledger are distributed on multiple Bookies, causing reads and writes to jump between multiple Bookies;

ZooKeeper : stores metadata of Pulsar and BookKeeper, cluster configuration and other information, and is responsible for coordination between clusters, service discovery, etc.;

Topic : used to transmit messages from producer to consumer. Pulsar has a leader Broker at the Topic level, which is said to have ownership of the Topic. All R/W for this Topic is completed through this Broker. Metadata such as the mapping relationship between Topic's Ledger and Fragment is stored in Zookeeper, and Pulsar Broker needs to track these relationships in real time for read and write processes;

Ledger : Segment, Pulsar underlying data is stored on BookKeeper in the form of Ledger. It is the smallest unit deleted by Pulsar;

Fragment : Each Ledger consists of several Fragments.

system framework

The above framework diagram demonstrates the two situations of capacity expansion and failover respectively. Expansion: Due to the increase in business volume, a new Bookie N is added, and the subsequently written data segment x and segment y are written into the newly added Bookie. To maintain a balanced capacity expansion result, the result is shown in the green module in the figure above. Failover: If segment 4 of Bookie 2 fails, Pulasr's Topic will immediately reselect Bookie 1 as the service to handle reading and writing.

Broker is a stateless service that only serves data calculation and not storage, so Pulsar can be considered a distributed system based on Proxy.

advantage

Flexible expansion
Seamless failure recovery
Support delayed messages
Built-in replication function for cross-regional replication such as disaster recovery
Supports two consumption models: stream (exclusive mode) and queue (shared mode)

RocketMQ

RocketMQ is a distributed messaging and streaming data platform with low latency, high performance, high reliability, trillion-level capacity and flexible scalability. RocketMQ is the third generation distributed messaging middleware open sourced by Alibaba in 2012. (Wikipedia)

basic terminology

Topic : A Topic can have 0, 1, or multiple producers sending messages to it. A producer can also send messages to different Topics at the same time. A Topic can also be subscribed by 0, 1, or multiple consumers;

Tag : a secondary type of message, which can provide users with additional flexibility. A message can have no tag;

Producer : message producer;

Broker : stores messages, a lightweight queue with Topic as the latitude; forwards messages, a single Broker node maintains long connections and heartbeats with all NameServer nodes, and regularly registers Topic information to the NameServer;

Consumer : Message consumer, responsible for receiving and consuming messages;

MessageQueue : The physical management unit of messages. A Topic can have multiple Queues. The introduction of Queues enables horizontal expansion capabilities;

NameServer : Responsible for the management of original data, including Topic and routing information. There is no communication between each NameServer;

Group : A group can subscribe to multiple topics. ProducerGroup and ConsumerGroup are one type of producer and one type of consumer respectively;

Offset : Access the storage unit through Offset. All messages in RocketMQ are persistent, and the storage unit has a fixed length. Offset is a Java Long type, and theoretically it will not overflow within 100 years, so Message Queue is considered to be infinitely long data, and Offset is a subscript;

Consumer : supports two consumption modes: PUSH and PULL, and supports cluster consumption and broadcast consumption.

system framework

advantage

Supports publish/subscribe (Pub/Sub) and point-to-point (P2P) messaging models:

Sequential queue: reliable first-in-first-out (FIFO) and strict sequential delivery in a queue; supports pull and push message modes;
The ability to accumulate millions of messages in a single queue;
Supports multiple messaging protocols, such as JMS, MQTT, etc.;
Distributed scale-out architecture;
Satisfy at-least-once message delivery semantics;
Provides a rich Dashboard, including configuration, indicators and monitoring;
Supported clients are currently java, c++ and golang

shortcoming

Community activity is average;
Delayed message: The open source version does not support arbitrary time precision, only specific levels.

scenes to be used

It was born for the financial Internet field and scenarios that require high reliability.

Original author: Geek Rebirth