Introduction and comparison of message middleware such as KAFKA, RABBITMQ, ROCKETMQ

Preface

In distributed systems, we extensively use message middleware to exchange data between systems to facilitate asynchronous decoupling. There are many open source message middlewares. Some time ago, the product RocketMQ (MetaQ's kernel) was also successfully open sourced and attracted everyone's attention.

concept

Introduction to MQ

MQ, Message queue, and message queue refer to a container for storing messages. The specific definition here is not similar to the database, cache, etc., used to store data. Of course, compared with products such as databases and caches, it also has some characteristics of its own. The specific characteristics will be introduced in detail later.
The commonly used MQ components now include ActiveMQ, RabbitMQ, RocketMQ, ZeroMQ, MetaMQ. Of course, Kafka, which has been popular in recent years, is also MQ in some scenarios. Of course, Kafka is more powerful, although different MQs have their own characteristics and Advantages, however, no matter what kind of MQ, there are some characteristics of MQ itself. Below, the characteristics of MQ are introduced.

MQ features

1. First-in-first-
out. You can't first-in-first-out. It can't be called a queue. The order of the message queue is basically determined when entering the queue, and manual intervention is generally not required. And, most importantly, the data is only one piece of data in use. This is also the reason why MQ is used in many scenarios.
2. Publish and subscribe
Publish and subscribe is a very efficient processing method. If there is no blocking, it can basically be regarded as a synchronous operation. This processing method can very effectively improve server utilization, and such application scenarios are very extensive.
3. Persistence
Persistence ensures that the use of MQ is not just an auxiliary tool for some scenarios, but allows MQ to store core data like a database.
4. Distributed
In the current high-traffic and big data usage scenarios, server software that only supports a single application is basically unusable. It supports distributed deployment and can be widely used. Moreover, MQ is positioned as a high-performance middleware.
Application scenarios

So, which one has the best performance of message middleware?

With this question in mind, our middleware testing team made a performance comparison of three common messaging products (Kafka, RabbitMQ, RocketMQ).

Kafka

Kafka is LinkedIn's open source distributed publish-subscribe messaging system, which currently belongs to the top Apache project. The main feature of Kafka is to process message consumption based on the Pull mode and pursue high throughput. The initial purpose is to use for log collection and transmission. Version 0.8 starts to support replication, does not support transactions, and has no strict requirements on message duplication, loss, and errors. It is suitable for data collection services for Internet services that generate large amounts of data.

RabbitMQ

RabbitMQ is an open source message queuing system developed using Erlang language and implemented based on the AMQP protocol. The main characteristics of AMQP are message-oriented, queue, routing (including point-to-point and publish/subscribe), reliability, and security. The AMQP protocol is more used in enterprise systems. For scenarios that require high data consistency, stability, and reliability, the requirements for performance and throughput are second.

RocketMQ

RocketMQ is Alibaba's open source message middleware. It is developed in pure Java and has the characteristics of high throughput, high availability, and suitable for large-scale distributed system applications. The RocketMQ idea originated from Kafka, but it is not a copy of Kafka. It optimizes the reliable transmission and transactionality of messages. It is currently widely used in Alibaba Group for transactions, recharge, stream computing, message push, log streaming, Scenarios such as binglog distribution.

Testing purposes

Compare the performance of Kafka, RabbitMQ, RocketMQ in sending small messages (124 bytes). In this stress test, we only focus on the performance indicators of the server, so the standard for stress testing is:

Increase the pressure on the sending end until the system throughput no longer rises and the response time is lengthened. At this time, the server has a performance bottleneck, and the best throughput of the corresponding system can be obtained.

testing scenarios

In the synchronous sending scenario, the performance of the three message middleware is clearly distinguished:

Kafka

Kafka's throughput is as high as 17.3w/s, which is worthy of being the industry leader in high-throughput message middleware. This mainly depends on its queue mode to ensure that the process of writing to the disk is linear IO. At this time, the broker disk IO has reached the bottleneck.

RocketMQ

RocketMQ also performed well, with throughput at 11.6w/s and disk IO %util close to 100%. After RocketMQ messages are written into the memory, ack is returned, and a separate thread is dedicated to flushing the disk. All messages are written to the file sequentially.

RabbitMQ

The throughput of RabbitMQ is 5.95w/s, and the CPU resource consumption is high. It supports the AMQP protocol, which is very heavyweight. In order to ensure the reliability of the message, it has made a trade-off on the throughput. We also did a performance test of RabbitMQ in a message persistence scenario, and the throughput was around 2.6w/s.

Test conclusion

Insert picture description here

In terms of the performance of the server processing synchronous sending, Kafka>RocketMQ>RabbitMQ.
Appendix:
Test Environment The
server is deployed on a single machine, and the machine configuration is as follows:

Insert picture description here

Application version:
Insert picture description here

Test script
Insert picture description here

Comparison of the advantages of message queues

Before we compared the simplest small message sending scenario, Kafka temporarily won. However, RocketMQ, which has experienced the baptism of Double Eleven, has its superiority in Internet application scenarios.

RabbitMQ

It is an open source message queue written in Erlang. It supports many protocols: AMQP, XMPP, SMTP, STOMP. It is also the case, making it very heavyweight and more suitable for enterprise-level development. At the same time, a broker architecture is implemented, which means that messages are queued in the central queue when they are sent to the client. It has good support for routing, load balance or data persistence.

Redis

It is a Key-Value NoSQL database. It is actively developed and maintained. Although it is a Key-Value database storage system, it supports MQ functions, so it can be used as a lightweight queue service. The enqueue and dequeue operations of RabbitMQ and Redis are executed 1 million times each, and the execution time is recorded every 100,000 times. The test data is divided into four different sizes of 128Bytes, 512Bytes, 1K and 10K. Experiments show that when entering the team, the performance of Redis is higher than RabbitMQ when the data is relatively small, and if the data size exceeds 10K, Redis is unbearably slow; when leaving the team, regardless of the size of the data, Redis shows very good performance , And RabbitMQ's dequeue performance is much lower than Redis.

ZeroMQ

Known as the fastest message queuing system, especially for high throughput demand scenarios. ZMQ can implement advanced/complex queues that RabbitMQ is not good at, but developers need to combine multiple technical frameworks by themselves. The technical complexity is a challenge to the successful application of this MQ. ZeroMQ has a unique non-middleware model, you don't need to install and run a message server or middleware, because your application will play this service role. You only need to simply reference the ZeroMQ library, which can be installed using NuGet, and then you can happily send messages between applications. But ZeroMQ only provides non-persistent queues, which means that if the machine is down, data will be lost. Among them, Twitter's Storm uses ZeroMQ as the data stream transmission.

ActiveMQ

Apache ActiveMQ is the most popular and powerful open source messaging and Integration Patterns server.
Apache ActiveMQ is fast, supports many cross-language clients and protocols, has an easy-to-use enterprise integration mode and many advanced features, and fully supports JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 license.
Features
Support Java Message Service (JMS) Version 1.1
Spring Framework
Cluster (Clustering)
Supported programming languages ​​include: C, C++, C#, Delphi, Erlang, Adobe Flash, Haskell, Java, JavaScript, Perl, PHP, Pike, Python and Ruby
protocol support includes: OpenWire, REST, STOMP, WS-Notification, MQTT, XMPP and AMQP [1]

Jafka / Kafka

Kafka is a sub-project of Apache, a high-performance cross-language distributed Publish/Subscribe message queue system, and Jafka was incubated on top of Kafka, which is an upgraded version of Kafka. It has the following characteristics: fast persistence, which can carry out message persistence under O(1) system overhead; high throughput, which can reach a throughput rate of 10W/s on an ordinary server; a fully distributed system, Broker , Producer, and Consumer all natively automatically support distributed and automatically achieve complex balance; support Hadoop data parallel loading, for log data and offline analysis systems like Hadoop, but require real-time processing limitations, this is a feasible solution . Kafka unifies online and offline message processing through Hadoop's parallel loading mechanism, which is also valued by the research system of this subject. Compared with ActiveMQ, Apache Kafka is a very lightweight messaging system. In addition to very good performance, it is also a well-working distributed system.

Other comparisons

Rabbitmq is more reliable than kafka, and kafka is more suitable for IO high-throughput processing, such as ELK log collection

Kafka, like RabbitMq, is a general purpose intent message broker, and both of them are for distributed deployment. But their assumptions on the definition of the message semantic model are very different. I am skeptical of the argument that "AMQP is more mature". Let us speak with facts to see what solutions can be used to solve your problem.
  a) You are more suitable to use Kafka in the following scenarios. You have a large number of events (more than 100,000/second), you need to deliver it successfully at least once in a partitioned, sequential manner to consumers who are mixed with online and packaged consumption, you want to be able to re-read the message, and you can accept that it is currently limited The node level is highly available or you don’t mind getting the support of the software in the infant stage through the forum/IRC tool.
  b) You are more suitable to use RabbitMQ in the following scenarios. You have fewer events (more than 20,000/sec) and need to find consumers through complex routing logic, you want reliable message delivery, you don’t care about the order of message delivery, you need to support cluster-nodes now The level of high availability means that you need 7*24 hours of paid support (of course, you can also use the forum/IRC tool).

Redis message push is based on distributed pub/sub, which is mostly used for real-time message push and does not guarantee reliability.

Redis message push (based on distributed pub/sub) is mostly used for real-time message push and does not guarantee reliability. Other mq and kafka are guaranteed to be reliable but have some delays (non-real-time systems do not guarantee delays). Redis-pub/sub will be emptied after power failure, and using redis-list as a message push is persistent, but it is too mentally retarded and not completely reliable and will not be lost. In addition, redis publish and subscribe does not support grouping except for different topics. For example, if one thing is published in Kafka, multiple subscribers can be grouped. Only one subscriber in the same group will receive the message, which can be used as Load balancing. For example, in Kafka publish: topic = "Post" data = "Article 1" This message, there are one hundred servers behind each server is a subscriber, all subscribed to this topic, but they may be divided into three groups, 50 units in group A are used to actually publish articles. All subscribers in group A 50 units are subscribed to this topic. Because it is in the same group, this message (topic="posting", data="article 1") will only be received by a currently idle machine in group A. The 25 servers in group B are used for statistics, and the 25 servers in group C are used for archive backup. Only one of each group will receive it. Different groups are used to decide how many points should be copied for each message, and which subscribers in the same group are busy and which subscribers are idle to determine which server the message will be distributed to for processing, the producer-consumer model. Redis has no such mechanism at all. These two points are the biggest difference.

redis is mainly used as an in-memory database

The author of redis added a message pub/sub based on the memory database. mq generally adopts a subscription-publishing model. If you consider performance, the main focus is on whether the consumption model is pull or push. The most influential one should be the storage structure. The performance of Kafka can only exert its power when the number of topics is less than 64. Determined by partition. In extreme cases, messages are lost, for example: after the master writes a message, the master machine is down and the hard disk is damaged. Found when reviewing the code. Rabbit does not know, but the performance of the rocket is (ten thousand per second), and it can be scaled horizontally infinitely. When the number of topics on a single machine is 256, the performance loss is small. Rocket can be said to be a variant of Kafka, a metaQ developed by Ali after fully reviewing the Kafka code. After continuous updating and patching, Ali changed the name of metaQ3.0 to rocket, and rocket is written in java for easy maintenance. The other is that rocket and kafka have similar infinite accumulation capabilities. Think about it, no messages will be lost when the power is off, there is no pressure on the backlog of 200 million messages, and the performance of niubility kafka and rocket mq does not need to be considered at all.

In terms of application scenarios,

RabbitMQ

RabbitMQ follows the AMQP protocol and is developed by the inherently high-concurrency erlanng language. It is used in real-time messaging with high reliability requirements. It is suitable for enterprise-level message sending and subscription, and it is also more popular.

kafka

Kafka is Linkedin's open source message publishing and subscription system in December 2010. It is mainly used to process active streaming data and large data volume data processing. Common log collection, data collection.

ActiveMQ

Asynchronous call One
-to-many communication
Do the integration of multiple systems, isomorphic and heterogeneous
As an alternative to RPC,
multiple applications are decoupled from each other
As the behind-the-scenes support of event-driven architecture
In order to improve the scalability of the system
In terms of architecture model,

RabbitMQ

RabbitMQ follows the AMQP protocol. The broker of RabbitMQ is composed of Exchange, Binding, and queue. Exchange and binding constitute the routing key of the message; the client Producer communicates with the server through the connection channel, and the Consumer obtains the message from the queue for consumption (long connection, queue). A message will be pushed to the consumer, and the consumer will read data from the input stream in a loop). rabbitMQ is broker-centric; there is a message confirmation mechanism.

kafka

Kafka follows the general MQ structure. Producer, broker, and consumer are centered on the consumer. On the client consumer where the consumption information of the message is stored, the consumer pulls data in batches from the broker according to the point of consumption; there is no message confirmation mechanism.

In throughput

kafka

Kafka has high throughput, internally uses message batch processing, zero-copy mechanism, data storage and acquisition are local disk sequential batch operations, with O(1) complexity, and message processing efficiency is very high.

rabbitMQ

RabbitMQ is slightly inferior to Kafka in terms of throughput. Their starting point is different. RabbitMQ supports reliable delivery of messages, supports transactions, and does not support batch operations; storage based on storage reliability requirements can use memory or hard disk.

In terms of usability,

rabbitMQ

rabbitMQ supports the queue of the mirror, the main queue fails, and the mirror queue takes over.

kafka

Kafka's broker supports the active/standby mode.

In terms of cluster load balancing,

kafka

Kafka uses zookeeper to manage the brokers and consumers in the cluster, and can register topics to zookeeper; through the coordination mechanism of zookeeper, the producer saves the broker information of the corresponding topic, which can be sent to the broker at random or polling; and the producer can be specified based on semantics Fragmentation, the message is sent to a fragment of the broker.

rabbitMQ

The load balancing of rabbitMQ needs to be supported by a separate loadbalancer.

other

Kafka is a reliable distributed log storage service. In simple terms, you can think of Kafka as a large tape that can be written sequentially, rewind at any time, and fast forward to a certain point in time for playback. Let me talk about the definition of the log first: the log is the core of the database, it is a strict and orderly record of all changes to the database, and the "table" is the result of the change. The other names of the log are: Changelog, Write Ahead Log, Commit Log, Redo Log, Journaling. The characteristics of Kafka are as follows: High write speed: Kafka can write to this tape at a speed exceeding 1Gbps NIC (actually up to SATA 3 speed, Refer to Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)), which makes full use of the physical characteristics of the disk, namely, slow random write (head flushing) and fast sequential write (head floating). High reliability: Distributed consistency is achieved through zookeeper, synchronized to any number of disks, failures are automatically switched over to select the master, and self-healing. High capacity: Through horizontal expansion, LinkedIn can store up to 175TB of new data and 800 billion messages daily through Kafka, which can be expanded infinitely, similar to sticking two tapes together. The fundamental flaws of traditional business databases are: 1. Too slow, too expensive to read and write, and random addressing is inevitable. (The fastest disk addressing is 5ms, and the solid state is too expensive.) 2. It can't adapt to the continuously generated data stream. The more you use it, the slower it gets. (Issue of indexing efficiency) 3. Unable to scale horizontally. (Mostly it is the separation of read and write, one master and multiple backups. Another: NewSQL has multiple masters through a consistent algorithm.) In response to these problems, Kafka proposes a method: "Log-centric approach (log-centric approach)." The traditional database is divided into two independent systems, namely the log system and the index system. "Persistence and indexing are separated, the log falls as fast as possible, and the index catches up at its own speed." Under the premise that data reliability is guaranteed by Kafka's fast, tape-like sequential recording method. The presentation and use of data becomes very flexible, and the data stream can be sent to the search system, RDBMS system, data warehouse system, graph database system, log analysis and other different database systems at the same time as needed. These different systems are just an interpretation of Kafka tape data, one side, one index, and one snapshot. If the data is lost, it’s okay. Just replay the tape. More often, the maintenance of these various database systems just needs to take a snapshot regularly and copy it to a secure object storage (such as S3). In one sentence: "Logs are the same log, and each index is different." About stream computing: Under the storage model based on stream as the basic abstraction, data stream and data stream can be mixed and processed by multiple streams, or stream And state, state and state JOIN processing, this is the function provided by Kafka Stream. A simple example is that after a user triggers an event, it is mixed with the user table to generate data augmentation (Augment), and then enter the data warehouse for correlation analysis. Some simple window statistics and real-time analysis are also easy to do. Satisfaction, for example, when the user login message is received, the number of online users is +1, and when offline is -1, it reflects the total number of online users in the current system.

500 copies of interview materials with major factories: follow the official account and enter the "interview questions" to get

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44081894/article/details/114890602