Open source message queue rookie: Pulsar!

bb5f77f67a4520dadd255b8f3edf3357.png

Pulsar is a cloud-native streaming data platform that integrates message delivery, message storage, and lightweight functional computing. Pulsar provides data storage and consumption capabilities. With its excellent architecture design and strong scalability, it is widely used in many fields such as message queues and stream data processing.

Advantages of Pulsar

There are many excellent open source message queues in the open source field, such as RabbitMQ, Apache RocketMQ, Apache ActiveMQ, and Apache Kafka.

On the basis of its predecessors, Pulsar has implemented many functions and features that were not implemented by the previous generation of message systems or stream data systems, such as cloud native, multi-tenant, separation of storage and computing, and hierarchical storage. In response to the pain points of the previous message queue system, Pulsar has made many targeted solutions.

What is Pulsar?

(1) Pulsar is a distributed messaging platform that can handle both streaming data and heterogeneous systems . Pulsar has a very flexible messaging model. In order to achieve a richer consumption model, Pulsar proposes the concept of subscription. Subscription is a rule of data consumption, which determines how to deliver messages to consumers, and determines the different behaviors of multiple consumers when consuming through different subscription modes.

(2) Pulsar is a streaming data platform integrating message delivery, message storage, and lightweight functional computing. Pulsar not only provides data storage and consumption capabilities, but also provides certain stream processing capabilities.

(3) Pulsar is a distributed and scalable streaming storage system, and builds a unified model of message queues and streaming services on the basis of data storage . This makes Pulsar not only have a message queue function (similar to the use of RabbitMQ and RocketMQ in business systems), but also a data flow processing model (similar to Kafka's positioning in big data systems).

cloud native architecture

Cloud native is a method of building and running applications in the era of cloud computing, which can make full use of the elastic automation advantages of the cloud platform. Cloud-native applications run optimally on the cloud, greatly improving the availability, agility, and scalability of business systems.

Pulsar is a product designed based on cloud-native infrastructure in the field of message queuing. It has many cloud-native application features, such as a stateless computing layer and separation of computing and storage. It can make good use of the elasticity (scalability) of the cloud and ensure sufficient It is scalable and fault-tolerant, and can run well in a containerized environment.

Pulsar's storage and computing separation architecture has greater value in cloud-native environments. The storage nodes in a Pulsar instance can be managed by a group of Bookie containers, and the computing nodes can be managed by another group of Broker containers. Storage and computing nodes can be scaled independently. With container orchestration tools such as Kubernetes, business parties can quickly build cloud-native message queues that scale elastically.

Separation of storage and computing

Pulsar is a message queue that separates storage and computing. The role that provides computing services is called Broker, and the role that provides storage services is called Bookie. Broker is a stateless component of the server, mainly responsible for two types of functions: data production and consumption and Pulsar management. It is BookKeeper that really takes on the heavy responsibility of storage.

Broker service is stateless and can be expanded independently when computing resources are insufficient. Bookie is a stateful storage service, and the data in Pulsar will be distributed among different Bookie nodes in the form of data blocks. When the storage resources are not enough, it can be expanded by adding Bookie nodes. Pulsar will sense the changes in the Bookie cluster and use the newly added Bookie nodes for storage at the right time, avoiding the operation and maintenance operations of manually migrating data. The expansion of Broker service and Bookie service is independent of each other, which avoids waste of resources and provides maintainability of Pulsar.

Hierarchical storage

BookKeeper clusters can use cheap mechanical hard disks as storage media. However, in the process of deploying the BookKeeper cluster, in order to maximize the writing and reading capabilities, it is possible to choose machines with solid-state drives (SSD), which are relatively expensive.

Pulsar is a message queue that separates storage and computing. Messages are initially stored in the BookKeeper cluster and managed abstractly through internal management components. The basic unit of data management is a data segment, and the basic unit of data deletion and creation is also a data segment. The Pulsar community provides the capability of layered storage, and provides a data offload function on the server side, which can switch each logical data segment from BookKeeper storage to other types of storage.

Pulsar's tiered storage feature allows older backlog data to be offloaded to long-term storage such as Hadoop HDFS or Amazon S3 storage, freeing up space in BookKeeper and reducing storage costs. The cold data and hot data in the message queue can be separated through tiered storage, making the cost more controllable.

Pluggable Protocol Handling

A pluggable protocol processing mechanism is supported in Pulsar, and Pulsar can dynamically load additional protocol handlers at runtime and support other message protocols. Based on the message queue protocol layer, Pulsar currently supports various protocols such as Kafka, RocketMQ, AMQP, and MQTT. Based on the message queuing protocol layer, Pulsar can extend its own cloud native, hierarchical storage, automatic load management and many other features to more message queuing systems

The Kafka project supported by the Pulsar protocol layer is KafkaOn Pulsar (KoP). By deploying the KoP protocol in the existing Pulsar cluster, users can continue to use the native Kafka protocol in the Pulsar cluster, and at the same time, they can take advantage of the powerful functions of Pulsar to improve the experience of existing Kafka applications.

data reliability

Pulsar can write data on a single partition through idempotent producers and guarantee its reliability. Through the client's self-incrementing sequence ID, retry mechanism, and server-side deduplication mechanism, idempotent producers can ensure that each message sent to a single partition will only be persisted once, and no data will be lost.

Additionally, all producing or consuming operations within a Pulsar transaction are committed as a unit. All operations in a transaction either all commit or all fail. Pulsar guarantees that each message is written or processed exactly once without data loss or duplication even in the event of a failure. If a transaction is aborted, all writes and acknowledgments within that transaction are automatically rolled back. To sum up, it can be found that the transaction functions of Kafka and Pulsar are all to support exactly once semantics.

Rich ecological support

When users use message queues or streaming services, they sometimes encounter application scenarios that only carry messages, or perform some simple statistics, filtering, summarization, and other operations. These functions can be natively supported through Pulsar Function. The official provides a variety of connectors for importing and exporting data. The capabilities provided by Pulsar I/O can flexibly combine Pulsar with external systems through simple configuration, such as relational databases, non-relational databases (such as MongoDB), data lakes, Hadoop ecosystems, etc.

Kafka is still excellent

Among the business challenges the author encountered, even under the impact of domestic head game business, Kafka is still very stable in reasonable configuration and use.

What kind of scenarios can continue to use Kafka?  In most cases, you can choose Kafka without hesitation

Kafka is still a good choice when there are not many topics in the cluster or the growth rate of topics is not particularly fast.

When complex enterprise-level scenarios are not required, Kafka is still the first choice. For example, when features such as multi-tenancy and cloud primitives are not required, when particularly complex throughput challenges are not required, and when features such as tiered storage are not required,

Kafka's native cluster mode is easy to use and can meet the needs of most businesses. The Kafka ecology is more complete, and there are more materials and pioneers at home and abroad. When encountering Kafka problems, the way to find solutions will be easier.

It is undeniable that a complex architecture will inevitably bring new advantages, but it will also bring about an increase in complexity, which will lead to an increase in the probability of problems. When using the early version of Pulsar, sometimes you will encounter some strange bugs, which require developers and maintenance personnel to have more knowledge reserves and problem-solving capabilities!

"Apache Pulsar Principle Analysis and Application Practice"

ba09f52f34914a6f97df9c9f5d8147bd.png3f11f41da5cabe1d243742523d469b55.jpeg

Scan the code to understand, it is available on major e-commerce platforms

Features of this book

This book starts from application practice, pays attention to the combination of theory and practice, and allows readers to understand the principles behind it on the basis of quick application. While introducing the basic theory, this book focuses on how to quickly build a stable Pulsar service based on the theory, and build a series of data services with Pulsar as the core relying on the rich Pulsar ecosystem.

main content

This is a professional guide book that interprets Apache Pulsar's related components, working principles and implementation practices from a practical perspective. This book is mainly aimed at junior and intermediate readers. It starts from basic concepts and gradually expands to basic operations, core technologies, common tools and typical applications.

The book is divided into 10 chapters:

Chapter 1 mainly introduces the basic knowledge related to Apache Pulsar, such as development history, applicable scenarios, advantages and disadvantages, and knowledge related to the message queue framework.

Chapter 2 mainly introduces the core concepts and architecture of Apache Pulsar.

Chapters 3 to 9 mainly introduce the necessary content for practical operations, such as the installation and deployment method of Apache Pulsar, basic operations, core components, advanced features, input/output, Pulsar SQL, operation and maintenance methods, etc.

Chapter 10 is the method of practical operation, including the Pulsarde application mode, the cooperation of flink to realize real-time processing, and the specific practice of building a real-time message pipeline, etc.

About the Author

07f6719e8f4c267f58d1d65a6491ac2e.png

Yang Guodong

Tencent software engineer, core contributor of Apache Pulsar, Apache Flink and other projects, open source enthusiast of Apache Pulsar community, master of Hangzhou Dianzi University.

Give a book! First come first served!

This benefit will send "Apache Pulsar Principle Analysis and Application Practice" * 5 copies

Book donation rules: no lottery, use community points to redeem directly!

Exchange address: http://spring4all.com/3076.html

Come and participate in the construction of community content, learn and grow together!

Click to read the original text to see more community benefits!

Guess you like

Origin blog.csdn.net/j3T9Z7H/article/details/131098428