[Pulsar series] The 10 minute message Pulsar System Concepts

Apache Pulsar

Pulsar is a multi-tenant, high-performance between service and service solutions to communicate messages, originally developed by Yahoo, is now managed by the Apache Software Foundation.
The main characteristics of the Pulsar is as follows:

  • Pulsar Examples native support for multi-cluster, seamless location-based cross-cluster backups
  • The news release and very low-end delay
  • Seamlessly expand to more than one million topic
  • Support Java, Go, Pytho and C ++ clients
  • Topic supports a variety of subscription model: exclusive (exclusive), shared (shared) and disaster recovery (failover)
  • Ensure messages by persistent storage mechanism Apache BookKeeper news service provided
  • serverless lightweight computing framework Pulsar Functions provided native stream data processing
  • serverless Pulsar frame connector is built on the IO Pulsar Functions, can easily move data into and out of the Pulsar
  • When the age data, the hierarchical storage of data unloaded from storage to the cold thermal storage (e.g., S3 and GCS)

Contents are as follows:

  • Message System Concepts
  • Architecture Overview
  • Pulsar client
  • Geographically-based backup
  • Multi-tenant
  • Authentication and Authorization
  • Message compression
  • Tiered Storage
  • Schema Management Services

1. Message System Concepts

Pulsar uses a publish-subscribe design pattern, also known as pub-sub. The design model, producer posted a message to the topic, consumer can subscribe to these topic, processing messages posted over, send a confirmation after processing.
Once the subscription is created, all messages will be retained Pulsar, even if the consumer is disconnected. Only after the consumer a confirmation message is successfully processed, preserved the message will be discarded.

1.1 Messages

Pulsar message is a basic unit. News is producer to the content topic, as well as consumer topic from the content consumption (send confirmation message processing is completed). A message similar to the postal system in the letters.
The message contains a plurality of attributes: Value (Data), Key (tagging, message compression is used), the Properties (optional, user-defined key / value), Producer name (name of manufacturer, can be generated by default, may be designated ), sequence ID (sequence id of the message), publish time (release time, automatically add the producer), Event time (optional timestamp message).

1.2 Producers

Producers are associated topic of the program, it posted a message to the broker Pulsar's.
Transmission mode : Producer can post messages to a broker to synchronization (sync) or asynchronously (the async) manner.

  • Synchronous transmission: After sending each message producer will wait for broker acknowledgment, if no confirmation is received, producer failed to send will be considered
  • Asynchronous transmission: Producer will be put into the blocking message queue, and then returns immediately. The client then sends a message back to the broker. If the queue is full (maximum number of configurations), the producer according to the parameters passed, blocked or producer may directly return failure.

Compression : the process of sending a message may be compressed to save bandwidth, pulsar support LZ4, ZLIB, ZSTD, SNAPPY type.
Batch: Batch If enabled, the producer sends the batch in a single request message. Delayed batch size is determined by the maximum number of messages and maximum release.

1.3 Consumers

Consumers subscribing associated topic, and then the program receive messages.
Receive mode : message may be received from the broker via synchronous or asynchronous manner.

  • Simultaneous reception: simultaneous reception will block until the message is available
  • Asynchronous Receiver: asynchronous receiver immediately returns future values, e.g. java in CompletableFuture, once a new message is available, it is done immediately.

Listener : The client libraries provide an implementation listener for consumers, such as the Java client, providing MesssageListener interface, this interface, upon receipt of a new message, received method will be called.

void received(Consumer<T> consumer,Message<T> msg);

Confirm : When a consumer success consumed a message, then the consumer will send a confirmation request to the broker, broker discards the message, or save the message.
A confirmation message may be one, you may be accumulated together. Cumulative confirmed, consumers only need to confirm the last message he received. All messages before (including this provision), and will not be re-distributed to the consumer again.

Cumulative message acknowledgment can not be used for share subscription model, because the shared mode, a subscription involves multiple consumers.
In shared mode, a plurality of messages can be independently confirmed.

Negative Acknowledgment : When the consumer within a certain time without success consumer news, but want to consume news article again, then the consumer can send a negative acknowledgment to the broker, broker and then resend the message. Messages can be one after another negative acknowledgment, also can accumulate negative acknowledgment, depending on consumption subscription model. In an exclusive and disaster recovery mode, consumers can only negative acknowledgment last message it receives. In shared mode, consumers can separate negative acknowledgment.
Confirm Timeout : When a message is not successfully consume, and you want to trigger the broker to automatically resend the message, you can use the automatic message retransmission mechanism unacknowledged. The client tracking unacknowledged message AckTimeout over the entire time range specified acknowledge timeout and automatically transmits a retransmission request to the unacknowledged message broker.

Negative confirmation before acknowledgment timeout. Negative acknowledgment in a more precise manner of controlling retransmission of a single message, and the message processing time than confirmation timeout retransmit the message to avoid invalid.

Dead Letter (Dead Letter) topic : Dead Letter topic allows you to consume certain messages can not succeed in consumer news consumer. In this mechanism, you can not consume messages are stored in a separate topic, known as the dead-letter topic. You can decide how to deal with the dead-letter topic in the news.
In the Java client, you can use the following example dead-letter topic:

Consumer<byte[]> consumer = pulsarClient.newConsumer(Schema.BYTES)
          .topic(topic)
          .subscriptionName("my-subscription")
          .subscriptionType(SubscriptionType.Shared)
          .deadLetterPolicy(DeadLetterPolicy.builder()
                .maxRedeliverCount(maxRedeliveryCount)
                .build())
          .subscribe();

Dependent on the dead letter topic message retransmission. You need to make sure the message retransmission method: negative acknowledgment or confirmation timeout. Negative confirmation before acknowledgment timeout.

At present, the dead-letter topic applies only to shared mode.

1.4 Topics

And other publish-subscribe system, like, Pulsar is to be named in the topic channel used transport from producer to consumer news. Topic is the name of the URL has a well-defined structure:

{persistent|non-persistent}://tenant/namespace/topic

persistent / non-persistent : Topic types, including persistent and non-persistent (lasting default type). After the specified topic persistent, all messages will be persisted to hard drive (which means multiple hard drives, unless a broker stand-alone mode). Conversely, the non-persistent topic data is not stored to the hard disk.
Tenant : Topic in the instance of a tenant, the tenant for a multi-tenant Pulsar is essential, it can be distributed across multiple clusters.
namespace : Topic management unit, acting as the management mechanism related topic group. Most of the topic at the namespace level configuration to take effect. Each tenant can have multiple namespace.
Topic : Topic names are freely definable, no special significance in the pulsar instance.

1.4.1 namespace

Namespace is a named term tenant on internal logic. A tenant can create multiple namespaces by admin API. For example, a docking multiple-tenant application, you can create a different namespace for each application. Namespace enables a program to create and manage topic in a hierarchical manner. For example: "my-tenant / app1", which is the namespace of the application app1, corresponding tenants my-tenant. You can create any number of topic in the namespace.

1.4.2 subscription model

Subscriptions are good naming configuration rules to determine how the message is sent to the consumer. Pulsar has three subscription model: exclusive (exclusive), shared (shared), failover (disaster recovery). The figure below shows these three modes:

1.4.2.1 Exclusive

Exclusive mode, only one consumer subscription topic. If more than one way to get consumers to try the same subscription topic, consumers will receive an error.
The above figure, only Consumer A can be consumed.

Exclusive模式为默认订阅模式。

1.4.2.2 Failover

灾备模式,多个consumer可以绑定到同一个订阅。Consumer将会按字典顺序排序,第一个consumer被初始化为唯一接受消息的消费者,这个consumer被称为master consumer。
当master consumer断开时,所有的消息(未被确认和后续进入的)将会被分发给队列中的下一个consumer。 下图中,Consumer B-0是master consumer,当Consumer B-0断开连接时,由于Consumer B-1在队列中下一个位置,那么它将会开始接收消息。

1.4.2.3 Shared

共享模式,多个消费者可以绑定到同一个订阅上。 消息通过round robin轮询机制分发给不同的消费者,并且每个消息仅会被分发给一个消费者。当消费者断开连接,所有被发送给他,但没有被确认的消息将被重新安排,分发给其它存活的消费者。
下图中,topic下有5条消息,m0~m4,消费者有C1/C2/C3,最终m0和m3分配给C1,m1分给C2,m2和m4分给C3,可以说明每个消息仅发给一个消费者。

Shared模式的限制: 有两点需注意,1、不保证消息顺序; 2、不能使用累计确认

Key_shared:
在Key-shared模式下,多个消费者可以关联到同一订阅。消息以分布式在消费者之间传递,具有相同key/orderingKey 的消息仅传递给一个消费者。无论消息被重发多少次,它都发给同一个消费者。当消费者连接或断开连接时,将导致某些消息的key的消费者变更。

该模式限制:消息必须指定key/orderingKey;不能使用累计确认;该模式目前是测试版,可以在broker.config禁用。

1.5 多topic订阅

当consumer订阅pulsar的topic时,它默认指定订阅了一个topic,例如:persistent://public/default/my-topic。 从Pulsar的1.23.0-incubating的版本开始,Pulsar消费者可以同时订阅多个topic。 你可以用以下两种方式定义topic的列表:

  • 通过最基础的正则表达式(regex),例如 persistent://public/default/finance-.*
  • 通过明确指定的topic列表

通过正则订阅多主题时,所有的主题必须在同一个namespace。

当订阅多主题时,Pulsar客户端会自动调用Pulsar的API来发现匹配表达式或者列表的所有topic,然后全部订阅。 如果此时有暂不存在的topic,那么一旦这些topic被创建,conusmer会自动订阅。

不能保证顺序性 当消费者订阅多topic时,Pulsar所提供对单一topic订阅的顺序保证,就hold不住了。 如果你在使用Pulsar的时候,遇到必须保证顺序的需求,强烈建议不要使用此特性。

下面是多主题订阅在java中的例子:

import java.util.regex.Pattern;

import org.apache.pulsar.client.api.Consumer;
import org.apache.pulsar.client.api.PulsarClient;

PulsarClient pulsarClient = // 实例化pulsar客户端

// 订阅一个namespace下的所有topic
Pattern allTopicsInNamespace = Pattern.compile("persistent://public/default/.*");
Consumer<byte[]> allTopicsConsumer = pulsarClient.newConsumer()
                .topicsPattern(allTopicsInNamespace)
                .subscriptionName("subscription-1")
                .subscribe();

// 根据正则订阅一个namespace下的多个topic
Pattern someTopicsInNamespace = Pattern.compile("persistent://public/default/foo.*");
Consumer<byte[]> someTopicsConsumer = pulsarClient.newConsumer()
                .topicsPattern(someTopicsInNamespace)
                .subscriptionName("subscription-1")
                .subscribe();

 

1.6 Partitioned topics(分区topic)

通常一个topic仅被一个broker服务,这限制了topic的最大吞吐量。 分区topic是特殊的topic类型,他可以被多个broker处理,这让topic有更高的吞吐量。
其实在背后,分区的topic通过N个内部topic实现,N是分区的数量。 当向分区的topic发送消息,每条消息被路由到其中一个broker。 Pulsar自动处理跨broker的分区分布。
下图对此做了阐明:
分析上图可知,Topic1有5个分区(P0到P4),分布在3个broker上。因为分区数量多于broker数量,其中有两个broker每个处理两个分区,第三个broker则只处理一个。(再次强调,分区的分布是Pulsar自动处理的)。
这个topic的消息被广播给两个consumer,路由模式决定哪个broker处理哪个partition,订阅模式决定哪条消息发送到哪个consumer。
大多数境况下,路由和订阅模式可以分开制定。通常来讲,吞吐能力的要求,决定了分区/路 的方式。订阅模式则应该由应用来做决定。
分区topic和普通topic,对于订阅模式如何工作,没有任何不同。分区只是决定了从生产者生产消息到消费者处理及确认消息过程中发生的事情。
分区topic需要通过admin API显式创建,创建topic时可以指定分区数。

1.6.1 路由模式

发布到分区主题时,必须指定路由模式。路由模式决定每个消息应该发布到哪个分区,即哪个内部主题。三种路由模式如下:

  • RoundRobinPartition:如果没有key,所有的消息通过轮询方式被路由到不同的分区,以达到最大吞吐量。请注意round-robin并不是作用于每条单独的消息,而是作用于延迟处理的批次边界,以确保批处理有效。 如果为message指定了key,分区的producer会把key做hash,然后分配消息到指定的分区。 这是默认的模式。
  • SinglePartition:如果没有key被提供,producer将会随机选择一个分区,把所有的消息发往该分区。 如果为message指定了key,分区的producer会把key做hash,然后分配消息到指定的分区。
  • CustomPartition:使用客制化消息路由实现,可以决定特定的消息进入指定的分区。 用户可以创建客制化的路由模式,通过使用 Java client ,实现MessageRouter接口。

1.7 顺序保证

消息的顺序与消息路由模式和消息的key有关。通常,用户需要对每个key分区的消息保证顺序。
当使用 SinglePartition或者RoundRobinPartition模式时,如果消息有key,消息将会被路由到匹配的分区,这是基于ProducerBuilder 中HashingScheme 指定的散列shema。
顺序保证有两种方式:

  • 按key分区:所有拥有相同key的消息有序, 并且会被发送至相同的partition。使用SinglePartition或RoundRobinPartition模式, 每条消息都需要有key。
  • 按producer:来自于相同producer的消息有序,路由策略为SinglePartition, 且每条消息都没有key。
1.7.1 HashingScheme

HashingScheme 是代表一组标准散列函数的枚举,为一个指定消息选择分区时使用。 有两种可用的散列函数:JavaStringHash 和Murmur332Hash,producer 的默认hash函数是JavaStringHash。请注意,当producer可能来自于不同语言客户端时,JavaStringHash是不起作用的。建议使用Murmur332Hash。

1.8 非持久topic

默认情况下,Pulsar保存所有没有确认的消息到多个BookKeeper的bookies中(存储节点)。持久topic的消息数据可以在broker重启或者订阅者出问题的情况下存活下来。 因此,持久性topic上的消息数据可以在 broker 重启和订阅者故障转移之后继续存在。
但是,Pulsar还支持非持久性topic,这些topic的消息从不持久化存储到磁盘,只存在于内存中。 Pulsar也提供了非持久topic。非持久topic的消息不会被保存在硬盘上,只存活于内存中。当使用非持久topic分发时,关掉Pulsar的broker或者关闭订阅者,此topic( non-persistent))上所有的瞬时消息都会丢失,意味着客户端可能会遇到消息缺失。
非持久性topic具有这种形式的名称(注意名称中的 non-persistent):

non-persistent://tenant/namespace/topic

 

非持久topic中,broker会立即发布消息给所有连接的订阅者,而不会在BookKeeper中存储。 如果有一个订阅者断开连接,broker将无法重发这些瞬时消息,订阅者将永远也不能收到这些消息了。 去掉持久化存储的步骤,在某些情况下,使得非持久topic的消息比持久topic稍微变快。但是同时,Pulsar的一些核心优势也丧失掉了。

非持久topic,消息数据仅存活在内存。 如果broker挂掉或者因其他情况不能从内存取到,你的消息数据就可能丢失。 只有在真的确信你的使用场景符合,并且你可以忍受时,才可去使用非持久topic。

默认非持久topic在broker上是开启的。 你可以通过broker的配置关闭。 你可以通过使用pulsar-admin-topics接口管理非持久topic。

1.8.1 性能

非持久消息通常比持久消息更快,因为broker无须持久化消息,当消息被分发给所有订阅者时,会立即发送ack给producer。 非持久topic让producer有更低的发布延迟。

1.8.2 客户端API

Producer和consumer连接持久topic和连接到非持久topic的方式是一样的。非持久的区别在于,topic的名称必须以non-persistent开头。 三种订阅模式--exclusive,shared,failover对于非持久topic都是支持的。
下面是一个非持久topic的java consumer例子:

PulsarClient client = PulsarClient.builder()
    .serviceUrl("pulsar://localhost:6650")
    .build();
String npTopic = "non-persistent://public/default/my-topic"; //这里表明是非持久化
String subscriptionName = "my-subscription-name";

Consumer<byte[]> consumer = client.newConsumer()
    .topic(npTopic)
    .subscriptionName(subscriptionName)
    .subscribe();

 

这里还有一个非持久topic的java producer例子:

Producer<byte[]> producer = client.newProducer()
            .topic(npTopic)
            .create();

 

1.9 消息保留和到期(retention and expiry)

Pulsar broker默认如下:

  • 立即删除所有已经被cunsumer确认过的的消息
  • 以消息backlog的形式,持久保存所有的未被确认消息

Pulsar有两个特性,让你可以覆盖上面的默认行为:

  • 消息存留让你可以保存consumer确认过的消息
  • 消息过期让你可以给未被确认的消息设置存活时长(TTL) 所有消息保留和到期都在namespace级别进行管理。有关操作方法,请参阅Message retention and expiry cookbook。
    下图说明了这两种概念: 图中第一个是消息存留,存留规则会被用于某namespace下所有的topic,指明哪些消息会被持久存储,即使已经被确认过。 没有被留存规则覆盖的消息将会被删除。 没有留存规则的话,所有被确认的消息都会被删除。
    图中第二个是消息过期,有些消息即使还没有被确认,也被删除掉了。因为根据设置在namespace上的TTL,他们已经过期了。(例如,TTL为5分钟,过了十分钟消息还没被确认)

1.10 重复数据消除(Message deduplication)

当消息被Pulsar持久化多于一次的时候,消息就会重复。 消息去重是Pulsar可选的特性,阻止不必要的消息重复,每条消息仅处理一次,即使消息被接收多次。
下图说明了禁用和启用重复数据消除的情况:
上图第一个场景中,消息去重被关闭。 Producer发布消息1到一个topic,消息到达broker后,被持久化到BookKeeper。 然后producer又发送了消息1(可能因为某些重试逻辑),然后消息被接收后又持久化在BookKeeper,这意味着消息重复发生了。
在第二个场景中,producer发送了消息1,消息被broker接收然后持久化,和第一个场景是一样的。 当producer再次发送消息时,broker知道已经收到个消息1,所以不会再持久化消息1。

消息重复数据消除是在namespace级别处理的。

1.10.1 生产者幂等

消息去重的另外一种方法是确保每条消息仅生产一次。 这种方法通常被叫做生产者幂等。 这种方式的缺点是,把消息去重的工作推给了应用去做。 在Pulsar中,去重被broker处理的,这意味着你不需要修改你的客户端代码。 你只需要做一些管理上的变化(参考Managing message deduplication )。

1.10.2 去重和实际一次语义

消息去重,使Pulsar成为与流处理引擎(SPE)或者其他寻求实际一次处理语义的系统连接的完美消息系统。 消息系统若不提供自动消息去重,则需要SPE或者其他系统保证去重。这意味着严格的消息顺序来自于让程序承担额外的去重工作。 使用Pulsar,严格的顺序保证不会带来任何应用层面的代价。

结语

由于篇幅有限,本篇文章只讲述Pulsar消息系统的基本概念,下篇文章重点讲解Pulsar的架构和客户端库使用教程。
参考文档(http://pulsar.apache.org/en/)

 

Guess you like

Origin www.cnblogs.com/iceblow/p/11318650.html