Message deduplication The deduplication here may not be what you think | Apache Pulsar Technology Series

Introduction

Apache Pulsar is a multi-tenant, high-performance inter-service message transmission solution that supports multi-tenancy, low latency, read-write separation, cross-region replication, rapid expansion, and flexible fault tolerance. The Pulsar working group within Tencent Cloud has conducted in-depth research on Pulsar and optimized a lot of performance and stability, and it has now been launched in Tencent's internal business TDBank. This article is one of the Pulsar technology series. It mainly introduces the Message Deduplication feature of Pulsar for your reference to avoid stepping on the pit during use.

Background of Message Deduplication

In the design of message middleware products, the delivery design of messages generally refers to the three delivery semantics proposed in Kafka, which are:

at- most  -once

At least once  (at-least-once)

Exactly once (or exactly once) (exactly-once)

It should be noted in understanding that this is a limited description of the delivery behavior.

At most once : When the client produces a message, it will only deliver the produced message once. There is no guarantee that the message will be produced successfully.

At least once : When a client produces a message, it may deliver it multiple times before receiving a successful response. In this scenario, there may be multiple duplicate messages on the server side.

Exactly once (or exactly once): When the client produces a message, the server guarantees that there is one and only one message for this production. "This production" here generally refers to a call to "SendMessage" by the client. In this sense, the server generally does not handle the scenario of calling production multiple times for the same message body, resulting in duplicate messages. Simply put, "exactly once" does not mean message de-duplication.

Many systems claim to provide "exactly-once" delivery semantics, but a careful reading of their declarations will reveal that some systems' declarations may be somewhat misleading, and we need to consider scenarios such as production timeouts, partial replica write success, partial failure, etc. Guarantee of semantics below.

At present, the vast majority of message middleware products in the industry, such as Kafka, RocketMQ, Pulsar, InLong-Tube, RabbitMQ, ActiveMQ, etc., all support the delivery semantics of at-least-once (at least once), that is, to produce a successful message, the server The client can guarantee at least one copy, and the consumer can consume at least one message. However, there are relatively few products that support exactly-once semantics.

Below, we focus on Pulsar's Message Deduplication (equivalent to an implementation of exactly-once), which may not be what you think.

Message deduplication in Pulsar

Functional configuration

The Message Deduplication function provided by Pulsar is disabled by default. When it is turned on, the configuration of the Broker side needs to be modified, and the client side also needs to add a little configuration. ( For details, please refer to pulsar's official website )

To enable the Message Deduplictiaon capability , first, the Broker side needs to change the following configuration:

`#Whether to enable the message deduplication function

Under the brokerDeduplicationEnabled#deduplication function, the number of producers is limited

brokerDeduplicationMaxNumberOfProducers

#The interval for generating deduplication snapshot information on the broker side

brokerDeduplicationEntriesInterval

#After the producer is disconnected, the duration of the deduplication information storage on the broker side

brokerDeduplicationProducerInactivityTimeoutMinutes`

Second, the producer client needs to make the following changes:

  1. Specify a name for the producer.

  2. Configure the message production timeout to 0 (the default is 30s).

The code example is as follows:

`PulsarClient pulsarClient = PulsarClient.builder()

         .serviceUrl("pulsar://localhost:6650")
         .build();

Producer producer = pulsarClient.newProducer()

         .producerName("producer-1")
		 .topic("persistent://public/default/topic-1")
         .sendTimeout(0, TimeUnit.SECONDS)
         .create();`

Functional principle

For each message request sent by the client, a unique Sequence ID number will be generated incrementally, and this information will be placed in the metadata of the message and transmitted to the broker. At the same time, the client Producer also maintains a queue of sent PendingMessages. After receiving the sending Ack information returned by the Broker, it removes the information of the same Sequence ID in the PendingMessages, and the client considers that the sent message is successfully produced. When the Broker enables the Message Deduplication function, the Broker judges whether each received message request is repeated.

The logic of the judgment is as follows: 1. For each producer, the Broker uses the producer name as the key, and stores the maximum Sequence ID information of the production message in two dimensions: currently received and processed:

/*当前已经接受不了到的*/ ConcurrentOpenHashMap<String, Long> highestSequencedPushed /*当前已经存储处理过的*/ ConcurrentOpenHashMap<String, Long> highestSequencedPersisted

2. Each time the Broker side receives a request to produce a Message, it will judge whether it is repeated, that is, whether the latest Sequence ID received is greater than the Sequence ID under the same ProducerName in the two dimensions saved by the Broker side. If it is greater, it will not be repeated. If less than or equal the message repeats. When the message is repeated, the broker will return directly, and will not continue the subsequent storage processing process.

An introduction to the configuration and implementation principles related to the Message Depulation feature of Pulsar above. It can be seen that the Message Depulication function on the Pulsar Broker side is not the deduplication of the message body, but on the premise that the client does not configure the timeout time, the Broker side within a certain time range, to the client under the same producer name. Unique row guarantee for delivered messages with the same Sequence id.

Summarize

After version 0.11.0.0, Kafka provides the option to support idempotent processing and the processing method of transaction-like messages for the exact-once semantics in two scenarios, within a topic and between multiple topics. Interested students can participate in kafka's source code and official website introduction .

Pulsar's Message Deduplication feature is similar in implementation to Kafka's single-topic guarantee for exaxtly-once semantics, and can also be considered an implementation of exaxtly-once semantics.

It is important to note here that exaxtly-once does not equal message deduplication. In actual development, both the production and consumption parts may generate duplicate messages.

The producer of the message, until it receives a clear confirmation of the successful production of the message, the storage state of the message on the server side is indeterminate.

For example, within a certain period of time, the producer does not receive a response from the production and chooses to resend. At this time, the server may have two or even multiple copies of the message.

In addition, the consuming part may also obtain repeatedly pushed messages in the following scenarios:

  1. When the consumer restarts, it has already consumed, but the Broker has not received Ack or the consumer has not triggered Ack;

  2. The broker restarts, because the consumer's Ack information is not saved in real time, after the broker restarts, a small amount of consumed messages may be pushed repeatedly;

  3. If the consumption is abnormal, the client uses the reconsumerLater or negativeAck method to confirm, then the Broker will push the message again.

Therefore, when choosing the features of message middleware, you need to pay attention to the relevant scenarios and limitations. Avoid unnecessary business impact due to duplicate messages.

one more thing

Tencent Cloud is based on Apache Pulsar's self-developed message middleware--TDMQ Pulsar version, which has excellent cloud-native and serverless features, is compatible with various components and concepts of Pulsar, and has the underlying advantages of separation of computing and storage, and flexible expansion and contraction. At present, the TDMQ Pulsar version has been commercialized. Users who are interested in Pulsar can go to the official website for details.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324197625&siteId=291194637