Apache Pulsar Technology Series - Principles of Pulsar Transaction Implementation

Introduction

Apache Pulsar is a multi-tenant, high-performance inter-service message transmission solution that supports multi-tenancy, low latency, read-write separation, cross-region replication, rapid expansion, and flexible fault tolerance. The Tencent Cloud MQ Oteam Pulsar working group has conducted in-depth research on Pulsar and a large number of performance and stability optimizations. It has now been launched in TDBank and Tencent Cloud TDMQ. This article will briefly introduce some concepts and principles of Pulsar server message confirmation, welcome to read.

About the Author

Lin Lin

Tencent Cloud Middleware Expert Engineer

Apache Pulsar PMC, author of "In-depth Analysis of Apache Pulsar". Currently focusing on the field of middleware, he has rich experience in message queue and microservices. Responsible for the design and development of TDMQ, and is currently committed to creating stable, efficient and scalable basic components and services.

foreword

Before the transaction message appears, the highest level of message delivery guarantee supported in Pulsar is through the broker's message deduplication mechanism to ensure that the producer's message on a single partition is only saved exactly once. When the Producer fails to send a message, even if the message is retried, the Broker can ensure that the message is persisted only once. However, in the scenario of Partitioned Topic, the Producer has no way to guarantee the atomicity of messages in multiple partitions.

When the Broker is down, the Producer may fail to send the message, and if the Producer does not retry or has exhausted the number of retries, the message will not be written to Pulsar. On the consumer side, the current message confirmation is a best-effort operation, and it does not guarantee that the message will be confirmed successfully. If the message confirmation fails, it will cause the message to be re-delivered, and the consumer will receive duplicate messages. Pulsar can only guarantee Consumers consume at least once.

Similarly, Pulsar Functions is only guaranteed to process a single message on an idempotent function once, i.e. business guarantees idempotency is required. It does not guarantee that processing multiple messages or outputting multiple results will only happen once.

For example, the execution steps of a Function are: consume messages from Topic-A1 and Topic-A2, then aggregate the messages in the Function (such as time window aggregation calculation), store the results in Topic-B, and finally Acknowledge (ACK) messages in Topic-A1 and Topic-A2. The Function may fail between "Output result to Topic-B" and "Confirm message", or even when acknowledging a single message. This will cause all (or some) of Topic-A1, Topic-A2's messages to be redelivered and reprocessed, and generate new results, which in turn will result in incorrect calculations for the entire time window.

Therefore, Pulsar needs a transaction mechanism to ensure exactly-once semantics (Exactly-once), production and consumption can be guaranteed to be exactly once, without duplication, and without data loss, even in the case of Broker downtime or Function processing failure.

Business Profile

The original intention of Pulsar transaction message design is to ensure the precise one-time semantics of Pulsar Function, which can ensure that when the Producer sends multiple messages to different Partitions, all of them can succeed or all fail at the same time. It can also ensure that when the Consumer consumes multiple messages, all of them can be confirmed successfully or all of them can fail at the same time. Of course, it is also possible to include both production and consumption in the same transaction, and either all succeed or all fail.

Let's take the Function scene at the beginning of this section as an example to demonstrate the scene where production and consumption are in the same transaction:

First, we need to enable transactions in broker.conf.

\transactionCoordinatorEnabled=true

Then, we create the PulsarClient and transaction objects respectively. Both the producer and consumer API need to bring this transaction object to ensure that they are in the same transaction.

//创建client,并启用事务
PulsarClient pulsarClient = PulsarClient.builder() 
        .serviceUrl("pulsar://localhost:6650")   
        .enableTransaction(true)
        .build();
        // 创建事务Transaction txn = pulsarClient
        .newTransaction()
        .withTransactionTimeout(1, TimeUnit.MINUTES)
        .build()
        .get();
        
String sourceTopic = "public/default/source-topic";
String sinkTopic = "public/default/sink-topic";
//创建生产者和消费者Consumer<String> sourceConsumer = pulsarClient.newConsumer(Schema.STRING)
        .topic(sourceTopic)
        .subscriptionName("my-sub")
        .subscribe();
        
Producer<String> sinkProducer = pulsarClient.newProducer(Schema.STRING)
        .topic(sinkTopic)
        .create();
        
// 从原Topic中消费一条消息,并发送到另外一个Topic中,它们在同一个事务内        Message<String> message = sourceConsumer.receive();
sinkProducer.newMessage(txn).value("sink data").sendAsync();
sourceConsumer.acknowledgeAsync(message.getMessageId(), txn);
// 提交事务
txn.commit().get();

Let's take the Function example at the beginning of this section:

When the transaction is not opened, if the Function first writes the result to the SinkTopic, but the message confirmation fails (Step-4 in the figure below fails), this will cause the message to be re-delivered (Step-1 in the figure below), and the Function will recalculate a result Then send it to SinkTopic, so that a piece of data is repeatedly calculated and delivered twice.

If the transaction is not opened, the Function will confirm the message first, and then write the data to the SinkTopic (execute Step-4 and then Step-3). At this time, if the writing to the SinkTopic fails and the message of the SourceTopic has been confirmed, it will cause Data is lost, and the final calculation result is inaccurate.

If a transaction is opened, as long as there is no commit at the end, all the previous steps will be rolled back, and the produced messages and confirmed messages will be rolled back, so that the entire process can be repeated without repeated calculations and loss. data. The entire timing diagram is shown below:

image.png

We only need to understand what each step has done according to the above steps, and then we can understand how the entire transaction is implemented. In the following subsections, we will go through it step by step.

business process

Before understanding the entire transaction process, let's first introduce the components of the transaction in Pulsar. Common distributed transactions include TC, TM, RM and other components:

  1. TM: The transaction initiator. Defines the boundary of the transaction and is responsible for informing the TC that the distributed transaction starts, commits, and rolls back. In a Pulsar transaction, each PulsarClient plays this role.

  2. RM: The resource manager of each node. To manage the resources of each branch transaction, each RM will be registered in the TC as a branch transaction. A TopicTransactionBuffer and PendingAckHandle are defined in Pulsar to manage production and consumption resources respectively.

  3. TC : Transaction Coordinator. A module used by TC to process transaction requests from Pulsar Client to track its transaction status. Each TC is identified by a unique id (TCID), and TCs maintain their own transaction metadata storage independently. TCID is used to generate transaction ID and broadcast to notify different nodes to commit and roll back transactions.

Below, we use a Producer to introduce the entire transaction process. The gray part in the figure represents the storage, the existing memory and Bookkeeper two storage implementations:

image.png

  1. Select TC. There may be multiple TCs (16 by default) in a Pulsar cluster. PulsarClient needs to select which TC to use when creating a transaction, and all subsequent transaction creation, commit, rollback and other operations will be sent to this TC. The selection rule is very simple. Since the topic of the TC is fixed, first Lookup checks the Broker where all partitions are located (each partition is a TC), and then every time the Client creates a new transaction, poll to select a TC.

  2. Start business. In the code, a transaction is opened through pulsarClient.newTransaction(), the Client will send a newTxn command to the corresponding TC, and the TC generates and returns an ID object of a new transaction, which stores the ID of the TC (for subsequent requests to find nodes) And the transaction ID, the transaction ID is incremented, and the same TC generation ID will not be repeated.

  3. Register the partition. Topic may be a partition topic, and the message will be sent to different Broker nodes. In order to let the TC know which nodes the message will be sent to (the TC needs to notify these nodes when the subsequent transaction is committed and rolled back), the Producer will first send the message before sending the message. Register the partition information on the TC. In this way, the subsequent TC will know which node's RM to notify to commit and roll back the transaction.

  4. send messages. This step is not much different from ordinary message sending, but the message needs to go through the RM on each Broker first. The RM in Pulsar is defined as TopicTransactionBuffer, some metadata will be recorded in the RM, and finally the message will still be written to the original Topic. At this point, although the message has been written to the original topic, the consumer is invisible, and the transaction isolation level in Pulsar is Read Commit.

  5. Commit the transaction. After the Producer has sent all the messages, it submits the transaction. After the TC receives the submission request, it will broadcast to notify the RM node to submit the transaction and update the corresponding metadata so that the message can be consumed by consumers.

How does the message in Setp-4 ensure that it is persisted to the topic and invisible?

A maxReadPosition attribute is stored in each topic to identify the maximum position that the current consumer can read. Before the transaction is committed, although the data has been persisted to the topic, the maxReadPosition will not change. Therefore, Consumer cannot consume uncommitted data.

The message has been persisted, and finally the transaction needs to be rolled back. How to deal with this part of the data?

If the transaction is to be rolled back, the RM will record the transaction as Aborted. The metadata of each message will save the transaction ID and other information, and the Dispatcher will determine whether the message needs to be delivered to the Consumer based on the transaction ID. If it is found that the transaction has ended, it is directly filtered (the message is confirmed internally).

What should I do if part of the transaction succeeds and part of it fails when the transaction is finally committed?

There is a timing object called TransactionOpRetryTimer in TC, and all transactions that are not all successfully broadcast will be handed over to it to be retried until all nodes finally succeed or exceed the number of retries. Wouldn't there be a consistency problem in this process? First, let's think about what is the scenario in which this happens. Usually, some Broker nodes are down and these nodes are unavailable, or network jitter causes them to be temporarily unreachable. In Pulsar, if a Broker goes down, the ownership of the Topic will be transferred. Unless the entire cluster is unavailable, a new Broker can always be found and solved by retrying. In the process of topic ownership transfer, maxReadPosition does not change, and consumers cannot consume messages. Even if the entire cluster is unavailable, after the cluster recovers, the Timer will still commit the transaction by retrying.

Will the consumption of ordinary messages be blocked if the transaction is not completed?

Can. Suppose we start a transaction, send a few transactional messages, but do not commit or rollback the transaction. At this time, continue to send ordinary messages to the topic. Since transaction messages have not been submitted, maxReadPosition will not change, consumers will not be able to consume new messages, and the consumption of ordinary messages will be blocked. This is expected behavior, in order to guarantee the order of the messages. And different topics will not affect each other, because each topic has its own maxReadPosition.

the realization of the transaction

We can divide the implementation of transactions into five parts: environment, TC, producer RM, consumer RM, and client. Since the management of producing and consuming resources is separate, we will introduce them separately.

Environment settings

The setting of the transaction coordinator needs to start from the initialization of the Pulsar cluster. We introduced how to build the cluster in Chapter 1. For the first time, you need to execute a command to initialize the cluster metadata in ZooKeeper. At this point, Pulsar will automatically create a SystemNamespace and create a Topic in it. The complete Topic is as follows:

persistent://pulsar/system/transaction_coordinator_assign

This is a PartitionedTopic with 16 partitions by default, and each partition is an independent TC. We can set the number of TCs through the --initial-num-transaction-coordinators parameter.

TC and RM

Next, let's look at the transaction component of the server, as shown in the following figure:

image.png

  • TransactionMetadataStoreService is the overall coordinator of transactions on the Broker, we can think of it as the TC.

  • TransactionMetadataStore is used by TC to store metadata of transactions, such as newly created transactions and partitions registered by Producer. This interface has two implementation classes, one is the implementation of saving data to Bookkeeper, and the other is directly saving the data in memory.

  • The TransactionTimeoutTracker server is used to track timed out transactions.

  • Various Providers, all of which belong to the factory class, do not need special attention.

  • TopicTransactionBuffer Producer's RM, when a transaction message is sent to the Broker, the RM will record some metadata as a proxy, and then store the message in the original Topic. It contains TopicTransactionBufferRecover and TransactionBufferSnapshotService. The metadata of RM will be structured into snapshots and refreshed regularly. These two objects are respectively responsible for snapshot recovery and snapshot preservation. Since production messages are in Topic, there will be one for each Topic/Partition.

  • PendingAckHandle Consumer's RM, since consumption is in subscription units, there is one for each subscription.

Since the online environment usually uses persistent transactions, the following principles are all implemented based on persistence.

All transaction-related services are initialized when the BrokerService starts. In the TC topic, each Partition is a Topic. When the TransactionMetadataStoreService is initialized, it will restore the previously persisted metadata from the Bookkeeper according to the TC Partition managed by the current Broker. Each TC saves the following metadata:

  • newTransaction. Create a new transaction and return a unique transaction ID object.

  • addProducedPartitionToTxn. Register the Partition information to be sent by the producer, which is used by the TC to notify the RM of the corresponding node to commit/roll back the transaction.

  • addAckedPartitionToTxn. Register the Partition information of the message to be consumed by the consumer, which is used by the subsequent TC to notify the RM of the corresponding node to commit/roll back the transaction.

  • endTransaction. End a transaction, which can be commit, rollback, or timeout.

When we initialize PulsarClient, if enableTransaction=true is set, when the Client is initialized, an additional TransactionCoordinatorClient will be initialized. Since the Tenant, Namespace and Topic names of TC are fixed, the TC client can find all Partition information through Lookup and cache it locally. When the subsequent Client creates a transaction, it will poll to select the next transaction from this cache list. TC used.

Producer transaction management

Next we will start a transaction:

// 创建事务
Transaction txn = pulsarClient
        .newTransaction()
        .withTransactionTimeout(1, TimeUnit.MINUTES)
        .build()
        .get();

In the above code, a newTxn will be sent to a TC and a Transaction object will be obtained.

When opening a transaction, TransactionCoordinatorClient will select a TC from the cache, and then send a newTxn command to the Broker where the selected TC is located. The structure of the command is defined as follows:

message CommandNewTxn {
    required uint64 request_id = 1;
    optional uint64 txn_ttl_seconds = 2 [default = 0];
    optional uint64 tc_id = 3 [default = 0];
}

Since the TCID is included in the command, there is no problem even if multiple TCs are managed by the same Broker. The Broker will find the corresponding TC based on the TCID and process the request.

Before the Producer sends a message, it will first send an AddPartitionToTxn command to the Broker, and only after it succeeds, will it continue to send the real message. After the transaction message arrives at the Broker, it is passed to the TransactionBuffer for processing. During the period, the Broker will definitely deduplicate the message. After the verification, the data will be saved in the TransactionBuffer, and the TransactionBuffer is just a proxy (which will save some metadata), and it will eventually call the original Topic to save the message, and the TransactionBuffer is initialized when it is initialized. , the constructor needs to pass in the original Topic object. We can think of TransactionBuffer as the RM on the Producer side.

Two kinds of information will be saved in the TransactionBuffer, one is the original message, which is directly saved by Topic. The other is a snapshot. The snapshot saves the topic name, the maximum readable location information (to prevent the Consumer from reading uncommitted data), and the list of aborted transactions in the topic.

Among them, the interrupted transaction is notified by TC broadcast to other Broker nodes. After the TransactionBuffer receives the information, it will directly write an abortMarker in the original Topic, mark the transaction has been interrupted, and then update the list in memory. abortMarker is also a normal message, but the metadata in the message header is different from normal messages. These data are stored in snapshots, mainly for quick recovery of data after Broker restarts. If the snapshot data is lost, TopicTransactionBufferRecover will read all the data in the topic from the end to the beginning, and update the interrupt list in memory every time an abortMarker is encountered. If there is a snapshot, we only need to read from the starting point at the snapshot to restore the data.

Consumer transaction management

The consumer needs to bring the transaction object when confirming the message, and the identification uses the transaction Ack:

\consumer.acknowledge(message, txn);

Each subscription on the server side has a PendingAckHandle object for managing transaction Ack information, we can think of it as the RM that manages consumer data. When the Broker finds that the message confirmation request contains transaction information, it will forward the request to the corresponding PendingAckHandle for processing.

All message confirmations that have opened a transaction will not directly modify the MarkDeleted position on the cursor, but will be persisted to an additional Ledger first, and a copy will also be cached in the Broker's memory. This Ledger is managed by pendingAckStore, we can think of it as the log of Consumer RM.

When the transaction is committed, the RM will call the Subscription corresponding to the consumer to perform all the message confirmation operations just now. At the same time, a special Marker is also written in the log Ledger to indicate that the transaction needs to be committed. When the transaction is rolled back, an AbortMarker will be recorded in the log first, and then the Message will be re-delivered.

The log saved in pendingAckStore is the redo log. When the component is initialized, it will first read all redo logs from the log Ledger, thereby rebuilding the previous message confirmation information in memory. Because message confirmation is an idempotent operation, if the Broker crashes accidentally, it only needs to re-execute the operations in the redo log. When the message in the subscription is actually confirmed, the corresponding redo log in the pendingAckStore can also be cleaned up. The cleaning method is very simple, just move the MarkDelete position of Ledger in pendingAckStore.

Let's talk about TC again

All transactions are committed and rolled back because the client side informs the TC, or the TC automatically senses the timeout. The TC log stores the Partitions to which the Producer's messages are sent, and which Partitions the Consumer will Ack. The RM is scattered on each broker and records the messages sent and messages to be acknowledged throughout the transaction. When the transaction ends, the TC uses the TCID as the key to find all the metadata, and through the metadata, it knows which RMs on the brokers need to be notified, and finally initiates a broadcast to notify the RMs on these brokers that the transaction needs to be committed/rolled back.

end

There are many design details in Pulsar. Due to the limited space, the author will organize a series of articles for technical sharing, so stay tuned. If you want to learn Pulsar systematically, you can buy the author's new book "In-depth Analysis of Apache Pulsar".

image.png

one more thing

At present, Tencent Cloud Message Queue TDMQ Pulsar version (TDMQ for Pulsar, referred to as TDMQ Pulsar version) has begun to be officially commercialized. The message queue Pulsar version is a message middleware based on Apache Pulsar self-developed. It has excellent cloud native and serverless features, is compatible with various components and concepts of Pulsar, and has the underlying advantages of separation of computing and storage, and flexible expansion and contraction.

If you want to know more, please click on the official website to understand.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324130895&siteId=291194637