Detailed Apache Pulsar message life cycle

Article Summary

This article is compiled from the speech "Deep Dive into Apache Pulsar Lifecycle" by Ran Xiaolong, Senior R&D Engineer of Tencent Cloud at Pulsar Summit Asia 2022. Topic is abstracted in Apache Pulsar to carry messages sent by users. After a message is sent to Topic, it will be calculated and stored in Bookie by Broker. This article will explain in detail how messages are sent to Broker and finally stored in Bookie after Broker's calculation and metadata processing, and then further explain how Bookie uses the garbage collection mechanism to recycle the data in Topic, and how the TTL and Retention strategies in Broker A mechanism that acts on the Bookie Client to trigger garbage collection.

About the Author

Ran Xiaolong, senior R&D engineer of Tencent Cloud, Apache Pulsar Committer, RoP maintainer, author and main maintainer of Apache Pulsar Go Client, Pulsarctl and Go Functions.

guide

This article is divided into the following sections:

  1. 1. Look at the message sending and receiving process from the user's perspective
  2. 2. TTL and Retention policy (closely related to the message life cycle)
  3. 3. Look at the message storage model from the perspective of Topic
  4. 4. Bookie GC recycling mechanism
  5. 5. Generation of dirty data such as orphaned Ledgers
  6. 6. How to clean up dirty data
  7. 1, 2, and 3 mainly analyze the principle at the Broker level, and 5 and 6 analyze the generation and cleaning of dirty data according to the problems encountered in the production environment.

Message sending and receiving process from the user's perspective

From the user's perspective, MQ can be understood as a Pub-Sub model. Broker abstracts a Topic, and messages are sent to the Topic by the producer and then entered by the consumer for consumption.

picture

First, you need to understand two concepts, Pending Queue and Receive Queue.

  • Pending Queue: The concept in the sending process. When the message is sent, it is not directly delivered to the Broker every time, but the Pending Queue is abstracted locally. All data enters the Pending Queue first and then is sent to the Broker.

  • Receive Queue: The concept in the receiving process. The principle is the same as the Pending Queue. When receiving a message, it does not directly request data from the Broker every time, but abstracts the Receive Queue locally. The data enters the Receive Queue in batches, and then combines the Pulsar message push-pull mechanism to continuously fill the Receive Queue to mobilize the overall process. .

In Pulsar, Broker does not parse batch messages, so Broker cannot know whether a message is a batch message. Here, the concept of an Entry is abstracted, and an Entry may contain batch messages or non-batch messages.

The figure below is a more in-depth architecture diagram from the user's perspective. Producers and consumers can be understood as the Client model, and the Client sends messages to the Broker. Broker can be understood as BookKeeper Client, and BookKeeper Client passes data to Bookie through operations of adding, deleting, modifying and checking. Both BookKeeper and Broker have metadata management centers. ZooKeeper is currently used more, which contains all node information, such as node scheduling information.

picture

Let's analyze the overall flow of data from Client to Broker and then to BookKeeper. First of all, the function of the BookKeeper storage layer is relatively simple and pure. As a distributed log file system, it exposes to the upper-level system and can be invoked by the upper-level system only for addition, deletion, modification and query operations. With these operations, the operation link from Client to BookKeeper can be observed:

  • Send -> Broker -> add Entry -> Bookie: Send  Send commands to Broker, Broker to BookKeeper addEntry
  • Receive -> Broker -> read Entry -> Bookie: Send  Receive commands to Broker, and Broker calls BookKeeper  readEntry interface to read messages from Bookie
  • Ack -> Broker (TTL) -> move cursor (markDeletePosition) -> Bookie: Send  Ack the command to the Broker, and the Broker will execute the move cursor operation. There are messages one by one in the Broker abstract Topic, Ack is equivalent to the behavior of operating the cursor, the pointer moves with the Ack behavior, and the pointer of markDeletePosition is abstracted here. All messages are consumed correctly until markDeletePosition.
  • Retention -> delete Entry -> Bookie: After receiving the Retention policy, Broker will call the Bookie delete Entry interface to delete the data in BookKeeper after triggering the Retention threshold. Delete Entry is the key topic of this article. Later, we will introduce in detail how Entry is deleted from BookKeeper after the Retention policy is triggered.

TTL and Retention Policy

First, the concepts of TTL policy and Retention policy need to be clarified.

TTL policy

The TTL policy means that if the message is not acknowledged by the user within the specified time, it will be actively Acked by the Broker.

Client exposes two interfaces Receive and Ack on the consumer side. When the user consumer receives the message, the Broker does not know that the user has received the message correctly at this time, and the user needs to manually call Ack to tell the Broker that he has successfully received the current message, so the Client needs to initiate a Oneway Ack request to notify the Broker to proceed to the next step . Regardless of whether the message is pushed to the Broker, the message sent by the producer to the Topic will generate a TTL (life cycle). All messages are controlled within the TTL, and after this time, the Broker will Ack the message instead of the user.

It should be noted here that there is no deletion-related operation in the above process, because TTL does not involve deletion-related operations. The role of TTL is only used to Ack the messages that should be Acked within the TTL range. The actual deletion operation is related to the abstracted Retention strategy in Pulsar.

Retention strategy

The retention policy refers to the time that the message continues to be retained on the Bookie side after the message is Acked (consumer Ack or TTL Ack), with Ledger as the minimum operation unit.

After the message is Acked (consumer Ack or TTL Ack), it belongs to the Retention strategy, that is, it is kept in BookKeeper for a certain period of time. For example, in the offline message scenario, the data will be kept for a period of time for review and other operations. Retention uses Ledger as the smallest unit of operation, and deletion means deleting the entire Ledger.

The following is a diagram of the Ack message within the TTL. There are 10 messages in the T1 time period, m1 - m5 are Acked messages, m6 - m10 are not Acked messages. In the T2 time period, assuming that the message has not been Acked after reaching the 3-minute threshold of the TTL, m6 - m8 will be checked by the TTL policy, and the Broker will actively Ack it. In the time period of T3, m6 - m8 have been Acked by Broker. This is the TTL policy operation behavior and scope.

picture

All policies in Pulsar abstract the thread pool in Broker, and execute threads periodically, such as TTL policy or Retention policy, which checks once every 5 minutes by default. The TTL strategy is to periodically check and update the position of the Cursor (equivalent to the Ack interface exposed on the Consumer side) according to the set time, and expire the message; the Retention strategy is to check the creation time of the Ledger and the size of the Entry to decide whether to delete it A certain Ledger.

The life cycle of TTL policy and Retention policy has the following rules on the time limit:

  • TTL time < Retention time, the life cycle of the message is equal to TTL time + Retention time.

  • TTL time ≥ Retention time, the life cycle of the message is equal to the TTL time. During the TTL check, one of the judgment criteria is whether the Ledger switches. If the switch occurs and the TTL time is reached, the Ledger will enter the Retention policy deletion action. So if the TTL time ≥ Retention time, the message life cycle is the TTL time.

Look at the message storage model from the perspective of Topic

picture

When it comes to the message storage model, the first thing to come into contact with is the Topic, the producer sends messages to this Topic, and the consumer consumes messages from the Topic. Topic abstracts the concept of Partition. Multiple Partitions can be created in a Topic to increase the ability of concurrent processing, that is, messages in a Topic can be distributed to multiple Partitions, and multiple Partitions carry the Topic's services.

In the Bookie storage layer, a Partition consists of multiple Ledgers. As shown in the figure, there are 5 Ledgers under Partition 3. Ledger stores multiple entries. As mentioned in the entry concept above, according to whether the message is a batch message, Entry can be divided into batch and non-batch. If the message is a batch message, then there are multiple Messages in the Entry; if the message is non-batch, then one Entry is equal to one Message. This is the storage model from the perspective of Topic.

Bookie GC recycling mechanism

The first three parts revolve around the Broker layer. As the computing layer, the Broker is essentially the Bookie Client. It calls the interface for adding, deleting, and checking exposed on the Bookie side to perform related operations. The operation logic is simple. The following will focus on how the BookKeeper layer compresses and recycles data.

Bookie compression type

There are two types of compression:

  • Automatic compression: Bookie has a GC Compaction thread that executes periodically. GC is divided into Minor GC and Major GC. The difference between the two GCs will be introduced in detail later.

  • Manual compaction: Call the Admin Rest API interface through Http exposed by BookKeeper to trigger GC requests. This operation is very common in daily emergency operation and maintenance. For example, if the Bookie disk memory suddenly increases sharply and the user wants to reclaim the data urgently, then the Minor GC and Major GC check cycles can be skipped and the GC can be manually triggered to free up disk space.

Bookie compression method

There are two compression methods for Bookie:

  • According to Entry size

    • compactionRateByEntries
    • isThrottleByBytes
  • According to the number of Entry (default)

    • compactionRateByEntries
  • In the production environment, it is recommended to compress according to the size of the Entry. From the experience of the actual production environment, each compression is 100MB, and the curve is relatively stable. Why is it not recommended to compress according to the number of Entry? First of all, as the concept of Entry mentioned above, an Entry may be a single message, or it may be a batch of messages (including many Messages), so if compressed according to the number, the number of Messages compressed each time is not certain. In addition, the Payload of each message is different, and the inconsistency of the message size will cause the size of each compression to be different, and the curve of GC compression recovery is not stable. Bookie GC occupies disk IO, and the disk IO of each machine is constant. In extreme cases, unsteady compression will be mapped to the Bookie main link read and write process, affecting stability. Compress according to the size of the Entry, the compression curve is smooth, and has little impact on stability.

Minor GC and Major GC

From the perspective of code implementation logic, Minor GC and Major GC are exactly the same, the difference between the two lies in the trigger timing and trigger threshold.

Minor GC Major GC
compression time 1h 24 hours
Compression Threshold Ratio 20% (minorCompactionThreshold) 80% (majorCompactionThreshold)
The maximum time spent on GC execution minorCompactionMaxTimeMillis majorCompactionMaxTimeMillis
  • Minor GC compression time is 1h, Major GC compression time is 24h.
  • The meaning of the compression threshold ratio is the proportion of useful data in Bookie. In Minor GC, Bookie useful data accounts for 20%; in Major GC, Bookie useful data accounts for 80%. When the proportion of useful data exceeds 20% and 80%, the data will not be recycled. The size of the file in the Entrylog is fixed at 1.1 GB. Assuming that the useful data of Major GC exceeds 80%, it can be understood that most of the data is useful and cannot be deleted, and all of the Entrylog is reserved. The remaining 20% ​​of data does not need to consume disk IO for recovery, and the loss of disk IO can be reduced by occupying a certain amount of space.
  • In order to avoid a GC execution time that is too long, the maximum GC execution time is set. If the specified time-consuming is exceeded, the GC will be forcibly suspended.

Notice:

  • The compression threshold ratio cannot exceed 100%.
  • Minor GC threshold must be less than Major GC.
  • When compressing, you must ensure that the disk still has a certain amount of free space.

Bookie compression

When compressing Bookie, you first need to understand the following concepts. (DBLedgerStorage is configured in the production environment, which is mostly used by the community at present. All GC recycling processes and BookKeeper-related content in the following are based on the default configuration.)

  • Metadata Store: The metadata storage center uses ZooKeeper by default. I use the tool ZK-Web provided by the community, and I can see that many Ledgers are stored under the Ledger path.

    picture

  • LedgerIndex: Ledger collection stored in RocksDB. Using DBLedgerStorage is equivalent to using RocksDB as Entrylog index storage. When reading data, first read RocksDB to find the index data, and then go to Entrylog to read Value. This is an operation of taking the Key to get "V".

    picture

  • LedgersMap: Ledger collection stored in the current single EntryLog.

  • EntryLogMetaMap: All EntryLogs under the current Bookie, Key is Entrylog ID, and Value is Entrylog Metadata. EntryLogMetaMap is a collection of EntryLogMeta, and EntryLogMetaMap contains a collection of LedgersMap.

With the above abstraction, we can judge. The Key of EntryLogMetaMap is the Entrylog ID, which is mapped to the LedgersMap collection.

picture

Throughout the compression process, there are three core processing logics and functions:

  1. doGcLedgers(): Process the collection of LedgerIndex (RocksDB), and judge whether the data can be deleted through the collection.

  2. doGcEntryLogs(): Process the collection of LedgersMap and EntryLogMetaMap, and judge which Ledgers in the current LedgersMap can be deleted and which Entrylogs in the current EntryLogMetaMap can be deleted based on the collection obtained by doGcLedgers().

  3. doCompactionEntryLogs(): After completing the above two steps, the specific deletion operation can be performed. doCompactionEntryLogs() deals with whether the EntryLog file itself can be deleted, and how to delete it is also a science for a Key Value library. The delete operation cannot be deleted directly from the Key Value collection, which will cause many message holes (message discontinuity). The delete operation in BookKeeper is to read undeletable data from the old EntryLog file and write it into the new EntryLog file, which is equivalent to making a backup in the new EntryLog file, so the old EntryLog file can be deleted at one time.

EntryLog has been mentioned many times in the previous article, and the following will introduce how and what is stored in EntryLog in BookKeeper. The composition of Entrylog is divided into three parts from top to bottom. The figure below can help you understand the general structure of Entrylog. If you need a precise understanding, you can read the relevant source code.

  • Header: Contains fingerprint information (BKLO, identifies the Entrylog file, used for verification), BookKeeper version, Ledgers Map Offset (Offset offset, how to read, etc.) and Ledgers Count (the number of Ledgers in an Entrylog).

  • LedgerEntry List: LedgerEntry objects, including Entry Size, Ledger ID, Entry ID and Count.

  • Ledgers Map: Contains Ledgers Map Size, Ledgers Count and Ledgers Map Entries. Each Ledgers Map Entries is a Key Value structure, which is mapped to Size by Ledger.

    picture

The whole process of data recovery

With the basic concepts introduced above, we can connect the recovery process of data from Broker to BookKeeper in series.

picture

First, the Client triggers the process. It is recommended to set the Retention policy when creating a topic. If not set, the default policy is to delete the message after the consumption is completed. After setting the Retention policy, Broker has a thread for regular inspection, and periodically executes the Retention policy for Topic. Call the exposed Delete Ledger interface on the expired deletable Ledger, as shown in the figure, Ledger 0 can be deleted, that is, call Delete Ledger to delete Ledger 0. After Ledger 0 is deleted, the ZooKeeper path of Ledger 0 is removed from ZooKeeper. This is the complete deletion process, and the above figure does not contain return logic.

Delete Ledger does not use the data on the BookKeeper disk from the time it is called until it returns success. Users may be confused why calling the interface to delete Ledger does not free up disk space. This is the reason, because the deletion operation and the operation of BookKeeper to reclaim the disk are completely asynchronous. The operation of BookKeeper to reclaim the disk is fixedly processed by the GC Compaction thread.

So, how does the GC Compaction periodic execution thread work? The periodic execution threads of GC Compaction are Minor GC and Major GC. In the operation process, the list of all Ledgers in ZooKeeper will be obtained first. Because creating a Ledger needs to register the corresponding ZooKeeper path with ZooKeeper, deleting a Ledger also needs to delete the path from ZooKeeper. The Ledger path on ZooKeeper is the most comprehensive and accurate, so Metadata Store (zk) is used as the benchmark to obtain the collection of all Ledger lists. Then perform the doGcLedgers() operation, compare all the Ledger list collections in RocksDB with the Ledger list collections obtained from ZooKeeper, and find out the Ledgers that can be deleted. After deletion, execute the doGcEntryLogs() operation to process the collection of LedgersMap and EntryLogMetaMap, and determine which Ledgers in the EntryLog can be deleted. After further deletion, perform the doCompactionEntryLogs() operation. Ideally, all the Ledgers in the Entrylog can be deleted, then the Entrylog can be cleared directly. In most cases, some data in the Entrylog can be deleted, while others cannot be deleted, so how to judge whether to keep the Entrylog? Determined by the compression threshold ratio of Minor GC and Major GC.

We combine the figure below to understand how to use doGcEntryLogs() to doCompactionEntryLogs(). Assuming that when doCompactionEntryLogs() passes the threshold of Major GC to determine that part of the unqualified data can be recycled, then the GC Compaction thread first checks whether the Ledger can be deleted from the old Entrylog. Assume that Ledger 0 and Ledger 2 can be deleted, and Ledger 1 and Ledger 3 cannot be deleted. After checking the availability ratio, it is judged that the Entrylog can be deleted according to the threshold, then write the useful data of Ledger 1 and Ledger 3 into a new Entrylog file, which is useful After the data is backed up, the old Entrylog files can be deleted.

picture

It needs to be added here that there is another action called Flush when creating a new Entrylog file. The old Entrylog file will generate index information when it is created. When the Entrylog in Bookie reads Entry, such as reading the data of Entry 0 and Ledger 1, it will trace the corresponding Entrylog according to the index information. After deleting the old Entrylog file and creating a new Entrylog file, the index information of the new Entrylog file needs to be updated to RocksDB to notify the upper layer of the read request to find the hexadecimal ID generated in the new Entrylog file to read Entry 0, Ledger 1 data.

The above is the complete life cycle of the message, including the whole process from TTL and Retention policy to Bookie GC recovery mechanism.

Generation of dirty data

The problems encountered in actual production are introduced below. In the figure below, we monitor the Entrylog files on each Bookie and find that, assuming that the Retention policy period is set to 1 day or 5 days, but these Entrylog files have existed for more than 200 days and have not been deleted. This is an abnormal situation, and the file will always occupy disk space if it is not deleted. After analysis, the following three situations may lead to the generation of dirty data:

picture

  • Ledger deletion logic error, resulting in orphaned Ledger: Looking back at the whole process of data recovery, the Ledger deletion operation is divided into two parts: cleaning the path from ZooKeeper and cleaning the Entrylog by the GC Compaction thread. The community initiated PIP[1] for two-stage deletion to ensure that orphan Ledgers will not be generated during the deletion process.
  • Broker will not load inactive topics, resulting in the retention policy not taking effect: the community is currently improving this logic. The only Delete Ledger operation exposed by BookKeeper can only fall into the behavior after setting the Retention policy. Therefore, if the Retention policy does not take effect, the Ledger generated by the Broker inactive Topic cannot be deleted.
  • The setting of the GC recovery threshold is unreasonable, so some data cannot be removed from the EntryLog: this is the main reason for the Entrylog that exists for more than 200 days in the above figure. According to the allocation of user data, it is found that the system does not set the recycling threshold according to 80% of the useful data, but adjusts it to 50%, resulting in half of the data always existing in the Entrylog, and the Entrylog cannot be deleted.
  • There are inactive Cursors (inactive means that there is no corresponding consumer under the Sub), and the Ledgers corresponding to these Cursors cannot be deleted: the current proposed solution is to add verification logic, and if the Cursor is not updated for a period of time, it will be deleted. This solution Still to be discussed and verified. In either case, the dirty data on Ledger cannot be deleted. So let's explain how to delete dirty data. Before understanding how to delete dirty data, you need to understand a concept called Custom Metadata. When the Broker generates or creates a Ledger, it can set some metadata for the Ledger, that is, customize the metadata properties of the Ledger. The following figure is the Custom Metadata provided by Pulsar by default, and the Pulsar Managed Ledger Base64 information obtained through BookKeeper Admin ctl. This string of attributes is written back as the information of a Topic. Only with the Topic information can the following operations be performed.

picture

Topic information, that is, the Owner Topic of Ledger, can be obtained through Ledger Metadata. Then we can start to clean up these dirty data.

Clear orphaned Ledgers

Clear orphaned Ledgers using the Clear Tool removal tool. The process is as follows:

  • Obtain all Ledger lists from ZooKeeper Snapshot (if the online environment is not stressful, you can also directly connect to ZooKeeper to read without using Snapshot.) After obtaining all Ledger lists from ZooKeeper Snapshot, use the BookKeeper Admin tool to obtain Ledger Custom Metadata.

  • Find the Owner Topic of the Ledger through Custom Metadata, and check whether the Topic exists in the Broker.

    • If the Topic in the Broker does not exist, the Client cannot successfully access the Broker first. It is meaningless to store data in BookKeeper and can be deleted directly.

    • If the Topic exists in the Broker, it will further check whether the Ledger exists. The Topic Stats Internal list shows all the Ledgers in the Topic to confirm whether the Ledger is included in the Topic. Note that sometimes the Topic Stats Internal command can get the Ledger list, sometimes it cannot.

      ![图片](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/889dc444f8c2467bb8f07aa5f0accd29~tplv-k3u1fbpfcp-zoom-1.image "null")
      

All Topic attributes and Topic Stats Internal and other indicator information are obtained from ZooKeeper by Broker. After the above checks have passed, the Ledger can be deleted from BookKeeper. The Ledger deletion logic is the same as the recovery process mentioned above. First, the ZooKeeper path of the Ledger is deleted, and the disk space occupied by the Ledger is deleted through the asynchronous process of the GC Compaction thread.

In addition, Schema and Cursor information will also be stored using Ledger. One of the information in the figure below is the Pulsar Schema ID. If the user specifies the Schema as String or Json, then the Schema attribute corresponding to the Ledger will be generated, and the Schema information will also be stored under ZooKeeper. Schema Ledger and Cursor Ledger can be obtained when checking Stats Internal, you need to check carefully.

picture

Note: Be sure to back up when cleaning dirty data. ZooKeeper Snapshot backups can recover data after mistaken deletion.

Summarize

From the user's perspective, the article describes the process of storing messages in Bookie, and explains Bookie's garbage collection mechanism, as well as how TTL and Retention policies affect Bookie Client to trigger the garbage collection mechanism. It is hoped that it can provide a reference for users to operate in the production environment.

quote link

[1] PIP: https://github.com/apache/pulsar/issues/16569

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4587289/blog/8585990