RocketMQ multi-level storage design and implementation

Author: Zhang Senze

With the official release of RocketMQ 5.1.0, multi-level storage has reached the Technical Preview milestone as a new independent module of RocketMQ: allowing users to offload messages from local disks to other cheaper storage media, and extend messages at a lower cost keep time. This article introduces the design and implementation of RocketMQ multi-level storage in detail.

Design overview

RocketMQ multi-level storage is designed to offload data to other storage media without affecting hot data read and write , and is suitable for two scenarios:

  1. Separation of hot and cold data: RocketMQ's newly generated messages will be cached in the page cache, which we call hot data ; when the cache exceeds the capacity of the memory, hot data will be swapped out to become cold data . If a few consumers try to consume cold data, they will reload cold data from the hard disk to the page cache, which will cause read and write IO competition and squeeze the space of the page cache. This problem can be avoided by switching the reading link of cold data to multi-level storage;
  2. Extend message retention time: By offloading messages to larger and cheaper storage media, a longer message retention time can be achieved at a lower cost. At the same time, multi-level storage supports specifying different message retention times for topics, and message TTL can be flexibly configured according to business needs.

The biggest difference between RocketMQ multi-level storage and Kafka and Pulsar is that we upload messages in a quasi-real-time manner instead of waiting for a CommitLog to be full before uploading, mainly based on the following considerations:

  1. Amortized cost: RocketMQ multi-level storage needs to convert the global CommitLog into a topic dimension and rebuild the message index. Processing the entire CommitLog file at one time will cause performance glitches;
  2. More friendly to small-scale instances: Small-scale instances are often configured with smaller memory, which means that hot data will be swapped out faster and become cold data. Waiting for the CommitLog to be full before uploading itself has the risk of cold reading. The quasi-real-time upload method can not only avoid the risk of cold reading when uploading messages, but also enable cold data to be read from multi-level storage as soon as possible.

Quick Start

Multi-level storage is designed to reduce the mental burden of users: users can realize non-inductive switching of hot and cold data read-write links without changing the client, and can have multi-level storage capabilities by simply modifying the server configuration, only need the following two steps:

  1. Modify Broker configuration, specify to use org.apache.rocketmq.tieredstore.TieredMessageStore as messageStorePlugIn
  2. Configure the storage medium you want to use, take unloading messages to other hard disks as an example: configure tieredBackendServiceProvider as org.apache.rocketmq.tieredstore.provider.posix.PosixFileSegment, and specify the new storage file path: tieredStoreFilepath

Optional: support modifying tieredMetadataServiceProvider to switch the implementation of metadata storage, the default is json-based file storage

For more usage instructions and configuration items, you can view the README of multi-level storage on GitHub [ 1]

Technology Architecture

insert image description here

architecture

接入层:TieredMessageStore/TieredDispatcher/TieredMessageFetcher

The access layer implements some read-write interfaces in MessageStore and adds asynchronous semantics to them. TieredDispatcher and TieredMessageFetcher respectively implement the upload/download logic of multi-level storage. Compared with the underlying interface, more performance optimization has been done here: including using an independent thread pool to avoid slow IO blocking access to hot data; using pre-read cache to optimize performance wait.

容器层:TieredCommitLog/TieredConsumeQueue/TieredIndexFile/TieredFileQueue

The container layer implements a logical file abstraction similar to DefaultMessageStore, and also divides files into CommitLog, ConsumeQueue, and IndexFile, and each logical file type holds a reference to the underlying physical file through FileQueue. The difference is that the CommitLog of the multi-level storage is changed to the queue dimension.

Driver layer : TieredFileSegment

The driver layer is responsible for maintaining the mapping from logical files to physical files, and connects to the underlying file system read and write interfaces (Posix, S3, OSS, MinIO, etc.) by implementing TieredStoreProvider. Currently, the implementation of PosixFileSegment is provided, which can transfer data to other hard disks or object storage mounted through fuse.

message upload

The message upload of RocketMQ multi-level storage is triggered by the dispatch mechanism: when initializing the multi-level storage, TieredDispatcher will be registered as the dispatcher of CommitLog. In this way, whenever a message is sent to Broker, TieredDispatcher will be called for message distribution, and TieredDispatcher will return success immediately after writing the message to the upload buffer. There will be no blocking logic in the entire dispatch process to ensure that it will not affect the construction of the local ConsumeQueue.

insert image description here

TieredDispatcher

The contents written into the upload buffer by TieredDispatcher are only the reference of the message, and the body of the message will not be read into the memory. Because the multi-level storage constructs the CommitLog in the queue dimension, the commitLog offset field needs to be regenerated at this time.

insert image description here

upload buffer

When the upload buffer is triggered to upload, when the commitLog offset field of each message is read, the new offset is embedded into the original message by splicing.

Upload progress control

Each queue will have two key points to control the upload progress:

  1. dispatch offset: the message location that has been written to the cache but not uploaded
  2. commit offset: the uploaded message location

insert image description here

upload progress

Analogous to consumers, dispatch offset is equivalent to the location of pulling messages, and commit offset is equivalent to the location of confirming consumption. The part between commit offset and dispatch offset is equivalent to pulling unconsumed messages.

message read

TieredMessageStore implements the message reading related interface in MessageStore, and judges whether to read messages from multi-level storage through the logical position (queue offset) in the request. According to the configuration (tieredStorageLevel), there are four strategies:

  • DISABLE: Disable reading messages from multi-level storage;
  • NOT_IN_DISK: Messages not in DefaultMessageStore are read from multi-level storage;
  • NOT_IN_MEM: The message that is not in the page cache, that is, the cold data is read from the multi-level storage;
  • FORCE: Force all messages to be read from multi-level storage, currently only for testing.
/**
  * Asynchronous get message
  * @see #getMessage(String, String, int, long, int, MessageFilter) 
  getMessage
  *
  * @param group Consumer group that launches this query.
  * @param topic Topic to query.
  * @param queueId Queue ID to query.
  * @param offset Logical offset to start from.
  * @param maxMsgNums Maximum count of messages to query.
  * @param messageFilter Message filter used to screen desired 
  messages.
  * @return Matched messages.
  */
CompletableFuture<GetMessageResult> getMessageAsync(final String group, final String topic, final int queueId,
    final long offset, final int maxMsgNums, final MessageFilter 
messageFilter);

Messages that need to be read from the multi-level storage will be processed by TieredMessageFetcher: first, verify whether the parameters are legal, and then initiate a pull request according to the logical position (queue offset). TieredConsumeQueue/TieredCommitLog converts the logical location to the physical location of the corresponding file and reads messages from TieredFileSegment.

// TieredMessageFetcher#getMessageAsync similar with 
TieredMessageStore#getMessageAsync
public CompletableFuture<GetMessageResult> getMessageAsync(String 
group, String topic, int queueId,
        long queueOffset, int maxMsgNums, final MessageFilter 
messageFilter)

TieredFileSegment maintains the location of each physical file stored in the file system, and reads the required data from it through the interface implemented for different storage media.

/**
  * Get data from backend file system
  *
  * @param position the index from where the file will be read
  * @param length the data size will be read
  * @return data to be read
  */
CompletableFuture<ByteBuffer> read0(long position, int length);

read-ahead cache

When TieredMessageFetcher reads a message, it pre-reads a part of the message for the next use, and these messages are temporarily stored in the pre-read cache.

protected final Cache<MessageCacheKey /* topic, queue id and queue 
offset */,
SelectMappedBufferResultWrapper /* message data */> readAheadCache;

The design of the read-ahead cache refers to the TCP Tahoe congestion control algorithm. The message volume of each pre-read is similar to the congestion window, which is controlled by the mechanism of addition and multiplication:

  • Additive increase: Starting from the minimum window, increase the amount of messages equal to the client's batchSize each time.
  • Multiplication and subtraction: When the cached messages have exceeded the cache expiration time and have not been fully fetched, the next pre-read message volume will be halved while cleaning the cache.

The read-ahead cache supports sharding and concurrent requests when reading a large amount of messages to achieve greater bandwidth and lower latency.

The read-ahead cache of a topic message is shared by all groups that consume this topic, and the cache invalidation strategy is:

  1. All groups subscribed to this topic have access to the cache
  2. The cache expiration time is reached

Recovery

Above we introduced that the upload progress is controlled by commit offset and dispatch offset. Multi-level storage creates metadata for each topic, queue, and fileSegment and persists these two locations. When the Broker restarts, it will recover from the metadata and continue to upload messages from the commit offset. The previously cached messages will be re-uploaded and will not be lost.

insert image description here

Development Plan

Cloud-native storage systems need to maximize the value of cloud storage, and object storage is the embodiment of cloud computing dividends. RocketMQ multi-level storage hopes to take advantage of the low cost of object storage to extend the message storage time and expand the value of data on the one hand; Architecture evolution.

tag filtering

Multi-level storage does not calculate whether the tag of the message matches when pulling the message, and the tag filtering is handed over to the client for processing. This will bring additional network overhead, and it is planned to add tag filtering capabilities on the server side in the future.

Broadcast consumption and multiple consumers with different consumption progress

Invalidation of the read-ahead cache requires that all groups subscribing to this topic have accessed the cache, which is difficult to trigger when the consumption progress of multiple groups is inconsistent, resulting in the accumulation of useless messages in the cache.

It is necessary to calculate the consumption qps of each group to estimate whether a group can use the cached messages before the cache expires. If a cached message is not expected to be accessed again until invalidated, it should be expired immediately. Correspondingly, for broadcast consumption, the message expiration strategy should be optimized so that all clients read the message before it becomes invalid.

Integration with High Availability Architecture

At present, we mainly face the following three problems:

  1. Metadata synchronization: how to reliably synchronize metadata among multiple nodes, and how to calibrate and complete missing metadata when a slave is promoted;
  2. It is forbidden to upload messages exceeding the confirm offset: in order to avoid message rollback, the maximum offset uploaded cannot exceed the confirm offset;
  3. Quickly start multi-level storage when the slave is promoted: only the master node has write permission, and after the slave node is promoted, it is necessary to quickly pull up the multi-level storage to continue uploading.

Related Links:

[1] README

https://github.com/apache/rocketmq/blob/develop/tieredstore/README.md

Click here to view message queue RocketMQ product details

Guess you like

Origin blog.csdn.net/alisystemsoftware/article/details/130198534