Detailed explanation of the random index module for hot and cold separation in RocketMQ

Characteristics of random indexing in messaging systems

RocketMQ is widely used in various business scenarios. In actual production scenarios, users usually select message IDs or specific business keys (such as student IDs, order numbers) to query and locate a specific batch of messages, and then locate the distributed system. complex issues in. Under traditional solutions, message index storage is based on a database system or a local file system, which is limited by disk capacity and difficult to meet the writing requirements of massive data.

In cloud-native scenarios, object storage can provide users with elasticity and pay-as-you-go capabilities, effectively reducing storage costs, but its support for random reading and writing is not friendly enough. The data written in RocketMQ's queue model is approximately ordered by time. It implements the non-stop write feature for random index hot data, supports hot and cold separation, and uses asynchronous reorganization to transfer cold data to cheaper storage. in the system.

Index structure built by RocketMQ on disk

An index is an efficient data structure that trades space for time and supports fast storage and search. Let's take a look at the structural design of RocketMQ's index file. RocketMQ's index file file structure is designed using a three-segment structure based on HashTable based on head interpolation method. This index file storage structure has the characteristics of fast query speed, small space occupation, and easy maintenance. However, as the amount of data increases, the number of local index files will also continue to increase.

They are: IndexHeader, Index Slots, and IndexItems.

Index file structure

Indexes with Hash conflicts are connected through a one-way linked list, and index entries are appended to the end of the file to improve writing performance:

1. The index header (IndexHeader) contains the metadata information of the index file, including MagicCode used to determine the starting position of the file. The start timestamp (startTimeStamp) and the end timestamp (endTimeStamp) represent the time interval range of index storage. Then it also includes the number of index slots that have been used by the file (hashSlotCount) and the number of indexes that have been stored (indexCount).

2. There is a fixed number of index slots (Slots), which stores the location of the head node of the index that generates a hash conflict. The hash value is obtained through hash mapping, and then the hash value is used to calculate the number of index slots (Slots). I get the position of the specific slot of the index, which can be regarded as the head node of the linked list.

3. Index items (IndexItems) store the specific data stored in each index. The messages sent by the message queue are finally stored in a file called CommitLog in a queue of a specific topic, so each index entry contains the topicId. QueueId, Offset, Size and other information to locate the actual message storage location in the CommitLog.

IndexItem

Index file data format conversion Compact

In RocketMQ, since the index module is a write-multiple-read structure with few zero updates, in order to reduce the average operation cost of the overall system, it is acceptable to have some read amplification overhead for a single read. Assume that the message index writing time overhead requires t1. On average, each message index is queried after t2. The format conversion time overhead is t_compact. Usually t_compact is much smaller than t2, so t_compact can be completed asynchronously within t2 time. The message before format conversion The index query time is t_before, the average query time cost of message index after format conversion is t_after, the average query time cost of message index after format conversion is less than the query time cost after format conversion t_before < t_after, then the data storage query time cost without format conversion is greater than Storage query time overhead after format conversion.

t1 + t2 + t_before > t1 + t2 + t_after。

timeline

The RocketMQ index file uses an open-chain HashTable based on the head interpolation method, which can be written sequentially when writing the index. However, when performing a specified key query, since a one-way linked list is used, hashing the key to the specified slot and obtaining the head node of the linked list, and then traversing the one-way linked list according to the head node of the linked list is a random IO query, and object storage is similar to mechanical Due to the characteristics of the hard disk, the time to read 20 Bytes is almost the same as the time to read several KB. Multiple random IOs will cause a large time overhead, so there may be serious data read amplification problems when there are many Hash conflicts.

In order to reduce the number of random query accesses to object storage files, multi-level storage asynchronously converts the index file data format. The converted index file can retrieve large chunks of data at one time, which can greatly reduce the number of IO accesses to object storage files. .

Specifically, the random index asynchronous reordering mechanism includes the following steps:

1. Group the local index files according to the mapped slots, and each group contains a certain number of index items.

2. Write new index files to the same group in order. The index entries of the group corresponding to the same slot are continuous arrays in the physical address space.

3. When a query is needed, the hash value of the key to be queried is mapped to the specified slot, and the first address of the array is stored in the slot position. By traversing the array, the index that needs to be queried is determined.

In this way, random query operations in object storage can be greatly reduced, thereby improving query efficiency and reducing time overhead. At the same time, because local index files require format conversion and grouping, certain computing and storage resources are also required.

Before format conversion

After format conversion

The rearranged index file rearranges the linked list with discontinuous physical addresses into an array with continuous physical addresses. Each SlotItem has 8 bytes. The first 4 bytes are used to record the first address of the array, and the last 4 bytes are used to record the first address of the array. Section is used to record the length of the array. Such format conversion has the following benefits.

  • In this way, subsequent reads of the index change from random IO of the linked list to sequential IO of the array. Reduces the time overhead caused by random IO.
  • Spatial locality can be exploited to increase the cache hit rate of the memory pageCache.

Status changes of a single index file

Single index file life cycle

The capacity of a single index file is limited. When there are many indexes being written, after an index file reaches the maximum number of indexes that can be stored, a new index file needs to be created to continue writing. Therefore, from creation to destruction, a file will go through stages such as creating a new file, Compact file, uploading it to an object storage file, and destroying it upon expiration.

When an index file in the "Writing File" state is completely full, it needs to be marked in the "Compact File" state. Compact file status means that the file no longer needs to be written to and has been Compacted, but still needs to be retained for subsequent upload to object storage. At this time, the file can be stored by uploading it to the object storage system and marked as "object storage file" status. Therefore, it also corresponds to the three states of the file, unsealed, compacted, and upload.

Multiple index file storage model

In order to realize the Non-Stop Write feature and improve the writing performance of the index, the design divides three different threads to cooperate with each other. They are writing thread, index query thread and background scheduled task thread. They are each responsible for different tasks, and use read-write locks to ensure correctness under concurrent conditions. The message queue is a storage system that is approximately ordered by time. Different index files store indexes for different time periods, so multiple files can be managed according to the approximately ordered order of time. The jump table data structure is used for management, which can easily support fast positioning search and interval query.

1.  The writing thread is non-blocking, and its responsibility is to write the index to the file in the writing state at the end of the queue. When a file is full, the thread will automatically create a new file at the end of the queue and switch to the next file for writing. In order to improve writing efficiency, this thread is also responsible for caching the index in the memory when writing the index to the file. When the cache reaches a certain amount, it writes them to the file in batches to reduce the number of disk IOs.

2.  The index query thread can query index files in different states. The specific query strategy is as follows:

    • For files that are being written, the query thread needs to wait for the writing thread to finish writing the index before querying; for files that have been filled, the query thread can directly query them; for files that have been Compacted, the query thread Also queries directly from local files.
    • For files uploaded to object storage, the data can be read directly from the object storage and the index file in the Compact format can be queried.

3.  The background scheduled task thread is mainly responsible for performing Compact operations on files that are in the writing state and have been filled. When performing a Compact operation, the thread needs to first obtain the read-write lock of the corresponding file to avoid concurrent access to the file by other threads. After Compact is completed, switch the status of the file to Compact Complete. Then you need to upload the Compacted file to the object storage to become an object storage file. After the upload is completed, switch the file status to the Uploaded status. During the upload process, the thread needs to release the read and write lock on the file.

system level design

In order to improve the scalability of the system and facilitate the writing of unit tests, the entire index service adopts the idea of ​​hierarchical design. From top to bottom, the index service layer, index file parsing layer and data storage layer are designed respectively. Different layers are responsible for handling different tasks, and the layers are decoupled. The upper layer only relies on the services provided by the lower layer.

  • Index service layer: This layer provides message indexing services for RocketMQ. It is responsible for the storage and query of message indexes, and is responsible for the life cycle management of index files, including creating index files, compact files, uploading files, destroying files, etc.
  • Index file parsing layer: This layer mainly performs format parsing on a single index file in different states, and also provides KV query and storage services for a single file. Specifically, this layer is responsible for reading the data in the index file and parsing it into a readable format for the upper layer to call.
  • Data storage layer: This layer is responsible for writing and reading binary stream data, and supports different types of storage methods, including object storage, local disk files, or database files. Specifically, this layer stores data on local disk or object storage or database files. When reading data, this layer is responsible for getting the data from the local disk or object storage and converting it into binary stream data and returning it to the caller.

By adopting the idea of ​​hierarchical design, the entire indexing service is divided into three different levels, making the system highly scalable and maintainable, and facilitating subsequent upgrades and maintenance. At the same time, each level is decoupled and has clear responsibilities, making it convenient for unit testing and maintenance.

Highly available system crash recovery process design

Since index files have different states, they are managed and maintained through the data structure of skip tables. When the system is down, index files in different states need to be restored. To this end, we use classification and folder management, and use folder names to manage and record index files in different states.

When recovering from a crash, we adopted the following process design:

1. After the system restarts, read the folder name list stored in the system. The list contains the folder names corresponding to all index files in different states.

2. Read the index files under each folder in sequence through the folder name list, load these index files into memory, and reconstruct the jump table.

3. Restore the current file status based on the folder name and its corresponding index file. For example, if the folder name is "writing", it means that the index file under the folder is in the writing state and needs to be processed accordingly according to the writing state.

Comparison with other systems

Rocksdb is a high-performance kv persistent storage engine developed based on Google LevelDB. RocksDB uses Log-Structured Merge (LSM) trees as the basic data storage structure. When data is written to RocksDB, it is first written to the MemTable in memory and persisted to the Write-Ahead-Log (WAL) file on disk.

Whenever the amount of MemTable cache data reaches the preset value, MemTable and WAL will be converted to an immutable state, and new MemTable and WAL will be allocated for subsequent writing, and then the same keys in the immutable MemTable will be merged, LSM tree There are multiple levels (Level), and each level is composed of multiple SSTables. The latest SSTable will be placed at the bottom, and the lower-level SSTable is created through an asynchronous compression (Compaction) operation.

The total SSTable size of each layer is determined by the configuration parameters. When the L layer data size exceeds the preset value, the overlapping portion of the L layer SSTable and the L+1 layer SSTable will be selected to merge, and the data reading performance will be optimized by repeating this process. However, Compaction This action will bring greater read and write amplification.

MySQL InnoDB is a transactional storage engine. It provides high performance, high reliability and high concurrency features. The bottom layer is implemented using B+ trees, and the data files themselves are index files. In order to solve the problem of data loss during downtime, InnoDB uses RedoLog to record write behavior synchronously. Because RedoLog is written sequentially, the writing efficiency is very high, and the data will be written to the cache and RedoLog first. Finally, the data will be written asynchronously from RedoLog to the B+ tree. Due to the hierarchical structure of the B+ tree, there is an upper limit on the number of indexes that can be supported. For example, when a single table exceeds hundreds of millions of records, significant performance degradation will occur. At the same time, the splitting and merging of B+ tree leaf nodes will also bring more reading and writing overhead.

RocketMQ itself is a storage system that writes more, reads less, and has zero updates and is approximately ordered by time. Therefore, RocketMQ can perform hot and cold separation storage simply and efficiently according to time. It also supports asynchronous file format conversion to reduce the overall system time overhead.

There is still room for improvement

The current index design is simple and reliable, but there are still some design deficiencies. For example: currently, when the message queue queries messages by key, there will also be a maxCount parameter. Since different index files are queried concurrently, the implementation of the current system has flaws. It may be necessary to query all index files and then perform the results. Summarize and determine whether the number of indexes specified by the maxCount parameter is reached.

When there are many index files, there may be potentially a large number of queries that will cause unnecessary time overhead. Therefore, a reasonable solution is that we need a multi-threaded global counter. When maxCount is met, we can stop querying subsequent redundant index files. This involves thread safety issues that may arise during multi-threaded access.

This message queue multi-level storage index module provides kv data query and storage, can redesign the index item (indexItem), migrate this system to other systems, and provide index services for other systems. You only need to add a new class, inherit indexItem as the parent class, rewrite related functions, and add custom fields to provide indexing services to other systems.

Reference documentation:

[1] Zhang, H., Wu, X., & Freedman, M. J. (2008). PacificA: Replication in Log-Based Distributed Storage Systems. [Online]. Available:

https://www.microsoft.com/en-us/research/wp-content/uploads/2008/02/tr-2008-25.pdf

[2] Facebook. (n.d.). RocksDB Compactions. [Online]. Available:

https://github.com/facebook/rocksdb/wiki/Compaction

[3] Oracle Corporation. (n.d.). "Inside InnoDB: The InnoDB Storage Engine" - Official MySQL Documentation. [Online]. Available:

https://dev.mysql.com/doc/refman/8.0/en/innodb-internals.html

Author: Su Changsheng

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

Tang Xiaoou, founder of SenseTime, passed away at the age of 55. In 2023, PHP stagnated . Hongmeng system is about to become independent, and many universities have set up "Hongmeng classes". The PC version of Quark Browser has started internal testing. ByteDance was "banned" by OpenAI. Zhihuijun's startup company refinanced, with an amount of over 600 million yuan, and a pre-money valuation of 3.5 billion yuan. AI code assistants are so popular that they can't even compete in the programming language rankings . Mate 60 Pro's 5G modem and radio frequency technology are far ahead No Star, No Fix MariaDB spins off SkySQL and forms as independent company
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10322742