RocketMQ-Message Storage

1. Disk file structure

1.1 Introduction to the file

File storage structure on RocketMQ's Broker machine disk
insert image description here

  1. CommitLog: The storage body of the message body and metadata, which stores the content of the message body written by the Producer side. The default size of a single file is 1G, the length of the file name is 20 bits, the left is filled with zeros, and the rest is the starting offset. For example, 00000000000000000000 represents the first file, the starting offset is 0, and the file size is 1G=1073741824; The first file is full, the second file is 00000000001073741824, the starting offset is 1073741824, and so on. The messages are mainly written to the log file sequentially, and when the file is full, it is written to the next file;
  2. ConsumeQueue: The logical queue for message consumption, which serves as an index for consuming messages, and saves the starting physical offset offset, message size, and HashCode value of the message Tag of the queue messages under the specified Topic in the CommitLog. The IndexFile (index file) only provides a method for querying messages by key or time interval for message query (ps: this method of searching for messages through IndexFile does not affect the main process of sending and consuming messages). In terms of actual physical storage, ConsumeQueue corresponds to the files under each Topic and QueuId. The size of a single file is about 5.72M, and each file is composed of 30W pieces of data. The default size of each file is 6 million bytes. When a ConsumeQueue type file is full, it will write to the next file;
  3. IndexFile: Because all messages are stored in the CommitLog, it will be very difficult to implement the method of querying messages based on key. Therefore, in order to solve this business requirement, IndexFile exists. It is used to provide access service for the generated index file, and query the real entity content of the message through the message Key value. In the actual physical storage, the file name is named after the timestamp when it was created. The fixed single IndexFile file size is about 400M, and an IndexFile can save 2000W indexes;

1.2 File format

insert image description here

1.2.1CommitLog

The storage body of the message body and metadata stores the content of the message body written by the Producer side. The physical file where the message is stored, the commitlog on each broker is shared by all the queues on the machine, without any distinction.
insert image description here
insert image description here

1.2.2 ConsumeQueue

Message consumption queue, after the message arrives in the CommitLog file, it will be asynchronously forwarded to the message consumption queue for consumption by message consumers. The ConsumeQueue storage format is as follows:
insert image description here
a single ConsumeQueue file contains 300,000 entries by default, and the length of a single file is 30w × 20 bytes. A single ConsumeQueue file can be seen as an array of ConsumeQueue entries, and its subscript is the logical offset of the ConsumeQueue , the offset stored in the message consumption progress is the logical offset.
ConsumeQueue is the index file of the Commitlog file. Its construction mechanism is that when the message reaches the Commitlog file, a dedicated thread generates a message forwarding task, thereby constructing the message consumption queue file and the index file mentioned below.

1.2.3IndexFile

Message index file, if a message contains a key value, IndexFile will be used to store the message index, which mainly stores the correspondence between message Key and Offset. The message consumption queue is an index file specially built by RocketMQ for message subscription, which improves the speed of retrieving messages according to topics and message queues. In addition, RocketMQ introduces the Hash index mechanism to index messages. The design of HashMap includes two basic points: Hash slot and Hash Conflicting linked list structure. The layout of the RocketMQ index file is shown in the figure:
insert image description here
lndexFile contains lndexHeader, Hash slot, and Hash entry in total. The index file is mainly used to query messages according to the key. The process is mainly as follows:

1、根据查询的 key 的 hashcode%slotNum 得到具体的槽的位置(slotNum 是一个索引文件里面包含的最大槽的数目,例如图中所示 slotNum=5000000)
2、根据 slotValue(slot 位置对应的值)查找到索引项列表的最后一项(倒序排列,slotValue 总是指向最新的一个索引项)
3、遍历索引项列表返回查询时间范围内的结果集(默认一次最大返回的 32 条记录)

2. Message reading and writing process

2.1 The overall architecture of message storage

insert image description here
The above figure shows the overall architecture of RocketMQ's message storage. RocketMQ adopts a hybrid storage structure, that is, all queues under a single instance of Broker share a log data file (CommitLog) for storage. The disadvantage of using a hybrid storage structure in RocketMQ is that there will be more random read operations, so the read efficiency is low. At the same time, consuming messages needs to rely on ConsumeQueue, and building this logical consumption queue requires a certain amount of overhead.
insert image description here

As can be seen from the overall architecture diagram above, RocketMQ's hybrid storage structure adopts a storage structure that separates data and index parts for Producer and Consumer. Producer sends messages to Broker, and then Broker uses synchronous or asynchronous methods to process messages. The disk is persisted and saved to the CommitLog. As long as the message is flushed and persisted to the disk file CommitLog, the message sent by the Producer will not be lost. Because of this, Consumers will definitely have the opportunity to consume this news, and it doesn't matter that the consumption time can be slightly delayed. To take a step back, even if the Consumer side cannot pull the message to be consumed for the first time, the Broker server can wait for a certain time delay through the long polling mechanism and initiate a request to pull the message again.
The background service thread on the Broker side—ReputMessageService continuously distributes requests and builds ConsumeQueue and IndexFile data asynchronously. Then, the Consumer can find the message to be consumed according to the ConsumerQueue. Among them, ConsumeQueue is used as the index of consuming messages, and saves the starting physical offset offset, message size, and HashCode value of the message Tag of the queue messages under the specified Topic in the CommitLog. IndexFile only provides a method for querying messages by key or time interval for message query.

2.2 Send message

When sending, Producer does not directly deal with Consume Queue. As mentioned above, all RMQ messages will be stored in the Commit Log. In order to prevent confusion in message storage, multi-threaded writing to the Commit Log will be locked.
insert image description here

Picture: Commit Log Sequential Write
After the message is persistently locked and serialized, the Commit Log is written sequentially, which is often referred to as the Append operation. With Page Cache, RMQ will be very efficient when writing Commit Log. Just because writing Commit Log is very fast, RMQ also boldly provides spin locks to improve performance.
After the Commit Log persists, the Broker will asynchronously dispatch the messages one by one to the corresponding Consume Queue file.
insert image description here
Picture: Consume Queue sequential writing
Each Consume Queue represents a logical queue, which is appended by ReputMessageService in a single Thread Loop. As shown in the figure above, each Consume Queue is obviously written in order from left to right in an efficient manner.

2.3 Consuming news

When consuming, the Consumer does not directly deal with the Commit Log, but pulls data from the Consume Queue.
insert image description here
Picture: Consume Queue sequential reading
As shown in the above figure, the order of pulling is from old to new, and each Consume Queue is sequentially read.
Just pulling the Consume Queue does not have the real content of the message, but there is a reference to the offset value of the Commit Log, so it is mapped to the Commit Log again to obtain the real message data.
insert image description here
Picture: Random read of Commit Log
A problem has arisen. As can be seen from the above picture, Commit Log will be read randomly.
insert image description here
Picture: Commit Log Overall Ordered Random Read
Although it is a random read, it still reads sequentially from old to new as a whole. As long as the random area is still within the hot spot of Page Cache, Page Cache can still be fully utilized.

Usually the random reading and writing of files is very slow, but the sequential reading and writing of files is almost as fast as the random reading and writing of memory. The reason why it is so fast is that the OS optimizes file IO. When the OS finds that there is a large amount of remaining physical memory in the system, in order to improve the performance of IO, a part of the memory is used as Page Cache.

When the OS reads the disk, it will pre-read the content into the Cache according to the order of the files, so that the next read can hit the Cache. When writing the disk, it will directly write to the Cache and return it. pdflush will use a certain strategy to flush the data from the Cache back to the disk. .

During file sequential IO, the read and write areas are all hotspots that have been intelligently cached by the OS, and will not cause a large number of Page Fault interrupts to read the disk again. File IO is almost equivalent to memory IO.

When sending a message, the message should be written into the Page Cache instead of directly writing to the disk, relying on the asynchronous thread to flush the disk; when receiving the message, the message is directly obtained from the Page Cache instead of being read from the disk due to page faults, and the Cache itself is managed by the kernel. It is necessary to copy the data from the program to the kernel, and transmit it directly through the Socket.

2.4PageCache and Mmap memory mapping

It is necessary to briefly introduce the concept of page cache. All file I/O requests of the system are implemented by the operating system through the page cache mechanism. For the operating system, disk files are composed of a series of data block sequences, and the size of the data block is determined by the operating system itself. A standard page size in x86 Linux is 4KB.

When the operating system kernel processes a file I/O request, it first looks in the page cache (each data block in the page cache is set with file and offset address information), and if it misses, it starts the disk I/O. Load the data block in the disk file to a free block in the page cache, and then copy it to the user buffer.

The page cache itself also pre-reads the data files. For the first read request operation of each file, the system reads the requested page and at the same time reads a few subsequent pages. Therefore, if you want to improve the hit rate of the page cache (try to keep the accessed pages in the physical memory), from a hardware point of view, the larger the physical memory, the better. From the perspective of the operating system, when accessing the page cache, even if only 1k messages are accessed, the system will pre-read more data in advance, and the next time the message is read, it is likely to be able to hit the memory.

In RocketMQ, the ConsumeQueue logical consumption queue stores less data and reads sequentially. Under the pre-reading effect of the page cache mechanism, the read performance of the Consume Queue will be relatively high and almost memory-like, even in the case of message accumulation. No performance impact. For the log data files stored in the CommitLog message, more random access reads will be generated when reading the message content, which seriously affects performance. If you choose an appropriate system IO scheduling algorithm, such as setting the scheduling algorithm to "Noop" (if SSD is used for block storage at this time), the performance of random reading will also be improved.
insert image description here
In the figure above, the entire OS has 3.7G of physical memory, 2.7G is used up, and there should be 1G of free memory left, but the OS gives 175M. Because when the OS finds that there is a large amount of remaining physical memory in the system, in order to improve IO performance, it will use the excess memory as a file cache, which is the buff/cache on the figure. In a broad sense, the Page Cache we refer to is a subset of these memories.

In addition, RocketMQ mainly reads and writes files through MappedByteBuffer. Among them, the FileChannel model in NIO is used to directly map the physical file on the disk to the memory address of the user mode (this Mmap method reduces the traditional IO to store the disk file data in the buffer of the operating system kernel address space and the user The performance overhead of copying back and forth between the buffers in the application address space), the operation on the file is converted into the operation on the memory address directly, thus greatly improving the efficiency of reading and writing the file (it should be noted here that the MappedByteBuffer This memory mapping method has several limitations, one of which is that it can only map 1.5~2G files to the virtual memory in user mode at a time, which is why RocketMQ sets a single CommitLog log data file to 1G by default).
insert image description here

Reference:
RocketMQ related flowchart/schematic diagram
RocketMQ six: RocketMQ message storage
RocketMQ high-performance underlying storage design
message middleware—RocketMQ message storage (1)
RocketMQ message storage

Guess you like

Origin blog.csdn.net/lihuayong/article/details/108560436