RocketMQ and Kafka bottom layer optimization full analysis

Hello everyone, I am yes.

We all know that RocketMQ and Kafka messages are stored in the disk, so why can the message storage and disk read and write be so fast? Have you made any optimizations? Is there any difference between the implementation of both of them? What are the advantages and disadvantages of each?

Today we will find out.

Storage medium-disk

Generally speaking, the messages of the message middleware are stored in the local file, because from the efficiency point of view, directly putting the local file is the fastest and has the highest stability. After all, if you put it in a third-party storage such as a database, one more depends on one less security, and there is also network overhead.

The bottleneck of a process for storing messages in a disk file is disk writing and reading. We know that disks are relatively slow to read and write. How to achieve high throughput by using disks as storage media?

Sequential read and write

The answer is sequential reading and writing .

Let's first understand the page cache . The page cache is a cache used by the operating system as a disk to reduce disk I/O operations.

When writing to the disk, it is actually written to the page cache, so that the writing to the disk becomes the writing to the memory. The written page becomes a dirty page, and then the operating system will write the dirty page to the disk when appropriate.

When reading, if the page cache hits, it will return directly. If the page cache is missed, a page fault interrupt will occur, load the data from the disk to the page cache, and then return the data.

And when reading, it will read ahead. According to the principle of locality, when reading, adjacent disk blocks will be read into the page cache. When writing, it will be written later , and the page cache is written, so that some small write operations can be merged into a large write, and then the disk can be flushed.

Moreover, according to the structure of the disk, the head hardly needs to change lanes or the time for lane change is very short during sequential I/O.

According to some test results on the Internet, the speed of sequential writing to disk is faster than random writing to memory.

Of course, such writing has the risk of data loss. For example, if the machine suddenly loses power, those dirty pages that have not been flushed will be lost. But it can be called fsyncforced brush plate, but this loss performance for larger.

Therefore, it is generally recommended to use a multiple copy mechanism to ensure the reliability of the message, rather than flushing the disk synchronously .

You can see that sequential I/O adapts to the structure of the disk, and there are pre-read and post-write. RocketMQ and Kafka are both sequential write and approximate sequential read. They all use the file append method to write messages, and only new messages can be written at the end of the log file, and old messages cannot be changed.

mmap-file memory map

It can be seen from the above that accessing a disk file will load data into the page cache, but the page cache belongs to the kernel space and cannot be accessed by the user space, so the data needs to be copied to the user space buffer.

It can be seen that the data needs to be accessed from the page cache through a copy program again, so mmapa wave of optimizations can also be used to avoid copying by using memory mapped files.

Simply put, file mapping is to directly map the virtual page of the program to the page cache, so that there is no need to copy the kernel mode to the user mode, and it also avoids the generation of duplicate data . And it is no longer necessary to read and write files through calls reador writemethods, and can be directly operated by mapping addresses and offsets .

sendfile-zero copy

Since the message is stored in the disk, when the consumer comes to pull the message, it has to get it from the disk. Let's first take a look at the general process of sending files.

Simply talk DMAabout what it is, its full name is Direct Memory Access. It can directly read and write system memory independently , without CPU intervention, and can be used like graphics cards and network cards DMA.

You can see that the data is actually redundant, so let's take a look at the mmapprocess of sending files later.

It can be seen that the number of context switches has not changed, but the data is copied less, which mmapis the same as the effect we mentioned above .

But the data is still redundant. Isn't it possible to directly copy the data from the page cache to the network card? sendfileThere is this effect. Let's take a look at the Linux 2.1 version first sendfile.

Because it is a system call is sent to meet the needs of comparison read + writeor mmap + writecontext switching it is certainly less, but it still seems redundant data ah. Yes, so Linux 2.4 version of sendfile + DMA with "Scatter-gather". Realize true no redundancy.

This is what we often call zero copy, which is FileChannal.transferTo()used at the bottom in Java sendfile.

Next, let's see how the points mentioned above are applied in RocketMQ and Kafka.

Application of RocketMQ and Kafka

RocketMQ

Adopt Topic混合追加方式, that is, a CommitLog file will contain all messages assigned to this Broker, no matter which Queue of which topic the message belongs to.

Therefore, all messages are sequentially appended and written to CommitLog , and a CosumerQueue corresponding to the message is established. Then, consumers obtain the real physical address of the message through CosumerQueue and then go to CommitLog to obtain the message. CosumerQueue can be understood as the index of the message.

Both CommitLog and CosumerQueue adopt mmap in RocketMQ.

When sending a message, the default is to copy the data to the heap memory and then send it . Let's look at the code.

This configuration can see the transferMsgByHeapdefault is true, that when we look at consumer pull message code.

You can see that RocketMQ copies the message to the buffer in the heap by default, and then stuffs it into the response body to send. However, it can be configured by parameters to not pass through the heap, but it does not use true zero copy, but is sent to SocketBuffer through mappedBuffer.

So RocketMQ uses sequential disk writing and mmap. Sendfile is not used, and there is a copy of page buffer to SocketBuffer.

Then when you pull the message, strictly speaking, the reading is random for CommitLog, because CommitLog's messages are stored in a mixed manner. **But on the whole, the messages are read sequentially from CommitLog, all from the old data to Orderly read new data. **And generally speaking, the message will be consumed as soon as it is stored, so the message should still be in the page cache at this time, so there is no need to read the disk.

And we mentioned above that the page cache will be periodically flushed, which is uncontrollable, and the memory is limited, there will be swaps, etc. , and **mmap is actually just a mapping, which is generated when the page is actually read Only when the page fault is interrupted will the data be actually loaded into the memory, which may cause monitoring glitches for the message queue.

Therefore, RocketMQ has made some optimizations, including: file pre-allocation and file pre-heating .

File pre-allocation

The default size of CommitLog is 1G. When the size limit is exceeded, a new file needs to be prepared. RocketMQ starts a background thread AllocateMappedFileServicefor continuous processing AllocateRequest. AllocateRequest is actually a pre-allocated request, which will prepare the next file allocation in advance. To prevent file allocation during the message writing process, which would cause jitter.

File warm-up

There is a warmMappedFilemethod, it will traverse each page of the currently mapped file, write a 0 byte, and then call mlockand madvise(MADV_WILLNEED).

Let's take a look again this.mlock, the internal is actually called mlockand madvise(MADV_WILLNEED).

mlock: You can lock part or all of the address space used by the process in physical memory to prevent it from being swapped to the swap space.

madvise: Advise the operating system that this file will be accessed in the near future, so it might be a good idea to read a few pages in advance.

RocketMQ summary

Sequential disk writing, as a whole, is sequential disk reading, and mmap is used, not a true zero copy. And because of the uncertainty of the page cache and mmap lazy loading (the data is actually loaded when the page is missing during access), file pre-allocation and file preheating are used, that is, each page is written with a 0 byte, and then the mlocksum is called madvise(MADV_WILLNEED).

Kafka

Kafka's log storage is different from RocketMQ, it is a partition and a file.

Kafka's message writing is also sequential writing for a single partition. If there are not many partitions, it is also considered sequential writing as a whole. Its log file does not use mmap, but the index file uses mmap. But Kafka used zero copy for sending messages.

Mmap is actually useless for message writing, because the message comes from the network. For sending messages, sendfile is more efficient than mmap+write, because there is one less copy of page buffer to SocketBuffer.

Looking at the source code of Kafka's message, the final call is FileChannel.transferTothat the bottom layer is sendfile.

From Kafka source code I did not see a similar RocketMQ of mlock and other operations, because first of all I think it is useless to log mmap, then swap in fact, Linux system parameters can vm.swappinessbe adjusted, suggested here is set to 1 instead of 0.

Assuming that the memory is really insufficient, if it is set to 0, when the memory is exhausted and cannot be swapped, some processes will be suddenly terminated. Set a 1, at least you can drag it, if you have a good monitoring method, you can also give a chance to find out, and it won't stop suddenly.

RocketMQ & Kafka comparison

The first is to write sequentially, but RocketMQ stores all messages in one file, while Kafka uses one file per partition .

One file per partition is more flexible in terms of migration or data replication .

But if there are more partitions, writing needs to frequently switch back and forth between multiple files. For each file, it is written sequentially, but from a global perspective, it is actually random writing, and it is the same when reading. Random reading . RocketMQ with just one file does not have this problem.

From the perspective of sending messages, RocketMQ uses the mmap + write method, and preheats to reduce the performance problems of large file mmap due to page faults. Kafka uses sendfile. Relatively speaking, I think Kafka is more efficient in sending because there is one less copy of the page buffer to the SocketBuffer.

And swap issues can also be set through system parameters.

At last

If there are any errors in the article, please contact me as soon as possible, thanks! And welcome to leave a message~


I am yes, from a little bit to a little bit, see you in the next article .

Guess you like

Origin blog.csdn.net/yessimida/article/details/108973712