Revisiting efficient file reading and writing from Apache Kafka

0. Overview

Kafka says: Don't be afraid of filesystems .

It simply uses ordinary files written in sequence, and leverages the Page Cache of the Linux kernel. It does not (explicitly) use memory, but uses memory. There is no other company that needs to maintain data in memory and persist data at the same time. Annoyance - as long as there is enough memory, and the speed of the producer and the consumer is not too different, the read and write all happen in the Page Cache, and there is no synchronized disk access at all.

The entire IO process is divided into file system layer (VFS+ ext3), Page Cache layer, general data block layer, IO scheduling layer, and block device driver layer from top to bottom. Here, I will review the Page Cache layer and the IO scheduling layer with the help of Apache Kafka, and write down a popular science article for Linux kernel 2.6.

1. Cache page

1.1 Reading and writing alley-oop

Linux will always move the memory in the system that has not been used by the application to the Page Cache, enter free on the command line, or cat /proc/meminfo, the "Cached" part is the Page Cache.

Each file in the Page Cache is a Radix tree (base tree), and the nodes are composed of 4k pages, which can be quickly located by the offset of the file.

When a write operation occurs, it just writes the data into the Page Cache and places the dirty flag on the page.

When a read operation occurs, it will first look for the content in the Page Cache, if there is, it will return directly, if not, it will read the file from the disk and write it back to the Page Cache.

It can be seen that as long as the speed of the producer and the consumer is not much different, the consumer will directly read the data written by the producer to the Page Cache before, and everyone completes the relay in the memory, and there is no disk access at all.

Compared with the traditional method of maintaining a message data in memory, this will not waste twice the memory, Page Cache does not require GC (60G memory can be used with confidence), and even if Kafka restarts, Page Cache still still there.

1.2 Background asynchronous flush strategy

This is what everyone needs to be most concerned about, because if you can't flush in time, OS crash (not application crash) may cause data loss, and Page Cache will instantly change from a friend to a devil.

Of course, Kafka is not afraid of losing, because its durability is guaranteed by replicate, and the missing data will be pulled from the original replicate follower after restart.

The kernel thread pdflush is responsible for sending dirty marked pages to the IO scheduling layer. The kernel will start a pdflush thread for each disk, wake up every 5 seconds (/proc/sys/vm/dirty_writeback_centisecs), and determine the behavior according to the following three parameters:

1. If the page dirty time exceeds 30 seconds (/proc/sys/vm/dirty_expire_centiseconds, the unit is one hundredth of a second), it will be flushed to the disk, so at most about 30 seconds of data will be lost during crash.

2. If the total size of the dirty page has exceeded 10% (/proc/sys/vm/dirty_background_ratio) of the available memory (MemFree+ Cached - Mapped in cat /proc/meminfo), the pdflush thread will be started in the background to write to the disk, but Does not affect the current write(2) operation. Increasing or decreasing this value is the most important tuning method in the flush strategy.

3. If the speed of wrte(2) is too fast, faster than pdflush, and the dirty page rapidly rises to 20% (/proc/sys/vm/dirty_ratio) of the total memory (MemTotal in cat /proc/meminfo), then this At the same time, the write operations of all applications will be blocked, and each will perform the flush in its own time slice, because the operating system thinks that it is too late to write to the disk. If the crash will lose too much data, let everyone calm down. This is a bit expensive and should be avoided as much as possible. Before Redis 2.8, Rewrite AOF often caused this large-scale blockage. Now it has been changed to Redis actively flush() every 32Mb.

For a detailed article, see: The Linux Page Cache and pdflush

1.3 Active flush method

For important data, the application needs to trigger flush by itself to ensure disk writing.

1. System calls fsync() and fdatasync()

fsync(fd) sends the write requests of all dirty pages belonging to the file descriptor to the IO scheduling layer.

fsync() always flushes both file content and file metadata, while fdatasync() only flushes file content and file metadata necessary for subsequent operations. Metadata includes timestamp, size, etc. The size may be necessary for subsequent operations, but the timestamp is not necessary. Because the metadata of the file is kept in another place, fsync() always triggers IO twice, and the performance is a little worse.

2. Set O_SYNC, O_DSYNC flag or O_DIRECT flag when opening file

The O_SYNC and O_DSYNC flags indicate that each write must wait until the flush is completed before returning. The effect is equivalent to write() followed by an fsync() or fdatasync(), but according to the test in APUE, because the OS has been optimized, the performance will be better than It is better to call write() + fsync() yourself, but it is much slower than just write.

The O_DIRECT flag means direct IO, skipping the Page Cache entirely. However, this also abandons the Cache when reading the file, and the disk file must be read every time. Moreover, all IO request lengths are required, and the offset must be an integer multiple of the underlying sector size. Therefore, when using direct IO, you must do a good job of Cache at the application layer.

1.4 Cleaning strategy of Page Cache

When the memory is full, you need to clear the Page Cache, or swap the memory occupied by the application to a file. There is a swappiness parameter (/proc/sys/vm/swappiness) to decide whether to swap or clear the page cache. The value is between 0 and 100. Setting it to 0 means try not to use swap, which is what many optimization guides let you do. , because the default value is actually 60, Linux thinks Page Cache is more important.

The cleaning strategy of Page Cache is an upgraded version of LRU. If the LRU is simply used, some newly read data that may only be used once will fill the head end of the LRU. Therefore, the original LRU queue was split into two, one for the new Page, and the other for the Page that has been accessed several times. When the Page is first accessed, it is placed in the new LRU queue, and it is upgraded to the old LRU queue after several rounds of access (think of the new generation and the old generation of the JVM Heap). When cleaning, start cleaning from the end of the new LRU queue until enough memory is cleared.

1.5 Read-ahead strategy

According to the cleaning strategy, if consumers in Apache Kafka are too slow and accumulate dozens of gigabytes of content, the cache will still be cleaned up. At this time, consumers need to read the disk.

The kernel also has a dynamically adaptive pre-reading strategy, and each read request will try to pre-read more content (it is a read operation anyway). If the kernel finds that a process has been using pre-read data, it will increase the size of the pre-read window (minimum 16K, maximum 128K), otherwise it will turn off the pre-read window. A file that is read continuously is obviously suitable for pre-reading.

2. IO scheduling layer

If all read and write requests are sent directly to the hard disk, it is too cruel for traditional hard disks. The IO scheduling layer mainly does two things, merging and sorting. Merge is to merge the operations of the same and adjacent sectors (512 bytes each) into one. For example, if I want to read sectors 1, 2, and 3 now, it can be merged into one operation to read sectors 1-3. Sorting is to arrange all operations in a queue in the sector direction, so that the heads of the disk can move in order, effectively reducing the slowest and slowest operation of mechanical hard disk addressing.

The sorting looks beautiful, but it may cause serious unfairness. For example, an application is madly writing to the disk in the adjacent sector, and other applications are waiting there, pdflush is fine, etc. It doesn't matter, the read requests are all synchronous Yes, it would be miserable to spend there.

There are various algorithms to solve this problem. The default algorithm of kernel 2.6 is CFQ (Completely Fair Queuing), which splits the total sorting queue into a sorting queue for each process that initiates reading and writing, and then uses time slices. Each queue is scheduled in turn, and several requests are taken from each process's queue to execute (default is 4).

In Apache Kafka, the reading and writing of messages all take place in memory. The real disk writing is the pdflush kernel thread, because they are all written sequentially. Even if there are multiple Partition files on a server, they can be merged and sorted. To obtain good performance, in other words, the number of Partition files does not affect performance, and there will be no situation where too many files become random read and write.

If it is an SSD hard disk, without the cost of addressing, sorting seems unnecessary, but the help of merging is still a lot, so there is another NOOP algorithm that only merges without sorting.

Off topic

In addition, there is a cache of dozens of megabytes on the hard disk. The difference between the external transfer rate (bus to cache) and the internal transfer rate (cache to disk) on the hard disk specification is here... The IO scheduling layer thinks that it has been The disk is written, but it may still not be written. If the power is cut off, the battery or large capacitor on the hard disk will save life...

Related reading :

Architecture Design of Distributed Publish-Subscribe Messaging System Kafkahttp : //www.linuxidc.com/Linux/2013-11/92751.htm

Apache Kafka code example http://www.linuxidc.com/Linux/2013-11/92754.htm

Apache Kafka Tutorial Notes http://www.linuxidc.com/Linux/2014-01/94682.htm

Apache kafka principle and characteristics (0.8V) http://www.linuxidc.com/Linux/2014-09/107388.htm

Kafka deployment and code example http://www.linuxidc.com/Linux/2014-09/107387.htm

Kafka introduction and cluster environment constructionhttp ://www.linuxidc.com/Linux/2014-09/107382.htm

Detailed introduction of Kafka : please click here
Kafka download address : please click here

This article permanently updates the link address : http://www.linuxidc.com/Linux/2015-05/117022.htm

http://www.linuxidc.com/Linux/2015-05/117022.htm