How does Kafka do it so fast? How to efficiently read and write data?

Regardless of whether Kafka is used as an MQ or a storage layer, there are nothing more than two functions (so simple). One is that the data produced by the Producer is stored in the broker, and the other is that the Consumer reads data from the broker. The speed of Kafka is reflected in two aspects of reading and writing. Let's talk about the reasons for Kafka's speed.

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

1. Use Partition to achieve parallel processing

We all know that Kafka is a Pub-Sub messaging system. Whether it is publishing or subscribing, a topic must be specified.

Topic is just a logical concept. Each Topic contains one or more Partitions, and different Partitions can be located at different nodes.

On the one hand, because different Partitions can be located on different machines, you can make full use of the advantages of clusters to achieve parallel processing between machines. On the other hand, because Partition physically corresponds to a folder, even if multiple Partitions are located on the same node, different Partitions on the same node can be placed on different disks through configuration, so as to achieve parallel processing between disks. Take advantage of multiple disks.

Can be processed in parallel, the speed will definitely be improved, and multiple workers will definitely be faster than one worker.

Can write to different disks in parallel? Can the speed of disk read and write be controlled?

Then simply talk about the disk/IO things first

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

What are the limiting factors of hard disk performance? How to design the system according to the disk I/O characteristics?

The main components inside the hard disk are the disk platters, the transmission arm, the read-write head and the spindle motor. The actual data is written on the disc, and the reading and writing is mainly done by the read and write head on the transmission arm. In actual operation, the spindle rotates the disk platters, and then the transmission arm can be extended to allow the reading head to perform read and write operations on the platters. The physical structure of the disk is shown in the figure below:

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

Due to the limited capacity of a single disk, a general hard disk has more than two disks, each of which has two sides and can record information, so one disk corresponds to two heads. The disc is divided into many sector-shaped areas, and each area is called a sector. The surface of the disk is centered on the center of the disk. Concentric circles with different radii are called tracks, and cylinders composed of tracks of the same radius on different disks are called cylinders. Both the track and the cylinder represent circles with different radii. In many cases, the track and the cylinder can be used interchangeably. The vertical viewing angle of the disk platter is shown below:

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

The key factor affecting the disk is the disk service time, that is, the time it takes for the disk to complete an I/O request. It consists of three parts: seek time, rotation delay and data transmission time.

The continuous read and write performance of the mechanical hard disk is very good, but the random read and write performance is very poor. This is mainly because it takes time for the magnetic head to move to the correct track. When reading and writing randomly, the magnetic head needs to move continuously, and time is wasted on the magnetic head. Addressing, so the performance is not high. The important main indicators for measuring disks are IOPS and throughput.

In many open source frameworks such as Kafka and HBase, random I/O is converted to sequential I/O as much as possible through additional writes, so as to reduce addressing time and rotation delay, thereby maximizing IOPS.

Interested students can look at those things about disk I/O

The speed of disk read and write depends on how you use it, that is, sequential read and write or random read and write.

2. Write disk sequentially

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

Each partition in Kafka is an ordered, immutable sequence of messages . New messages are constantly appended to the end of the partition. This is sequential writing.

Someone has done a benchmark test a long, long time ago: "2 million writes per second (on three cheap machines)" http://ifeve.com/benchmarking-apache-kafka-2-million-writes-second-three -cheap-machines/

Due to limited disks, it is impossible to save all data. In fact, Kafka, as a messaging system, does not need to save all data. Old data needs to be deleted. Also due to sequential writing, when Kafka uses various deletion strategies to delete data, it does not modify the file by using the "read-write" mode, but divides the Partition into multiple segments, and each segment corresponds to a physical file. , Delete the data in Partition by deleting the entire file. This way of erasing old data also avoids random write operations to files.

3. Make the most of Page Cache

The purpose of introducing the Cache layer is to improve the performance of the Linux operating system for disk access. The Cache layer caches part of the data on the disk in memory. When a request for data arrives, if the data exists in the Cache and is the latest, the data is directly passed to the user program, eliminating the operation of the underlying disk and improving performance. The Cache layer is also one of the main reasons why disk IOPS can exceed 200.

In the implementation of Linux, file Cache is divided into two levels, one is Page Cache, and the other is Buffer Cache. Each Page Cache contains several Buffer Caches. Page Cache is mainly used as a cache of file data on the file system, especially when the process has read/write operations on the file. Buffer Cache is mainly designed to be used by the system for data caching of blocks when the system reads and writes to block devices.

Benefits of using Page Cache:

  • The I/O Scheduler will assemble consecutive small blocks into large physical writes to improve performance
  • The I/O Scheduler will try to reorder some write operations in order to reduce the movement time of the disk head
  • Make full use of all free memory (non-JVM memory). If you use application layer Cache (ie JVM heap memory), it will increase the GC burden
  • Read operations can be performed directly in Page Cache. If the consumption and production speeds are equal, there is no need to exchange data through physical disks (directly through Page Cache)
  • If the process restarts, the Cache in the JVM will be invalid, but the Page Cache is still available

After the Broker receives the data, it only writes the data to the Page Cache when writing to the disk, and does not guarantee that the data will be completely written to the disk. From this point of view, the data in Page Cache may not be written to the disk when the machine is down, which may cause data loss. But this kind of loss only occurs in scenarios where the operating system does not work, such as machine power failures, and this scenario can be completely resolved by the Kafka-level Replication mechanism. If the data in Page Cache is forced to flush to disk in order to ensure that data is not lost in this case, it will reduce performance. Because of this, Kafka provides two parameters, flush.messages and flush.ms, to force the data in Page Cache to flush to disk, but Kafka does not recommend using it.

4. Zero Copy Technology

In Kafka, a large amount of network data is persisted to disk (Producer to Broker) and disk files are sent over the network (Broker to Consumer). The performance of this process directly affects the overall throughput of Kafka.

The core of the operating system is the kernel, which is independent of ordinary applications and can access the protected memory space as well as access to the underlying hardware devices.

In order to prevent the user process from directly operating the kernel and ensure the security of the kernel, the operating system divides the virtual memory into two parts, one is the kernel space (Kernel-space), and the other is the user space (User-space).

In traditional Linux systems, standard I/O interfaces (such as read, write) are based on data copy operations, that is, I/O operations will cause data to be between the buffer in the kernel address space and the buffer in the user address space Copy, so standard I/O is also called cached I/O. The advantage of this is that if the requested data is already stored in the kernel's cache memory, the actual I/O operations can be reduced, but the disadvantage is that the process of data copying will cause CPU overhead.

We simplify the production and consumption of Kafka into the following two processes [2]:

  1. Persist network data to disk (Producer to Broker)
  2. The disk file is sent over the network (Broker to Consumer)

4.1 Persistence of network data to disk (Producer to Broker)

In the traditional mode, data transfer from the network to a file requires 4 data copies, 4 context switches, and two system calls.

 

data = socket.read()// 读取网络数据 File file = new File() file.write(data)// 持久化到磁盘 file.flush()

This process actually took place four data copies:

  1. First copy the network data to the kernel mode Socket Buffer through DMA copy
  2. Then the application reads the kernel mode Buffer data into the user mode (CPU copy)
  3. Then the user program copies the user mode buffer to the kernel mode (CPU copy)
  4. Finally copy the data to the disk file through DMA copy

DMA (Direct Memory Access): Direct memory access. DMA is a hardware mechanism that allows two-way data transfer between peripherals and system memory without the involvement of the CPU. Using DMA can make the system CPU get rid of the actual I/O data transmission process, thereby greatly improving the system throughput.

At the same time, it is accompanied by four context switches, as shown in the following figure

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

Data placement is usually non-real-time, and the same is true for Kafka producer data persistence. Kafka's data is not written to the hard disk in real time . It makes full use of the paging storage of modern operating systems to use memory to improve I/O efficiency, which is the Page Cache mentioned in the previous section.

For Kafka, the data produced by the Producer is stored in the broker. This process reads the network data of the socket buffer, which can actually be placed directly in the kernel space. It is not necessary to read the network data of the socket buffer into the application process buffer; here the application process buffer is actually the broker, and the broker receives the data from the producer for persistence.

In this special scenario: when receiving network data from the socket buffer, the application process does not need intermediate processing and directly performs persistence. You can use mmap memory file mapping.

Memory Mapped Files: mmap for short, or MMFile. The purpose of mmap is to map the address of the read buffer in the kernel to the user buffer in the user space. In this way, the kernel buffer and application memory are shared, and the process of copying data from the kernel read buffer to the user buffer is eliminated. Its working principle is to directly use the Page of the operating system to directly map files to physical memory. After the mapping is completed, your operations on the physical memory will be synchronized to the hard disk.

In this way, a large I/O boost can be obtained, eliminating the overhead of copying from user space to kernel space.

mmap also has a very obvious flaw-unreliable, the data written in mmap is not actually written to the hard disk, the operating system will actually write the data to the hard disk when the program actively calls flush. Kafka provides a parameter-producer.type to control whether or not flush is active; if Kafka writes mmap, it flushes immediately and then returns to the Producer called synchronization (sync); after writing mmap, it returns to Producer immediately without calling flush, it is called asynchronous (async), the default is sync.

Why is Kafka so fast? Efficiently read and write data, it turned out to be so

Zero-copy technology means that when the computer performs operations, the CPU does not need to first copy data from one memory area to another memory area, which can reduce context switching and CPU copy time.

Its function is to reduce the number of data copies, reduce the number of system calls, and achieve zero participation of the CPU during the transfer of datagrams from the network device to the user program space, and completely eliminate the CPU load in this regard.

There are currently three types of zero-copy technology [3]:

Direct I/O: Data directly crosses the kernel and is transferred between the user address space and I/O devices. The kernel only performs auxiliary tasks such as necessary virtual storage configuration;

Avoid data copy between the kernel and user space: when the application does not need to access the data, you can avoid copying the data from the kernel space to the user space

mmap

sendfile

splice && tee

sockmap

copy on write: Copy-on-write technology, data does not need to be copied in advance, but part of it is copied when it needs to be modified.

4.2 The disk file is sent over the network (Broker to Consumer)

Traditional way: read the disk first, then use the socket to send, actually it has been copied four times

 

buffer = File.read Socket.send(buffer)

This process can be compared to the production message above:

  1. First, read the file data into the kernel mode Buffer (DMA copy) through the system call
  2. Then the application program reads the memory state Buffer data into the user state Buffer (CPU copy)
  3. Then the user program copies the user mode buffer data to the kernel mode buffer when sending data through the Socket (CPU copy)
  4. Finally, copy the data to the NIC Buffer through DMA copy

The Linux 2.4+ kernel provides zero copy through the sendfile system call. After the data is copied to the kernel mode buffer through DMA, it is directly copied to the NIC buffer through DMA without CPU copy. This is also the source of the term zero copy. In addition to reducing data copying, because the entire read file-network transmission is completed by one sendfile call, the entire process only has two context switches, which greatly improves performance.

image.png

Kafka's solution here is to use NIO's transferTo/transferFrom to call the operating system's sendfile to achieve zero copy. A total of 2 kernel data copies, 2 context switches, and one system call occurred, eliminating CPU data copies

5. Batch processing

In many cases, the bottleneck of the system is not CPU or disk, but network IO.

Therefore, in addition to the low-level batch processing provided by the operating system, Kafka clients and brokers will accumulate multiple records (including read and write) in a batch before sending data over the network. The recorded batch process amortizes the network round-trip overhead and uses larger data packets to improve bandwidth utilization.

6. Data Compression

Producer can compress the data and send it to the broker, thereby reducing the network transmission cost. The currently supported compression algorithms are: Snappy, Gzip, LZ4. Data compression is generally used in conjunction with batch processing as an optimization method.

to sum up

If, the interviewer asks me again: Why is Kafka so fast? I would say this:

  • partition parallel processing
  • Write disk sequentially, make full use of disk characteristics
  • Utilizing modern operating system paged storage Page Cache to use memory to improve I/O efficiency
  • Zero-copy technology
  • The data produced by the Producer is persisted to the broker, and mmap file mapping is used to achieve sequential fast writing
  • Customer reads data from the broker, uses sendfile to read the disk file into the OS kernel buffer, and then transfers to the NIO buffer for network transmission, reducing CPU consumption

 

Guess you like

Origin blog.csdn.net/yunduo1/article/details/108714939