The implementation principle of kafka’s high throughput, low latency and high performance

Author: Source Code Times-Teacher Raymon

Kafka’s implementation principles of high throughput, low latency, and high performance

Kafka is a ubiquitous messaging middleware in the field of big data. It is currently widely used in real-time data pipelines within enterprises and helps enterprises build their own stream computing applications. Although Kafka is a disk-based data storage, it has the characteristics of high performance, high throughput, and low latency. Its throughput can easily reach tens of thousands, tens or even millions. The reason for this is worth exploring. Let us Master the various ingenious designs of Kafka together.

Throughput: Throughput is the amount of data transferred or the number of transactions processed through a system, network, or device within a certain period of time. It is one of the important indicators to measure system performance and efficiency.

  • For networks, throughput can refer to the data transfer rate of a network connection and can be measured in bytes/second or bits/second.
  • For a server or database system, throughput can represent the number of transactions or requests that the system can handle per second.
  • In a storage system, throughput can represent the amount of data the system reads or writes per second.

Higher throughput means the system can process data faster, while lower throughput can cause data transfer delays and a busy system.

Kafka can process 1 piece of data per millisecond and 1,000 pieces of data per second. How many pieces of data can be processed in this unit of time is called throughput. 1,000 pieces of data, each piece of data is 10kb. The throughput is equivalent to approximately 1000 pieces of data per second. 10mb of data


Low latency: Low latency means that the time it takes for a system or application to process a request or transmit data is as short as possible.

If you process one piece of data when it comes, it may cause each piece of data to be processed for 1 millisecond. Then 1,000 pieces of data can be processed per second. This is the throughput per second. But what if micro-batch processing technology is used? For example, if a total of 1,000 pieces of data are collected within 10 milliseconds, and then handed over to the engine for processing at once, 1,000 pieces of data will be processed in 1 millisecond. Then 100,000 pieces of data can be processed in 1 second, and the throughput is directly increased by 100 times.


This is the so-called micro-batch processing technology used in streaming computing. You process each piece of data one by one. Each piece of data needs to start new computing resources, which involves network overhead and even disk overhead. But if you process 1,000 items at a time, it is actually the same as if you process 1 item at a time.

Next, let’s take a look at the underlying principles of Kafka’s high throughput and low latency implementation.

1. Page cache technology + disk sequential writing

First of all, Kafka will write to the disk every time it receives data, as shown in the figure below:

If the data is stored on the disk and data is frequently written to the disk file, this Will the performance be poor? Everyone must feel that disk write performance is extremely poor.

But in fact, Kafka has an extremely excellent and excellent design here, which is to ensure data writing performance. First of all, Kafka is based on the operating systemPage cache< a i=2> to achieve file writing. (The operating system itself has a layer of cache, called page cache, which is a cache in memory. We can also call it os cache, which means the cache managed by the operating system itself.)

When writing a disk file, it can be written directly into the os cache, that is, it is only written into the memory. Then the operating system decides when to actually flush the data in the os cache to the disk. in the file.
Just this step can improve the disk file writing performance a lot, because it is equivalent to writing to the memory, not to the disk. Look at the picture below:

Page caching alone can improve a lot of performance, but Kafka also uses adisk sequential write technology here. In other words, the data is only appended to the end of the file, not modified at random locations in the file.

For a hard disk, it has several important indicators: sequential reading and writing capabilities and random reading and writing capabilities. If the hard disk is compared to a warehouse, the data in the hard disk is compared to various goods in the warehouse. Hard disk The reading and writing is compared to the purchase and shipment of goods by a worker in a warehouse. At this time, if a worker needs to pick up a large refrigerator, although the refrigerator is large and heavy, the worker can complete it quickly. This is sequential reading and writing.

But if a job requires fetching a bottle of water, a writing box, a bag of bread, a mouse, a tube of toothpaste, etc. at the same time, although they are small and light, the job must travel throughout the warehouse to get them. to them. Therefore, workers tend to do slower, which is random reads and writes.

Summary:
The page cache + disk sequential writing method makes Kafka’s writing performance extremely high, minimizing the time overhead of each data processing, which in turn greatly improves In order to achieve the throughput of processing data per second, Kafka is generally deployed on a physical machine, and it is no problem for a single machine to write tens of thousands to hundreds of thousands of messages per second.
This method also takes into account the two requirements of low latency and high throughput. Try to squeeze the writing performance of each message to the extreme, so that low latency writing can be achieved while corresponding The throughput per second will naturally increase, which is also a very core underlying mechanism of Kafka.

2. Zero copy to achieve high-performance reading

Then when consuming data, the data needs to be read from the disk file and sent through the network. How to improve performance at this time?
The first is the use of page cache technology. As mentioned before, when kafka writes data to the disk file, it is actually written to the page cache. No disk IO occurs directly, so the written Most of the data stays in the page cache of the os layer (this essence is actually similar to the implementation principle of elasticsearch)

Then when reading, if it is read from the disk under normal circumstances, To get data, first try to read from the page cache. If it cannot be read, read from the disk IO. After reading the data, it will first be placed in a page cache on the os layer. Then a context switch will occur to the system, and the os read cache will be The data is copied to the application cache.

Then the context switch occurs again to the os layer, the application cached data is copied to the os socket cache, and finally the data is sent to the network card: [Non Zero-copy implementation plan]

During this process, several context switches occurred, and several data copies were involved. If the relationship with the hardware is not considered, The interaction between them is at least from the os cache to the user cache, and from the user cache to the socket cache. Two copies are absolutely unnecessary.

But if you use zero-copy technology, which is Linux's sendfile, you can directly transfer the operation to the os. The os will check whether there is data in the page cache. If not, it will read it from the disk. If there is, it will directly transfer it to the os. The data in the os cache is copied to the network card, so there is no need to go through so many steps: [Zero-copy technology]

Comparing with the figure 1, is zero-copy technology much faster? Therefore, by using zero-copy technology to read data on the disk, and with the help of page cahce, the performance is very high.

Guess you like

Origin blog.csdn.net/u014494148/article/details/134882943