How Kafka achieves high throughput

 

Kafka is a distributed message system that needs to process a large amount of messages. The design of Kafka is to write all messages to the hard disk with low speed and large capacity in exchange for stronger storage capacity, but in fact, the use of hard disk does not bring Excessive performance penalty
 
Kafka mainly uses the following methods to achieve ultra-high throughput
 
sequential read and write
 
Kafka's messages are continuously appended to the file. This feature enables Kafka to make full use of the sequential read and write performance of the disk.
 
Sequential read and write does not require the seek time of the hard disk head, and only requires a small sector rotation time, so the speed is much faster than random read and write
 
Kafka officially gave the test data (Raid-5, 7200rpm):
 
Sequential I/O: 600MB/s
 
Random I/O: 100KB/s
 
zero copy
First, briefly understand the operation process of the file system. For example, a program needs to send the file content to the network.
 
This program works in user space, files and network sockets belong to hardware resources, and there is a kernel space between them
 
Inside the operating system, the whole process is:

After Linux kernel 2.2, a system call mechanism called "zero-copy" appeared, which skips the copy of "user buffer" and establishes a direct mapping between disk space and memory, and data is no longer copied to "User Mode Buffer"
 
The system context switch is reduced to 2 times, which can double the performance

file segmentation
Kafka's queue topic is divided into multiple partitions, and each partition is divided into multiple segments, so the messages in a queue are actually stored in N multiple segment files.

Through the segmentation method, each file operation is an operation on a small file, which is very lightweight and increases the parallel processing capability.
 
Bulk send
Kafka allows sending messages in batches, first buffering the messages in memory, and then sending them in batches in one request
 
For example, you can specify that the cached message will be sent out when it reaches a certain amount, or it will be sent out after it has been cached for a fixed time.
 
If 100 messages are sent, or every 5 seconds
 
This strategy will greatly reduce the number of I/Os on the server side
 
data compression
Kafka还支持对消息集合进行压缩,Producer可以通过GZIP或Snappy格式对消息集合进行压缩
 
压缩的好处就是减少传输的数据量,减轻对网络传输的压力
 
Producer压缩之后,在Consumer需进行解压,虽然增加了CPU的工作,但在对大数据处理上,瓶颈在网络上而不是CPU,所以这个成本很值得

 

http://it.dataguru.cn/article-9855-1.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327083632&siteId=291194637