The design features high throughput kafka

One of the goals is to design kafka high throughput. In addition to the most basic one partition into a plurality of topic, but also optimize the following aspects.

When the order to read and write directly read the disk, using page cache, the file data is mapped into memory using socket communication network when the transmission sendfile memory area (to reduce operating system context switches, zero-copy speed);: kafka broker terminal in order to improve the throughput

an end producer, up message buffer, when the number of messages reaches a certain threshold (a number or time), to a bulk Broker;

Consumer, batch fetch by arranging a plurality of message reaches a certain threshold value (or a certain amount of time), the information taken from the bulk pull broker.;

For producer / consumer / broker three terms, CPU expenses should not large, and therefore enable message compression scheme to reduce the amount of network transmission data; compressing consumes a small amount of CPU resources can be any message transmitted on the network have been compressed .kafka supports gzip / snappy and other compression methods.  

Sequential write disk

The " some scenarios sequential write random disk write memory faster than " the, disk write process becomes sequential write, can greatly improve the utilization of the disk.

Kafka the entire design, the Partition array corresponds to a very long, and all messages received write sequence Broker this large array. Consumer consumption while Offset by order of the data, and does not delete the data has been consumed, thus avoiding the process of random write disk.

Due to limited disk, you can not save all the data, in fact, as a message system Kafka did not need to save all the data, you need to delete the old data. And this removal process, not through the use of "read - write" mode to modify the file, but the Partition into multiple Segment, each Segment corresponds to a physical file, to delete the data in the Partition by deleting entire files. In this way the old data to clear the way, but also to avoid random file writes.

Understood by the following code, delete Kafka Segment manner, directly deleting the log file Segment corresponding to the entire index file, and the entire part instead of deleting the file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/**
* Delete this log segment from the filesystem.
*
* @throws KafkaStorageException if the delete fails.
*/
def delete() {
val deletedLog = log.delete()
val deletedIndex = index.delete()
val deletedTimeIndex = timeIndex.delete()
if(!deletedLog && log.file.exists)
throw new KafkaStorageException("Delete of log " + log.file.getName + " failed.")
if(!deletedIndex && index.file.exists)
throw new KafkaStorageException("Delete of index " + index.file.getName + " failed.")
if(!deletedTimeIndex && timeIndex.file.exists)
throw new KafkaStorageException("Delete of time index " + timeIndex.file.getName + " failed.")
}

Full use of the Page Cache

The benefits of using the following Page Cache

  • I / O Scheduler will write successive assembled into large pieces of physical writes to improve performance
  • I / O Scheduler will attempt to re-write some of the right order, thus reducing disk head movement time
  • Full use of all free memory (non-JVM memory). If you use the application layer Cache (ie the JVM heap memory), it will increase the burden on GC
  • Read operations can be carried out directly in the Page Cache. If consumption and production at a fairly even without (directly through the Page Cache) to exchange data through physical disk
  • If the process is restarted, Cache will fail within the JVM, but is still available Page Cache

Broker After receiving the data, write data is written to disk only Page Cache, does not necessarily guarantee that the data is completely written to disk. From this point of view, it may cause the machine downtime, data in the Page Cache is not written to disk resulting in data loss. But this loss only occurs in the machine caused power outages and other operating system is not working the scene, but this scenario can be resolved by the Replication mechanism Kafka level. If this case in order to ensure data is not lost while forcing the data in the Page Cache Flush to disk, but will reduce performance. That is why, while providing Kafka flush.messagesand flush.mstwo parameters are mandatory data Page Cache Flush to disk, but Kafka is not recommended.

If the data rate of consumption and production speed very, do not even need to exchange data, but directly exchange data through Page Cache physical disk. Meanwhile, Follower when Leader Fetch data from, can also be done by Page Cache. Network / Disk Partition Leader node figure below shows a read and write information.

Kafka I / O page cache

Can be seen from the figure, the Broker about 35MB per second data received through the network from the Producer, although the Fetch Follower Broker from the data, but that substantially no disk read Broker. This is because the direct extraction Page Cache Broker will return to the data from the Follower.

Multi-Disk Drive Support

Broker of log.dirsconfiguration items, allows you to configure multiple folders. If there are multiple Disk Drive on the machine, different Disk can be mounted to a different directory, then these directories are configured to log.dirsinside. Kafka will try different Partition assigned to a different directory, that is a different Disk, to take full advantage of the multi-Disk advantage.

Zero-copy

Kafka presence of a large amount of network data persisted to disk (the Producer Broker) and disk files sent over the network (Broker to Consumer) process. This process performance directly affects the overall throughput of Kafka.

Four and four copies of the conventional mode switch context

以将磁盘文件通过网络发送为例。传统模式下,一般使用如下伪代码所示的方法先将文件数据读入内存,然后通过Socket将内存中的数据发送出去。

1
2
buffer = File.read
Socket.send(buffer)

这一过程实际上发生了四次数据拷贝。首先通过系统调用将文件数据读入到内核态Buffer(DMA拷贝),然后应用程序将内存态Buffer数据读入到用户态Buffer(CPU拷贝),接着用户程序通过Socket发送数据时将用户态Buffer数据拷贝到内核态Buffer(CPU拷贝),最后通过DMA拷贝将数据拷贝到NIC Buffer。同时,还伴随着四次上下文切换,如下图所示。

Four context switches four copies BIO

sendfile和transferTo实现零拷贝

Linux 2.4+内核通过sendfile系统调用,提供了零拷贝。数据通过DMA拷贝到内核态Buffer后,直接通过DMA拷贝到NIC Buffer,无需CPU拷贝。这也是零拷贝这一说法的来源。除了减少数据拷贝外,因为整个读文件-网络发送由一个sendfile调用完成,整个过程只有两次上下文切换,因此大大提高了性能。零拷贝过程如下图所示。

BIO zero copy two context switches

从具体实现来看,Kafka的数据传输通过TransportLayer来完成,其子类PlaintextTransportLayer通过Java NIO的FileChannel的transferTotransferFrom方法实现零拷贝,如下所示。

1
2
3
4
@Override
public long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
return fileChannel.transferTo(position, count, socketChannel);
}

注: transferTotransferFrom并不保证一定能使用零拷贝。实际上是否能使用零拷贝与操作系统相关,如果操作系统提供sendfile这样的零拷贝系统调用,则这两个方法会通过这样的系统调用充分利用零拷贝的优势,否则并不能通过这两个方法本身实现零拷贝。

减少网络开销

批处理

批处理是一种常用的用于提高I/O性能的方式。对Kafka而言,批处理既减少了网络传输的Overhead,又提高了写磁盘的效率。

Kafka 0.8.1及以前的Producer区分同步Producer和异步Producer。同步Producer的send方法主要分两种形式。一种是接受一个KeyedMessage作为参数,一次发送一条消息。另一种是接受一批KeyedMessage作为参数,一次性发送多条消息。而对于异步发送而言,无论是使用哪个send方法,实现上都不会立即将消息发送给Broker,而是先存到内部的队列中,直到消息条数达到阈值或者达到指定的Timeout才真正的将消息发送出去,从而实现了消息的批量发送。

Kafka 0.8.2开始支持新的Producer API,将同步Producer和异步Producer结合。虽然从send接口来看,一次只能发送一个ProducerRecord,而不能像之前版本的send方法一样接受消息列表,但是send方法并非立即将消息发送出去,而是通过batch.sizelinger.ms控制实际发送频率,从而实现批量发送。

由于每次网络传输,除了传输消息本身以外,还要传输非常多的网络协议本身的一些内容(称为Overhead),所以将多条消息合并到一起传输,可有效减少网络传输的Overhead,进而提高了传输效率。

零拷贝章节的图中可以看到,虽然Broker持续从网络接收数据,但是写磁盘并非每秒都在发生,而是间隔一段时间写一次磁盘,并且每次写磁盘的数据量都非常大(最高达到718MB/S)。

数据压缩降低网络负载

Kafka从0.7开始,即支持将数据压缩后再传输给Broker。除了可以将每条消息单独压缩然后传输外,Kafka还支持在批量发送时,将整个Batch的消息一起压缩后传输。数据压缩的一个基本原理是,重复数据越多压缩效果越好。因此将整个Batch的数据一起压缩能更大幅度减小数据量,从而更大程度提高网络传输效率。

Broker接收消息后,并不直接解压缩,而是直接将消息以压缩后的形式持久化到磁盘。Consumer Fetch到数据后再解压缩。因此Kafka的压缩不仅减少了Producer到Broker的网络传输负载,同时也降低了Broker磁盘操作的负载,也降低了Consumer与Broker间的网络传输量,从而极大得提高了传输效率,提高了吞吐量。

高效的序列化方式

Key Kafka and Payload message (or Value) type can be customized simply along with the appropriate serialization and deserialization can be. Thus a user can quickly and by using a compact serialization - deserialization mode (e.g. Avro, Protocal Buffer) to reduce the size of the actual data transmission network and disk storage, thereby increasing throughput. Here we note that if the serialization method used is too slow, even if the compression ratio is very high, the ultimate efficiency is not necessarily high. Kafka Design Analysis (six) - Kafka high-performance architecture of the Road

 

Guess you like

Origin www.cnblogs.com/doit8791/p/11329440.html