In pursuit of the ultimate performance, Kafka control of this 11 essentials

Many students Kafka private letter asked me what had been done in the performance optimization initiatives, the answer to the questions of fact, I already wrote, is not systematically organized a recent thinking a little time to sort out, next time there is students asked me related issues I can swing a chic links. This issue is also Kafka frequently asked questions of the interview, the interviewer ask you this question is not too make things difficult for you. In the Internet there are many articles related to the opening to explain the problem, such as number before major public reproduced in "Kafka Why so fast? "I read these articles, written in good, the problem is just part of the list of essentials, not all detailed out. Listed herein will be appreciated more than you have to search the Internet more, then if you are reading this article, encountered during the interview questions, I believe you will make the interviewer shines.

PS: All the essentials of this article in the "in-depth understanding of Kafka", a book has a detailed description, if the relevant knowledge unfamiliar, you can go look at it.

Batch processing

Traditional messaging middleware to send messages and for the consumer as a whole is a single bar. For the producer, it is to send a message, then the broker returns an ACK indicating reception of the generated secondary rpc herein; For the consumer, it first request acceptance message, and then returns a message broker, and finally transmits the ACK indicates consumption, produced here three times rpc (some messaging middleware optimizes the moment, when the broker returned return multiple messages). And Kafka uses batch processing: the producer of a number of news aggregation, then do into twice rpc message broker, which was originally required a lot of times rpc to complete the operation. Suppose you want to send a message 1000, the size of each message 1KB, then conventional message middleware need rpc 2000 times, and 1000 Kafka could put this message into a packaging 1MB message, using 2 rpc to complete the task. This improvement initiative was once considered a kind of "cheating" behavior, but in the micro-batch concept prevalent today, other messaging middleware have begun to follow suit.

Client optimization

Here then the concept of batch processing to continue, the new version of the producer client abandoned the previous single-threaded, and uses a dual threads: the main thread and Sender thread. The main thread is responsible for message into the client cache, Sender thread is responsible for sending a message from the cache, the cache will be a plurality of messages to a batch polymerization. Some messages directly messaging middleware will throw broker.

Log Format

Kafka began after the 0.8 version log format three changes: v0, v1, v2. In the hair off before the article, "read a text message formats evolve Kafka" Kafka described in detail in log format, Kafka log format more conducive to handling the bulk of the message, interested students can read about this article to for understanding.
Here Insert Picture Description

Log coding

If Kafka understand specific log format (refer to the figure), then you should understand log (the Record, message or call) itself in addition to the basic value and key, there are other fields, these additional fields originally according to occupy a certain fixed size length (on the left with reference to FIG.), and the latest version is adopted into Kafka and ZigZag Varints field coding, effectively reducing the size of these additional fields occupied. Log (messages) as smaller, then the efficiency of the transmission network also becomes high, the log archiving will also enhance efficiency, thereby finishing the performance will be improved.

Message compression

Kafka supports a variety of messaging compression (gzip, snappy, lz4). The message compression can greatly reduce network traffic, reduce network I / O, thereby improving overall performance. Message compression is a way to optimize the use of time for the space, if there are certain requirements for delay, the message compression is not recommended.

Indexed for quick location query

Each log file segment corresponding to the two index files, mainly used to improve the efficiency of lookup message, which is a way to improve performance. (Specific contents detailed explanation in chapter 5 of the book, the public seem to have forgotten published in number, find a circle did not find)

Partition

很多人会忽略掉这个因素,其实分区也是提升性能的一种非常有效的方式,这种方式所带来的效果会比前面所说的日志编码、消息压缩等更加的明显。分区在其他分布式组件中也有大量涉及,至于为什么分区能够提升性能这种基本知识在这里就不在赘述了。不过需要注意,一昧地增加分区并不能一直带来性能的提升,有兴趣的同学可以看一下这篇《Kafka主题中的分区数越多吞吐量就越高?》。

一致性

绝大多数的资料在讲述Kafka性能优化的举措之时是不会提及一致性的东西的。我们所了解的通用的一致性协议如Paxos、Raft、Gossip等,而Kafka另辟蹊径采用类似PacificA的做法不是“拍大腿”拍出来的,采用这种模型会提升整理的效率。具体的细节后面会整理一篇,类似《在Kafka中使用Raft替换PacificA的可行性分析及优缺点》。

顺序写盘

操作系统可以针对线性读写做深层次的优化,比如预读(read-ahead,提前将一个比较大的磁盘块读入内存) 和后写(write-behind,将很多小的逻辑写操作合并起来组成一个大的物理写操作)技术。Kafka 在设计时采用了文件追加的方式来写入消息,即只能在日志文件的尾部追加新的消 息,并且也不允许修改已写入的消息,这种方式属于典型的顺序写盘的操作,所以就算 Kafka 使用磁盘作为存储介质,它所能承载的吞吐量也不容小觑。

页缓存

为什么Kafka性能这么高?当遇到这个问题的时候很多人都会想到上面的顺序写盘这一点。其实在顺序斜盘前面还有页缓存(PageCache)这一层的优化。

页缓存是操作系统实现的一种主要的磁盘缓存,以此用来减少对磁盘 I/O 的操作。具体 来说,就是把磁盘中的数据缓存到内存中,把对磁盘的访问变为对内存的访问。为了弥补性 能上的差异,现代操作系统越来越“激进地”将内存作为磁盘缓存,甚至会非常乐意将所有 可用的内存用作磁盘缓存,这样当内存回收时也几乎没有性能损失,所有对于磁盘的读写也 将经由统一的缓存。

当一个进程准备读取磁盘上的文件内容时,操作系统会先查看待读取的数据所在的页 (page)是否在页缓存(pagecache)中,如果存在(命中)则直接返回数据,从而避免了对物 理磁盘的 I/O 操作;如果没有命中,则操作系统会向磁盘发起读取请求并将读取的数据页存入 页缓存,之后再将数据返回给进程。同样,如果一个进程需要将数据写入磁盘,那么操作系统也会检测数据对应的页是否在页缓存中,如果不存在,则会先在页缓存中添加相应的页,最后将数据写入对应的页。被修改过后的页也就变成了脏页,操作系统会在合适的时间把脏页中的 数据写入磁盘,以保持数据的一致性。

对一个进程而言,它会在进程内部缓存处理所需的数据,然而这些数据有可能还缓存在操 作系统的页缓存中,因此同一份数据有可能被缓存了两次。并且,除非使用 Direct I/O 的方式, 否则页缓存很难被禁止。此外,用过 Java 的人一般都知道两点事实:对象的内存开销非常大, 通常会是真实数据大小的几倍甚至更多,空间使用率低下;Java 的垃圾回收会随着堆内数据的 增多而变得越来越慢。基于这些因素,使用文件系统并依赖于页缓存的做法明显要优于维护一 个进程内缓存或其他结构,至少我们可以省去了一份进程内部的缓存消耗,同时还可以通过结构紧凑的字节码来替代使用对象的方式以节省更多的空间。如此,我们可以在 32GB 的机器上使用 28GB 至 30GB 的内存而不用担心 GC 所带来的性能问题。此外,即使 Kafka 服务重启, 页缓存还是会保持有效,然而进程内的缓存却需要重建。这样也极大地简化了代码逻辑,因为 维护页缓存和文件之间的一致性交由操作系统来负责,这样会比进程内维护更加安全有效。

Kafka 中大量使用了页缓存,这是 Kafka 实现高吞吐的重要因素之一。虽然消息都是先被写入页缓存,然后由操作系统负责具体的刷盘任务的。

零拷贝

I made a long time ago before it had an article entitled " What is a Zero Copy ", for Zero Copy do not know if the students can read it. Kafka uses Zero Copy technology to enhance the efficiency of consumption. Kafka said before the news first page write buffer, if when consumers read messages if you can hit in the page cache, you can read directly from the page cache, which in turn saves a page from disk to cache the copy overhead. In addition to the concept of reading and writing can learn more about what is read and write amplification amplification.

Attached

A disk IO process can refer to the following:
Here Insert Picture Description
specific analysis refer to " Linux disk IO papers finishing small mind "

Written in the last

This is a list of some essentials Kafka in performance optimization. All contents of this article are in the "in-depth understanding of Kafka" a book has to explain, but just scattered everywhere, according to the established order of arrangement, and strive hard from easy entry. If the re-use of space in the book to set out a similar theme, then there will be knowledge to explain the redundancy, there is no repeat again finishing in the book, but the content will be made out in the public numbers, the already finishing off quite a few in other dimensions a. If you need a new dimension of content, you can leave a message in the public numbers, the great appeal of this sort, then I would be one of the article that way.


We welcome the support of new work: "In-depth understanding of Kafka: the core design principles and practice" and "RabbitMQ practical guide", while welcoming the attention of the author micro-channel public number: Zhu servant of the blog.

Guess you like

Origin blog.csdn.net/u013256816/article/details/93772377