Kafka high-performance throughput Secret

Editor:

https://www.cnblogs.com/Dhouse/p/8513004.html

 

Kafka as the most popular open source messaging system is widely used in the data buffer, asynchronous communication, the log collection aspect, decoupling and so on. RocketMQ compared to other common messaging system, Kafka in protecting the most features, while also providing superior read and write performance.

This article will be simple analysis of the performance of Kafka, a few words about Kafka's architecture and terminology involved:

1.  Topic: Message for dividing the logical concept, it can be distributed over a plurality Topic Broker. 2. Partition: Kafka is the basis of all the lateral expansion and parallelization of each Topic are cut into at least one Partition.

3. Offset: message number, order number in Partition is not cross-Partition.

4. Consumer: for taking out from the Broker / consumption Message.

5. Producer: for transmitting to the Broker / Production Message.

6. Replication: Kafka supports Partition units of redundant backup Message, each Partition are arranged at least one Replication (when only a Replication i.e. only the Partition itself).

7. Leader: Partition each Replication will set a unique Leader selected, all read and write requests are processed Leader. Other Replicas from the Leader of the updated data synchronized to the local, the process is similar to the familiar MySQL in Binlog synchronization.

8. Broker: Kafka Broker used to accept requests the Producer and Consumer, and the Message persisted to local disk. Each Cluster which will be elected to serve as a Broker Controller, responsible for Leader Election Partition, coordination Partition migration work.

9. ISR (In-Sync Replica): is a subset of Replicas, showing the current and the Leader can be Alive "Catch-up" is set Replicas. Since the reading and writing are the first Leader fall, so in general pull data from the synchronization mechanism by Replica Leader Leader and will have some delay (including the delay time and the number of delays in two dimensions), any more than a threshold value the Replica will be kicked out of the ISR. Each Partition has its own independent ISR.

Almost all of the above terms we may encounter in the course of the use of Kafka, but also both are core concepts or components, it is felt from the design itself, Kafka was simple enough. This paper focuses on Kafka excellent throughput performance, one by one explain the "black technology" which is designed and used to achieve them.

 

Broker


Unlike other memory MemcacheQ Redis and message queues, Kafka is designed to have all of the Message lower writing speed large-capacity hard disk, in exchange for greater storage capacity. In fact, Kafka did not bring too much hard disk performance loss, "neatly" copy of a "short cut."

First of all, he is saying "behave" because Kafka only Sequence I / O on the disk , due to the special nature of the messaging system to read and write, it does not exist any problem. Performance on the disk I / O, and a reference set of test data Kafka given official (Raid-5,7200rpm):

Sequence I/O: 600MB/s 
Random I/O: 100KB/s

So by only Sequence I / O limitations, disk access speed to avoid the impact of poor performance may result.

 

principle:

640?wx_fmt=png

 

Next we chat Kafka is how "of a shortcut."

First, Kafka heavily dependent on the underlying operating system function PageCache provided . When the upper write operation, write data to the operating system only PageCache, while Page labeled property Dirty. When the read operation occurs, start PageCache find, if a page fault occurs only disk scheduling, and ultimately return the required data. In fact PageCache is to have as much free memory as a disk cache to use. And if there are other processes for memory, and recover the cost of PageCache small, so modern OS support PageCache.

Use PageCache function and avoid the internal cache data in the JVM , JVM GC provides a powerful capability for us, but also introduces some problems do not apply to Kafka design.
• If the cache management in the Heap, GC thread will frequently scan the JVM Heap space, unnecessary overhead. If Heap is too large, perform a Full GC on the availability of the system, it will be a great challenge.
• All objects within the JVM can not avoid having a Object Overhead (ten million can not be discounted), memory efficient space utilization can therefore be reduced.
• All In-Process Cache has a copy of the same PageCache in the OS. So by the cache only on PageCache, allow at least double the available cache space.
• If Kafka restart, all of In-Process Cache will fail, and OS management PageCache can still continue to use.

PageCache only the first step, Kafka used to live in order to further optimize the performance also features Sendfile technology . Before explaining Sendfile, first introduce the conventional network I / O operation process, generally divided into the following four steps.

1. OS from the hard disk data read PageCache core region.

2. Copy the user process the data from the kernel to the User.

3. Then the user process and then write data to the Socket, the data flow into the core area of ​​Socket Buffer.

4. OS then Copy data from the Buffer Buffer to the card, thus completing a transmission.640?wx_fmt=png 

The whole process had experienced twice Context Switch, four System Call. A duplicate copy of the same data between the kernel and user Buffer Buffer, inefficient. Wherein the 2,3-step is not necessary, can complete copy of the data directly in the kernel region. This is exactly the problem Sendfile solved, after Sendfile optimized, the entire I / O process becomes a following like this. 


640?wx_fmt=png

 

Through the above description is easy to see, Kafka was originally designed to make every effort to complete the in-memory data exchange, whether it is the external messaging system as a whole, or internal interaction with the underlying operating system. If properly set up between the Producer and Consumer progress of production and consumption, can achieve data exchange zero I / O. This is the reason why I say that Kafka use "hard disk" does not bring much performance loss. Here are some indicators to pick me in a production environment. 


(20 Brokers, 75 Partitions per Broker, 110k msg/s) 

640?wx_fmt=jpeg
 

At this point the cluster only write, not read. 10M / s flow rate of about Send Replicate between Partition is generated. As can be seen from the comparison and the writ of recv rate, write disk using Asynchronous + Batch manner, the underlying OS may also be optimized disk write sequence. And when there is an incoming Read Request divided into two cases, one is a first memory for data exchange.


640?wx_fmt=jpeg
 

Send traffic increased to an average of 60M / s from an average of 10M / s, while the disk Read no more than 50KB / s. PageCache reduce disk I / O is very obvious.

Next is to read some received for some time, it has been swapped out the old data on the disk brush written from memory.

 
640?wx_fmt=jpeg
 

Other indicators still the same, and disk Read Biao already high 40 + MB / s. At this point all of the data are already gone hard drive (hard disk for sequential reads OS layer can be optimized Prefill PageCache's). Still we do not have any performance problems.

Tips

1. Kafka官方并不建议通过Broker端的log.flush.interval.messages和log.flush.interval.ms来强制写盘,认为数据的可靠性应该通过Replica来保证,而强制Flush数据到磁盘会对整体性能产生影响。

2. 可以通过调整

/proc/sys/vm/dirty_background_rati和/proc/sys/vm/dirty_ratio来调优性能。

3. 脏页率超过第一个指标会启动pdflush开始Flush Dirty PageCache。

4. 脏页率超过第二个指标会阻塞所有的写操作来进行Flush。

5. 根据不同的业务需求可以适当的降低dirty_background_ratio和提高dirty_ratio。

 

Partition


Partition是Kafka可以很好的横向扩展和提供高并发处理以及实现Replication的基础。

扩展性方面。首先,Kafka允许Partition在集群内的Broker之间任意移动,以此来均衡可能存在的数据倾斜问题。其次,Partition支持自定义的分区算法,例如可以将同一个Key的所有消息都路由到同一个Partition上去。 同时Leader也可以在In-Sync的Replica中迁移。由于针对某一个Partition的所有读写请求都是只由Leader来处理,所以Kafka会尽量把Leader均匀的分散到集群的各个节点上,以免造成网络流量过于集中。

并发方面。任意Partition在某一个时刻只能被一个Consumer Group内的一个Consumer消费(反过来一个Consumer则可以同时消费多个Partition),Kafka非常简洁的Offset机制最小化了Broker和Consumer之间的交互,这使Kafka并不会像同类其他消息队列一样,随着下游Consumer数目的增加而成比例的降低性能。此外,如果多个Consumer恰巧都是消费时间序上很相近的数据,可以达到很高的PageCache命中率,因而Kafka可以非常高效的支持高并发读操作,实践中基本可以达到单机网卡上限。

不过,Partition的数量并不是越多越好,Partition的数量越多,平均到每一个Broker上的数量也就越多。考虑到Broker宕机(Network Failure, Full GC)的情况下,需要由Controller来为所有宕机的Broker上的所有Partition重新选举Leader,假设每个Partition的选举消耗10ms,如果Broker上有500个Partition,那么在进行选举的5s的时间里,对上述Partition的读写操作都会触发LeaderNotAvailableException。

再进一步,如果挂掉的Broker是整个集群的Controller,那么首先要进行的是重新任命一个Broker作为Controller。新任命的Controller要从Zookeeper上获取所有Partition的Meta信息,获取每个信息大概3-5ms,那么如果有10000个Partition这个时间就会达到30s-50s。而且不要忘记这只是重新启动一个Controller花费的时间,在这基础上还要再加上前面说的选举Leader的时间 

此外,在Broker端,对Producer和Consumer都使用了Buffer机制。其中Buffer的大小是统一配置的,数量则与Partition个数相同。如果Partition个数过多,会导致Producer和Consumer的Buffer内存占用过大。

Tips

1. Partition的数量尽量提前预分配,虽然可以在后期动态增加Partition,但是会冒着可能破坏Message Key和Partition之间对应关系的风险。

2. Replica的数量不要过多,如果条件允许尽量把Replica集合内的Partition分别调整到不同的Rack。

3. 尽一切努力保证每次停Broker时都可以Clean Shutdown,否则问题就不仅仅是恢复服务所需时间长,还可能出现数据损坏或其他很诡异的问题。

 

Producer


Kafka的研发团队表示在0.8版本里用Java重写了整个Producer,据说性能有了很大提升。我还没有亲自对比试用过,这里就不做数据对比了。本文结尾的扩展阅读里提到了一套我认为比较好的对照组,有兴趣的同学可以尝试一下。

其实在Producer端的优化大部分消息系统采取的方式都比较单一,无非也就化零为整、同步变异步这么几种。

Kafka系统默认支持MessageSet,把多条Message自动地打成一个Group后发送出去,均摊后拉低了每次通信的RTT。而且在组织MessageSet的同时,还可以把数据重新排序,从爆发流式的随机写入优化成较为平稳的线性写入。

此外,还要着重介绍的一点是,Producer支持End-to-End的压缩。数据在本地压缩后放到网络上传输,在Broker一般不解压(除非指定要Deep-Iteration),直至消息被Consume之后在客户端解压。

当然用户也可以选择自己在应用层上做压缩和解压的工作(毕竟Kafka目前支持的压缩算法有限,只有GZIP和Snappy),不过这样做反而会意外的降低效率!!!! Kafka的End-to-End压缩与MessageSet配合在一起工作效果最佳,上面的做法直接割裂了两者间联系。至于道理其实很简单,压缩算法中一条基本的原理“重复的数据量越多,压缩比越高”。无关于消息体的内容,无关于消息体的数量,大多数情况下输入数据量大一些会取得更好的压缩比。

不过Kafka采用MessageSet也导致在可用性上一定程度的妥协。每次发送数据时,Producer都是send()之后就认为已经发送出去了,但其实大多数情况下消息还在内存的MessageSet当中,尚未发送到网络,这时候如果Producer挂掉,那就会出现丢数据的情况。

为了解决这个问题,Kafka在0.8版本的设计借鉴了网络当中的ack机制。如果对性能要求较高,又能在一定程度上允许Message的丢失,那就可以设置request.required.acks=0 来关闭ack,以全速发送。如果需要对发送的消息进行确认,就需要设置request.required.acks为1或-1,那么1和-1又有什么区别呢?这里又要提到前面聊的有关Replica数量问题。如果配置为1,表示消息只需要被Leader接收并确认即可,其他的Replica可以进行异步拉取无需立即进行确认,在保证可靠性的同时又不会把效率拉得很低。如果设置为-1,表示消息要Commit到该Partition的ISR集合中的所有Replica后,才可以返回ack,消息的发送会更安全,而整个过程的延迟会随着Replica的数量正比增长,这里就需要根据不同的需求做相应的优化。

Tips

1. Producer的线程不要配置过多,尤其是在Mirror或者Migration中使用的时候,会加剧目标集群Partition消息乱序的情况(如果你的应用场景对消息顺序很敏感的话)。

2. 0.8版本的request.required.acks默认是0(同0.7)。

 

Consumer


Consumer端的设计大体上还算是比较常规的。

• 通过Consumer Group,可以支持生产者消费者和队列访问两种模式。
• Consumer API分为High level和Low level两种。前一种重度依赖Zookeeper,所以性能差一些且不自由,但是超省心。第二种不依赖Zookeeper服务,无论从自由度和性能上都有更好的表现,但是所有的异常(Leader迁移、Offset越界、Broker宕机等)和Offset的维护都需要自行处理。
• 大家可以关注下不日发布的0.9 Release。开发人员又用Java重写了一套Consumer。把两套API合并在一起,同时去掉了对Zookeeper的依赖。据说性能有大幅度提升哦~~

Tips

强烈推荐使用Low level API,虽然繁琐一些,但是目前只有这个API可以对Error数据进行自定义处理,尤其是处理Broker异常或由于Unclean Shutdown导致的Corrupted Data时,否则无法Skip只能等着“坏消息”在Broker上被Rotate掉,在此期间该Replica将会一直处于不可用状态。

 

陛下...看完奏折,点个赞再走吧!

 

推荐阅读

 

技术:前后端分离--整套解决方案

技术:Java面试必备技能

技术:docker私有仓库搭建,证书认证,鉴权管理

技术:如何成功抢到回家的火车票!

技术:Linux 命令行快捷键

技术:超详细黑苹果安装图文教程送EFI配置合集及系统

Technology: terror advertising push. In fact, we are in the "naked" every day!

Technology: If I do not say, you can see that this is not a real anchor it?

Technology: IaaS, PaaS and SaaS, QPS, RT and TPS, PV, UV and IP in the end what does this mean?

Technology: Project multi-context switching --Maven Profile

 

Tools: how to "kill" by means of video technology in APP annoying ads?

Tools: through technical means "kill" in the video APP annoying advertising (Tencent video)

 

Bloggers 12 years of java development experience, is engaged in research and development of intelligent voice work, focusing on micro-channel public number and bloggers technical exchanges! More resources too dry for you to come!

640?wx_fmt=jpeg

 

 

Guess you like

Origin blog.csdn.net/mxw2552261/article/details/90631662