浅谈实时流平台Kafka的消息系统设计

文章目录

Kafka解决什么问题？
Kafka的流处理
Kafka如何实现负载均衡Load balancing
Kafka的数据单位是什么？怎么传输？
消费者如何解析生产者的消息？
如何保证一个partition的消息只被一个consumer消费？
如何实现异步传输？ Asynchronous send
消息消费方式Push vs. pull？
怎么样定位哪些数据被消费了？

Kafka解决什么问题？

以下英文来源于http://kafka.apache.org
为了满足大量数据的实时吞吐。
We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.
It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.
It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.
Kafka建立了新的消息机制（包含partition）和消费模型
We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.
另外还要提供容错性，相比其他消息系统，更像一个数据库日志（a database log）。

Kafka的流处理

Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

Kafka如何实现负载均衡Load balancing

生产者将数据直接传输给broker，所有结点都可以回答：哪个服务器是活动的？每个topic的某个partition的leader是谁？
负载均衡可以通过随机负载或者指定分区函数。

Kafka的数据单位是什么？怎么传输？

Kafka的数据单位是message，是一个字节数组。消息中可以有元数据key，用来写入特定的分区partition。为了提高效率，消息批量写入。同一批次的消息会发送到同一个topic和partition。

消费者如何解析生产者的消息？

可以根据业务场景选择合适的Schema，如JSON、XML。
序列化框架：Avro提供了一个压缩的序列化格式、消息模式和消息负载分离。

如何保证一个partition的消息只被一个consumer消费？

consumer group保证一个partition只会被一个consumer成员消费。
一个consumer group可以包含多个consumer成员，成员间读取不同的partition进行消费，互不干扰。

如何实现异步传输？ Asynchronous send

Kafka将数据缓存在内存中，然后批量发送。将不超过一定数量的请求打包，并等待不超过一定时间发送批量数据。
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.

消息消费方式Push vs. pull？

消息消费可以有两种方式，一种是消费者从broker那里pull数据，一种是broker将数据push给消费者。Kafka中，生产者将数据push给broker，消费者从broker那里pull数据，这样解决了消费者的处理能力落后于生产者时的问题。
The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a “long poll” waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).

怎么样定位哪些数据被消费了？

许多消息系统保存了元数据来定位哪些消息被消费了。如果broker在将消息传送出去后将数据标记为consumed，如果消费者处理失败，那么消息就丢失了。因此许多消息系统加入了应答机制：消息被发送后先标记为sent，直到消费者返回处理成功才标记为consumed。messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed.这产生了新的问题，如果消费者处理成功，但是没有返回应答，那么消息会被消费两次；而且broker要维护消息的多个状态。
Kafka将数据分区(partition),每个partition只被一个consumer group中的一个consumer消费。每个partition只保存了一个整数(offset)，来标记消费位置。消费者也可以定位到一个旧的位置，来重新消费数据。