3.Kafka系列之设计思想(一)

最近在看官方文档，第4章DESIGN内容我觉得挺好，分享给大家

4.1 Motivation动机

We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.

我们将 Kafka 设计为能够作为一个统一平台来处理大型公司可能拥有的所有实时数据馈送。为此，我们必须考虑相当广泛的用例集

It would have to have high-throughput to support high volume event streams such as real-time log aggregation.

它必须具有高吞吐量才能支持大容量事件流，例如实时日志聚合

It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.

它需要优雅地处理大量数据积压，以便能够支持来自离线系统的周期性数据加载

It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.

这也意味着系统必须处理低延迟交付才能处理更传统的消息用例

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.

我们希望支持对这些提要进行分区、分布式、实时处理，以创建新的派生提要。这激发了我们的分区和消费者模型。

Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee fault-tolerance in the presence of machine failures.

最后，当流送入其他数据系统进行服务的情况下，我们知道系统必须能够在出现机器故障时保证容错性

Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messaging system. We will outline some elements of the design in the following sections.

支持这些用途使得我们设计出许多独特元素，与传统的消息传递系统相比，它更类似于数据库日志。我们将在以下部分概述设计的一些元素

4.2 Persistence持久性

Don’t fear the filesystem!不要害怕文件系统！

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that “disks are slow” which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.

Kafka 严重依赖文件系统来存储和缓存消息。人们普遍认为“磁盘很慢”，这让人们怀疑持久化结构能否提供有竞争力的性能。事实上，磁盘比人们预期的要慢得多，也快得多，这取决于它们的使用方式。一个设计得当的磁盘结构通常可以和网络一样快

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!

关于磁盘性能的关键事实是，在过去十年中，硬盘驱动器的吞吐量与磁盘寻道的延迟有所不同。因此，在具有六个 7200rpm SATA RAID-5 阵列的JBOD 配置上，线性写入的性能约为 600MB/秒，而随机写入的性能仅为约 100k/秒——相差超过 6000 倍。这些线性读取和写入是所有使用模式中最可预测的，并且由操作系统进行了大量优化。现代操作系统提供预读和后写技术，以大块倍数预取数据，并将较小的逻辑写入分组成较大的物理写入。可以在这篇ACM 队列文章中找到对这个问题的进一步讨论；他们实际上发现在某些情况下，顺序磁盘访问可能比随机内存访问更快！

To compensate for this performance divergence, modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

为了弥补这种性能差异，现代操作系统越来越积极地使用主内存进行磁盘缓存。回收内存时，现代操作系统会很乐意将所有空闲内存转移到磁盘缓存中，而性能损失很小。所有的磁盘读写都会经过这个统一的缓存。如果不使用直接 I/O，则无法轻易关闭此功能，因此即使进程维护数据的进程内缓存，该数据也可能会在操作系统页面缓存中复制，有效地将所有内容存储两次

Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:

此外，我们构建在 JVM 之上，任何花时间研究 Java 内存使用的人都知道两件事:

1.The memory overhead of objects is very high, often doubling the size of the data stored (or worse).

对象的内存开销非常高，通常会使存储的数据大小翻倍（或更糟)

2.Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

随着堆内数据的增加，Java 垃圾收集变得越来越繁琐和缓慢

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore, this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.

由于这些因素，使用文件系统和依赖页面缓存优于维护内存中缓存或其他结构——我们至少通过自动访问所有可用内存来使可用缓存翻倍，并且可能通过存储紧凑的内存来再次翻倍字节结构而不是单个对象。这样做将导致 32GB 机器上的缓存高达 28-30GB 而没有 GC 惩罚。此外，即使服务重新启动，此缓存仍将保持温暖状态，而进程内缓存将需要在内存中重建（对于 10GB 的缓存可能需要 10 分钟），否则它将需要从一个完全冷的缓存开始（这可能意味着糟糕的初始性能）。这也大大简化了代码，因为维护缓存和文件系统之间一致性的所有逻辑现在都在操作系统中，这往往比一次性的过程中尝试更有效、更正确。如果您的磁盘使用有利于线性读取，那么预读实际上是在每次磁盘读取时使用有用的数据预先填充此缓存。

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel’s pagecache.

这表明了一种非常简单的设计：与其在内存中保留尽可能多的内容并在我们用完空间时恐慌地将其全部刷新到文件系统，不如将其反转。所有数据都会立即写入文件系统上的持久日志中，而不必刷新到磁盘。实际上这只是意味着它被转移到内核的页面缓存中

This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).

这种以页面缓存为中心的设计风格在此处有关 Varnish 设计的文章中进行了描述（以及适度的夸大）。

Constant Time Suffices恒定时间就足够了

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache–i.e. doubling your data makes things much worse than twice as slow.

消息系统中使用的持久数据结构通常是每个消费者队列，带有关联的 BTree 或其他通用随机访问数据结构，以维护有关消息的元数据。BTree 是可用的最通用的数据结构，可以在消息系统中支持各种事务性和非事务性语义。不过，它们确实带来了相当高的成本：Btree 操作是 O(log N)。通常 O(log N) 被认为本质上等同于恒定时间，但对于磁盘操作而言并非如此。磁盘寻道以 10 毫秒的速度出现，并且每个磁盘一次只能进行一次寻道，因此并行性受到限制。因此，即使很少的磁盘寻道也会导致非常高的开销。由于存储系统混合了非常快的缓存操作和非常慢的物理磁盘操作，因此随着数据随着固定缓存的增加而增加，树结构的观察性能通常是超线性的–即将数据加倍会使事情变得比慢两倍更糟。

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.

直观地说，持久队列可以建立在简单的读取和附加到文件的基础上，就像日志记录解决方案的常见情况一样。这种结构的优点是所有操作都是 O(1) 并且读取不会阻塞写入或彼此。这具有明显的性能优势，因为性能与数据大小完全分离——一台服务器现在可以充分利用许多廉价、低转速的 1+TB SATA 驱动器。尽管它们的寻道性能较差，但这些驱动器具有可接受的大型读取和写入性能，而且价格是后者的 1/3，容量是后者的 3 倍。

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, we can retain messages for a relatively long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.

在没有任何性能损失的情况下访问几乎无限的磁盘空间意味着我们可以提供一些消息系统中不常见的功能。例如，在 Kafka 中，我们可以将消息保留相对较长的时间（比如一周），而不是尝试在消息被消费后立即删除。正如我们将要描述的，这为消费者带来了很大的灵活性。

欢迎关注公众号算法小生

3.Kafka系列之设计思想(一)

猜你喜欢