3. Design ideas of Kafka series (1)

Recently, I was reading the official documents. I think the content of Chapter 4 DESIGN is very good. I would like to share it with you.

4.1 Motivation motivation

We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.

We designed Kafka to act as a unified platform for all the real-time data feeds a large company might have. For this we have to consider a fairly broad set of use cases

It would have to have high-throughput to support high volume event streams such as real-time log aggregation.

It must have high throughput to support high-volume event streams, such as real-time log aggregation

It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.

It needs to gracefully handle large data backlogs to be able to support periodic data loads from offline systems

It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.

This also means that the system must handle low-latency delivery in order to handle more traditional messaging use cases

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.

We want to support partitioned, distributed, real-time processing of these feeds to create new derived feeds. This motivates our partition and consumer models.

Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee fault-tolerance in the presence of machine failures.

Finally, when streams are fed into other data systems for service, we know that the system must be able to guarantee fault tolerance in the event of machine failure

Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messaging system. We will outline some elements of the design in the following sections.

Supporting these uses led us to design a number of unique elements that more closely resemble database logs than traditional messaging systems. We outline some elements of the design in the following sections

4.2 Persistence Persistence

Don't fear the filesystem! Don't be afraid of the filesystem!

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that “disks are slow” which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.

Kafka relies heavily on the file system to store and cache messages. The common perception that "disks are slow" casts doubt on the ability of persistent structures to provide competitive performance. In fact, disks are much slower and faster than one might expect, depending on how they are used. A properly designed disk structure can usually be as fast as the network

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!

The key fact about disk performance is that over the past decade, the throughput of hard drives has varied from the latency of disk seeks. So, on a JBOD configuration with six 7200rpm SATA RAID-5 arrays, the linear write performance is about 600MB/sec, while the random write performance is only about 100k/sec - a difference of over 6000x. These linear reads and writes are the most predictable of all usage patterns and are heavily optimized by the operating system. Modern operating systems provide read-ahead and write-behind techniques to prefetch data in multiples of large blocks and group smaller logical writes into larger physical writes. Further discussion of this issue can be found in this ACM queue article; they actually found that in some cases sequential disk access can be faster than random memory access!

To compensate for this performance divergence, modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

To compensate for this performance difference, modern operating systems are increasingly using main memory for disk caching. When reclaiming memory, modern operating systems will happily move all free memory to disk cache with little performance penalty. All disk reads and writes will go through this unified cache. This cannot be easily turned off without using direct I/O, so even if a process maintains an in-process cache of data, that data may be duplicated in the OS page cache, effectively storing everything twice

Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:

Also, we're built on top of the JVM, and anyone who's taken the time to study Java's memory usage knows two things:

1.The memory overhead of objects is very high, often doubling the size of the data stored (or worse).

Objects have a very high memory overhead, often doubling the size of the data stored (or worse)

2.Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

Java Garbage Collection becomes cumbersome and slow as data in the heap grows

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore, this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.

Due to these factors, using a filesystem and relying on a page cache is preferable to maintaining an in-memory cache or other structures - we at least double the available cache by automatically accessing all available memory, and possibly double it again by storing compact bytes Structures rather than individual objects. Doing this will result in a cache of up to 28-30GB on a 32GB machine with no GC penalty. Also, even if the service is restarted, this cache will remain warm, while the in-process cache will need to be rebuilt in memory (maybe 10 minutes for a 10GB cache), or it will need to start from a completely cold cache (which might means poor initial performance). This also greatly simplifies the code, since all the logic to maintain coherency between the cache and the filesystem is now in the OS, which is often more efficient and correct than trying to do it in a one-off process. If your disk usage favors linear reads, then read-ahead essentially pre-populates this cache with useful data on every disk read.

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel’s pagecache.

This suggests a very simple design: instead of keeping as much as possible in memory and panic flushing it all to the filesystem when we run out of space, we reverse it. All data is written immediately to a persistent journal on the filesystem without flushing to disk. Actually this just means that it is moved into the kernel's page cache

This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).

This page cache-centric design style is described (and moderately exaggerated) in the article on Varnish design here.

Constant Time Suffices Constant time is sufficient

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache–i.e. doubling your data makes things much worse than twice as slow.

The persistent data structure used in messaging systems is typically a per-consumer queue, with an associated BTree or other general-purpose random-access data structure to maintain metadata about messages. BTree is the most general data structure available that can support various transactional and non-transactional semantics in a messaging system. They do come with a fairly high cost though: Btree operations are O(log N). Usually O(log N) is considered to be essentially equivalent to constant time, but this is not true for disk operations. Disk seeks occur at 10ms, and each disk can only do one seek at a time, so parallelism is limited. Therefore, even a small number of disk seeks can result in very high overhead. Since storage systems mix very fast cache operations with very slow physical disk operations, tree structures typically observe performance superlinearly as data grows with pinned cache - i.e. doubling the data makes things Worse than twice as slow.

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.

Intuitively, persistent queues can be built on top of simple reading and appending to files, as is often the case with logging solutions. The advantage of this structure is that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages, as performance is completely decoupled from data size - a single server can now take full advantage of many cheap, low-spinning 1+TB SATA drives. Despite their poor seek performance, these drives have acceptable large read and write performance for 1/3 the price and 3 times the capacity.

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, we can retain messages for a relatively long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.

Access to virtually unlimited disk space without any performance penalty means we can provide some functionality not commonly found in messaging systems. For example, in Kafka, we can keep messages for a relatively long period of time (say, a week) instead of trying to delete messages as soon as they are consumed. As we'll describe, this opens up a lot of flexibility for consumers.

Welcome to pay attention to the official account algorithm niche

Guess you like

Origin blog.csdn.net/SJshenjian/article/details/129938222