In pursuit of the ultimate performance, Kafka control of these 11 essentials

Many students Kafka private letter asked me what had been done in the performance optimization initiatives, the answer to the questions of fact, I already wrote, is not systematically organized a recent thinking a little time to sort out, next time there is students asked me related issues I can swing a chic links. This issue is also Kafka frequently asked questions of the interview, the interviewer ask you this question is not too make things difficult for you. In the Internet there are many articles related to the opening to explain the problem, such as number before major public reproduced in "Kafka Why so fast?", I read these articles, written in good, the problem is just part of the list of essentials, not all of detailed out. Listed herein will be appreciated more than you have to search the Internet more, then if you are reading this article, encountered during the interview questions, I believe you will make the interviewer shines.

Batch processing

Traditional messaging middleware to send messages and for the consumer as a whole is a single bar. For the producer, it is to send a message, then the broker returns an ACK indicating reception of the generated secondary rpc herein; For the consumer, it first request acceptance message, and then returns a message broker, and finally transmits the ACK indicates consumption, produced here three times rpc (some messaging middleware optimizes the moment, when the broker returned return multiple messages). And Kafka uses batch processing: the producer of a number of news aggregation, then do into twice rpc message broker, which was originally required a lot of times rpc to complete the operation. Suppose you want to send a message 1000, the size of each message 1KB, then conventional message middleware need rpc 2000 times, and 1000 Kafka could put this message into a packaging 1MB message, using 2 rpc to complete the task. This improvement initiative was once considered a kind of "cheating" behavior, but in the micro-batch concept prevalent today, other messaging middleware have begun to follow suit.

Client optimization

Here then the concept of batch processing to continue, the new version of the producer client abandoned the previous single-threaded, and uses a dual threads: the main thread and Sender thread. The main thread is responsible for message into the client cache, Sender thread is responsible for sending a message from the cache, the cache will be a plurality of messages to a batch polymerization. Some messages directly messaging middleware will throw broker.

Log Format

Kafka began after the 0.8 version log format three changes: v0, v1, v2. In the hair off before the article, "read a text message formats evolve Kafka" Kafka described in detail in log format, Kafka log format more conducive to handling the bulk of the message, interested students can read about this article to for understanding.

 

In pursuit of the ultimate performance, Kafka control of these 11 essentials

 

 

Log coding

If Kafka understand specific log format (refer to the figure), then you should understand log (the Record, message or call) itself in addition to the basic value and key, there are other fields, these additional fields originally according to occupy a certain fixed size length (on the left with reference to FIG.), and the latest version is adopted into Kafka and ZigZag Varints field coding, effectively reducing the size of these additional fields occupied. Log (messages) as smaller, then the efficiency of the transmission network also becomes high, the log archiving will also enhance efficiency, thereby finishing the performance will be improved.

Message compression

Kafka supports a variety of messaging compression (gzip, snappy, lz4). The message compression can greatly reduce network traffic, reduce network I / O, thereby improving overall performance. Message compression is a way to optimize the use of time for the space, if there are certain requirements for delay, the message compression is not recommended.

Indexed for quick location query

Each log file segment corresponding to the two index files, mainly used to improve the efficiency of lookup message, which is a way to improve performance. (Specific contents detailed explanation in chapter 5 of the book, the public seem to have forgotten published in number, find a circle did not find)

Partition

Many people will ignore this factor, in fact, enhance one property of the partition is a very effective way, this way the effect will be brought in front of said log ratio encoding, message compression and other more obvious. Partitions in other distributed components is also heavily involved, as to why partitions can improve the performance of this basic knowledge here is not in the repeat. Remember, though, reimbursing increase partition and not always bring performance improvements.

consistency

The vast majority of information about Kafka at the initiative of the performance optimization is not going to mention the consistency of things. We know the consistency of common protocols such as Paxos, Raft, Gossip, etc., and Kafka another way similar PacificA approach is not "shoot thigh" shoot out, using this model will enhance finishing efficiency.

Sequential write disk

The operating system can do for linear read and write in-depth optimization, such as pre-reading (read-ahead, advance to a larger disk block is read into memory) and the write (write-behind, will write many small logical merge composed of a large physical writes) technology. Kafka employed in the design of a document written message to additional embodiment, i.e., only adding a new message at the tail of the log file, and can not be modified message has been written, this approach is a typical disk write sequence of operations Therefore, even if the throughput of the disk as a storage medium Kafka used, it can carry not be underestimated.

Page caching

Why Kafka performance so high? When confronted with this question a lot of people will think of the above sequence of disk write this. In fact, in order swash plate there in front of the page cache (PageCache) this layer optimization.

Page cache disk cache is a major operating system implementation, in order to reduce the operation of the disk I / O's. Specifically, it is the cached data on the disk into memory, the disk access becomes access to memory. To compensate for the difference in performance, modern operating system, more and more "aggressively" memory as a disk cache, even happy all available memory for disk cache, so that when garbage collection is almost no performance penalty for all disk read and write but also via the unified cache.

When a process is ready to read the contents of a file on disk, the operating system will first check the page data to be read is located (page) is in the page cache (pagecache), if there is (a hit) directly return data, thereby avoiding the physical disk I / O operations; if not hit, then the system will initiate a read request to the disk and read the data page is stored in the page cache, then after the data is returned to the process. Similarly, if a process needs to write data to disk, the operating system will detect whether data corresponding page in the page cache, and if not, it will first add the appropriate page in the page cache, data is written to the corresponding final page. After the page is modified page will become a dirty, dirty operating system data page is written to disk at the right time, in order to maintain data consistency.

For a process that will deal with in the course of internal cache required data, but these data may also be cached in the operating system's page cache, so it is possible the same data is cached twice. And, unless Direct I / O way, otherwise the page cache is difficult to be prohibited. In addition, used the Java people are generally aware of two facts: the object memory overhead is very large, often several times the size of the real data is even more, space utilization is low; with Java's garbage collection within the heap data It is increasingly becoming more and more slowly. Based on these factors, the use of the file system cache pages and rely on practices that are clearly superior to maintain an in-process cache or other structure, at least we can cut out a process of internal cache consumption, but also through compact byte alternative to using the object code to save more space. So, we can use the 28GB to 30GB of 32GB of memory on the machine without worrying about performance issues caused by GC. In addition, even if Kafka service restarts, the page cache will still remain valid, but the cache in the process it takes to rebuild. This also greatly simplifies the code logic, because to maintain consistency between the cache and the page file handed over to the operating system to be responsible, so more will be safe and effective to maintain than in-process.

Kafka extensive use of the page cache, which is an important factor to achieve high throughput of Kafka. Although the messages are first written page cache, then brush responsible for specific tasks by the operating system disk.

Zero-copy

Kafka uses Zero Copy technology to enhance the efficiency of consumption. Kafka said before the news first page write buffer, if when consumers read messages if you can hit in the page cache, you can read directly from the page cache, which in turn saves a page from disk to cache the copy overhead. In addition to the concept of reading and writing can learn more about what is read and write amplification amplification.

Attached

A disk IO process can refer to the following chart:

 

In pursuit of the ultimate performance, Kafka control of these 11 essentials

Guess you like

Origin www.cnblogs.com/CQqf2019/p/11124777.html