I dig the underlying principles of Kafka! It found that three popular truth of the universe!

Currently on the market a variety of middleware emerging, we are doing will inevitably tangle when a specific selection, set forth herein superficial point of view, in fact, each middleware in its design, has its own unique characteristics or optimization points, these happen to be our concern, so as to make the best use, will play its characteristics to the maximum; but also understand their own weaknesses, mainly in order to avoid the pit. A variety of middleware is like building blocks, we can do, is to choose the building blocks suitable shape, Dachu need a house.

Kafka have to say that this building, not only do the clipping decoupled messaging middleware, and can do real-time stream processing, data services with both hands, it turns on the living room, under the kitchen. So the first chapter, would like to start its application scenarios are starting to talk about what techniques and principles to support its technical characteristics of the Kafka family.

Kafka core idea of ​​generalization

All messages in the "orderly log" of storage, producer publishes a message to the end (understood as additional), consumers read sequentially from a logical place.

[A] scene messaging middleware

In the choice of messaging middleware, our main concerns are: performance and reliability of the message, sequential.

1. Performance

Kafka on high-performance, mainly because it realized the advantage of some of the underlying operating system optimization techniques, although as a programmer to write code for business, these underlying knowledge also need to know.

I dig the underlying principles of Kafka!  It found that three popular truth of the universe!

[A] zero-copy optimization

This is Kafka optimize the consumer side, we compare the difference between the two plans by the traditional way with zero-copy mode:

  • The traditional way:
I dig the underlying principles of Kafka!  It found that three popular truth of the universe!
  • Zero-copy mode:
  • The ultimate goal: how to get the data without the user space?
  • As can be seen from the figure, the steps are omitted zero copy copied to the user buffer, the file descriptor by copying data directly from the kernel space to the network interface.
I dig the underlying principles of Kafka!  It found that three popular truth of the universe!

[Two] to optimize the order written to disk

  • When the message is written, the file using additional embodiment, and not modify the message has been written, then written to disk is written sequentially. We believe that based on the generally poor performance of disk read and write, refers to a disk-based random access; in fact, based on the order of disk read and write, random read and write performance of the memory is close to, the following is a performance comparison chart:
I dig the underlying principles of Kafka!  It found that three popular truth of the universe!

[Three] to optimize memory map

  • Summary: the user region of memory space is mapped to the kernel space, so that, either user space or kernel space modification of this memory area, can be mapped directly to another.
  • Advantages: user mode and kernel mode, if there are a lot of data transmission efficiency is very high.
  • 为什么会提高效率:概括来讲,传统方式为read()系统调用,进行了两次数据拷贝;内存映射方式为mmap()系统调用,只进行一次数据拷贝

【优化四】批量压缩

  • 生产者:批量发送消息集
  • 消费者:主动拉取数据,同样采用批量拉取的方式

2.可靠性

Kafka的副本机制是保证其可靠性的核心。

关于副本机制,我将它理解为Leader-Follower机制,就是多个服务器中有相同数据的多个副本,并且划分的粒度是分区。很明显,这样的策略就有下面几个问题必须解决:

  • 各副本间如何同步?
  • ISR机制:Leader动态维护一个ISR(In-Sync Replica)列表,
  • Leader故障,如何选举新的Leader?
  • 要想解决这个问题,就要引出Zookeeper,它是Kafka实现副本机制的前提,关于它的原理且听下回分解,本篇还是从Kafka角度进行分析。在这里我们只需要了解,一些关于Broker、Topics、Partitions的元信息存储在Zookeeper中,Leader发生故障时,从ISR集合中进行选举新的Leader。

request.required.acks来设置数据的可靠性:

I dig the underlying principles of Kafka!  It found that three popular truth of the universe!

分区机制和副本机制知识点:

I dig the underlying principles of Kafka!  It found that three popular truth of the universe!

3.顺序性

顺序性保证主要依赖于分区机制 + 偏移量。

提到分区,首先就要解释一下相关的概念以及他们之间的关系,个人总结如下几点:

服务器(Broker):指一个独立的服务器

主题(Topic):消息的逻辑分类,可跨Broker

分区(Partition):消息的物理分类,基本的存储单元

这里盗一张图阐述上述概念间的关系

I dig the underlying principles of Kafka!  It found that three popular truth of the universe!
  • 为什么分区机制可以保证消息的顺序性?
  • Kafka可以保证一个分区内消息是有序且不可变的。
  • 生产者:Kafka的消息是一个键值对,我们通过设置键值,指定消息被发送到特定主题的特定分区。
  • 可以通过设置key,将同一类型的消息,发到同一个分区,就可以保证消息的有序性。
  • 消费者:消费者需要通过保存偏移量,来记录自己消费到哪个位置,在0.10版本前,偏移量保存在zk中,后来保存在 __consumeroffsets topic中。

[Processing flow] Scene II

After the 0.10 version, Kafka built streaming framework API - Kafka Streams, Kafka based streaming processing library, which utilizes the above-described, so far, it will become Kafka include a messaging system, storage system, stream centralized processing platform stream processing system.

And the existing Spark Streaming platform is different, Spark Streaming or Flink is a is a system architecture, and Kafka Streams belong to a library. Kafka Streams adhering to the simple design principle advantage is reflected in the operation and maintenance. Kafka Streams while maintaining all the features mentioned above.

Both regarding the appropriate application scenarios, big brother has given conclusion, not forced summarized.

  • Kafka Streams: for "Kafka -> Kafka" scene
  • Spark Streaming: for "Kafka -> Database" or "Kafka -> Data Science Model" Scene

reference

  • "Kafka Definitive Guide"
  • "Kafka Inside"
  • The Pathologies of Big Data
  • Real-time processing era of big data: Apache Kafka

Recently I stumbled on a giant cow artificial intelligence course, could not help but share to everyone. Tutorial is not only a zero-based, user-friendly, and very humorous, like watching a fiction! I think too much bad, so to others. Click here to jump to the tutorial.

Guess you like

Origin www.cnblogs.com/CQqf2019/p/11262321.html