Da Vernacular Kafka Architecture Principle

introduction

The era of big data is coming, if you don’t know Kafka, it’s really out! According to statistics, one-third of the world's Fortune 500 companies are using Kafka, including all TOP10 travel companies, 7 TOP10 banks, 8 TOP10 insurance companies, 9 TOP10 telecommunications companies, and so on. LinkedIn, Microsoft, and Netflix use Kafka to process trillions of information every day. In this article, let us discuss the architecture principles of kafka together.


kafka official network: http://kafka.apache.org/

1. Introduction to kafka

Kafka was originally developed by Linkedin. It is a distributed, partitioned, multi-copy, multi-subscriber , distributed log system coordinated by zookeeper (also can be used as an MQ system). It is often used for web/nginx logs, access logs, Message service and so on. Linkedin contributed to the Apache Foundation in 2010 and became a top open source project.

2. Features of Kafka

  • High throughput and low latency: Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds;

  • Scalability: Kafka cluster supports hot expansion;

  • Persistence and reliability: messages are persisted to the local disk, and data backup is supported to prevent loss;

  • Fault tolerance: Allow nodes in the cluster to fail (if the number of partition copies is n, then n-1 nodes are allowed to fail);

  • High concurrency: a single machine can support simultaneous reading and writing of thousands of clients;

3. Application scenarios of kafka

  • Log collection : A company can use Kafka to collect logs of various services, and open it to various consumer terminals, such as Hadoop, Hbase, Solr, etc., through a unified interface through Kafka.

  • Message system : decoupling producers and consumers, caching messages, etc.

  • User activity tracking : Kafka is often used to record various activities of web users or app users, such as web browsing, search records, clicks, and other activities. These activity information is published by various servers to Kafka topics, and then subscribers subscribe to these topic to do real-time monitoring and analysis, or load it into hadoop or data warehouse for offline analysis and mining.

  • Operational indicators : Kafka is also often used to record operational monitoring data.

  • Streaming

4. Kafka architecture (the highlight!)

The following is an architecture diagram of Kafka,

On the whole, the Kafka architecture contains four major components: producer, consumer, kafka cluster, zookeeper cluster . Compared with the above structure diagram, let's clarify a few very important terms first, (see the picture! Compare the picture to understand~)

1、broker

The Kafka cluster contains one or more servers, and each server node is called a broker.

2、topic

Each message published to the Kafka cluster has a category. This category is called topic. In fact, the message is classified according to topic. Topic is a logical classification. The data of the same topic can be on the same broker or on the same broker. Different broker nodes.

3、partition

Partition , each topic is physically divided into one or more partitions, and each partition physically corresponds to a folder , which stores all the messages and index files of this partition . You can specify that the number of parition when creating a topic, the producer sends a message to the topic, the message will be based on a partition plan appended to the file of the partition at the end , it belongs to the order of writing to disk , so efficiency is very high (proven, sequential write efficient than random write disk The memory is even higher, which is a very important guarantee for Kafka's high throughput).

The above mentioned partitioning strategy , the so-called zoning strategy is to determine the producer to send a message to the algorithm which partitions . Kafka provides us with a default partition strategy, and it also supports custom partition strategies. Kafka allows you to set a key for each message. Once the message is defined with a key, it can ensure that all messages of the same key enter the same partition. This strategy is a kind of custom strategy, called " press Message key saving strategy ", or Key-ordering strategy.

Multiple partitions of the same topic can be deployed on multiple machines to achieve the scalability of Kafka . The data in the same partition is ordered, but multiple partitions under the topic cannot guarantee the order when consuming data. In the scenario where the order of message consumption needs to be strictly guaranteed, the number of partitions can be set to 1, but this The disadvantage of this approach is that it reduces throughput. Generally speaking, it is only necessary to ensure the orderliness of each partition, and then set the key to the message to ensure that messages with the same key fall into the same partition, which can satisfy most applications.

4、offset

Each message in the partition is marked with a sequence number. This sequence number represents the offset of the message in the partition, called offset. Each message has a unique offset in the partition. The message person specifies the offset to be consumed. news.

Under normal circumstances, the consumer will increment the offset after consuming a message and prepare to consume the next message, but it is also possible to set the offset to a smaller value and re-consume some messages that have been consumed. It can be seen that the offset is controlled by the consumer Yes, the consumer consumes which message it wants to consume, so the kafka broker is stateless, and it does not need to mark which messages have been consumed.

5、producer

The producer, the producer sends a message to the specified topic, and the message is appended to the end of a partition according to the allocation rules.

6、consumer

Consumers, consumers consume data from topics.

7、consumer group

Consumer group. Each consumer belongs to a specific consumer group. Consumer group can be specified for each consumer. If not specified, it belongs to the default group.

A message of the same topic can only be consumed by one consumer in the same consumer group, but multiple consumer groups can consume the message at the same time. This is also the method used by Kafka to realize the broadcasting and unicasting of a topic message. If broadcasting is required, only one consumer can be placed in a consumer group. To realize unicast, place all consumers in the same consumer group. That's it.

With consumer group, consumers can be grouped freely without sending messages to different topics multiple times.

8、leader

Each partition has multiple copies, one and only one of them is the leader, and the leader is responsible for all client read and write operations.

9、follower

The follower does not provide services to the outside world, and only maintains data synchronization with the leader. If the leader fails, a follower is elected to act as the new leader. When the follower and the leader hang up, get stuck or the synchronization is too slow, the leader will delete the follower from the ISR list and create a new follower.

10、rebalance

Multiple consumers under the same consumer group coordinate their consumption work with each other. Think of it this way. A topic is divided into multiple partitions. All consumers in a consumer group cooperate to consume all the partitions under a certain topic subscribed to. (Each consumer consumes part of the partition), Kafka will evenly distribute all the partitions under the topic to each consumer under the consumer group, as shown below,

Rebalance means "rebalance". After a consumer in the consumer group hangs up, other consumers automatically redistribute the subscribed topic partition. This is an important means for Kafka consumers to achieve high availability. As shown in the figure below, C2 in Consumer Group A hangs up, and C1 will receive P1 and P2 to achieve rebalancing. Similarly, when a new consumer joins the consumer group, it will also trigger a rebalancing operation.

5. Some explanations of Kafka architecture

  • A typical Kafka cluster contains several producers, several brokers (Kafka supports horizontal expansion, generally the more the number of brokers, the higher the cluster throughput), several consumer groups, and a zookeeper cluster. Kafka coordinates the management of the Kafka cluster through zookeeper, elects the partition leader, and performs rebalance when the consumer group changes.

  • Kafka's topic is divided into one or more partitions. Multiple partitions can be distributed on one or more broker nodes. At the same time, for fault tolerance, each partition will replicate multiple copies, which are located on different broker nodes. These partition copies Medium (both leader and follower are called partition copies), one partition copy will be the leader, and the rest of the partition copies will be the follower. The leader is responsible for all client read and write operations. The follower does not provide services to the outside world. It only synchronizes data from the leader. When the leader fails, one of the followers will replace the leader and continue to provide services to the outside world.

  • For traditional MQ, messages that have been consumed will be deleted from the queue, but messages that have been consumed in Kafka will not be deleted immediately. The data retention time is defined in the kafka server.propertise configuration file. When the file It will not be deleted until the set storage time,

    # 数据的保存时间(单位:小时,默认为7天)
    
    log.retention.hours=168

    Because the time complexity of Kafka reading messages is O(1) and has nothing to do with file size, deleting expired files here has nothing to do with improving Kafka performance, so the choice of deletion strategy should consider disk and specific requirements.

  • Peer-to-peer model VS publish and subscribe model

    In traditional messaging systems, there are two main messaging modes: point-to-point mode and publish-subscribe mode.

    ①Point-to-point mode 

    The producer sends a message to the queue. The queue supports the existence of multiple consumers, but for a message, it can only be consumed by one consumer, and in the point-to-point mode, messages that have been consumed will be deleted from the queue and no longer stored .

    ② Publish and subscribe model

    Producers publish messages to topics, topics can be subscribed by multiple consumers, and messages published to topics will be consumed by all subscribers. Kafka is a publish-subscribe model.

  • Consumer pull and push

    ① Push method: The message middleware actively pushes the message to the consumer;

    Advantages: The advantage is that consumers do not need to open additional thread monitoring middleware, which saves overhead.

    Disadvantages: unable to adapt to consumers with different consumption rates. Because the sending rate of messages is determined by the broker,

    The processing speed of the consumers is not the same, so it is easy to cause some consumers to be idle and some consumers to accumulate, causing delays.

    The flush area overflows.

    ② Pull method: Consumers take the initiative to pull messages from the message middleware;

    Advantages: The consumer can pull according to processing capacity;

    Disadvantages: the consumer needs to open another thread monitoring middleware, which has performance overhead;

    For Kafka, the pull mode is more appropriate. The pull mode simplifies the design of the broker. Consumers can autonomously control the rate of consuming messages. At the same time, Consumers can control their own consumption methods. They can consume batches or consume items one by one. At the same time, they can choose different submission methods to achieve different transmission semantics.

6. Comparison of kafka and rabbitMQ

 

RabbitMQ

Kafka

Development language

erlang

scale , Java

Architecture model

① Follow AMQP;

② Producers, consumers, brokers.

③ The broker is composed of exchange, binding, and queue;

④ The consumer location is saved by the broker through the confirmation mechanism;

① Does not follow AMQP;

② Producers, consumers, kafka clusters, zookeeper clusters;

③ The Kafka cluster is composed of multiple broker nodes. Messages are classified by topic, and each topic is divided into multiple partitions;

④ The broker is stateless, and the offset is specified by the consumer;

reliability

 

RabbitMQ has better reliability, supports transactions, and supports message confirmation mechanism

High availability

The mirror queue is adopted, that is, the master-slave mode, and the data is asynchronous and synchronous. When the message comes, the master and slave are all written, and the ack is returned, which ensures the consistency of the data.

Each partition has one or more copies. These copies are stored on different brokers. There is one and only one partition copy as the leader, and the rest as followers. When the leader is unavailable, the follower will be elected as the new leader to continue to provide services .

Only the leader provides read and write services, and the follower pulls data synchronously from the leader and then backs it up.

Throughput

Kafka is higher

 

Whether to support affairs

stand by

not support

Load balancing

Need external support to achieve (such as: loadbalancer)

Kafka uses zk and partition mechanism to achieve load balancing

Whether to support consumer Push

not support

stand by

Whether to support consumer Pull

stand by

stand by

Applicable scene

Kafka's advantages are mainly reflected in throughput, and it is mainly used in high throughput scenarios. Such as log collection.

It has higher rigor, less possibility of data loss, and higher real-time performance, which is used for message transmission that requires higher real-time and reliability.

 

7. Why is Kafka throughput so high?

1. Read and write disk sequentially

Kafka persists messages to the local disk. Most people think that disk read and write performance is poor, and may question Kafka performance. In fact, whether it is memory or disk, the key to fast or slow is the addressing method. Disks are divided into sequential read and write and random read and write, and memory is also divided into sequential read and write and random read and write. Disk-based random read and write is really slow, but disk-based sequential read and write performance is very high . Generally speaking, it is three orders of magnitude higher than disk random read and write. In some cases, disk sequential read and write performance is even higher than memory Random read and write, here is a performance comparison chart from the famous academic journal ACM Queue:

2 、 cache page

In order to optimize read and write performance, Kafka uses the Page Cache of the operating system itself, which uses the operating system's own memory instead of the JVM space memory . This is because,

> Everything in the JVM is an object, and the storage of objects will bring additional memory consumption;

> Using JVM will be affected by GC, as the data increases, garbage collection will become complicated and slow, reducing throughput;

In addition, the operating system itself has made a lot of optimizations to the page cache. Through the operating system's Page Cache, Kafka's read and write operations are basically based on system memory, and the read and write performance has also been greatly improved.

3. Zero copy

Zero copy refers to the optimization of Kafka on the consumer side using the " zero-copy " mechanism of the Linux operating system . Let's first look at the entire process of data transmission from the broker disk to the consumer end through the network when the consumer side consumes data:

> The operating system reads data from the disk to the page cache of the kernel space (kernel space);

> The application program reads the page cache data to the user space buffer;

> The application writes the data in the user space buffer back to the socket buffer in the kernel space;

> The operating system copies data from the socket buffer to the hardware (such as network card) buffer;

The whole process is shown in the figure above. This process includes 4 copy operations and 2 system context switches. Context switching is a CPU-intensive work. Data copying is an I/O-intensive work, and its performance is actually very inefficient.

Zero copy is to use a system call method called sendfile() to send data directly from the page cache to the Socket buffer, avoiding the switch of system context and eliminating the back and forth copy from kernel space to user space. As can be seen from the above figure, "zero copy" does not mean that the entire process does not copy at all, but from the perspective of the kernel, it avoids copying from kernel space to user space.

4. Partitioning

Kafka's messages are classified and stored by topic, and the data in the topic is stored in different broker nodes according to one partition . Each partition corresponds to a folder on the operating system, and the partition is actually stored in segments according to segments. This is also in line with the design idea of ​​partitioning and bucketing in distributed systems.

Through this partitioned design, Kafka's message messages are actually distributed and stored in a small segment, and each file operation is also a direct operation segment. For further query optimization, Kafka creates an index file for the segmented data file by default, which is the .index file on the file system. This design of partition and index not only improves the efficiency of data reading, but also improves the parallelism of data operations.

In short, Kafka uses sequential read and write, Page Cache, zero copy, and partition segmentation designs, coupled with optimizations in indexing , and Kafka data read and write are also batches instead of single, which makes Kafka high performance , High throughput, low latency .

Original link

Guess you like

Origin blog.csdn.net/a1036645146/article/details/109049569