A brief analysis of Kafka architecture and basic principles

Introduction to Kafka

 Kafka is an enterprise-level message publishing and subscription system written in Scala and Java. It was first developed by Linkedin Company and was eventually open sourced to a project of the Apache Software Foundation. Kafka is a distributed, high-throughput messaging system that supports partitioning, multiple copies, and multiple subscribers. It is widely used in scenarios such as application decoupling, asynchronous processing, current limiting and peak shaving, and message-driven scenarios. This article will give a brief introduction to Kafka's architecture and related components. Before introducing the architecture of Kafka, let us first understand the core concepts of Kafk.

Kafka core concepts

Before introducing the architecture and basic components of Kafka in detail, you need to first understand some core concepts of Kafka.
Producer: The producer of messages, responsible for sending messages to the Kafka cluster;
Consumer: The consumer of messages, actively pulling messages from the Kafka cluster.
Consumer Group: Each Consumer belongs to a specific Consumer Group. When creating a new Consumer, you need to specify the corresponding Consumer Group ID.
Broker: A service instance in a Kafka cluster, also called a node. Each Kafka cluster contains one or more Brokers (a Broker is a server or node).
Message: The object entity transmitted through the Kafka cluster stores the information that needs to be transmitted.
Topic: The category of the message, which is mainly used to logically distinguish the messages. Each message sent to the Kafka cluster needs to have a specified Topic, and the consumer consumes the specified message based on the Topic.
Partition: The partition of the message. Partition is a physical concept, equivalent to a folder. Kafka will create a folder for each partition of each topic. The messages of a Topic will be stored in one or more Partitions.
Segment: There are multiple segment file segments (segmented storage) in a partition. Each Segment is divided into two parts, the .log file and the .index file. The .index file is an index file and is mainly used to quickly query the .log file. The offset position of the data;
.log file:The data file that stores Message is called a log file in Kafka. There are n number of .log files (segmented storage) under a partition by default. The default size of a .log file is 1G. Messages will be continuously appended to the .log file. When the size of the .log file exceeds 1G, a new .log file will be automatically created.
.index file: stores the index data of the .log file. Each .index file has a .log file corresponding to the same name.

We will introduce some of the above core concepts in more depth later. After introducing the core concepts of Kafka, let's take a look at the basic functions, components and architectural design of Kafka.

Kafka API

As shown in the figure above, Kafka mainly contains four main API components:
1. The Producer API
application sends one or more Topic messages to the Kafka cluster through the Producer API.

2. The Consumer API
application subscribes to the Kafka cluster for one or more Topic messages through the Consumer API, and processes the messages received under these Topics.

3. The Streams API
application uses the Streams API to act as a stream processor (Stream Processor), obtains input streams from one or more Topics, and produces an output stream to one or more Topics, which can effectively transform the input stream. Become an output stream and output it to the Kafka cluster.

4. Connect API
allows applications to build and run reusable producers or consumers through Connect API, and can connect kafka topics to existing applications or data systems. Connect actually does two things: Use the Source Connector to read data from the data source (such as DB) and write it to the Topic, and then read the data in the Topic through the Sink Connector and output it to the other end (such as DB ) to realize the transmission of message data between external storage and Kafka cluster.

Kafka architecture

Next, we will start from the architecture of Kafka and focus on the main components and implementation principles of Kafka. Kafka supports message persistence. The consumer consumes messages by actively pulling messages. The client is responsible for maintaining the subscription status and subscription relationship. After the message is consumed, it will not be deleted immediately. Historical messages will be retained. Generally, they are retained for 7 days by default. Therefore, when supporting multiple subscribers, the message does not need to be copied multiple times, and only one copy needs to be stored. The implementation principle of each component will be introduced in detail below.
1. Producer
Producer is a message producer in Kafka. It is mainly used to produce messages with specific topics. The messages produced by the producer are classified by topics and stored on the Broker of the Kafka cluster. Specifically, they are stored in the specified partition. directory, stored in Segment mode (.log files and .index files).

2. Consumer
Consumer is a consumer in Kafka. It is mainly used to consume messages from specified topics. Consumer consumes messages from the Kafka cluster by actively pulling them. The consumer must belong to a specific consumer group.

3. Topic
Messages in Kafka are classified according to Topic. Topic supports multiple subscriptions. A Topic can have multiple consumers of different subscription messages. There is no limit to the number of Topics in a Kafka cluster. The data of the same Topic will be divided into the same directory. A Topic can contain one or more partitions. The messages of all partitions added together are all the messages of a Topic.

4. Partition
In Kafka, in order to increase the message consumption speed, multiple Partitions can be assigned to each Topic. This is what we mentioned before, Kafka supports multiple partitions. By default, messages of a Topic are only stored in one partition. The messages from all partitions of the Topic are combined to form all the messages under one Topic. Each partition has a number starting from 0. The data in each partition is ordered, but the direct data of different partitions cannot be guaranteed to be ordered, because different partitions require different Consumers to consume. Each partition Partition can only be allocated to one Consumer, but a Consumer can have multiple Partitions of a Topic at the same time.

5. Consumer Group
Each Consumer in Kafka belongs to a specific Consumer Group. If not specified, all Consumers belong to the same default Consumer Group. A Consumer Group consists of one or more Consumers. Consumers in the same Consumer Group only consume the same message once. Each Consumer Group has a unique ID, Group ID, also called Group Name. All Consumers in the Consumer Group coordinate to subscribe to all Partitions of a Topic, and each Partition can only be consumed by one Consumer in one Consumer Group, but can be consumed by a Consumer in different Consumer Groups. As shown below:

In terms of hierarchical relationship, Consumer corresponds to Topic, and Consumer corresponds to Partition under Topic. The number of Consumers in the Consumer Group and the number of Partitions under the Topic jointly determine the concurrency of message consumption, and the number of Partitions determines the final concurrency, because a Partition can only be consumed by one Consumer. When the number of Consumers in a Consumer Group exceeds the number of Partitions under the subscribed Topic, Kafka will allocate a Consumer to each Partition, and the extra Consumers will be idle. When the number of Consumers in the Consumer Group is less than the number of Partitions in the currently scheduled Topic, a single Consumer will bear the consumption work of multiple Partitions. As shown in the figure above, each Consumer in Consumer Group B needs to consume data from two Partitions, and there will be an additional idle Consumer 4 in Consumer Group C. To sum up, the more Partitions under the same Topic, the more Consumers can consume at the same time, the faster the consumption will be, and the higher the throughput will be. At the same time, the number of Consumers in the Consumer Group needs to be controlled to be less than or equal to the number of Partitions, and preferably an integer multiple: such as 1, 2, 4, etc.

6. Segment
Considering the performance of message consumption, messages in Kafka are stored in segments in each Partition, that is, a new Segment is created for every 1G message, and each Segment contains two files: .log file and . index file. We have said before that the .log file is where Kafka actually stores the messages produced by the Producer, and the .index file uses a sparse index to store the logical number and physical offset address (offset) of the corresponding message in the .log file in order to speed up the data processing. query speed. The .log file and the .index file have a one-to-one correspondence and appear in pairs. The following figure shows how .log files and .index files exist in Partition.

Each message in Kafka has its own logical offset (relative offset) and the actual physical address position on the physical disk. That is to say, a message in Kafka has two positions: offset (relative offset) and position (disk physical offset address). In the design of Kafka, the offset of the message is used as part of the Segment file name. The Segment file naming rule is: the first Segment of the Partition global starts from 0, and the name of each subsequent segment file is the maximum offset of the previous Partition (the offset of the Message, not the actual physical offset address, the actual physical address needs to be mapped to. log, the principle of querying messages in the .log file will be introduced in detail later). The maximum value is a 64-bit long, represented by a 20-digit number, and is filled with zeros in the front.

The above figure shows the direct mapping relationship between the .index file and the .log file. Through the above figure, we can briefly introduce the process of Kafka finding Message in the Segment: 1. According to the offset of the next message to be consumed,
  here it is assumed to be 7 , use binary search to find the .index file with the maximum number of the current offset in the Partition whose file name is less than (must be less, because files with file name numbers equal to the current offset store messages that are greater than the current offset), here is naturally Found 00000000000000000000.index.
  2. In the .index file, use binary search to find the largest offset whose offset is less than or equal to the specified offset (assumed to be 7 here). What is found here is 6, and then obtain the Position pointed to by offset 6 in the index file ( physical offset address) is 258.
  3. In the .log file, scan sequentially starting from disk position 258 until the Message with offset 7 is found.
At this point, we have briefly introduced the storage and query principles of .index files and .log files, the basic components of Segment. But we will find a problem: the offsets in the .index file are not stored continuously in order. Why does Kafka design the index file to be discontinuous? This discontinuous index design method is called a sparse index. Kafka uses a sparse index to read the index. Whenever Kafka writes 4k data in the .log, it writes additional data to the .index. An index record. The main reasons for using sparse indexes are as follows:
  (1) Index sparse storage can significantly reduce the storage space occupied by the .index file.
  (2) The sparse index file is small and can be completely read into the memory. This can avoid frequent IO disk operations when reading the index, so that the Message in the .log file can be quickly located through the index.

7. Message
Message is the actual carrier of the information actually sent and subscribed. Every message sent by the Producer to the Kafka cluster is packaged into a Message object by Kafka and then stored in the disk instead of directly. The physical structure of Message on disk is as follows.

On-disk format of a message

offset         : 8 bytes 
message length : 4 bytes (value: 4 + 1 + 1 + 8(if magic value > 0) + 4 + K + 4 + V)
crc            : 4 bytes
magic value    : 1 byte
attributes     : 1 byte
timestamp      : 8 bytes (Only exists when magic value is greater than zero)
key length     : 4 bytes
key            : K bytes
value length   : 4 bytes
value          : V bytes

Among them key, and valuestore the actual Message content, the length is not fixed, while the others are statistics and descriptions of the Message content, and the length is fixed. Therefore, during the process of finding the actual Message, the disk pointer will calculate the number of movement bits based on the offsetsum of the Message message lengthto speed up the Message search process. The reason why this acceleration is possible is that Kafka's .log files are written sequentially. When writing data to the disk, data is appended and there is no random writing operation.

8.Partition Replicas
Finally, let’s briefly talk about the Partition Replicas (partition replica) mechanism in Kafka. Kafka before version 0.8 did not have a replica mechanism. When creating a Topic, you can specify partitions for the Topic and the number of replicas. The partition copy in kafka is shown in the figure below:

Kafka uses replication-factor to control how many message copies are stored on several Brokers (servers). Generally, the number of replicas is equal to the number of Brokers, and the same replication factor cannot be placed in the same Broker. The replica factor is based on partitions and distinguishes roles; the master replica is called Leader (there is only one at any time), the slave replica is called Follower (there can be multiple), and the replicas in synchronization state are called in-sync-replicas (ISR ). The Leader is responsible for reading and writing data, and the Follower is not responsible for providing external data reading and writing. It only synchronizes data from the Leader. Consumers and producers both read and write data from the leader and do not interact with the followers. Therefore, Kafka is not separated from reading and writing. The advantage of using the Leader for reading and writing at the same time is that it reduces the data reading delay caused by data synchronization, because the Follower can only provide external reading services after synchronizing data from the Leader.

If a partition has three replica factors, even if one of them fails, only one leader will be selected from the remaining two, as shown in the figure below. But it will not start another copy in another broker (because if it is started in another broker, there must be data copying and transmission, which will occupy the network IO for a long time. Kafka is a high-throughput messaging system, and this situation is not allowed. occur). If all copies of the specified partition are down, the Consumer will fail to write data if it sends data to the specified partition. The message sent by the Consumer to the specified Partition will first be written to the Leader Partition. After writing, the message needs to be written to other partition copies in the ISR list. After writing, the message can be submitted to offset.

At this point, we have almost finished briefly introducing the architecture and basic principles of Kafka. In order to achieve high throughput and fault tolerance, Kafka has also introduced many excellent design ideas, such as zero copy, high concurrency network design, and sequential storage. We will talk about it later when we have time.

Release Notes

Original link: https://www.cnblogs.com/msjhw/p/15774122.html

Guess you like

Origin blog.csdn.net/TangYuG/article/details/132001111