Elaborate on the Kafka Partition partition

Partition (partition) is the core role of Kafka, and it is very important for Kafka's storage structure and the way of producing and consuming messages.

If you master Partition well, you can understand Kafka faster. This article will explain the concept, structure, and behavior of Partition.

1. Events, Streams, Topics
Before going deep into Partition, let's look at several higher-level concepts and their connection with Partition.

Event (event) represents a fact that happened in the past. Simple understanding is a message, a record.

Events are immutable, but live, and often flow from one place to another.

A Stream event stream represents related events in motion.

When an event stream enters Kafka, it becomes a Topic topic.

insert image description here

Therefore, a Topic is a specific event stream, and it can also be understood that a Topic is a static Stream.

Topic organizes related Events together and saves them. A Topic is like a table in a database.

Two, Partition partition
insert image description here

Topic in Kafka is divided into multiple Partition partitions.

Topic is a logical concept, and Partition is the smallest storage unit, holding part of the data of a Topic.

Each Partition is a separate log file, and each record is written in an appended form.

insert image description here

Record (record) and Message (message) is a concept.

Three, Offsets (offset) and the order of the message
Each record in the Partition will be assigned a unique serial number, called Offset (offset).

Offset is an incrementing, immutable number that is automatically maintained by Kafka.

When a record is written to Partition, it is appended to the end of the log file and assigned a sequence number as Offset.
insert image description here

As shown above, the Topic has 3 Partition partitions. When sending a message to the Topic, it is actually written into a Partition and assigned an Offset.

Attention should be paid to the sequence of messages. If a Topic has multiple Partitions, then from the Topic level, the messages are out of order.

But if you look at Partition alone, the internal messages of Partition are in order.

Therefore, the internal messages of a Partition are ordered, and the cross-Partition messages of a Topic are out of order.

If it is mandatory that the Topic is ordered as a whole, the Topic can only have one Partition.

4. Partition provides Kafka with scalability
insert image description here

A Kafka cluster consists of multiple Brokers (that is, Servers), and each Broker contains part of the data of the cluster.

Kafka distributes multiple Partitions of Topic among multiple Brokers.

This has multiple benefits:

If all Partitions of a Topic are placed on a Broker, the scalability of this Topic will be greatly reduced, and it will be limited by the IO capability of this Broker. After the Partition is dispersed, the Topic can be expanded horizontally.
A Topic can be consumed by multiple Consumers in parallel. If all Partitions of a Topic are in one Broker, the number of supported Consumers is limited, and after decentralization, more Consumers can be supported.
A Consumer can have multiple instances. If Partition is distributed among multiple Brokers, multiple instances of Consumer can connect to different Brokers, which greatly improves the message processing capability. A Consumer instance can be responsible for a Partition, so that message processing is clear and efficient.
5. Partition provides data redundancy for Kafka.
Kafka generates multiple copies for a Partition and disperses them in different Brokers.

If a Broker fails, the Consumer can find a copy of the Partition on other Brokers and continue to get messages.

6. Write Partition
A Topic has multiple Partitions, so when sending a message to a Topic, which Partition should be written? There are 3 writing methods.

  1. Use Partition Key to write to a specific Partition
    insert image description here

When the Producer sends a message, it can specify a Partition Key, so that it can write to a specific Partition.

Partition Key can use any value, such as device ID, User ID.

The Partition Key will be passed to a Hash function, and the calculation result will determine which Partition to write.

Therefore, messages with the same Partition Key will be placed in the same Partition.

For example, if User ID is used as Partition Key, then the messages of this ID are all in the same Partition, which can ensure the order of such messages.

This method needs to pay attention to Partition hotspots.

For example, if User ID is used as the Partition Key, if a certain User generates a lot of messages and is an active user in the header, then all the messages of this user enter the same Partition, which will cause a hotspot problem, causing a Partition to be extremely busy.

  1. It is decided by kafka.
    If the Partition Key is not used, Kafka will use polling to determine which Partition to write.

In this way, messages will be evenly written to each Partition.

But this cannot ensure the order of the messages.

  1. Custom rules
    Kafka supports custom rules, and a Producer can use its own partition to specify rules.

7. Read Partition
Kafka Unlike ordinary message queues that have a publish/subscribe function, Kafka does not push messages to Consumers.

Consumer must pull messages from Topic's Partition by itself.

A Consumer connects to a Broker's Partition and reads messages from it in turn.
insert image description here

The Offset of the message is the cursor of the Consumer, and the consumption of the message is recorded according to the Offset.

After reading a message, the Consumer will advance to the next Offset in the Partition to continue reading the message.

The advancement and recording of Offset are the responsibility of the Consumer, and Kafka does not care.

insert image description here

There is a concept of Consumer Group in Kafka. Multiple Consumers form a group to consume a Topic.

Consumers in the same group have the same Group ID.

The Consumer Group mechanism will ensure that a message is only consumed by the only Consumer in the group and will not be consumed repeatedly.

The consumption group allows multiple Partitions to consume in parallel, which greatly improves the message consumption capability. The maximum parallelism is the number of Partitions of the Topic.
insert image description here

For example, if a Topic has 3 Partitions, and you have 4 Consumers responsible for this Topic, only the Consumer will work, and the other will act as a backup team member. When a Consumer fails, it will make up for it, which is a good fault tolerance mechanism.

References
————————————————
Original link: https://blog.csdn.net/duysh/article/details/116481414

Guess you like

Origin blog.csdn.net/gaogaonannannan/article/details/128921543