Distributed - Message Queue Kafka: Kafka consumer partition rebalancing (Rebalance)

01. What is Kafka consumer partition rebalancing?

Consumers in a consumer group share ownership of the topic partition. When a new consumer joins the group, it starts reading a portion of the messages originally read by other consumers. When a consumer is shut down or crashes, it leaves the group and the partitions originally read by it will be read by other consumers in the group.

The act of transferring ownership of a partition from one consumer to another is called rebalancing. Rebalancing is important because it brings high availability and scalability to consumer groups (you can add or remove consumers with confidence). Under normal circumstances, however, we do not want rebalancing to occur.

Rebalance is essentially a set of protocols that stipulates how a consumer group agrees to distribute all partitions of a subscribed topic. Suppose there are 20 consumer instances under a certain group, and the group subscribes to a topic with 100 partitions. Under normal circumstances, Kafka will allocate 5 partitions to each consumer on average. This allocation process is called Rebalance.

Rebalance means that if the number of consumers in the consumer group changes or the number of consumed partitions changes, Kafka will redistribute the relationship between consumer partitions.

02. What are the triggering conditions for Kafka consumer partition rebalancing?

Changes to the topic (for example, an administrator adds a new partition) will cause the partitions to be reallocated. The Rebalance operation on the Kafka consumer side will occur under the following circumstances:

① New or reduced consumers are added to the consumer group;

② The number of partitions of the topic subscribed by the consumer changes;

③ The number of topics subscribed by consumers changes;

The latter two are usually active operations of operation and maintenance, so the rebalance caused by them is mostly inevitable. In fact, in most cases, the cause of partition rebalancing is a change in the number of consumer group members.

03. What is the process of Kafka consumer partition rebalancing?

Rebalance is performed through a consumer client called the "group leader" in the consumer group.

① Select the group leader: When a consumer wants to join the consumer group, it will send a JoinGroup request to the group coordinator. The first consumer to join the group becomes the group leader.

② Consumers maintain their affiliation with the group and their ownership of the partition by periodically sending heartbeats to the Broker assigned as the group coordinator (Coordinator).

③ The group leader obtains the group member list from the group coordinator (the list contains all consumers that have recently sent heartbeats, and they are considered "alive"), and is responsible for allocating partitions to each consumer. It uses a class that implements the PartitionAssignor interface to decide which partitions should be assigned to which consumers.

④ After the partition allocation is completed, the group leader will send the partition allocation information to the group coordinator;

⑤ The group coordinator then sends this information to all consumers. Each consumer can only see its own allocation information, and only the group leader holds information about all consumers and their partition ownership.

04. How does Kafka determine that a consumer is dead?

Consumers will send heartbeats to the broker designated as the group coordinator (the coordinator may be different for different consumer groups) to maintain group membership and ownership of partitions. Heartbeats are sent by a background thread of the consumer, and as long as the consumer can send heartbeats at normal intervals, it is considered "alive".

If a consumer does not send a heartbeat for a long enough period of time, its session will time out and the group coordinator will consider it "dead" and trigger a rebalance. If a consumer crashes and stops reading messages, the group coordinator will not receive a heartbeat for a few seconds, and it will consider the consumer "dead" and trigger a rebalance. During these few seconds, the "dead" consumer will not read the messages in the partition. After closing the consumer, the coordinator will immediately trigger a rebalancing to minimize processing delays.

05. How does Kafka avoid consumer partition rebalancing?

The most common reason for rebalance in real application scenarios is the addition or reduction of consumers in the consumer group, especially when the consumer crashes. The crash here does not necessarily mean that the consumer process "hangs" or the machine where the consumer process is located is down. The following two situations are also considered dead. What we have to do is how to avoid these two unnecessary Rebalances.

① Failure to send heartbeat in time

Since the consumer failed to send heartbeats in time, the consumer was removed from the consumer group and caused Rebalance. Therefore, the values ​​​​of session.timeout.ms and heartbeat.interval.ms need to be carefully set. Here are some recommended values, which can be " Use it in a production environment "brainlessly".

(1) Set session.timeout.ms= 6s.
(2) Set heartbeat.interval.ms= 2s.

It is necessary to ensure that the consumer instance can send at least 3 rounds of heartbeat requests before it is determined to be dead, that is, session.timeout.ms>= 3 * heartbeat.interval.ms. Setting session.timeout.msto 6s is mainly to allow the Coordinator to locate the dead Consumer faster. After all, we still hope to find those consumers who are "eating on corpses" as soon as possible and kick them out of the group as soon as possible.

② The consumer’s consumption time is too long and the message cannot be processed within the specified time.

There was a customer before. In their scenario, when consumers consume data, they need to process the messages and write them to MongoDB. Obviously this is a very heavy consumption logic. Any instability in MongoDB will lead to an increase in the consumption time of the consumer program. At this time, max.poll.interval.msthe setting of parameter values ​​is particularly critical. If you want to avoid unexpected Rebalance, you'd better set this parameter value to a larger value, slightly longer than your maximum downstream processing time. Taking MongoDB as an example, if the maximum time to write to MongoDB is 7 minutes, then you can set this parameter to about 8 minutes.

06. What is the impact of Kafka consumer partition rebalancing?

① Affects the consumption speed and throughput of the consumer group: The consumer reallocates partitions, which may cause the consumer to stop consuming for a period of time until the reallocation is completed.

② Repeated consumption of messages may occur:

Because the offset submission process of Consumer consumption partition messages is not real-time, the minimum frequency of submission is controlled by the parameter auto.commit.interval.ms. The default is 5000, which means it is submitted at least once every 5 seconds. Let's imagine the following scenario: Rebalance occurs 3 seconds after the displacement is submitted. After Rebalance, all Consumers continue to consume from the last submitted displacement, but the displacement is already the displacement data 3 seconds ago, so 3 seconds before Rebalance occurs All data consumed per second must be consumed again. Although you can increase the frequency of submission by reducing the value of auto.commit.interval.ms, doing so can only narrow the time window for repeated consumption, but cannot completely eliminate it.

Unfortunately, the Kafka community currently has no complete solution to the impact of Reblance. The impact can only be reduced by avoiding unnecessary Rebalance.

07. Two mechanisms for Kafka consumer partition rebalancing?

Depending on the partition allocation strategy used by the consumer group, rebalancing can be divided into two types.

① Active rebalancing (range, round-robin, sticky partition allocation strategy)

During active rebalancing, all consumers relinquish ownership of their currently assigned partitions, i.e. stop reading messages. The consumer rejoins the group, gets its reassigned partition, and continues reading messages. This ensures that each consumer in the consumer group gets the same number of partitions, thus achieving load balancing. But this process will cause the entire consumer group to be unavailable within a short time window. The length of this time window depends on the size of the consumer group and several configuration parameters.

② Collaborative rebalancing (cooperative sticky partition allocation strategy)

Kafka cooperative rebalancing (also known as incremental rebalancing) is used to redistribute partitions when consumer group membership changes. The cooperative rebalancing mechanism will only redistribute the changed partitions, not all partitions (for example, after a consumer exits the consumer group, the partitions it consumes will be repartitioned to other consumers).

Cooperative rebalancing usually refers to reallocating part of a consumer's partitions to another consumer, while other consumers continue to read the partitions that have not been reallocated. In a cooperative rebalancing, the consumer group leader notifies all consumers that they will lose ownership of some partitions, and then the consumers stop reading from these partitions and relinquish ownership of them. Then, the consumer group leader will allocate these unowned partitions to other consumers, thus achieving partition redistribution. While this incremental rebalancing may require several iterations until a steady state is reached, it avoids the "stop the world" pauses seen in active rebalancing. This is particularly important for large consumer groups, as their rebalancing can take a long time.

08. kafka consumer partition rebalancing protocol

rebalance is essentially a set of protocols. Group and coordinator use this set of protocols to complete group rebalance. The latest version of Kafka provides the following five protocols to handle rebalance-related matters.

① JoinGroup request: consumer requests to join the group.
② SyncGroup request: The group leader synchronizes the allocation plan to all members of the group.
③ Heartbeat request: The consumer regularly reports heartbeats to the coordinator to indicate that it is still alive.
④ LeaveGroup request: The consumer actively notifies the coordinator that the consumer is about to leave the group.
⑤ DescribeGroup request: View all information of the group, including member information, agreement information, allocation plan, subscription information, etc. This request type is primarily intended for administrator use. The coordinator does not use this request to perform rebalance.

During the rebalance process, the coordinator mainly handles JoinGroup and SyncGroup requests sent from the consumer. When the consumer actively leaves the group, it will send a LeaveGroup request to the coordinator.

After a successful rebalance, all consumers in the group need to send Heartbeat requests to the coordinator regularly. Each consumer also determines whether the current group has started a new round of rebalance based on whether the response to the Heartbeat request contains REBALANCE_IN_PROGRESS.

09. kafka consumer partition rebalancing process

At present, rebalance is mainly divided into two steps: joining the group and synchronously updating the distribution plan.

① Join the group: All consumers in the group send JoinGroup requests to the coordinator. After collecting all JoinGroup requests, the coordinator selects a consumer to serve as the leader of the group and sends all member information and their subscription information to the leader. It is particularly important to note that the leader and coordinator of a group are not the same concept. The leader is a consumer instance, and the coordinator is usually a broker in the Kafka cluster. In addition, the leader rather than the coordinator is responsible for formulating the distribution plan for all members of the entire group.

Insert image description here

② Synchronously update the distribution plan: The leader of the group begins to formulate the distribution plan, that is, based on the previously mentioned distribution strategy, it determines which topics and which partitions each consumer is responsible for. Once the allocation is completed, the leader will encapsulate the allocation plan into a SyncGroup request and send it to the coordinator. What is more interesting is that all members of the group will send SyncGroup requests, but only the SyncGroup request sent by the leader contains the distribution plan. After receiving the allocation plan, the coordinator extracts the plan belonging to each consumer and returns it to the respective consumer as a response to the SyncGroup request.

Insert image description here

10. What are the fixed members of the Kafka consumer group?

By default, a consumer's group membership is temporary. When a consumer leaves the group, the partition ownership assigned to it will be revoked; when the consumer rejoins, it will be assigned a new member ID and new partition through the rebalancing protocol.

A consumer can be assigned a unique group.instance.id, making it a permanent member of the group. Typically, when a consumer first joins a group as a fixed member, the group coordinator allocates a portion of the partitions to it according to the partition allocation policy. When this consumer is closed, it does not automatically leave the group - it remains a member of the group until the session times out. When the consumer rejoins the group, it will continue to hold its previous identity and be assigned to the partition it held previously. The group coordinator caches the partition allocation information of each member and only needs to send the information in the cache to the rejoined fixed members without rebalancing.

If two consumers join the same group with the same group.instance.id, the second consumer will get an error telling it that a consumer with the same ID already exists.

Group fixed membership is useful if the application needs to maintain local state or cache related to consumer partition ownership. If rebuilding the local cache is time-consuming, you don't want to have to go through this process every time you restart the consumer. More importantly, partitions owned by the consumer will not be reallocated when the consumer restarts. During the restart process, the consumer will not read these partitions, so when the consumer restarts, the read progress will be slightly behind, but you have to trust that they will catch up.

It should be noted that fixed members of the group will not actively leave the group when it is closed. When they "really disappear" depends on the session.timeout.ms parameter. You can set this parameter high enough to avoid triggering rebalancing on simple application restarts, but small enough to automatically reallocate partitions in the event of a severe outage and avoid reading progress on those partitions. Large lag.

11. Four scenarios for Kafka consumer partition rebalancing

① New members join the group:

Insert image description here

② Group member crashes:

Group members collapse and group members voluntarily leave are two different scenarios. Because members will not actively inform the coordinator of the crash, the coordinator may need a complete session.timeout cycle to detect such a crash, which will inevitably cause consumer lag. It can be said that leaving the group actively initiates rebalance; while crashing passively initiates rebalance.

Insert image description here

③ Group members voluntarily leave the group:

Insert image description here

④ Submit displacement:

Insert image description here

Guess you like

Origin blog.csdn.net/qq_42764468/article/details/132254921