Kafka rebalance

1. What is the
transfer of ownership of the Rebalance partition from one consumer to another? This behavior is called Rebalance.

Rebalance achieves high availability and scalability for consumer groups.

Consumers maintain their affiliation with the group and their ownership of the partition by sending heartbeats to the brokers assigned as the group coordinator (Coordinator).

The so-called coordinator, the corresponding term in Kafka is Coordinator, which serves Consumer Group specifically, is responsible for performing Rebalance for the Group, and providing displacement management and group member management. Specifically, when the Consumer application submits the displacement, it actually submits the displacement to the Broker where the Coordinator is located. Similarly, when the Consumer application starts, it also sends various requests to the Broker where the Coordinator is located, and then the Coordinator is responsible for performing metadata management operations such as consumer group registration and member management records.

When adding/removing consumers in the group or adding/removing kafka cluster broker nodes, the group coordinator Broker will trigger rebalancing and re-allocate consumers for each Partition. During the rebalance period, consumers cannot read the messages, causing the entire consumer group to be unavailable for a short period of time.

Rebalance is essentially an agreement that specifies how all Consumers under a Consumer Group can reach an agreement to allocate each partition subscribed to Topic. For example, there are 20 Consumer instances under a Group, and it subscribes to a Topic with 100 partitions. Under normal circumstances, Kafka allocates 5 partitions to each Consumer on average. This allocation process is called Rebalance.

2. The trigger timing of Rebalance
changes in the number of group members. For example, a new Consumer instance joins or leaves the group, or a Consumer instance crashes and is "kicked out" from the group.

Add consumers. After the customer subscribes to the topic, the poll method is executed for the first time to
remove the customer. If the customer.close() operation is executed or the consumer client is down, the heartbeat will no longer be sent to the group coordinator through poll. When the group coordinator detects that the consumer has no heartbeat, it will trigger rebalancing.
The number of subscribed topics has changed. Consumer Group can use regular expressions to subscribe to topics. For example, consumer.subscribe(Pattern.compile("t.*c")) means that the Group subscribes to all topics that start with the letter t and end with the letter c. During the operation of the Consumer Group, if you newly create a topic that meets these conditions, then the Group will undergo Rebalance.

The number of partitions subscribed to the topic has changed. Kafka currently only allows to increase the number of partitions for a topic. When the number of partitions increases, all groups subscribed to the topic will be triggered to turn on Rebalance.

Add broker. For example, restart the broker node to
remove the broker. For example, kill the broker node.
3. The process of
Rebalance Rebalance is carried out through the consumer client called "group owner" in the consumer group. What is a group owner? The "group owner" is the first consumer to join the group. When a consumer joins the group for the first time, it will send a JoinGroup request to the group coordinator. If it is the first one, the consumer is designated as the "group owner" (does the group owner really want to be with the qq group? Ah, that's the first person to join the group).


The group owner obtains the list of group members from the group coordinator, and then assigns a partition to each consumer. There are two allocation strategies: Range and RoundRobin.
The Range strategy is to assign several consecutive partitions to consumers. If there are partitions 1-5, assuming there are 3 consumers, then consumer 1 is responsible for partition 1-2, consumer 2 is responsible for partition 3-4, and consumer 3 Responsible for partition 5.
The RoundRoin strategy is to allocate all partitions to consumers one by one. If there are partitions 1-5, assuming there are 3 consumers, then partition 1->consumption 1, partition 2->consumer 2, partition 3>consumer 3. Partition 4>Consumer 1, Partition 5->Consumer 2.
After the group leader is allocated, the allocation status is sent to the group coordinator.
The group coordinator then sends this information to the consumer. Each consumer can only see his own distribution information, and only the group owner knows the distribution information of all consumers.
Find the coordinator When
all Brokers are started, the corresponding Coordinator component will be created and started. In other words, all Brokers have their own Coordinator components. So, how does Consumer Group determine which Broker the Coordinator that serves it is on? The answer lies in the internal displacement topic __consumer_offsets of Kafka that we mentioned before.

Currently, Kafka's algorithm for determining the Broker where the Coordinator is located for a Consumer Group has two steps.

Step 1: Determine which partition of the displacement topic to save the Group data: partitionId=Math.abs(groupId.hashCode()% offsetsTopicPartitionCount).

Step 2: Find out the Broker where the leader copy of the partition is located, and this Broker is the corresponding Coordinator.

Fourth, the problem of Rebalance
First, during the Rebalance process, all Consumer instances will stop consumption and wait for the Rebalance to complete.
Secondly, the current design of Rebalance is that all Consumer instances participate together and all partitions are all redistributed. In fact, it is more efficient to minimize changes in the distribution plan. For example, instance A was previously responsible for consuming partitions 1, 2, and 3. After Rebalance, if possible, it is better to let instance A continue to consume partitions 1, 2, and 3 instead of being reassigned to other partitions. In this case, the TCP connections of instance A to the brokers where these partitions are located can continue to be used, without re-creating Socket resources connected to other brokers.
Finally, Rebalance is too slow. Once, there were hundreds of Consumer instances in the group of a foreign user, and it took several hours to successfully rebalance at a time! This is totally unbearable. The most tragic thing is that the community is currently powerless to do this, at least there is no particularly good solution yet.
The so-called "sufficiency is not as good as it is", perhaps the best solution is to avoid the occurrence of Rebalance.
During the entire process of Rebalance, all instances cannot consume any messages, so it has a great impact on Consumer's TPS.

5. Avoid Rebalance.
Knowing the problem of Rebalance, we can know that if Rebalance is reduced, the TPS of Consumer can be improved as a whole.

As mentioned earlier, there are three trigger timings for Rebalance. Among them, the operation of adding Consumer instances is planned, and may be due to the need to increase TPS or improve scalability.

Heartbeat not sent in time The
first type of non-essential Rebalance is caused by failing to send heartbeat in time, causing Consumer to be "kicked out" from the group. Therefore, you need to carefully set the values ​​of session.timeout.ms and heartbeat.interval.ms. I'm here to give some recommended values ​​that you can apply in your production environment "without brain".

Set session.timeout.ms = 6s.
Set heartbeat.interval.ms = 2s.
It is necessary to ensure that the Consumer instance can send at least 3 rounds of heartbeat requests before being judged as "dead", that is, session.timeout.ms >= 3 * heartbeat.interval.ms.
The main purpose of setting session.timeout.ms to 6s is to allow Coordinator to locate the suspended Consumer faster. After all, we still hope to find out the Consumers who are "vegetarian meal" as soon as possible and kick them out of the Group as soon as possible. I hope this configuration can better help you avoid the first type of "unnecessary" Rebalance.

Consumer consumption time is too long The
second type of non-essential Rebalance is caused by Consumer consumption time being too long. I have a customer before. In their scenario, when the Consumer consumes data, the message needs to be written to MongoDB after processing. Obviously, this is a heavy consumption logic. A little bit of instability in MongoDB will lead to an increase in the consumption time of the Consumer program. At this time, the setting of max.poll.interval.ms parameter value is particularly critical. If you want to avoid unintended Rebalance, you'd better set this parameter to a larger value, which is slightly longer than your downstream maximum processing time. Take MongoDB as an example. If the maximum time to write MongoDB is 7 minutes, then you can set this parameter to about 8 minutes.

If you set these parameters appropriately according to the recommended values ​​above, but find that Rebalance still occurs, then I suggest you to check the GC performance on the Consumer side, such as whether there is a long pause caused by frequent Full GC, and thus Triggered Rebalance. Why specifically say GC? That's because in actual scenarios, I have seen too many unexpected Rebalances caused by the program's frequent Full GC due to unreasonable GC settings.

6. Application of rebalancing
If Kafka triggers rebalancing , we need to submit the offset of the last processed record before the consumer loses ownership of a partition. If the consumer prepares a buffer for handling occasional events, it needs to process the accumulated records in the buffer before losing ownership of the partition. You may also need to close file handles, database connections, etc.

When assigning a new partition to a consumer or removing an old partition, you can execute some application code through the consumer API, and pass in a ConsumerRebalanceListener instance when calling the subscribe() method. ConsumerRebalanceListener has two methods that need to be implemented.

The public void onPartitionsRevoked(Collection partitions) method will be called before the rebalance starts and after the consumer stops reading messages. If the offset is submitted here, the next consumer to take over the partition will know where to start reading.
The public void onPartitionsAssigned(Collection partitions) method will be called after the partition is reallocated and before the consumer starts to read the message.
private Map<TopicPartition, OffsetAndMetadata> currentOffsets =
  new HashMap<>();

private class HandleRebalance implements ConsumerRebalanceListener {
    public void onPartitionsAssigned(Collection<TopicPartition>
      partitions) {
    }

    public void onPartitionsRevoked(Collection<TopicPartition>
      partitions) {
        System.out.println("Lost partitions in rebalance.
          Committing current
        offsets:" + currentOffsets);
        consumer.commitSync(currentOffsets);
    }
}

try {
    consumer.subscribe(topics, new HandleRebalance());

    while (true) {
        ConsumerRecords<String, String> records =
          consumer.poll(100);
        for (ConsumerRecord<String, String> record : records)
        {
            System.out.println("topic = %s, partition = %s, offset = %d,
             customer = %s, country = %s\n",
             record.topic(), record.partition(), record.offset(),
             record.key(), record.value());
             currentOffsets.put(new TopicPartition(record.topic(),
             record.partition()), new
             OffsetAndMetadata(record.offset()+1, "no metadata"));
        }
        consumer.commitAsync(currentOffsets, null);
    }
} catch (WakeupException e) {
    // 忽略异常,正在关闭消费者
} catch (Exception e) {
    log.error("Unexpected error", e);
} finally {
    try {
        consumer.commitSync(currentOffsets);
    } finally {
        consumer.close();
        System.out.println("Closed consumer and we are done");
    }
}
 
 
 

Guess you like

Origin blog.csdn.net/Erica_1230/article/details/114685744