Consumer rebalance failure problem positioning and solution ideas

background

Recently, there have been some problems in the company's use of Kafka's Consumer advanced API. The problem is described as follows:
Elephant push push queue, in the peak period of sending messages (about 8:00 to 10:00 every day), the load of consumer nodes is high, and the JVM memory usage is greater than % 80. At this time, the JVM will frequently continue FullGC and the asynchronous thread will be stuck (stop the world), and asynchronous threads such as heartbeat will not be able to send and receive packets normally. In this case, the expired zk session triggers Consumer rebalance, and the zk session expires frequently and rebalance will be sent. During the rebalance process, due to the conflict of writing temporary nodes, the consumer rebalance fails and the shard data cannot be allocated (that is, the Consumer is disconnected).

rebalance失败官方解释：
consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for “conflict in “).
If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won’t realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

The following configuration will cause rebalance to fail
rebalance.max.retries * rebalance.backoff.ms < zookeeper.session.timeout.ms

Why does the Elephant push queue jvm take up a lot of memory?

The consumer will asynchronously push the message to the downstream system, and the asynchronously issued request will be placed in the memory map. If a downstream ack is received, the request will be deleted from the memory map, or the request will be automatically deleted if no ack is received for more than 5s. The current flow control will trigger the
Elephant JVM memory high problem when the memory map exceeds 2000 requests. Reasons:
The Elephant mafka push client has a high memory usage and the mafka client is disconnected. The high memory usage is initially determined as the problem that the push volume of the push side is larger than the downstream processing capacity, which causes the push side requests to accumulate in the netty memory buffer. This problem can be confirmed by the huge gap between the request push volume (17k/s) and the ack volume (10k/s) during peak periods

Target

Ensure that active consuming nodes can take over or assign data shards to failed consuming nodes, and failed consuming nodes release data shard resources they hold.
Improve the alarm mechanism to monitor and collect abnormal rebalance conditions.

analyze

Create a topicXXX 6 partition as shown in the figure below. The allocation of each consumer node in the consumer group is as follows.
write picture description here
Consumers store the structure on zk

Counter control logic

At present, the control strategy of Consumer rebalance is completed by each Consumer through Zookeeper.
1. When each Consumer client is started, it will register itself under zk's own Consumer group. 2. Changes of Consumers under
Watcher Consumer group 3. Changes of
Watcher Topic 4. Changes
of Wacher zk session timeout

Consumer logic description

a. For Consumers in the same Consumer Group, Kafka sends each message in the corresponding Topic to only one of the Consumers.
b. Each Consumer in the Consumer Group reads one or more Partitions of the Topic, and is the only Consumer;
c. All threads of multiple consumers in a Consumer Group consume all the Partitions of a topic in order, if the Consumer The total number of threads of all consumers in the group is greater than the number of partitions

Consumer rebalacne algorithm

当一个group中,有consumer加入或者离开时,会触发partitions均衡.均衡的最终目的,是提升topic的并发消费能力.
1) 假如topic1,具有如下partitions: P0,P1,P2,P3
2) 加入group中,有如下consumer: C0,C1
3) 首先根据partition索引号对partitions排序: P0,P1,P2,P3
4) 根据(consumer.id + ‘-‘+ thread序号)排序: C0,C1
5) 计算倍数: M = [P0,P1,P2,P3].size / [C0,C1].size,本例值M=2(向上取整)
6) 然后依次分配partitions: C0 = [P0,P1],C1=[P2,P3],即Ci = [P(i * M),P((i + 1) * M -1)]
在这种策略下，每一个consumer或者broker的增加或者减少都会触发consumer rebalance。因为每个consumer只负责调整自己所消费的partition，为了保证整个consumer group的一致性，所以当一个consumer触发了rebalance时，该consumer group内的其它所有consumer也应该同时触发rebalance。

解决思路

为什么活跃消费节点无法分配到下线消费节点的分片数据？

下面分析下原因：目前有三种方式会触发rebalance，其一Topic与partition映射表发生变化；其二同一订阅组中消费节点数发生变化，其三zk会话过期
前两种直接就做rebalance操作了，最后一种消费节点重新注册临时znode到zk上，然后再做rebalance操作，rebalance失败会导致消费节点下线，其他活跃消费节点也无法分配到分片数据，其实质原因是消费节点rebalance失败下线时并没有从zk的Consumer Group下删除自身临时节点，而每个消费节点的分片数据是根据Consumer Group下数量按照计算规则分配的，所以活跃消费节点无法分配到分片数据。
消费节点分配分片数据简单计算规则：每个消费节点分片数 =总体分片数/消费节点数

会话过期重新注册消费节点问题：消费端无限重试直到注册成功为止，每次重试会休眠一定时间。

rebalance问题：有限次重试(设置阀值)内无法分配分片，则分配分片失败。

经过多次测试和验证，发现如果rebalance失败，是因为多个消费节点在zk注册临时节点产生冲突，另外zk同步数据延时性时间拉长远大于(rebalance.max.retries * rebalance.backoff.ms)，rebalance就会失败，如果再次触发rebalance且成功，其分片数据会被分配并消费，所以rebalance失败，其实是一个中间件状态。

解决思路-方案1

1.每当消费节点rebalance失败下线时，删除自身注册的临时节点，再次触发rebalance直到分片数据成功分配为止，如分配不成功报警。
2.当前两种Watcher触发rebalance时，检查并重新注册消费节点到Consumer Group下，重新获得分片数据。

注明

Coordinating rebalance through zk like this is not an optimal solution, and there will be a split-brain situation. According to the official Kafka documentation, the Kafka authors are developing the use of the central coordinator in version 0.9.x. The general idea is to elect a broker as a coordinator, and it will watch Zookeeper to determine whether there is an increase or decrease of partitions or consumers, then generate a rebalance command, and check whether these rebalances are successfully executed in all related consumers, if not, then Retry, if successful, the rebalance is considered successful