kafka.common.ConsumerRebalanceFailedException

kafka.common.ConsumerRebalanceFailedException :log-push-record-consumer-group_mobile-pushremind02.lf.xxx.com-1399456594831-99f15e63 can't rebalance after 3 retries

at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(Unknown Source)
at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(Unknown Source)
at kafka.consumer.ZookeeperConsumerConnector.consume(Unknown Source)
at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreams(Unknown Source)
at com.xxx.mafka.client.consumer.DefaultConsumerProcessor.getKafkaStreams(DefaultConsumerProcessor.java:149)
at com.xxx.mafka.client.consumer.DefaultConsumerProcessor.recvMessage(DefaultConsumerProcessor.java:63)
at com.xxx.service.mobile.push.kafka.MafkaPushRecordConsumer.main(MafkaPushRecordConsumer.java:22)

at com.xxx.service.mobile.push.Bootstrap.main(Bootstrap.Java:34)

 

Analysis of the reasons for the above problems:

The same consumer group has multiple consumers started successively, that is, multiple consumers in a consumer group consume multiple partition data at the same time.

Solution:

1. Configure zk problem (kafka's consumer configuration)

zookeeper.session.timeout.ms=5000

zookeeper.connection.timeout.ms=10000

rebalance.backoff.ms=2000

rebalance.max.retries=10

 

 

In the process of using the advanced API, this problem generally occurs because the zookeeper.sync.time.ms time interval configuration is too short. It is not ruled out that there are other reasons, but the author encountered this reason.

Let me explain why: in a consumer group (number of consumers < number of partitions), load balancing will be triggered whenever a consumer sends a change. The first thing is to release the consumer resources , otherwise, call ConsumerFetcherThread to close and release all connections of the current kafka broker and release the currently consumed partitons, which is actually to delete the temporary node (/xxx/consumer/owners/topic-xxx/partitions [0-n]), all consumers in the same consumer group obtain the partitions to be consumed by this consumer through calculation, and then the consumer registers the corresponding temporary node card position, which means that I have the consumption ownership of the partition, and other consumers cannot use it.

 

If you understand the above explanation, the following is easier. When the consumer calls Rebalance, it adopts the principle of failure retry according to the time interval and the maximum number of times. Whenever the acquisition of partitions fails, it will retry the acquisition. For example, if a company has a meeting, department B reserves the meeting room at a certain time, but when the time comes to visit the meeting room, it is found that department A is still using it. At this time, Department B can only wait and ask every once in a while. If the time is too frequent, the conference room will be occupied all the time. If the time interval is set to be longer, it may go twice, and Department A will let it out.

 

Similarly, when a new consumer joins and re-triggers the rebalance, the existing ( old ) consumer will recalculate and release the occupied partitions, but it will consume a certain amount of processing time. At this time, the new ( new ) consumer is likely to preempt the partitions. fail. We assume that this problem will not occur if we set enough time for old consumers to release resources.

 

Official explanation:

consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for "conflict in ").

  • If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
  • Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won't realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

 

 

If the rebalance.backoff.ms time setting is too short, the old consumer will not have time to release resources, and the new consumer will exit after failing to retry many times and reaching the threshold.

 

Make sure rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms

 

kafka zk node storage, please refer to: kafka storage structure in zookeeper

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326449218&siteId=291194637