【疑难杂症】记录 kafka.common.ConsumerRebalanceFailedException:异常

最近用到kafka，使用过程中发送和接收数据却出现了诸多异常。有一个异常很诡异，如下：

Exception in thread "main" kafka.common.ConsumerRebalanceFailedException:

groupB_ip-10-38-19-230-1414174925481-97fa3f2a can't rebalance after 4

retries

kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)

kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)

kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:212)

at kafka.javaapi.consumer.Zookeeper……

debug发现，在Consumer端，代码跑到

Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap =   this.consumer
               .createMessageStreams(topicCountMap);

这一行就“卡住不动”了，并且出现上述异常。上网搜索相关解决方案。说是把Consumer端的zookeeper.sync.time.ms属性设置得大一点，尝试之后，问题依旧。。

直到我在下面的地址发现了一个比较靠谱的解决方法：

援引自英文原文：

consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for "conflict in ").

If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won't realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

然后我尝试了粗体部分的解决方法，在Consumer端设置两个属性如下：

props.put("rebalance.max.retries", "5");
props.put("rebalance.backoff.ms", "1200");

并确保5*1200=6000的值大于zookeeper.session.timeout.ms属性对应的值（这里我是5000）。再次分别启动Producer端和Comsumer端，问题果然解决了。

注：服务端Producer的metadata.broker.list属性最好不止一个，这样也就要求你做负载均衡。

PS：对于kafka的一些异常需要比较清楚地去了解它的运行机制，但我没这么多时间。所以就临时抱佛脚去解决问题了。