Kafka平台问题总结

总结一下运营kafka平台过程中遇到的问题。

网络限制

新扩容的机器在由于其他原因导致在交换机上做了流量限制，出只能达到300+Mb，这个问题很隐蔽，不容易被发现。解决方式也很容易直接去掉限流即可。

Zookeeper连接上限

基于kafka客户端做了一层包装，某些版本上有bug会导致zookeeper连接有泄漏，很容易就超过一个ip Max60个连接的限制，最终导致访问zookeeper失败。

a01.zookeeper.kafka.javagc
     60 192.168.200.36
     60 192.168.200.193
     60 192.168.200.19
     59 192.168.200.35
     59 192.168.200.194

ConsumerRebalanceFailedException

consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for "conflict in ").
If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won't realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Myconsumerseemstohavestopped,why?>

问题分析：http://blog.csdn.net/lizhitao/article/details/49589825

kafka 0.9以前的consumer client的设计不太好，建议升级至0.9+版本，这部分重新设计了。

Consumer Client Re-Design

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

NotLeaderForPartitionException

kafka.common.NotLeaderForPartitionException: null
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_76]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_76]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_76]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_76]
        at java.lang.Class.newInstance(Class.java:379) ~[na:1.7.0_76]
        at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:70) ~[stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4$$anonfun$apply$5.apply(AbstractFetcherThread.scala:157) ~[stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4$$anonfun$apply$5.apply(AbstractFetcherThread.scala:157) ~[stormjar.jar:na]
        at kafka.utils.Logging$class.warn(Logging.scala:88) [stormjar.jar:na]
        at kafka.utils.ShutdownableThread.warn(ShutdownableThread.scala:23) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:156) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:112) [stormjar.jar:na]
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:112) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) [stormjar.jar:na]
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) [stormjar.jar:na]

kafka Server挂起,需要检查log.dirs参数中配置的所有路径，磁盘损坏的情况会发生这种情况。

阶段性网络中断

一台kafka机器2个小时一个周期，会有2分钟中断通讯，导致Leader切换。排查不到原因，很可能是交换机或其他设备做了限制，直接下线解决。