Kafka platform problem summary

 

Summarize the problems encountered in the process of operating the kafka platform.

 

 

network limitation

      The newly expanded machines have traffic restrictions on the switches due to other reasons, and the output can only reach 300+Mb. This problem is very hidden and is not easy to be discovered. The solution is also very easy to directly remove the current limit.

 

 

 

 

Zookeeper connection limit

      A layer of packaging is made based on the kafka client. There are bugs in some versions, which will cause the zookeeper connection to leak. It is easy to exceed the limit of 60 connections per ip Max, which eventually leads to the failure of accessing zookeeper.

 

a01.zookeeper.kafka.javagc
     60 192.168.200.36
     60 192.168.200.193
     60 192.168.200.19
     59 192.168.200.35
     59 192.168.200.194

 

 

 

ConsumerRebalanceFailedException

     consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for "conflict in ").
    If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
    Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won't realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Myconsumerseemstohavestopped,why?>

Problem analysis: http://blog.csdn.net/lizhitao/article/details/49589825

 

The design of the consumer client before kafka 0.9 is not very good. It is recommended to upgrade to version 0.9+. This part has been redesigned.

Consumer Client Re-Design

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

 

 

 

NotLeaderForPartitionException

 

kafka.common.NotLeaderForPartitionException: null
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_76]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_76]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_76]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_76]
        at java.lang.Class.newInstance(Class.java:379) ~[na:1.7.0_76]
        at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:70) ~[stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4$$anonfun$apply$5.apply(AbstractFetcherThread.scala:157) ~[stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4$$anonfun$apply$5.apply(AbstractFetcherThread.scala:157) ~[stormjar.jar:na]
        at kafka.utils.Logging$class.warn(Logging.scala:88) [stormjar.jar:na]
        at kafka.utils.ShutdownableThread.warn(ShutdownableThread.scala:23) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:156) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$4.apply(AbstractFetcherThread.scala:112) [stormjar.jar:na]
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:112) [stormjar.jar:na]
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) [stormjar.jar:na]
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) [stormjar.jar:na]

 kafka Server挂起,需要检查log.dirs参数中配置的所有路径,磁盘损坏的情况会发生这种情况。

 

 

 

阶段性网络中断

      一台kafka机器2个小时一个周期,会有2分钟中断通讯,导致Leader切换。排查不到原因,很可能是交换机或其他设备做了限制,直接下线解决。

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326764166&siteId=291194637