一次阿里云上的kakfa集群升级历险记

       由于要在生产环境上debezium,笔者看到生产环境上的kafka版本是1.0.0,而现在kafka最新版本都是2.0了,于是想升级一下kafka。按照kafka的官网上的例子来升级。发现升级完kafka集群就不可用了。
       其实三台broker启动起来也没报错,server.properties中的listeners和advertised listeners和原来的一样。但是消费者程序一直报”Broker not available”错误。自己用控制台生产消费发现也报错。
       看了kafka的server.log中的日志:

[2018-08-15 13:00:00,277] WARN [Controller id=2, targetBrokerId=0] Connection to node 0 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-08-15 13:00:00,277] WARN [Controller id=2, targetBrokerId=1] Connection to node 1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-08-15 13:00:00,289] WARN [Controller id=2, targetBrokerId=2] Connection to node 2 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-08-15 13:00:00,379] WARN [Controller id=2, targetBrokerId=0] Connection to node 0 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

       发现是controller有问题,检查了一下kafka controller的日志:

[2018-08-15 13:00:00,074] WARN [RequestSendThread controllerId=2] Controller 2's connection to broker 172.16.6.18:9092 (id: 0 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 172.16.6.18:9092 (id: 0 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
        at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
        at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-08-15 13:00:00,076] WARN [RequestSendThread controllerId=2] Controller 2's connection to broker 172.16.6.36:9092 (id: 1 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 172.16.6.36:9092 (id: 1 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
        at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
        at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-08-15 13:00:00,088] WARN [RequestSendThread controllerId=2] Controller 2's connection to broker 172.16.6.37:9092 (id: 2 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 172.16.6.37:9092 (id: 2 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
        at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
        at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-08-15 13:00:00,177] WARN [RequestSendThread controllerId=2] Controller 2's connection to broker 172.16.6.18:9092 (id: 0 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 172.16.6.18:9092 (id: 0 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
        at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:279)
        at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

       奇怪的是,controller broker连不上三台机器,controller broker连自己(它自己就是三台中的一台)都连接不上去。
在一台broker上telnet三台broker 9092,发现用内网ip和主机名都是连接拒绝。
telnet情况

       笔者的/etc/hosts配置如下(注意这儿127.0.0.1 还绑定了 ali-37 ,后面会讲这个坑):

127.0.0.1 ali-37  localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.6.18 ali-18
172.16.6.36 ali-36
172.16.6.37 ali-37

       被逼无奈,把三台broker的server.properties中的listeners都改成了内网ip

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://ali-37:9092
listeners=PLAINTEXT://172.16.6.37:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
advertised.listeners=PLAINTEXT://172.16.6.37:9092

       再轮流重启三台broker后,三台broker上可以互相telnet 内网ip 9092了,集群能正常工作了。
       奇怪的是,笔者之前的配置是可以用的,升级后就不能用了,然后再用老版本的kafka启动也突然不能用了。最终通过把server.properties中的listeners改了之后就好了。

       经过公司资深运维老杨哥解释了一下,因为kafka老集群运行的时候/etc/hosts还没有把主机名绑定到127.0.0.1,后来集群运行起来后,把主机名绑定到了127.0.0.1,因为主机名还绑定到了内网ip,但是127.0.0.1在前面,所以总是会解析到127.0.0.1,而不是内网ip。可以把绑定到127.0.0.1的去掉:

#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
127.0.0.1  localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.6.18 ali-18
172.16.6.36 ali-36
172.16.6.37 ali-37

这样kafka的server.properties中仍用主机名作为listeners,也能正常工作。

猜你喜欢

转载自blog.csdn.net/lzufeng/article/details/81704616