Unable to read database, zookeeper failed to start

Recently our cluster status abnormalities found zk start time has been a failure, saw logs is really sad, 5s hang up, which if opened in the black, every minute is reported, at the beginning really thought about it so hard to understand why the start zk and very stable at about 5s failure, by the time you can judge this process has not fully started, in this period of time, at most, it is in initstate

[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 11:29:21 CST; 4s ago
     Docs: http://zookeeper.apache.org
  Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (java)
   CGroup: /system.slice/zookeeper.service
           └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 11:29:21 CST; 5s ago
     Docs: http://zookeeper.apache.org
  Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (java)
   CGroup: /system.slice/zookeeper.service
           └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default
Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-03-23 11:29:26 CST; 706ms ago
     Docs: http://zookeeper.apache.org
  Process: 31225 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 31138 (code=exited, status=1/FAILURE)

Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED
Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service: main process exited, code=exited, status=1/FAILURE
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: ZooKeeper JMX enabled by default
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Stopping zookeeper ... /opt/zookeeper/zookeeper-prod/bin/zkServer.sh: 第 182 行:kill: (31138) - 没有那个进程
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: STOPPED
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Unit zookeeper.service entered failed state.
Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service failed.

See the log that is mentioned /opt/zookeeper/zookeeper-prod/conf/zoo.cfgso go in this directory to see if there is no content worth mining. After all, confyou can guess is a configuration folder, there should be a reason logor outputlike folders, which kept a running log, especially errorlogging, according to this idea can be related to the investigation.
After the discovery of /opt/zookeeper/zookeeper-prod/binthe directory there is a zookeeper.outfile, this is the implementation details, may have a look, and then cat look, the problem is very clear

2020-03-23 11:36:58,799 [myid:] - INFO  [main:QuorumPeerConfig@136] - Reading configuration from: /opt/zookeeper/zookeeper-prod/bin/../conf/zoo.cfg
2020-03-23 11:36:58,814 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.26 to address: /10.153.115.26
2020-03-23 11:36:58,815 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.25 to address: /10.153.115.25
2020-03-23 11:36:58,816 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.24 to address: /10.153.115.24
2020-03-23 11:36:58,816 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.29 to address: /10.153.115.29
2020-03-23 11:36:58,816 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.28 to address: /10.153.115.28
2020-03-23 11:36:58,816 [myid:] - INFO  [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.27 to address: /10.153.115.27
2020-03-23 11:36:58,816 [myid:] - WARN  [main:QuorumPeerConfig@354] - Non-optimial configuration, consider an odd number of servers.
2020-03-23 11:36:58,816 [myid:] - INFO  [main:QuorumPeerConfig@398] - Defaulting to majority quorums
2020-03-23 11:36:58,821 [myid:5] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2020-03-23 11:36:58,821 [myid:5] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2020-03-23 11:36:58,822 [myid:5] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2020-03-23 11:36:58,837 [myid:5] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2020-03-23 11:36:58,839 [myid:5] - INFO  [main:QuorumPeerMain@130] - Starting quorum peer
2020-03-23 11:36:58,849 [myid:5] - INFO  [main:ServerCnxnFactory@117] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
2020-03-23 11:36:58,856 [myid:5] - INFO  [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:QuorumPeer@1158] - tickTime set to 2000
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:QuorumPeer@1204] - initLimit set to 10
2020-03-23 11:36:58,861 [myid:5] - INFO  [main:QuorumPeer@1178] - minSessionTimeout set to -1
2020-03-23 11:36:58,862 [myid:5] - INFO  [main:QuorumPeer@1189] - maxSessionTimeout set to -1
2020-03-23 11:36:58,871 [myid:5] - INFO  [main:QuorumPeer@1467] - QuorumPeer communication is not secured!
2020-03-23 11:36:58,871 [myid:5] - INFO  [main:QuorumPeer@1496] - quorum.cnxn.threads.size set to 20
2020-03-23 11:36:58,872 [myid:5] - INFO  [main:FileSnap@86] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c
2020-03-23 11:36:59,290 [myid:5] - ERROR [main:QuorumPeer@692] - Unable to load database on disk
java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2020-03-23 11:36:59,292 [myid:5] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:693)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689)
    ... 4 more

Log reads, is reading a snapshot zk, and then on the error, unable to load the database on disk, then! I'll delete snapshots snapshot.b91d0000003c, allowed himself to regenerate the snapshot file, get away.

2020-03-23 11:36:58,872 [myid:5] - INFO  [main:FileSnap@86] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c
2020-03-23 11:36:59,290 [myid:5] - ERROR [main:QuorumPeer@692] - Unable to load database on disk

Acid cool

[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper
● zookeeper.service - ZooKeeper Service
   Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-03-23 12:12:08 CST; 5min ago
     Docs: http://zookeeper.apache.org
  Process: 25348 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
  Process: 25658 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS)
 Main PID: 25667 (java)
   CGroup: /system.slice/zookeeper.service
           └─25667 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo...

Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service...
Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: ZooKeeper JMX enabled by default
Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg
Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Starting zookeeper ... STARTED
Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.

Because every time zk will have a run of a snapshot file, this is the state with the recovery, since the disk is full before this host, import zk can not write messages in a timely manner, then we will restart the device, it should be is at this time, snapshot file write failed due. But this is not a universal solution, but also to analyze specific issues, because in the cluster, the snapshot file deleted, follow-up want to restore the database state be difficult, but, fortunately, we are 6 zk node, and the other five nodes normal, Therefore, such operations are allowed.
Call it a day so far :)

Guess you like

Origin blog.51cto.com/yerikyu/2481123