1 issue:
Recently, a test environment Swarm clusters hung up, this cluster has two management nodes, perform docker node ls, are reported:
The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online
Obviously the two management nodes are online.
2 Analysis:
By docker info command, see an error message
Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Two nodes one by one analysis of the log and found that the error log periodic print:
The first management node:
Mar 4 09:30:05 manager1 dockerd: time="2020-03-04T09:30:05.663865244+08:00" level=error
msg="error sending message to peer" error="rpc error: code = Internal desc = connection error: desc = \"transport: x509: certificate has expired or is not yet valid\""
The second node management report:
Mar 4 09:08:01 manager2 dockerd: time="2020-03-04T09:08:01.446858105+08:00" level=warning
msg="error renewing TLS certificate: rpc error: code = Internal desc = connection error: desc = \"transport: remote error: tls: bad certificate\""
The preliminary conclusion, the second node management certificate in question, and very likely expired,
According to information literally guess what: There seems to be a BUG, need to refresh the local certificate request from a remote node, requesting remote node also reported certificate does not form a paradox.
View time two machines, were normal time
3 Verification:
By command
docker swarm ca | openssl x509 -noout -text
View second node certificate management, command error can not be displayed certificate information
Two access nodes directly through the Google browser 2377 port https: // xxxx: 2377
Click the certificate, view the certificate, found that the current time is not within the validity period, then proceed to update the certificate is valid
Then faced with the problem: where to store the certificate? How to update? Reference to the contents of the following addresses:
Certificate-related discussions on github
4 final resolution:
Because the certificate management node two fail, it leaves the cluster initiative directly
docker swarm leave --force
A management node is still not normal, execute commands on a management node
docker swarm init --force-new-cluster --advertise-addr x.x.x.x
Found it impossible to perform a normal restart the process docker
systemctl restart docker
Wait for a long time, after performing again
docker swarm init --force-new-cluster --advertise-addr x.x.x.x
Before cluster back to normal, and the deployment and configuration still exists, the problem be solved
(Completed)