k8s sail in mind: upgrade to a luxury cruise (High Availability Cluster) and encounters strange fault (dns resolve anomalies)

Before we build k8s cluster only 1 station master, availability is not high, two days began to build high-availability cluster, but the absence of use before creating a cluster with kubeadm command --control-plane-endpoint parameters, you can not directly upgrade now there is a cluster, only to re-create the high availability (High-Availability) cluster.

High-availability cluster principle is very simple, multiple master, each cluster data are stored (etcd), all nodes access api server load balancing through specialized structures, so that when part of the master goes down, no effect on the normal operation of the cluster.

We spent three master, but begin to create highly available cluster in the first stage master server encountered a problem never dreamed of.

kubeadm init \
    --kubernetes-version v1.16.3 \
    --control-plane-endpoint "k8s-api:6443" --upload-certs \
    --image-repository registry.aliyuncs.com/google_containers \
    --pod-network-cidr=192.168.0.0/16 --v=6

To save time, we do not own the deployment of additional load balancing, but directly use Ali cloud network load balancing (four forwards tcp), in the above hosts will be master of k8s-api to resolve Ali cloud load balancing of IP.

But always fails to create a cluster, the following error message

[kubelet-check] Initial timeout of 40s passed.
I1217 08:39:21.852678   20972 round_trippers.go:443] GET https://k8s-api:6443/healthz?timeout=32s  in 30000 milliseconds

After investigation we found that because Ali cloud four load balancing does not support forwarding the request to the same server, the server sends a request that is forwarded to the back-end server can not be the same server.

Later, we used a workaround to solve the problem, not the k8s-api resolves to the IP load balancing on the master server, but resolve to master their own IP, only to resolve IP load balancing on the nodes.

When we build a good high availability cluster, not enough time to enjoy the large luxury cruise ship on the high, they encounter a strange dns resolve problems. When resolving hostnames in the container is very slow, sometimes successfully resolved, and sometimes failed to resolve, whether k8s the service name, or manually add the dns resolution records, or redis Ali cloud services, have this problem. dns resolution service using a coredns, pod network using a calico. There was a cluster of three maste Taiwan node, we began to think that the problem k8s network, starting with the time to build this cluster network is flannel, later changed calico, but toss a long time to no avail, finishing last night to this end tired power to do, angrily before going to bed all servers in the cluster shutdown.

Boot today, they encountered a never dreamed of things, the problem turned out to be the magic disappeared, I thought it was just the upgrade of a cruise ship during the episode.

End of the day, I encountered a never dreamed things online before using to build the only non-availability cluster 1 station master in part nodes also appeared the same dns resolve the problem (using a flannel network nodes), according to the enduring trick just learned, the problems of the restart, the problem immediately disappeared.

Two different clusters, using a different pod network, and using a different network addresses (192.168.0.0/16 are the 10.244.0.0/16), actually had the same dns resolve the problem, and all by restarting can be solved, this strange question to our mind set sail out of a problem.

But by the extravagance and easy, from luxury to economy into a luxury cruise ship ready, we do not want to open a fishing boat (docke swarm), anyway, the boat will have to continue to open.

Guess you like

Origin www.cnblogs.com/cmt/p/12061089.html