[Translation] Kubernetes cluster disaster (disaster recovery) Ultimate Guide

原文 - The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Preface

Kubernetes allows us to run large-scale container of the app without the need for load balancing too much attention to the details of the app. You can ensure that your app is running on high availability through multiple app Kubernetes copies (replicas) (pods). All the intricacies of container choreographed safely hidden, so you can focus on developing app instead of focusing on how to deploy it. Here you can learn more about Kubernetes high availability clusters and how to achieve high availability Kubernetes (use Kubedm for high availability in Kubernetes ) by Kubeadm .

But using Kubernetes has its own challenges and make Kubernetes up and running takes some effort. If you are not familiar with how to run Kubernetes, you may be able to look at this .

Kubernetes allows us to achieve zero downtime deployment, but inevitably there will be service interruptions or events can occur at any time. Your network may be down off your latest app image may lead to a serious bug, or at most rare cases, you may face a natural disaster.

When you're using Kubernetes, and sooner or later, you need to set up a backup. In order to prevent your cluster into a non-recoverable state, you need a backup to return to a steady state cluster earlier.

Why back up and reply?

About why you need to prepare a backup and recovery scheme for your Kubernetes cluster, there are three reasons:

  1. To recover after a disaster: for example, someone accidentally namespace in which your deployment of deleted situation.
  2. In order to replicate the environment: you want to copy your production environment to a staging environment, in order to do some testing before a major upgrade.
  3. Migration Kubernetes cluster: Let's say you want to migrate your Kubernetes cluster to another.

What you need to back up?

Now you know why, let's see what specific backup done. You need to back up two things:

  1. Your Kubernetes control plane (usually a master node) is stored in the data storage etcd, so you need to back up etcd All statements in order to obtain all Kubernetes resources.
  2. If you have a "stateful" (stateful) container (usually in actual use is encountered), you also need to back up lasting volume (persistent volume).

How to back up?

There are a variety of tools such as Heptio ark and Kube-backup support to build back up and return Kubernetes clusters on the cloud provider. But if you do not have to be managed using (refer to cloud suppliers) Kubernetes cluster it? You may need to personally go into battle, if you Kubernetes run on bare metal, just like us. We have run three master's Kubernetes cluster, and there are three etcd member simultaneously run on each master. If we lost a master, we can restore master, the minimum number of runs due to the still meet etcd. Now if in a production environment we lose two master, we need a mechanism to restore the cluster operation.

Want to know how to build multiple master of Kubernetes cluster? Continue reading it!

Etcd to do a backup:

The difference to etcd backup mechanism depends on how you build etcd cluster in Kubernetes environment. There are two ways to set up an environment in Kubernetes etcd cluster:

  1. Internal etcd cluster: This means that you run the container / pod form etcd cluster is Kubernetes cluster, and management responsibility for these pod that Kubernetes.
  2. External etcd Cluster: cluster Etcd most cases in the form of running Linux service Kubernetes outside the cluster and the cluster in order to provide endpoint to Kubernetes Kubernetes cluster to read and write.

Internal Etcd to make a strategic cluster backup:

To take a backup internal etcd pod, we need to use Kubernetes CronJob function, this method does not need to install any etcdctl client on a host computer (node). The following is the definition of Kubernetes CronJob, used to get etcd backup per minute:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: backup
namespace: kube-system
spec:
# activeDeadlineSeconds: 100
schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            # Same image as in /etc/kubernetes/manifests/etcd.yaml
            image: k8s.gcr.io/etcd:3.2.24
            env:
            - name: ETCDCTL_API
              value: "3"
            command: ["/bin/sh"]
            args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: backup
          restartPolicy: OnFailure
          hostNetwork: true
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
          - name: backup
            hostPath:
              path: /data/backup
              type: DirectoryOrCreate
复制代码

External Etcd to do a backup of the cluster strategy:

If you run etcd cluster as a service on a Linux host, you should set up a scheduled task (cron job) to a Linux backup your cluster. Run the following command to back up etcd.

ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save /path/for/backup/snapshot.db
复制代码

DR (disaster recovery)

Now, suppose that after Kubernetes cluster completely downn out, we need to restore Kubernetes cluster by etcd snapshot. In general, we need to start etcd cluster, and then in the master node and etcd execution kubeadm init on the endpoint of this (these) hosts. Make sure you back up a certificate put under / etc / kubernetes / pki (default directory to store cluster Kubernetes used to create certificates when you create a cluster kubeadm init) directory.

Etcd internal cluster recovery strategy:

docker run --rm \
-v '/data/backup:/backup' \
-v '/var/lib/etcd:/var/lib/etcd' \
--env ETCDCTL_API=3 \
'k8s.gcr.io/etcd:3.2.24' \
/bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-2018-12-09_11:12:05_UTC.db' ; mv /default.etcd/member/ /var/lib/etcd/"
kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd
复制代码

External etcd cluster recovery strategy

3 etcd node recovery with the following command:

ETCDCTL_API=3 etcdctl snapshot restore snapshot-188.db \
--name master-0 \
--initial-cluster master-0=http://10.0.1.188:2380,master-01=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.0.1.188:2380
ETCDCTL_API=3 etcdctl snapshot restore snapshot-136.db \
--name master-1 \
--initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.0.1.136:2380
ETCDCTL_API=3 etcdctl snapshot restore snapshot-155.db \
--name master-2 \
--initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 \
--initial-cluster-token my-etcd-token \
--initial-advertise-peer-urls http://10.0.1.155:2380
The above three commands will give you three restored folders on three nodes named master:
0.etcd, master-1.etcd and master-2.etcd
复制代码

Now, stop all etcd services on the node, with the restored folder is replaced etcd on all nodes in a folder, and then start etcd service. Now you can see all the nodes, but it is possible only to see the master is ready state, you need to re-use existing ca.crt (you should make a backup) files will join another two nodes into the cluster. Run the following command on the master:

kubeadm token create --print-join-command
复制代码

This will give you kubeadm join command, add a parameter --ignore-preflight-errorsto run this command on another two nodes, so that they become ready status.

in conclusion

One of the ways is to create fault handling master more than the master Kubernetes cluster, but even that does not make you completely eliminate Kubernetes etcd backup and recovery needs, but you also might accidentally destroy data in the HA environment.

Guess you like

Origin blog.csdn.net/weixin_34364135/article/details/91387541