K8s in Action reading notes - [10] StatefulSets: deploying replicated stateful applications

K8s in Action reading notes - [10] StatefulSets: deploying replicated stateful applications

10.1 Replicating stateful pods

ReplicaSet creates multiple Pod replicas from a single Pod template. There is no difference between these copies except for their names and IP addresses. If the Pod template contains a volume that references a specific PersistentVolumeClaimReplicaSet, all replicas of it will use the same one PersistentVolumeClaimand therefore the same one ersistentVolumeClaimbound by that P PersistentVolume(as shown in Figure 10.1).

Since the reference to the declaration is in the Pod template used to create multiple copies of the Pod, there is no way to make each copy use its own independent PersistentVolumeClaim. At least with a single ReplicaSet, it is not possible to run a distributed data store using a ReplicaSet because each instance requires its own independent storage . Other solutions are needed.

image-20230604145800327

10.1.1 Running multiple replicas with separate storage for each

How can I run multiple copies of a Pod and have each Pod use its own storage volume? ReplicaSet creates an exact copy of a Pod, so it cannot be used for such Pods. So what can be used?

Create Pod manually

It is possible to create Pods manually and make each Pod use its own PersistentVolumeClaim, but since there is no ReplicaSet to manage them, they need to be managed manually and recreated when they disappear (such as node failure). Therefore, this is not a viable option.

Use one ReplicaSet per Pod instance

Multiple ReplicaSets can be created, one ReplicaSet per Pod, with each ReplicaSet's desired number of replicas set to 1, and each ReplicaSet's Pod template referencing a dedicated one PersistentVolumeClaim(as shown in Figure 10.2).

image-20230604150210700

While this allows for automatic rescheduling in the event of a node failure or accidental deletion of a Pod, it is more cumbersome than having a single ReplicaSet . For example, consider how to scale the Pod in this case. The desired number of replicas cannot be changed, instead additional ReplicaSets must be created.

Therefore, using multiple ReplicaSets is not the best solution.

Use multiple directories on the same storage volume

Using the same PersistentVolume, but creating a separate file directory for each Pod within the volume, is a trick that can be used (shown in Figure 10.3).

Since replicas cannot be configured differently from a single Pod template, there is no way to tell each instance which directory it should use. However, you can have each instance automatically choose a data directory that no other instance is currently using. This solution requires coordination between instances and is not easy to implement correctly .

image-20230604150438224

10.1.2 Providing a stable identity for each pod

Some applications require a stable network identity for each instance in addition to storage. From time to time, a Pod may be terminated and replaced with a new Pod. When a ReplicaSet replaces a Pod, the new Pod is a completely new Pod with a new hostname and IP address, although the data in its storage volume may be the data of the terminated Pod. For some applications, starting with data from an old instance but with a completely new network identity can cause problems.

Why do some applications require stable network identity? This requirement is very common in distributed stateful applications. Some applications require administrators to list all other cluster members and their IP addresses (or hostnames) in each member's configuration file. But in Kubernetes, every time a Pod is rescheduled, the new Pod gets a new hostname and a new IP address, so the entire application cluster needs to be reconfigured every time a member is rescheduled.

Use a dedicated Service for each Pod instance

To solve this problem, you can use a trick to provide stable network addresses for cluster members by creating a dedicated Kubernetes Service for each individual member. Because the service IP is stable, you can point to each member in the configuration by its service IP (rather than the Pod IP).

This is similar to creating a ReplicaSet for each member to provide them with independent storage, as described earlier. Combining these two techniques results in the setup shown in Figure 10.4 (additional services covering the entire cluster membership are also shown, since a client for the cluster is usually required).

image-20230604150837185

Not only is this solution unsightly, but it still doesn't solve all the problems. Individual Pods have no way of knowing which Service they are exposed through (and thus their stable IP), so they cannot use that IP to self-register with other Pods.

Fortunately, Kubernetes provides us with StatefulSet .

10.2 Understanding StatefulSets

10.2.1 Comparing StatefulSets with ReplicaSets

Using the “pet” vs. “cattle” analogy to understand stateful Pods

You may have heard the analogy between “pets” and “cattle.” If not, let me explain. We can think of instances of an application as pets or cattle.

Note: StatefulSets were originally called PetSets. The name comes from the analogy between "pets" and "cattle" explained here.

We tend to think of application instances as pets, giving each instance a name and giving each instance individual attention. But it's often better to treat instances as herds and not pay special attention to each individual instance . This allows unhealthy instances to be easily replaced without much thought, like a farmer replacing unhealthy cattle.

An instance of a stateless application behaves much like a cow in a herd. If an instance dies, you can create a new one and people won't notice any difference.

However, with stateful applications, an application instance is more like a pet. You can't buy a new pet when a pet dies and expect people not to notice. To replace a lost pet, you need to find a new pet that looks and behaves exactly like the old pet . For the application, this means that the new instance needs to have the same state and identity as the old instance.

Compare StatefulSets to ReplicaSets or ReplicationControllers

Pod replicas managed by ReplicaSet or ReplicationController are like herds of cattle. Because they are mostly stateless, they can be replaced at any time with a fresh copy of the Pod. Stateful Pods require a different approach. When a stateful Pod instance dies (or the node it is on fails), the Pod instance needs to be restarted on another node, but the new instance needs to get the same name, network identity, and status as the old instance it replaced . This is StatefulSetwhat happens when managing Pods.

StatefulSetEnsure that Pods are rescheduled in a way that preserves their identity and state. It also allows you to easily expand the number of pets. A StatefulSet like a ReplicaSet has an expected replica count field for determining how many pets you want running at that time. Similar to ReplicaSets, Pods are created from a Pod template specified as part of a StatefulSet. But unlike Pods created by ReplicaSets, Pods created by StatefulSets are not identical copies. Each Pod can have its own set of volumes , making it unique from other Pods. Pet Pods also have a predictable (and stable) identity, rather than each new Pod instance getting a completely random identity.

10.2.2 Providing a stable network identity

GOVERNING SERVICE

Each Pod created by a StatefulSet is assigned a serial number index (zero-based), which is used to generate the Pod's name and hostname and to attach stable storage to the Pod. Therefore, Pod names are predictable because each Pod's name is derived from the StatefulSet's name and the ordinal index of the instance. Unlike pods with random names, they are named in a very organized way, as shown in the figure below.

image-20230604151501047

Unlike regular Pods, stateful Pods sometimes need to be addressed by their hostname, while stateless Pods usually do not. After all, every stateless Pod is just like any other Pod. You can choose any of them when you need one. But with stateful Pods, you usually want to operate on a specific Pod.

Therefore, StatefulSet requires you to create a corresponding control headless service ( governing headless Service) that provides the actual network identity for each Pod. Through this service, each Pod gets its own DNS entry, so its peers and possibly other clients in the cluster can address the Pod by its hostname. For example, if the control service belongs to the default namespace and is named foo, and the name of one of the Pods is A-0, you can access that Pod through its fully qualified domain name, which is a-0.foo.default.svc.cluster.local. This is not possible with Pods managed using ReplicaSet.

Additionally, you can use DNS to find the names of all Pods in the StatefulSet.

Replace Pod

When a Pod instance managed by a StatefulSet disappears (because the node where the Pod is located fails, the Pod is evicted from the node, or someone manually deletes the Pod object), the StatefulSet will ensure that it is replaced with a new instance, similar to what ReplicaSet does. But unlike ReplicaSet, the new Pod will get the same name and host name as the disappeared Pod (the difference between ReplicaSet and StatefulSet is shown in Figure 10.6).

image-20230604152740019

New Pods are not necessarily scheduled to the same nodes, but as you learned earlier, it doesn't matter which node the Pod runs on. Even if the Pod is scheduled on a different node, it will still be available and accessible under the previous hostname.

Expansion and contraction of StatefulSet

Resizing a StatefulSet creates a new Pod instance with the next unused ordinal index. If you expand the number of instances from two to three, the new instance will get index 2 (existing instances obviously have indexes 0 and 1).

Resizing a StatefulSet The advantage of scaling downwards is that you always know which pods will be removed. Again, this contrasts with downsizing a ReplicaSet, where you don't know which instance will be deleted, or even be able to specify which instance you want to delete first (but this feature may be introduced in the future). Shrinking a StatefulSet always removes the instance with the highest ordinal index first (as shown in Figure 10.7). This makes the effects of downsizing predictable.

image-20230604152855919

Since some stateful applications don't handle scaling down well, StatefulSets only scale down one Pod instance at a time . For example, a distributed data store may lose data if multiple nodes go offline at the same time. For example, if a replicated data store is configured to store two copies of each data entry, then if both nodes are shut down at the same time, the data entries will be lost if they happen to be stored on both nodes. If the downscaling is done sequentially, the distributed data store will have time to create an additional copy of the data entry elsewhere to replace the (single) missing copy.

So, for this exact reason, StatefulSets do not allow scaling down operations when any instance is unhealthy . If an instance is unhealthy and you scale down an instance at the same time, you effectively lose two cluster members at once.

10.2.3 Providing stable dedicated storage to each stateful instance

You've seen how StatefulSets ensure that stateful Pods have stable identities, but how is storage handled? Each stateful Pod instance requires its own storage, and if a stateful Pod is rescheduled (replaced with a new instance with the same identity as before), the new instance must come with the same storage. How do StatefulSets achieve this?

Obviously, the stateful Pod's storage needs to be persistent and separate from the Pod . In Chapter 6, you learned about PersistentVolumesand , which allow persistent storage to be attached PersistentVolumeClaimsby referencing a Pod . PersistentVolumeClaimBecause PersistentVolumeClaimsand PersistentVolumesis a one-to-one relationship, each Pod instance in the StatefulSet needs to reference a different one PersistentVolumeClaimto have its own independent one PersistentVolume. Since all Pod instances are copied from the same Pod template, how do they reference different ones PersistentVolumeClaim? So who creates these statements?

Cooperate Pod template with Volume Claim template

StatefulSet not only creates Pods, but also creates PersistentVolumeClaims, just like creating Pods. Therefore, a StatefulSet can also have one or more Volume Claim templates to generate PersistentVolumeClaims with each Pod instance (see Figure 10.8).

image-20230604153528158

PersistentVolumes of PersistentVolumeClaims can be pre-allocated by the administrator or created on the fly by dynamically allocating PersistentVolumes.

Creation and deletion of PersistentVolumeClaims

By adding a Pod instance, StatefulSet will create two or more API objects (one is the Pod, and the other or more are PersistentVolumeClaims referenced by the Pod). However, when scaling down an instance, only the Pod is deleted, leaving the PersistentVolumeClaims unchanged. This is because the consequences of deleting a PersistentVolumeClaim are obvious. When a PersistentVolumeClaim is deleted, the PersistentVolume it is bound to will be recycled or deleted, and the data in it will be lost.

PersistentVolumeClaims need to be removed manually to release the underlying PersistentVolume.

The persistence of the PersistentVolumeClaim after the scale-down operation means that a subsequent scale-up operation can re-attach the same PersistentVolumeClaim with the bound PersistentVolume and its contents to a new Pod instance (as shown in Figure 10.9). If the StatefulSet is accidentally shrunk, this error can be undone by scaling the operation again and the new Pod instance will get the same persistent state (and the same name) again.

image-20230604153751716

10.2.4 Understanding StatefulSet guarantees

Regular stateless Pods are interchangeable, while stateful Pods are not. We have seen that a stateful Pod is always replaced with an identical Pod (with the same name and hostname, using the same persistence store, etc.). When Kubernetes discovers that an old Pod no longer exists (such as manually deleting the Pod), it replaces it.

But what if Kubernetes cannot determine the state of the Pod? If it creates a replacement Pod with the same identity, there may be two instances of the application running in the system with the same identity. They will also be bound to the same storage, so two processes with the same identity will write to the same file at the same time. This is not a problem for Pods managed by a ReplicaSet since obviously the applications work on the same files. Additionally, ReplicaSets create Pods with randomly generated identities, so there is no chance of two processes with the same identity running at the same time.

Therefore, Kubernetes must be very careful to ensure that no two stateful Pod instances with the same identity and bound to the same PersistentVolumeClaim are running simultaneously . StatefulSet must provide at-least-once semantic guarantees for stateful Pod instances.

This means that the StatefulSet must ensure that the old Pod has stopped running before creating a replacement Pod. This has significant implications for handling node failures, which we will demonstrate later in this chapter.

10.3 Using a StatefulSet

10.3.1 Creating the app and container image

Extend the previous kubia program to allow storage and retrieval of individual data entries on each Pod instance. As follows:

const http = require('http');
const os = require('os');
const fs = require('fs');

const dataFile = "/var/data/kubia.txt";

function fileExists(file) {
    
    
  try {
    
    
    fs.statSync(file);
    return true;
  } catch (e) {
    
    
    return false;
  }
}

var handler = function(request, response) {
    
    
  if (request.method == 'POST') {
    
    
    var file = fs.createWriteStream(dataFile);
    file.on('open', function (fd) {
    
    
      request.pipe(file);
      console.log("New data has been received and stored.");
      response.writeHead(200);
      response.end("Data stored on pod " + os.hostname() + "\n");
    });
  } else {
    
    
    var data = fileExists(dataFile) ? fs.readFileSync(dataFile, 'utf8') : "No data posted yet";
    response.writeHead(200);
    response.write("You've hit " + os.hostname() + "\n");
    response.end("Data stored on this pod: " + data + "\n");
  }
};

var www = http.createServer(handler);
www.listen(8080);

Whenever the application receives a POST request, it writes the data received in the request body to a file /var/data/kubia.txt. On receiving a GET request, it returns the hostname and the stored data (file contents).

The Dockerfile file is as follows:

FROM node:7
ADD app.js /app.js
ENTRYPOINT ["node", "app.js"]

10.3.2 Deploying the app through a StatefulSet

To deploy your application, you need to create two (or three) different types of objects:

  1. PersistentVolumes used to store data files (if the cluster does not support dynamic provision of PersistentVolumes, you need to create them manually).
  2. Controlling Service required by StatefulSet.
  3. StatefulSet itself.

For each Pod instance, the StatefulSet will create a PersistentVolumeClaim that will be bound to a PersistentVolume. If your cluster supports dynamic provisioning, you do not need to manually create any PersistentVolumes (you can skip the next section). If it is not supported, you will need to create it as described in the next section.

Create a persistent volume

You will need three PersistentVolumes because you will be extending the StatefulSet to three copies. If you plan to expand the StatefulSet to more replicas, you will need to create more PersistentVolumes.

Create a PersistentVolume using the following file:

# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
  # PV卷名称
  name: mongodb-pv1
spec:
  # 容量
  capacity: 
    # 存储大小: 100MB
    storage: 100Mi
  # 该卷支持的访问模式
  accessModes:
    - ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
    - ReadOnlyMany  # ROX, 该卷可以被多个节点以只读方式挂载
  # 回收策略: 保留
  persistentVolumeReclaimPolicy: Retain
  # 该持久卷的实际存储类型: 此处使用HostPath类型卷
  hostPath:
    path: /tmp/mongodb1
---
# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
  # PV卷名称
  name: mongodb-pv2
spec:
  # 容量
  capacity: 
    # 存储大小: 100MB
    storage: 100Mi
  # 该卷支持的访问模式
  accessModes:
    - ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
    - ReadOnlyMany  # ROX, 该卷可以被多个节点以只读方式挂载
  # 回收策略: 保留
  persistentVolumeReclaimPolicy: Retain
  # 该持久卷的实际存储类型: 此处使用HostPath类型卷
  hostPath:
    path: /tmp/mongodb2
---
# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
  # PV卷名称
  name: mongodb-pv3
spec:
  # 容量
  capacity: 
    # 存储大小: 100MB
    storage: 100Mi
  # 该卷支持的访问模式
  accessModes:
    - ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
    - ReadOnlyMany  # ROX, 该卷可以被多个节点以只读方式挂载
  # 回收策略: 保留
  persistentVolumeReclaimPolicy: Retain
  # 该持久卷的实际存储类型: 此处使用HostPath类型卷
  hostPath:
    path: /tmp/mongodb3
$ kubectl apply -f mongodb-pv.yaml 
persistentvolume/mongodb-pv1 created
persistentvolume/mongodb-pv2 created
persistentvolume/mongodb-pv3 created

$ kubectl get pv
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
mongodb-pv1   100Mi      RWO,ROX        Retain    Available         5s
mongodb-pv2   100Mi      RWO,ROX        Retain    Available         5s
mongodb-pv3   100Mi      RWO,ROX        Retain    Available         5s
Create management service

As mentioned before, before deploying a StatefulSet, you first need to create a headless service to provide network identification for your stateful pods. The following listing shows the definition of the service

# kubia-service-headless.yaml
apiVersion: v1
kind: Service
metadata:
  name: kubia
spec:
  clusterIP: None
  selector:
    app: kubia
  ports:
  - name: http
    port: 80
Create a StatefulSet configuration file

The configuration file is as follows:

# kubia-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kubia
spec:
  serviceName: kubia
  replicas: 2
  selector:
    matchLabels:
      app: kubia # has to match .spec.template.metadata.labels
  template:
    metadata:
      labels:
        app: kubia
    spec:
      containers:
      - name: kubia
        image: luksa/kubia-pet
        ports:
        - name: http
          containerPort: 8080
        volumeMounts:
        - name: data
          mountPath: /var/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      resources:
        requests:
          storage: 1Mi
      accessModes:
      - ReadWriteOnce

The StatefulSet manifest is not much different from the ReplicaSet or Deployment manifest you created earlier. The difference is the volumeClaimTemplates list. In it, you define a volume claim template named data, which will be used to create a PersistentVolumeClaim for each pod. A pod applies for volume space by including persistentVolumeClaim in the manifest. In the previous pod template, you will not find such a volume.

Create a StatefulSet
$ kubectl create -f kubia-statefulset.yaml
statefulset.apps/kubia created

$ kubectl get pod
NAME      READY   STATUS    RESTARTS   AGE
kubia-0   1/1     Running   0          57s
kubia-1   1/1     Running   0          39s

Pod will wait for the previous pod to be created before creating it.

Test the generated Pod

Let’s take a closer look at the spec of the first pod in the listing below to understand how StatefulSet builds the pod based on the pod template and the PersistentVolumeClaim template.

$ kubectl get po kubia-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-06-04T08:23:10Z"
  generateName: kubia-
  labels:
    app: kubia
    controller-revision-hash: kubia-c94bcb69b
    statefulset.kubernetes.io/pod-name: kubia-0
  name: kubia-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: kubia
    uid: fe22c872-ed23-4497-aa63-489b28eb4ae3
  resourceVersion: "3898989"
  uid: 4d66ebd8-014c-4cb2-ae78-9e94d925b133
spec:
  containers:
  - image: luksa/kubia-pet
    imagePullPolicy: Always
    name: kubia
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    resources: {
    
    }
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/data
      name: data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-zm4s7
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: kubia-0
  nodeName: yjq-k8s2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {
    
    }
  serviceAccount: default
  serviceAccountName: default
  subdomain: kubia
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: data
    persistentVolumeClaim: # 由StatefulSet创建的volume
      claimName: data-kubia-0
  - name: kube-api-access-zm4s7
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

The PersistentVolumeClaim template is used to create a PersistentVolumeClaim and a volume within a pod that references the created PersistentVolumeClaim.

$ kubectl get pvc
NAME           STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-kubia-0   Bound    mongodb-pv1   100Mi      RWO,ROX                     6m15s
data-kubia-1   Bound    mongodb-pv2   100Mi      RWO,ROX                     5m57s

You can see that the PersistentVolumeClaim is indeed created.

10.3.3 Playing with your pods

Communicate with Pods using an API server

A useful feature of the API server is the ability to proxy connections directly to individual pods. If you want to perform a request to the kubia-0 pod, you can use the following URL:

<apiServerHost>:<port>/api/v1/namespaces/default/pods/kubia-0/proxy/<path>

It is more convenient to open the proxy directly:

$ kubectl proxy
Starting to serve on 127.0.0.1:8001

Since you will be communicating with the API server through the kubectl proxy, localhost:8001 will be used instead of the actual API server host and port. You would send a request to the kubia-0 pod like this:

$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: No data posted yet

The response shows that the request was indeed received and processed by the application running in the kubia-0 pod.

Because you are communicating with the pod through the API server, and you are connecting to the API server through the kubectl proxy, the request goes through two different proxies (the first is the kubectl proxy and the other is the API server, which will request proxy to the pod). To understand the situation more clearly, please refer to Figure 10.10.

image-20230604163639118

The request you send to the pod is a GET request, but you can also send a POST request through the API server. This is accomplished by sending the POST request to the same proxy URL as the GET request.

When your application receives a POST request, it stores the contents of the request body in a local file. Send a POST request to the kubia-0 pod:

$ curl -X POST -d "Hey there! This greeting was submitted to kubia-0." localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
Data stored on pod kubia-0

The data you sent should now be stored in the pod. Let's perform the GET request again and see if it returns the stored data:

$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.

Okay, so far so good. Now let's see what the other cluster node (kubia-1 pod) returns:

$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-1/proxy/
You've hit kubia-1
Data stored on this pod: No data posted yet

As expected, each node has its own state. But is this state persisted? Let's verify this.

Check whether the Pod's state is persistent

Delete a Pod and check:

$ kubectl delete pod kubia-0
pod "kubia-0" deleted

$ kubectl get po
NAME      READY   STATUS        RESTARTS   AGE
kubia-0   1/1     Terminating   0          20m
kubia-1   1/1     Running       0          19m

$ kubectl get pod
NAME      READY   STATUS              RESTARTS   AGE
kubia-0   0/1     ContainerCreating   0          14s
kubia-1   1/1     Running             0          20m

You can see that a Pod with the same name is created.

This new pod may be scheduled to any node in the cluster, not necessarily the node where the previous pod is located. The old pod's entire identity (name, hostname, and storage) is effectively moved to the new node (as shown in Figure 10.11).

image-20230604164523118

Now that the new pod is running, let's check if it has the same identity as before:

$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.

You can see that the previously stored data is also persisted.

Scale a StatefulSet

The way to scale a StatefulSet is not much different from deleting a Pod and immediately recreating it from the StatefulSet. Remember that scaling a StatefulSet only removes Pods and does not affect PersistentVolumeClaims.

It is important to remember that scaling (down or up) occurs step by step, just like when creating a StatefulSet, you create Pods one by one. When scaling down more than one instance, the Pod with the highest sequence number will be deleted first. The Pod with the next highest sequence number will be deleted only after the Pod is completely terminated.

Access services within the cluster

Compared with using a piggyback Pod to access the Service from within the cluster, you can use the proxy function provided by the API server to access the Service in the same way.

First create the following Service:

# kubia-service-public.yaml
apiVersion: v1
kind: Service
metadata:
  name: kubia-public
spec:
  selector:
    app: kubia
  ports:
  - port: 80
    targetPort: 8080

The form of the URI path from the proxy request to the Service is as follows:

/api/v1/namespaces/<namespace>/services/<service name>/proxy/<path>

Therefore, you can run curl on your local machine and access the Service through kubectl proxy as follows (you ran kubectl proxy before and it should still be running):

$ curl localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.

Likewise, clients (within the cluster) can use the kubia-public Service to store data into and read data from the clustered data store. Of course, each request will land on a random cluster node , so you'll get data from a random node every time. You'll make improvements in the next step.

10.4 Discovering peers in a StatefulSet

An important requirement for group applications is peer discovery capabilities, that is, each StatefulSet member needs to be able to easily find all other members . Of course, it can do this by talking to an API server, but one of the goals of Kubernetes is to provide features that help applications be completely Kubernetes agnostic. Therefore, having applications communicate with the Kubernetes API is not advisable.

So, how does a Pod discover its peers without communicating with the API? Is there an existing, well-known technology that can achieve this? What about the Domain Name System (DNS)? Depending on how much you know about DNS, you may know the purpose of A, CNAME, or MX records. In addition, there are other less well-known types of DNS records, one of which is the SRV record.

A. CNAME and MX records are common record types in DNS and are used to resolve domain names to corresponding IP addresses or other domain names. A records are used to resolve domain names into IPv4 addresses, CNAME records are used to create aliases of domain names or point to other domain names, and MX records are used to specify the mail server that receives emails from this domain name.

SRV records are used to point to the hostname and port of a server that provides a specific service. Kubernetes creates SRV records to point to the hostname of the Pod behind the headless service.

The SRV records for your stateful Pod will be listed by running the dig DNS lookup tool on a temporary new Pod. Use the following command:

$ kubectl run -it srvlookup --image=tutum/dnsutils --rm --restart=Never -- dig SRV kubia.default.svc.cluster.local

This command runs a one-time pod (--restart=Never) named srvlookup, attached to the console (-it), and removed immediately after termination (-rm). This Pod runs a container from the tutum/dnsutils image and runs the following command: dig SRV kubia.default.svc.cluster.local.

$ kubectl run -it srvlookup --image=tutum/dnsutils --rm --restart=Never -- dig SRV kubia.default.svc.cluster.local
......

;; ANSWER SECTION:
kubia.default.svc.cluster.local. 30 IN  SRV     0 50 80 kubia-1.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 30 IN  SRV     0 50 80 kubia-0.kubia.default.svc.cluster.local.

;; ADDITIONAL SECTION:
kubia-0.kubia.default.svc.cluster.local. 30 IN A 10.244.1.126
kubia-1.kubia.default.svc.cluster.local. 30 IN A 10.244.1.125

;; Query time: 2 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Sun Jun 04 12:50:25 UTC 2023
;; MSG SIZE  rcvd: 350

pod "srvlookup" deleted

The ANSWER SECTION section shows two SRV records pointing to the hostnames of the two Pods that support the headless service. Each Pod also has its own A record, as shown in ADDITIONAL SECTION.

To get a list of all other Pods of a StatefulSet, just perform an SRV DNS lookup. For example, here's how to perform this lookup in Node.js:

dns.resolveSrv("kubia.default.svc.cluster.local", callBackFunction);

You will use this command in your application to enable each Pod to discover its peers.

The order of returned SRV records is random because they all have the same priority. Don't expect to always see kubia-0 listed before kubia-1.

10.4.1 Implementing peer discovery through DNS

Your ancient data store is not yet clustered. Each data storage node operates completely independently of the other nodes, with no communication between them. Next, you'll make them communicate with each other.

Data sent by clients connected to the data storage cluster through the kubia-public service will fall on random cluster nodes. A cluster can store multiple data entries, but clients currently don't have a good view of all of them. Because the service forwards requests to Pods randomly, if the client wants to get data from all Pods, it needs to perform many requests until all Pods are hit.

You can improve this issue by having the node return data for all cluster nodes. To do this, the node needs to find all peers. You will use your learning about StatefulSets and SRV records to achieve this purpose.

You will modify the application's source code, as shown in the following example:

const http = require('http');
const os = require('os');
const fs = require('fs');
const dns = require('dns');

const dataFile = "/var/data/kubia.txt";
const serviceName = "kubia.default.svc.cluster.local";
const port = 8080;


function fileExists(file) {
    
    
  try {
    
    
    fs.statSync(file);
    return true;
  } catch (e) {
    
    
    return false;
  }
}

function httpGet(reqOptions, callback) {
    
    
  return http.get(reqOptions, function(response) {
    
    
    var body = '';
    response.on('data', function(d) {
    
     body += d; });
    response.on('end', function() {
    
     callback(body); });
  }).on('error', function(e) {
    
    
    callback("Error: " + e.message);
  });
}

var handler = function(request, response) {
    
    
  if (request.method == 'POST') {
    
    
    var file = fs.createWriteStream(dataFile);
    file.on('open', function (fd) {
    
    
      request.pipe(file);
      response.writeHead(200);
      response.end("Data stored on pod " + os.hostname() + "\n");
    });
  } else {
    
    
    response.writeHead(200);
    if (request.url == '/data') {
    
    
      var data = fileExists(dataFile) ? fs.readFileSync(dataFile, 'utf8') : "No data posted yet";
      response.end(data);
    } else {
    
    
      response.write("You've hit " + os.hostname() + "\n");
      response.write("Data stored in the cluster:\n");
      dns.resolveSrv(serviceName, function (err, addresses) {
    
    
        if (err) {
    
    
          response.end("Could not look up DNS SRV records: " + err);
          return;
        }
        var numResponses = 0;
        if (addresses.length == 0) {
    
    
          response.end("No peers discovered.");
        } else {
    
    
          addresses.forEach(function (item) {
    
    
            var requestOptions = {
    
    
              host: item.name,
              port: port,
              path: '/data'
            };
            httpGet(requestOptions, function (returnedData) {
    
    
              numResponses++;
              response.write("- " + item.name + ": " + returnedData + "\n");
              if (numResponses == addresses.length) {
    
    
                response.end();
              }
            });
          });
        }
      });
    }
  }
};

var www = http.createServer(handler);
www.listen(port);

Figure 10.12 shows what happens when the application receives a GET request. First, the server that receives the request performs an SRV record lookup for the headless kubia service, then sends a GET request to every Pod that supports the service (or even to itself, which is obviously unnecessary, but I want the code to be as possible as Simple). It then returns a list of all nodes and the data stored on each node.

image-20230604205628599

The container image containing the new version of this application is located at docker.io/luksa/kubia-pet-peers.

10.4.2 Updating a StatefulSet

Your StatefulSet is already running, so let's see how to update its Pod template so that the Pods use the new image. At the same time, you will also set the number of replicas to 3. To update a StatefulSet, use the kubectl edit command (the patch command is another option):

This will open the StatefulSet definition in your default editor. In the definition, change changespec.replicas to 3 and modify the spec.template.spec.containers.image property so that it points to the new image (luksa/kubia-pet-peers instead of luksa/kubia-pet). Save the file and exit the editor to update the StatefulSet. Previously there were two replicas running, now you should see an additional replica named kubia-2 starting up. List the Pods to confirm:

$ kubectl get pod
NAME      READY   STATUS              RESTARTS   AGE
kubia-0   1/1     Running             0          4h16m
kubia-1   1/1     Running             0          4h36m
kubia-2   0/1     ContainerCreating   0          12s

The new Pod instance is running the new image. But what about the two existing copies? Judging by their age, they appear to have not been updated. This is normal because initially, StatefulSets behave more like ReplicaSets than Deployments, so they do not perform rolling updates when the template is modified. You need to delete the copies manually and the StatefulSet will recreate them based on the new template:

Starting with Kubernetes version 1.7, StatefulSets support the same rolling updates as Deployments and DaemonSets. For more information, see the documentation for StatefulSet's spec.updateStrategy field using the kubectl explain command.

10.4.3 Trying out your clustered data store

Once both Pods are running, you can see if your brand new Stone Age data store is working as expected. Follow the example below to make some requests to the cluster.

$ curl -X POST -d "The sun is shining" localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
Data stored on pod kubia-1

$ curl -X POST -d "The weather is sweet" localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
Data stored on pod kubia-2

Now, follow the example below to read the stored data.

$ curl localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
You've hit kubia-0
Data stored in the cluster:
- kubia-0.kubia.default.svc.cluster.local: Hey there! This greeting was submitted to kubia-0.
- kubia-1.kubia.default.svc.cluster.local: The sun is shining
- kubia-2.kubia.default.svc.cluster.local: The weather is sweet

When a client request reaches your cluster node, it discovers all peers, collects data from them, and sends all data back to the client. Even if you expand or shrink the StatefulSet, the Pod serving client requests can always find all peers running at the time.

10.5 Understanding how StatefulSets deal with node failures

In Section 10.2.4, we mentioned that Kubernetes must ensure that the stateful pod has stopped running before creating a replacement pod. When a node suddenly fails, Kubernetes has no way of knowing the status of the node or its pods. It has no way of determining if the pods have stopped running, or if they are still running and may even still be reachable, it's just that the Kubelet has stopped reporting node status to the master.

Because a StatefulSet guarantees that two pods with the same identity and storage will never run at the same time, when a node fails, a StatefulSet cannot and should not create replacement pods until it is confident that the previous pod has stopped running.

A StatefulSet can determine the state of a pod only when told to do so by the cluster administrator . Administrators can do this by deleting specific pods or deleting the entire node, which will cause all pods on the node to be deleted.

Guess you like

Origin blog.csdn.net/weixin_47692652/article/details/131037143