K8s in Action reading notes - [10] StatefulSets: deploying replicated stateful applications
10.1 Replicating stateful pods
ReplicaSet creates multiple Pod replicas from a single Pod template. There is no difference between these copies except for their names and IP addresses. If the Pod template contains a volume that references a specific PersistentVolumeClaim
ReplicaSet, all replicas of it will use the same one PersistentVolumeClaim
and therefore the same one ersistentVolumeClaim
bound by that P PersistentVolume
(as shown in Figure 10.1).
Since the reference to the declaration is in the Pod template used to create multiple copies of the Pod, there is no way to make each copy use its own independent PersistentVolumeClaim
. At least with a single ReplicaSet, it is not possible to run a distributed data store using a ReplicaSet because each instance requires its own independent storage . Other solutions are needed.
10.1.1 Running multiple replicas with separate storage for each
How can I run multiple copies of a Pod and have each Pod use its own storage volume? ReplicaSet creates an exact copy of a Pod, so it cannot be used for such Pods. So what can be used?
Create Pod manually
It is possible to create Pods manually and make each Pod use its own PersistentVolumeClaim
, but since there is no ReplicaSet to manage them, they need to be managed manually and recreated when they disappear (such as node failure). Therefore, this is not a viable option.
Use one ReplicaSet per Pod instance
Multiple ReplicaSets can be created, one ReplicaSet per Pod, with each ReplicaSet's desired number of replicas set to 1, and each ReplicaSet's Pod template referencing a dedicated one PersistentVolumeClaim
(as shown in Figure 10.2).
While this allows for automatic rescheduling in the event of a node failure or accidental deletion of a Pod, it is more cumbersome than having a single ReplicaSet . For example, consider how to scale the Pod in this case. The desired number of replicas cannot be changed, instead additional ReplicaSets must be created.
Therefore, using multiple ReplicaSets is not the best solution.
Use multiple directories on the same storage volume
Using the same PersistentVolume
, but creating a separate file directory for each Pod within the volume, is a trick that can be used (shown in Figure 10.3).
Since replicas cannot be configured differently from a single Pod template, there is no way to tell each instance which directory it should use. However, you can have each instance automatically choose a data directory that no other instance is currently using. This solution requires coordination between instances and is not easy to implement correctly .
10.1.2 Providing a stable identity for each pod
Some applications require a stable network identity for each instance in addition to storage. From time to time, a Pod may be terminated and replaced with a new Pod. When a ReplicaSet replaces a Pod, the new Pod is a completely new Pod with a new hostname and IP address, although the data in its storage volume may be the data of the terminated Pod. For some applications, starting with data from an old instance but with a completely new network identity can cause problems.
Why do some applications require stable network identity? This requirement is very common in distributed stateful applications. Some applications require administrators to list all other cluster members and their IP addresses (or hostnames) in each member's configuration file. But in Kubernetes, every time a Pod is rescheduled, the new Pod gets a new hostname and a new IP address, so the entire application cluster needs to be reconfigured every time a member is rescheduled.
Use a dedicated Service for each Pod instance
To solve this problem, you can use a trick to provide stable network addresses for cluster members by creating a dedicated Kubernetes Service for each individual member. Because the service IP is stable, you can point to each member in the configuration by its service IP (rather than the Pod IP).
This is similar to creating a ReplicaSet for each member to provide them with independent storage, as described earlier. Combining these two techniques results in the setup shown in Figure 10.4 (additional services covering the entire cluster membership are also shown, since a client for the cluster is usually required).
Not only is this solution unsightly, but it still doesn't solve all the problems. Individual Pods have no way of knowing which Service they are exposed through (and thus their stable IP), so they cannot use that IP to self-register with other Pods.
Fortunately, Kubernetes provides us with StatefulSet .
10.2 Understanding StatefulSets
10.2.1 Comparing StatefulSets with ReplicaSets
Using the “pet” vs. “cattle” analogy to understand stateful Pods
You may have heard the analogy between “pets” and “cattle.” If not, let me explain. We can think of instances of an application as pets or cattle.
Note: StatefulSets were originally called PetSets. The name comes from the analogy between "pets" and "cattle" explained here.
We tend to think of application instances as pets, giving each instance a name and giving each instance individual attention. But it's often better to treat instances as herds and not pay special attention to each individual instance . This allows unhealthy instances to be easily replaced without much thought, like a farmer replacing unhealthy cattle.
An instance of a stateless application behaves much like a cow in a herd. If an instance dies, you can create a new one and people won't notice any difference.
However, with stateful applications, an application instance is more like a pet. You can't buy a new pet when a pet dies and expect people not to notice. To replace a lost pet, you need to find a new pet that looks and behaves exactly like the old pet . For the application, this means that the new instance needs to have the same state and identity as the old instance.
Compare StatefulSets to ReplicaSets or ReplicationControllers
Pod replicas managed by ReplicaSet or ReplicationController are like herds of cattle. Because they are mostly stateless, they can be replaced at any time with a fresh copy of the Pod. Stateful Pods require a different approach. When a stateful Pod instance dies (or the node it is on fails), the Pod instance needs to be restarted on another node, but the new instance needs to get the same name, network identity, and status as the old instance it replaced . This is StatefulSet
what happens when managing Pods.
StatefulSet
Ensure that Pods are rescheduled in a way that preserves their identity and state. It also allows you to easily expand the number of pets. A StatefulSet like a ReplicaSet has an expected replica count field for determining how many pets you want running at that time. Similar to ReplicaSets, Pods are created from a Pod template specified as part of a StatefulSet. But unlike Pods created by ReplicaSets, Pods created by StatefulSets are not identical copies. Each Pod can have its own set of volumes , making it unique from other Pods. Pet Pods also have a predictable (and stable) identity, rather than each new Pod instance getting a completely random identity.
10.2.2 Providing a stable network identity
GOVERNING SERVICE
Each Pod created by a StatefulSet is assigned a serial number index (zero-based), which is used to generate the Pod's name and hostname and to attach stable storage to the Pod. Therefore, Pod names are predictable because each Pod's name is derived from the StatefulSet's name and the ordinal index of the instance. Unlike pods with random names, they are named in a very organized way, as shown in the figure below.
Unlike regular Pods, stateful Pods sometimes need to be addressed by their hostname, while stateless Pods usually do not. After all, every stateless Pod is just like any other Pod. You can choose any of them when you need one. But with stateful Pods, you usually want to operate on a specific Pod.
Therefore, StatefulSet requires you to create a corresponding control headless service ( governing headless Service
) that provides the actual network identity for each Pod. Through this service, each Pod gets its own DNS entry, so its peers and possibly other clients in the cluster can address the Pod by its hostname. For example, if the control service belongs to the default namespace and is named foo, and the name of one of the Pods is A-0, you can access that Pod through its fully qualified domain name, which is a-0.foo.default.svc.cluster.local
. This is not possible with Pods managed using ReplicaSet.
Additionally, you can use DNS to find the names of all Pods in the StatefulSet.
Replace Pod
When a Pod instance managed by a StatefulSet disappears (because the node where the Pod is located fails, the Pod is evicted from the node, or someone manually deletes the Pod object), the StatefulSet will ensure that it is replaced with a new instance, similar to what ReplicaSet does. But unlike ReplicaSet, the new Pod will get the same name and host name as the disappeared Pod (the difference between ReplicaSet and StatefulSet is shown in Figure 10.6).
New Pods are not necessarily scheduled to the same nodes, but as you learned earlier, it doesn't matter which node the Pod runs on. Even if the Pod is scheduled on a different node, it will still be available and accessible under the previous hostname.
Expansion and contraction of StatefulSet
Resizing a StatefulSet creates a new Pod instance with the next unused ordinal index. If you expand the number of instances from two to three, the new instance will get index 2 (existing instances obviously have indexes 0 and 1).
Resizing a StatefulSet The advantage of scaling downwards is that you always know which pods will be removed. Again, this contrasts with downsizing a ReplicaSet, where you don't know which instance will be deleted, or even be able to specify which instance you want to delete first (but this feature may be introduced in the future). Shrinking a StatefulSet always removes the instance with the highest ordinal index first (as shown in Figure 10.7). This makes the effects of downsizing predictable.
Since some stateful applications don't handle scaling down well, StatefulSets only scale down one Pod instance at a time . For example, a distributed data store may lose data if multiple nodes go offline at the same time. For example, if a replicated data store is configured to store two copies of each data entry, then if both nodes are shut down at the same time, the data entries will be lost if they happen to be stored on both nodes. If the downscaling is done sequentially, the distributed data store will have time to create an additional copy of the data entry elsewhere to replace the (single) missing copy.
So, for this exact reason, StatefulSets do not allow scaling down operations when any instance is unhealthy . If an instance is unhealthy and you scale down an instance at the same time, you effectively lose two cluster members at once.
10.2.3 Providing stable dedicated storage to each stateful instance
You've seen how StatefulSets ensure that stateful Pods have stable identities, but how is storage handled? Each stateful Pod instance requires its own storage, and if a stateful Pod is rescheduled (replaced with a new instance with the same identity as before), the new instance must come with the same storage. How do StatefulSets achieve this?
Obviously, the stateful Pod's storage needs to be persistent and separate from the Pod . In Chapter 6, you learned about PersistentVolumes
and , which allow persistent storage to be attached PersistentVolumeClaims
by referencing a Pod . PersistentVolumeClaim
Because PersistentVolumeClaims
and PersistentVolumes
is a one-to-one relationship, each Pod instance in the StatefulSet needs to reference a different one PersistentVolumeClaim
to have its own independent one PersistentVolume
. Since all Pod instances are copied from the same Pod template, how do they reference different ones PersistentVolumeClaim
? So who creates these statements?
Cooperate Pod template with Volume Claim template
StatefulSet not only creates Pods, but also creates PersistentVolumeClaims, just like creating Pods. Therefore, a StatefulSet can also have one or more Volume Claim templates to generate PersistentVolumeClaims with each Pod instance (see Figure 10.8).
PersistentVolumes of PersistentVolumeClaims can be pre-allocated by the administrator or created on the fly by dynamically allocating PersistentVolumes.
Creation and deletion of PersistentVolumeClaims
By adding a Pod instance, StatefulSet will create two or more API objects (one is the Pod, and the other or more are PersistentVolumeClaims referenced by the Pod). However, when scaling down an instance, only the Pod is deleted, leaving the PersistentVolumeClaims unchanged. This is because the consequences of deleting a PersistentVolumeClaim are obvious. When a PersistentVolumeClaim is deleted, the PersistentVolume it is bound to will be recycled or deleted, and the data in it will be lost.
PersistentVolumeClaims need to be removed manually to release the underlying PersistentVolume.
The persistence of the PersistentVolumeClaim after the scale-down operation means that a subsequent scale-up operation can re-attach the same PersistentVolumeClaim with the bound PersistentVolume and its contents to a new Pod instance (as shown in Figure 10.9). If the StatefulSet is accidentally shrunk, this error can be undone by scaling the operation again and the new Pod instance will get the same persistent state (and the same name) again.
10.2.4 Understanding StatefulSet guarantees
Regular stateless Pods are interchangeable, while stateful Pods are not. We have seen that a stateful Pod is always replaced with an identical Pod (with the same name and hostname, using the same persistence store, etc.). When Kubernetes discovers that an old Pod no longer exists (such as manually deleting the Pod), it replaces it.
But what if Kubernetes cannot determine the state of the Pod? If it creates a replacement Pod with the same identity, there may be two instances of the application running in the system with the same identity. They will also be bound to the same storage, so two processes with the same identity will write to the same file at the same time. This is not a problem for Pods managed by a ReplicaSet since obviously the applications work on the same files. Additionally, ReplicaSets create Pods with randomly generated identities, so there is no chance of two processes with the same identity running at the same time.
Therefore, Kubernetes must be very careful to ensure that no two stateful Pod instances with the same identity and bound to the same PersistentVolumeClaim are running simultaneously . StatefulSet must provide at-least-once semantic guarantees for stateful Pod instances.
This means that the StatefulSet must ensure that the old Pod has stopped running before creating a replacement Pod. This has significant implications for handling node failures, which we will demonstrate later in this chapter.
10.3 Using a StatefulSet
10.3.1 Creating the app and container image
Extend the previous kubia program to allow storage and retrieval of individual data entries on each Pod instance. As follows:
const http = require('http');
const os = require('os');
const fs = require('fs');
const dataFile = "/var/data/kubia.txt";
function fileExists(file) {
try {
fs.statSync(file);
return true;
} catch (e) {
return false;
}
}
var handler = function(request, response) {
if (request.method == 'POST') {
var file = fs.createWriteStream(dataFile);
file.on('open', function (fd) {
request.pipe(file);
console.log("New data has been received and stored.");
response.writeHead(200);
response.end("Data stored on pod " + os.hostname() + "\n");
});
} else {
var data = fileExists(dataFile) ? fs.readFileSync(dataFile, 'utf8') : "No data posted yet";
response.writeHead(200);
response.write("You've hit " + os.hostname() + "\n");
response.end("Data stored on this pod: " + data + "\n");
}
};
var www = http.createServer(handler);
www.listen(8080);
Whenever the application receives a POST request, it writes the data received in the request body to a file /var/data/kubia.txt
. On receiving a GET request, it returns the hostname and the stored data (file contents).
The Dockerfile file is as follows:
FROM node:7
ADD app.js /app.js
ENTRYPOINT ["node", "app.js"]
10.3.2 Deploying the app through a StatefulSet
To deploy your application, you need to create two (or three) different types of objects:
- PersistentVolumes used to store data files (if the cluster does not support dynamic provision of PersistentVolumes, you need to create them manually).
- Controlling Service required by StatefulSet.
- StatefulSet itself.
For each Pod instance, the StatefulSet will create a PersistentVolumeClaim that will be bound to a PersistentVolume. If your cluster supports dynamic provisioning, you do not need to manually create any PersistentVolumes (you can skip the next section). If it is not supported, you will need to create it as described in the next section.
Create a persistent volume
You will need three PersistentVolumes because you will be extending the StatefulSet to three copies. If you plan to expand the StatefulSet to more replicas, you will need to create more PersistentVolumes.
Create a PersistentVolume using the following file:
# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
# PV卷名称
name: mongodb-pv1
spec:
# 容量
capacity:
# 存储大小: 100MB
storage: 100Mi
# 该卷支持的访问模式
accessModes:
- ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
- ReadOnlyMany # ROX, 该卷可以被多个节点以只读方式挂载
# 回收策略: 保留
persistentVolumeReclaimPolicy: Retain
# 该持久卷的实际存储类型: 此处使用HostPath类型卷
hostPath:
path: /tmp/mongodb1
---
# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
# PV卷名称
name: mongodb-pv2
spec:
# 容量
capacity:
# 存储大小: 100MB
storage: 100Mi
# 该卷支持的访问模式
accessModes:
- ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
- ReadOnlyMany # ROX, 该卷可以被多个节点以只读方式挂载
# 回收策略: 保留
persistentVolumeReclaimPolicy: Retain
# 该持久卷的实际存储类型: 此处使用HostPath类型卷
hostPath:
path: /tmp/mongodb2
---
# 1. 创建持久卷PV
apiVersion: v1
kind: PersistentVolume
metadata:
# PV卷名称
name: mongodb-pv3
spec:
# 容量
capacity:
# 存储大小: 100MB
storage: 100Mi
# 该卷支持的访问模式
accessModes:
- ReadWriteOnce # RWO, 该卷可以被一个节点以读写方式挂载
- ReadOnlyMany # ROX, 该卷可以被多个节点以只读方式挂载
# 回收策略: 保留
persistentVolumeReclaimPolicy: Retain
# 该持久卷的实际存储类型: 此处使用HostPath类型卷
hostPath:
path: /tmp/mongodb3
$ kubectl apply -f mongodb-pv.yaml
persistentvolume/mongodb-pv1 created
persistentvolume/mongodb-pv2 created
persistentvolume/mongodb-pv3 created
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
mongodb-pv1 100Mi RWO,ROX Retain Available 5s
mongodb-pv2 100Mi RWO,ROX Retain Available 5s
mongodb-pv3 100Mi RWO,ROX Retain Available 5s
Create management service
As mentioned before, before deploying a StatefulSet, you first need to create a headless service to provide network identification for your stateful pods. The following listing shows the definition of the service
# kubia-service-headless.yaml
apiVersion: v1
kind: Service
metadata:
name: kubia
spec:
clusterIP: None
selector:
app: kubia
ports:
- name: http
port: 80
Create a StatefulSet configuration file
The configuration file is as follows:
# kubia-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kubia
spec:
serviceName: kubia
replicas: 2
selector:
matchLabels:
app: kubia # has to match .spec.template.metadata.labels
template:
metadata:
labels:
app: kubia
spec:
containers:
- name: kubia
image: luksa/kubia-pet
ports:
- name: http
containerPort: 8080
volumeMounts:
- name: data
mountPath: /var/data
volumeClaimTemplates:
- metadata:
name: data
spec:
resources:
requests:
storage: 1Mi
accessModes:
- ReadWriteOnce
The StatefulSet manifest is not much different from the ReplicaSet or Deployment manifest you created earlier. The difference is the volumeClaimTemplates list. In it, you define a volume claim template named data, which will be used to create a PersistentVolumeClaim for each pod. A pod applies for volume space by including persistentVolumeClaim in the manifest. In the previous pod template, you will not find such a volume.
Create a StatefulSet
$ kubectl create -f kubia-statefulset.yaml
statefulset.apps/kubia created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Running 0 57s
kubia-1 1/1 Running 0 39s
Pod will wait for the previous pod to be created before creating it.
Test the generated Pod
Let’s take a closer look at the spec of the first pod in the listing below to understand how StatefulSet builds the pod based on the pod template and the PersistentVolumeClaim template.
$ kubectl get po kubia-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-06-04T08:23:10Z"
generateName: kubia-
labels:
app: kubia
controller-revision-hash: kubia-c94bcb69b
statefulset.kubernetes.io/pod-name: kubia-0
name: kubia-0
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: kubia
uid: fe22c872-ed23-4497-aa63-489b28eb4ae3
resourceVersion: "3898989"
uid: 4d66ebd8-014c-4cb2-ae78-9e94d925b133
spec:
containers:
- image: luksa/kubia-pet
imagePullPolicy: Always
name: kubia
ports:
- containerPort: 8080
name: http
protocol: TCP
resources: {
}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-zm4s7
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: kubia-0
nodeName: yjq-k8s2
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {
}
serviceAccount: default
serviceAccountName: default
subdomain: kubia
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: data
persistentVolumeClaim: # 由StatefulSet创建的volume
claimName: data-kubia-0
- name: kube-api-access-zm4s7
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
The PersistentVolumeClaim template is used to create a PersistentVolumeClaim and a volume within a pod that references the created PersistentVolumeClaim.
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-kubia-0 Bound mongodb-pv1 100Mi RWO,ROX 6m15s
data-kubia-1 Bound mongodb-pv2 100Mi RWO,ROX 5m57s
You can see that the PersistentVolumeClaim is indeed created.
10.3.3 Playing with your pods
Communicate with Pods using an API server
A useful feature of the API server is the ability to proxy connections directly to individual pods. If you want to perform a request to the kubia-0 pod, you can use the following URL:
<apiServerHost>:<port>/api/v1/namespaces/default/pods/kubia-0/proxy/<path>
It is more convenient to open the proxy directly:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
Since you will be communicating with the API server through the kubectl proxy, localhost:8001 will be used instead of the actual API server host and port. You would send a request to the kubia-0 pod like this:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: No data posted yet
The response shows that the request was indeed received and processed by the application running in the kubia-0 pod.
Because you are communicating with the pod through the API server, and you are connecting to the API server through the kubectl proxy, the request goes through two different proxies (the first is the kubectl proxy and the other is the API server, which will request proxy to the pod). To understand the situation more clearly, please refer to Figure 10.10.
The request you send to the pod is a GET request, but you can also send a POST request through the API server. This is accomplished by sending the POST request to the same proxy URL as the GET request.
When your application receives a POST request, it stores the contents of the request body in a local file. Send a POST request to the kubia-0 pod:
$ curl -X POST -d "Hey there! This greeting was submitted to kubia-0." localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
Data stored on pod kubia-0
The data you sent should now be stored in the pod. Let's perform the GET request again and see if it returns the stored data:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.
Okay, so far so good. Now let's see what the other cluster node (kubia-1 pod) returns:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-1/proxy/
You've hit kubia-1
Data stored on this pod: No data posted yet
As expected, each node has its own state. But is this state persisted? Let's verify this.
Check whether the Pod's state is persistent
Delete a Pod and check:
$ kubectl delete pod kubia-0
pod "kubia-0" deleted
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Terminating 0 20m
kubia-1 1/1 Running 0 19m
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
kubia-0 0/1 ContainerCreating 0 14s
kubia-1 1/1 Running 0 20m
You can see that a Pod with the same name is created.
This new pod may be scheduled to any node in the cluster, not necessarily the node where the previous pod is located. The old pod's entire identity (name, hostname, and storage) is effectively moved to the new node (as shown in Figure 10.11).
Now that the new pod is running, let's check if it has the same identity as before:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.
You can see that the previously stored data is also persisted.
Scale a StatefulSet
The way to scale a StatefulSet is not much different from deleting a Pod and immediately recreating it from the StatefulSet. Remember that scaling a StatefulSet only removes Pods and does not affect PersistentVolumeClaims.
It is important to remember that scaling (down or up) occurs step by step, just like when creating a StatefulSet, you create Pods one by one. When scaling down more than one instance, the Pod with the highest sequence number will be deleted first. The Pod with the next highest sequence number will be deleted only after the Pod is completely terminated.
Access services within the cluster
Compared with using a piggyback Pod to access the Service from within the cluster, you can use the proxy function provided by the API server to access the Service in the same way.
First create the following Service:
# kubia-service-public.yaml
apiVersion: v1
kind: Service
metadata:
name: kubia-public
spec:
selector:
app: kubia
ports:
- port: 80
targetPort: 8080
The form of the URI path from the proxy request to the Service is as follows:
/api/v1/namespaces/<namespace>/services/<service name>/proxy/<path>
Therefore, you can run curl on your local machine and access the Service through kubectl proxy as follows (you ran kubectl proxy before and it should still be running):
$ curl localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.
Likewise, clients (within the cluster) can use the kubia-public Service to store data into and read data from the clustered data store. Of course, each request will land on a random cluster node , so you'll get data from a random node every time. You'll make improvements in the next step.
10.4 Discovering peers in a StatefulSet
An important requirement for group applications is peer discovery capabilities, that is, each StatefulSet member needs to be able to easily find all other members . Of course, it can do this by talking to an API server, but one of the goals of Kubernetes is to provide features that help applications be completely Kubernetes agnostic. Therefore, having applications communicate with the Kubernetes API is not advisable.
So, how does a Pod discover its peers without communicating with the API? Is there an existing, well-known technology that can achieve this? What about the Domain Name System (DNS)? Depending on how much you know about DNS, you may know the purpose of A, CNAME, or MX records. In addition, there are other less well-known types of DNS records, one of which is the SRV record.
A. CNAME and MX records are common record types in DNS and are used to resolve domain names to corresponding IP addresses or other domain names. A records are used to resolve domain names into IPv4 addresses, CNAME records are used to create aliases of domain names or point to other domain names, and MX records are used to specify the mail server that receives emails from this domain name.
SRV records are used to point to the hostname and port of a server that provides a specific service. Kubernetes creates SRV records to point to the hostname of the Pod behind the headless service.
The SRV records for your stateful Pod will be listed by running the dig DNS lookup tool on a temporary new Pod. Use the following command:
$ kubectl run -it srvlookup --image=tutum/dnsutils --rm --restart=Never -- dig SRV kubia.default.svc.cluster.local
This command runs a one-time pod (--restart=Never) named srvlookup, attached to the console (-it), and removed immediately after termination (-rm). This Pod runs a container from the tutum/dnsutils image and runs the following command: dig SRV kubia.default.svc.cluster.local
.
$ kubectl run -it srvlookup --image=tutum/dnsutils --rm --restart=Never -- dig SRV kubia.default.svc.cluster.local
......
;; ANSWER SECTION:
kubia.default.svc.cluster.local. 30 IN SRV 0 50 80 kubia-1.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 30 IN SRV 0 50 80 kubia-0.kubia.default.svc.cluster.local.
;; ADDITIONAL SECTION:
kubia-0.kubia.default.svc.cluster.local. 30 IN A 10.244.1.126
kubia-1.kubia.default.svc.cluster.local. 30 IN A 10.244.1.125
;; Query time: 2 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Sun Jun 04 12:50:25 UTC 2023
;; MSG SIZE rcvd: 350
pod "srvlookup" deleted
The ANSWER SECTION section shows two SRV records pointing to the hostnames of the two Pods that support the headless service. Each Pod also has its own A record, as shown in ADDITIONAL SECTION.
To get a list of all other Pods of a StatefulSet, just perform an SRV DNS lookup. For example, here's how to perform this lookup in Node.js:
dns.resolveSrv("kubia.default.svc.cluster.local", callBackFunction);
You will use this command in your application to enable each Pod to discover its peers.
The order of returned SRV records is random because they all have the same priority. Don't expect to always see kubia-0 listed before kubia-1.
10.4.1 Implementing peer discovery through DNS
Your ancient data store is not yet clustered. Each data storage node operates completely independently of the other nodes, with no communication between them. Next, you'll make them communicate with each other.
Data sent by clients connected to the data storage cluster through the kubia-public service will fall on random cluster nodes. A cluster can store multiple data entries, but clients currently don't have a good view of all of them. Because the service forwards requests to Pods randomly, if the client wants to get data from all Pods, it needs to perform many requests until all Pods are hit.
You can improve this issue by having the node return data for all cluster nodes. To do this, the node needs to find all peers. You will use your learning about StatefulSets and SRV records to achieve this purpose.
You will modify the application's source code, as shown in the following example:
const http = require('http');
const os = require('os');
const fs = require('fs');
const dns = require('dns');
const dataFile = "/var/data/kubia.txt";
const serviceName = "kubia.default.svc.cluster.local";
const port = 8080;
function fileExists(file) {
try {
fs.statSync(file);
return true;
} catch (e) {
return false;
}
}
function httpGet(reqOptions, callback) {
return http.get(reqOptions, function(response) {
var body = '';
response.on('data', function(d) {
body += d; });
response.on('end', function() {
callback(body); });
}).on('error', function(e) {
callback("Error: " + e.message);
});
}
var handler = function(request, response) {
if (request.method == 'POST') {
var file = fs.createWriteStream(dataFile);
file.on('open', function (fd) {
request.pipe(file);
response.writeHead(200);
response.end("Data stored on pod " + os.hostname() + "\n");
});
} else {
response.writeHead(200);
if (request.url == '/data') {
var data = fileExists(dataFile) ? fs.readFileSync(dataFile, 'utf8') : "No data posted yet";
response.end(data);
} else {
response.write("You've hit " + os.hostname() + "\n");
response.write("Data stored in the cluster:\n");
dns.resolveSrv(serviceName, function (err, addresses) {
if (err) {
response.end("Could not look up DNS SRV records: " + err);
return;
}
var numResponses = 0;
if (addresses.length == 0) {
response.end("No peers discovered.");
} else {
addresses.forEach(function (item) {
var requestOptions = {
host: item.name,
port: port,
path: '/data'
};
httpGet(requestOptions, function (returnedData) {
numResponses++;
response.write("- " + item.name + ": " + returnedData + "\n");
if (numResponses == addresses.length) {
response.end();
}
});
});
}
});
}
}
};
var www = http.createServer(handler);
www.listen(port);
Figure 10.12 shows what happens when the application receives a GET request. First, the server that receives the request performs an SRV record lookup for the headless kubia service, then sends a GET request to every Pod that supports the service (or even to itself, which is obviously unnecessary, but I want the code to be as possible as Simple). It then returns a list of all nodes and the data stored on each node.
The container image containing the new version of this application is located at docker.io/luksa/kubia-pet-peers
.
10.4.2 Updating a StatefulSet
Your StatefulSet is already running, so let's see how to update its Pod template so that the Pods use the new image. At the same time, you will also set the number of replicas to 3. To update a StatefulSet, use the kubectl edit command (the patch command is another option):
This will open the StatefulSet definition in your default editor. In the definition, change changespec.replicas to 3 and modify the spec.template.spec.containers.image property so that it points to the new image (luksa/kubia-pet-peers instead of luksa/kubia-pet). Save the file and exit the editor to update the StatefulSet. Previously there were two replicas running, now you should see an additional replica named kubia-2 starting up. List the Pods to confirm:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Running 0 4h16m
kubia-1 1/1 Running 0 4h36m
kubia-2 0/1 ContainerCreating 0 12s
The new Pod instance is running the new image. But what about the two existing copies? Judging by their age, they appear to have not been updated. This is normal because initially, StatefulSets behave more like ReplicaSets than Deployments, so they do not perform rolling updates when the template is modified. You need to delete the copies manually and the StatefulSet will recreate them based on the new template:
Starting with Kubernetes version 1.7, StatefulSets support the same rolling updates as Deployments and DaemonSets. For more information, see the documentation for StatefulSet's spec.updateStrategy field using the kubectl explain command.
10.4.3 Trying out your clustered data store
Once both Pods are running, you can see if your brand new Stone Age data store is working as expected. Follow the example below to make some requests to the cluster.
$ curl -X POST -d "The sun is shining" localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
Data stored on pod kubia-1
$ curl -X POST -d "The weather is sweet" localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
Data stored on pod kubia-2
Now, follow the example below to read the stored data.
$ curl localhost:8001/api/v1/namespaces/default/services/kubia-public/proxy/
You've hit kubia-0
Data stored in the cluster:
- kubia-0.kubia.default.svc.cluster.local: Hey there! This greeting was submitted to kubia-0.
- kubia-1.kubia.default.svc.cluster.local: The sun is shining
- kubia-2.kubia.default.svc.cluster.local: The weather is sweet
When a client request reaches your cluster node, it discovers all peers, collects data from them, and sends all data back to the client. Even if you expand or shrink the StatefulSet, the Pod serving client requests can always find all peers running at the time.
10.5 Understanding how StatefulSets deal with node failures
In Section 10.2.4, we mentioned that Kubernetes must ensure that the stateful pod has stopped running before creating a replacement pod. When a node suddenly fails, Kubernetes has no way of knowing the status of the node or its pods. It has no way of determining if the pods have stopped running, or if they are still running and may even still be reachable, it's just that the Kubelet has stopped reporting node status to the master.
Because a StatefulSet guarantees that two pods with the same identity and storage will never run at the same time, when a node fails, a StatefulSet cannot and should not create replacement pods until it is confident that the previous pod has stopped running.
A StatefulSet can determine the state of a pod only when told to do so by the cluster administrator . Administrators can do this by deleting specific pods or deleting the entire node, which will cause all pods on the node to be deleted.