Kubernetes study notes - best practices for developing applications (1) 20230311

1. All resources of kubernetes

The following are the various kubernetes components used in a typical application

A typical application manifest contains one or more Deployment and StatefulSet objects. These objects contain pod templates for one or more containers, each container has a liveness probe, and readiness probes for the services provided by the container (if any). A pod that provides a service exposes itself through one or more services. When these services need to be accessed from the cluster, either configure these services as LoadBalancer or NodePort type services, or open the services through Ingress resources

Pod templates (configuration files from which pods are created) typically reference two types of secrets . One is used when pulling images from a private mirror repository; the other is used directly by processes running in pods. Privacy credentials themselves are usually not part of the application manifest, because they are not configured by the application developer, but by the operations team. Secret credentials are usually assigned to a ServiceAccount, which is then assigned to each individual pod.

An application also contains one or more ConfigMap objects, which can be used to initialize environment variables, or be mounted as configMap volumes in pods. Some pods use additional volumes, such as emptyDir or gitRepo volumes. A pod that requires persistent storage requires a persistentVolumeClaim volume. PersistentVolumeClaim is also part of an application manifest, and the StorageClass referenced by PersistentVolumeClaim is created in advance by the system administrator.

In some cases, an application also needs to use tasks (jobs) or cronjobs (cronJobs). DaemonSets are usually not part of application deployment, but are usually created by system administrators to run system services on all or some nodes. Horizontal pod autoscaler can be included in the application manifest by the developer or added to the system by the operation and maintenance team later. The cluster administrator also creates LimitRange and ResourceQuota objects to control the computing resource usage of each pod and all pods (as a whole).

After the application is deployed, various kubernetes controllers will automatically create other objects. These include the service endpoint (Endpoint) object created by the endpoint controller (Endpoint controller), the ReplicaSet object created by the deployment controller (Deployment controller), and the actual pod object created by the ReplicaSet (or job, CronJob, StatefulSet, DaemonSet).

Resources are usually organized by one or more tags. This applies not only to pods, but to other resources as well. In addition to tags, most resources also contain an annotation describing the resource, listing contact information for the person or team responsible for modifying the resource, or providing additional metadata for other tools used by managers.

Pod is the center of all resources, and it is undoubtedly the most important resource in kubernetes. After all, each of your applications runs in pods.

2. Understand the life cycle of pod

A pod can be compared to a virtual machine running a single application, but there are still significant differences. One example is that the application running in the pod may be killed at any time, because kubernetes needs to schedule the pod to another node, or request Shrinkage.

The application on the pod may be killed or rescheduled

Applications running in virtual machines are rarely migrated from one machine to another. When a migration operator migrates an application, they can reconfigure the application and manually check that the application works correctly in the new location.

With kubernetes, applications can be automatically migrated more frequently without manual intervention, which means that no one will configure the applications to ensure that they will run normally after migration.

Changes in local ip and hostname are expected

When a pod is killed and run elsewhere (a new pod replaces the old pod, the old pod was not migrated), it not only has a new ip address but also a new name and hostname. Most stateless applications can handle this scenario without adverse effects. But stateful services usually don't work. A stateful application can run through a StatefulSet, and the StatefulSet will ensure that the application must be able to cope with this change after it is scheduled to a new node and started. Therefore, application developers should not rely on member IP addresses to establish relationships with each other in a cluster. In addition, if you use hostnames to build relationships, you must use StatefulSet.

Data written to disk is expected to disappear

In the case of an application writing data to disk, that data may be lost when the application is started in a new pod, unless you attach persistent storage to the application's data write path. Data loss is guaranteed when pods are rescheduled. But even in the absence of scheduling, files written to disk are still lost. Even during the lifetime of a single pod, files written to disk by applications in the pod are lost.

Suppose there is an application whose startup process is relatively easy to use and requires a lot of computing operations. In order to make the application faster in subsequent startups, developers generally cache some calculation results during the startup process to disk. . Since applications in kubernetes run in containers by default, these files will be written to the container's file system. If the container is restarted at this time, these files will be lost. Because a new writable layer will be used when a new container starts.

A single container may be restarted for various reasons, such as a process crash, an inventory probe returning a failure, or the node running out of memory and the process being killed by OOMKiller. When the above happens, the pods are still the same, but the containers are brand new. Kubelet will not run a container multiple times, but will recreate a container.

Use storage volumes to persist data across containers

When the container of the pod is restarted, in order not to lose data, a pod-level volume is required. Because the existence and destruction of volumes is consistent with the pod life cycle, new containers will be able to reuse data written to volumes by previous containers.

Sometimes it is a good idea to use storage volumes to store across containers, but not always, what if the newly created process crashes again due to data corruption? This will cause a continuous loop crash (pod will prompt CrashLoopBackOff status). If no storage volumes are used, new containers will be started from scratch and will most likely not crash. Using storage volumes to store data across containers is a double-edged sword.

2. Reschedule dead or partially dead pods

If a pod's containers keep crashing, the kubelet will keep restarting them. The interval between each restart increases exponentially until it reaches 5 minutes. During this 5-minute interval, the pods essentially died because their container processes were not running.

To be fair, if it is a pod with multiple containers, some of them may be running normally, so the pod is only partially dead, but if the pod contains only one container, then the pod is completely dead and useless , because there are no processes running in it.

One may wonder why these pods are not automatically removed or rescheduled even though they are part of a ReplicaSet or similar controller. You might expect to be able to delete the pod and restart a pod that runs successfully on other nodes, after all the container may have crashed due to a node-related issue that doesn't occur on other nodes. Unfortunately, this is not the case, the ReplicaSet itself does not care whether the pod is in a dead state, it only cares whether the number of pods matches the expected number of replicas.

3. Start pods in a fixed order

Another difference between applications running in pods and applications running manually is that operators know the dependencies between applications when manually deploying applications, so that they can start the applications in order.

How pods are started

When using kubernetes to run multiple pod applications, kubernetes has no built-in method to run some pods first and then wait for these pods to run successfully before running other pods. Of course, you can publish the configuration of one application first, and then publish the configuration of the second application after the pod is started. But the entire system is usually defined in a single yaml file or json file, which contains definitions of multiple pods, services, or other objects.

The kubernetes api server does process objects in the order defined by the yaml and json files. But that just means they are in order when written to etcd. There is no guarantee that pods will start in this order.

But you can prevent a main container from starting until its preconditions are met. This is achieved through a container called init contained in the pod.

Introduction to init container

The init container can be used to initialize pods

A pod can have any number of init containers. init containers are executed sequentially. And only when the last init container is executed will the main container be started. That said, the init container can also be used to delay the start of the pod's main container. For example, until a certain condition is met, the init container can wait until the service that the main container depends on is started and can provide services. When the service is started and can provide services, the execution of the init container ends. Then the main container can be started. This way the main container doesn't happen to use it before the service it depends on is ready.

Add the init container to the pod

The init container can be defined in the pod spec file like the main container, but through the field spec.initContainers.

fortun-client.yaml

spec:
initContainers:
-name:init
image:busybox
command:
-sh
-c
-'while true;do echo "waiting for fortune service to come up ...";
wget http://fortune -q -T 1 -o /dev/null > /dev/null 2>/dev/null &&break;sleep 1;done; echo "Service is up! Starting main container."'

When deploying this pod, only the init container of the pod will start up, which can be displayed by command kubectl get po to view the status of the pod

$kubectl get po

You can view the logs of the init container through kubectl logs

$kubectl logs fortune-client -c init

Best practices for handling multiple dependencies within a pod

Consider using a Readiness probe. If an application fails to work in the absence of one of its dependencies, it needs to be notified of this through its Readiness probe, so that kubernetes will also know that the application is not ready. The reason for this is not only because the readiness probe receives the signal that the Huizu town application becomes a service endpoint, but also because the Deployment controller will use the application's readiness probe when rolling upgrades, so that the wrong version can be avoided. .

4. Add lifecycle hooks

Pods also allow defining two types of lifecycle hooks

Post-start hook

Pre-stop hook

These lifecycle hooks are specified on a per-container basis, unlike init containers, which are applied to the entire pod. These hooks, as their name suggests, are executed after the container is started and before it is stopped.

Lifecycle hooks are similar to inventory probes and readiness probes, they can both

Execute a command inside the container

Send an HTTP GET request to a URL

Use post-start lifecycle hooks

Post-start hooks are executed immediately after the container's main process starts. It can be used to do some extra work when the application starts. Development and you can add these operations to the application code, but if you run an application developed by others, most of them don’t want to modify their source code. The post-start hook allows you to run some additional functions without changing the application. Order. These commands may include signaling to external listeners that the application has stopped, or initializing the application so that it can run smoothly.

This hook is executed in parallel with the main process. The name of the hook may be misleading, because it does not wait until the main process is fully started before executing.

Even though the hook runs asynchronously, it does affect the container in two ways. Before the hook is executed, the container will stay in the waiting state. The reason is ContainerCreating. Therefore, the pod state will be Pending instead of Running. If the hook fails or returns a non-zero status code, the main container will be killed.

A pod manifest containing a post-start hook: post-start-hook.yaml

apiVersion:v1
kind:pod
metadata:
name:pod-with-poststart-hook
spec:
containers:
-image:luksa/kubia
name:kubia
lifecycle:
postStart:
exec:
command:
-sh
- -c
-"echo 'hook will fail with exit code 15';sleep 5; exit 15"

As in the above example, the commands echo, sleep, and exit are executed together with the main process of the container when the container is created. Typically, we don't execute commands like this, but run them through shell scripts or binary executables stored in the container image.

Let it lifecycle hook before using stop

Executed on pre-stop hooks immediately before the container is terminated. When a container needs to be terminated, kubelet will execute the pre-stop hook when the pre-stop hook is configured, and will only send the SIGTERM signal to the container process after the hook program is executed (if the process does not terminate gracefully , will be killed)

The pre-stop hook can be used to trigger the container to shut down gracefully when the container does not shut down gracefully after receiving the SIGTERM signal. These hooks can also perform arbitrary operations before the container terminates, and do not require the application to implement these operations internally.

Configuring a pre-stop hook in the pod manifest is similar to adding a post-start hook method. The following is a pre-stop hook that executes an HTTP GET request.

pre-stop-hoop-httpget.yaml code snippet

lifecycle：
preStop：
httpGet：
port：8080
path：shutdown

As defined above, the pre-stop hook executes the HTTP GET request to http://pod_IP :8080/shutdown immediately when the kubelet starts to stop the container. You can also set the host of the scheme (HTTP or HTTPS)

Note: By default, the value of host is the IP address of the pod. Make sure the request is not sent to localhost, because localhost means node, not pod

Different from the post-start hook, the container will be terminated regardless of whether the hook execution is successful or not. Neither an error status code returned by HTTP nor a non-zero exit code returned by a command-based hook will prevent the container from terminating. If the execution of the pre-stop hook fails, you will see a FailedPreStopHook alarm in the pod's events.

Lifecycle hooks are for containers not pods

Lifecycle hooks are for containers, not pods. The pre-stop hook should not be used to run operations that need to be performed when the pod terminates. The reason is that the pre-stop hook will only investigate before the container is terminated (most likely due to a failed liveness probe). This process happens multiple times during the pod's lifecycle, not just when the pod is shut down.

5. Understand pod shutdown

The shutdown of the pod is triggered by the deletion of the pod's object by the API server. After receiving the HTTP DELETE request, the API server does not delete the pod object, but sets a deletionTimestamp value for the pod. Pods with deletionTimestamp start to stop.

When the kubelet realizes that a pod needs to be terminated, it starts terminating each container in the pod. The kubelet will give each container a certain amount of time to gracefully stop. This time is called the Termination Grace Period, and each pod can be configured individually. After the termination process starts, the timer starts ticking, and the following events are executed in order:

1) Execute the pre-stop hook (if configured), and then wait for it to complete

2) Send a SIGTERM signal to the main process of the container

3) Wait for the container to shut down gracefully or wait for the termination grace period to expire

4) If the main container process does not shut down gracefully, use the sigkill signal to forcibly terminate the process

Specify a termination grace period

Set through the spec.terminationGracePeriodPeriods field of the pod spec. By default, the value is 30, which means that the container will have 30 seconds to gracefully terminate itself before being forced to terminate

Improvement: The termination grace period should be set long enough so that your container process can complete cleanup within this period of time.

When deleting a pod, the termination grace period specified in the pod spec can also be overridden by:

$kubectl delete po mypod --grace-period=5

The above command will wait 5 seconds for kuectl to let the pod shut down by itself. When all pod containers are stopped, the kubelet will notify the api server, and then the pod resources will be deleted eventually. It is possible to force the API server to delete pod resources immediately without waiting for confirmation. It can be achieved by setting the grace period time to 0, and then adding a --force option

$kubectl delete po mypod --grace-period=0 --force

When using the above options, you need to pay attention, especially for StatefulSet pods, the StatefulSet controller will be very careful to avoid running two instances of the same pod at the same time (two pods have the same serial number, name, and mount the same PersistentVolume). Forcefully deleting a pod causes the controller to create a replacement pod without waiting for the containers in the deleted pod to finish shutting down. In other words, two instances of the same pod could be running at the same time, causing the stateful cluster service to behave abnormally. Only forcefully delete a stateful pod if it is confirmed that the pod is no longer running, or cannot communicate with other members of the cluster (this can be confirmed by the network connection failure of the node hosting the pod and cannot be reconnected).

Properly handle container shutdown operations in the application

Applications should respond to the SIGTERM signal by initiating a shutdown process, and terminate when the process is complete. In addition to handling the SIGTERM signal, it should also be possible to hook before the stop to be notified of the shutdown. In both cases, the application has only a fixed amount of time to terminate cleanly.

But what if it is impossible to predict how long the application will take to terminate cleanly? If the application is a distributed data storage, one of the pod instances will be deleted and then shut down during shrinking. During the shutdown process, the pod needs to migrate its data to other surviving pods to ensure that the data will not be lost. , then this pod should start migrating data when it receives a termination signal (whether it is through a SIGTERM signal or a pre-stop hook)?

The answer is no, this approach is not recommended for the following reasons:

Termination of a container does not necessarily mean that the entire pod is terminated

There is no guarantee that this shutdown process can be executed before the process is killed

Replace critical shutdown processes with pods that focus on shutdown processes

One solution is to have the application (on receiving a termination signal) create a new job resource that will run a new pod whose only job is to migrate data from the deleted pod to the surviving pod . But you can't guarantee that the application can successfully create the jod object every time. What if the node fails when the application is about to create a job?

A reasonable solution to this problem is to have a dedicated pod that runs continuously to continuously check for orphaned data. When the pod finds orphaned data, it can migrate them to surviving pods. Of course, it doesn't have to be a pod that runs continuously. You can also use CronJob resources to run this pod periodically.

StatefulSet also has problems. For example, shrinking StatefulSet will cause PersistentVolumeClaim to be in an isolated state, which will cause the data stored in PersistentVolumeClaim to be stranded. Of course, in the subsequent expansion process, the PersistentVolume will be attached to the new pod instance, but what if this expansion operation never happens (or will happen after a long time)? . Therefore, when using StatefulSet, you may want to run a pod for data migration. In order to avoid data migration during the application upgrade process, pods dedicated to data migration can be configured with a waiting time before data migration, so that stateful pods have time to start. stand up.