K8S series (9) application health of observability

1. Sources of demand

Let's first look at the source of the entire requirement: After migrating the application to Kubernetes, how to ensure the health and stability of the application? In fact, it is very simple and can be enhanced in two ways:

  1. The first is to improve the observability of the application;
  2. The second is to improve the recoverability of the application.

From the perspective of observability, it can be enhanced in three aspects:

  1. The first is the health status of the application, which can be observed in real time;
  2. The second is to obtain the resource usage of the application;
  3. The third is to get the real-time logs of the application for problem diagnosis and analysis.

When a problem occurs, the first thing to do is to reduce the scope of the impact and debug and diagnose the problem. Finally, when a problem occurs, the ideal situation is: a complete recovery can be performed through the self-healing mechanism integrated with K8s.

2. Liveness and Readiness

This section introduces Liveness probe and eadiness probe.

Application Health Status - A First Look at Liveness and Readiness

Liveness probe is also called ready pointer, which is used to judge whether a pod is in the ready state. When a pod is in the ready state, it can provide corresponding services externally, that is to say, the traffic at the access layer can reach the corresponding pod. When the pod is not in the ready state, the access layer will remove the corresponding traffic from the pod.

Let's look at a simple example:

The following figure is actually an example of Readiness:

insert image description here

When the pod pointer judgment is always in the failure state, the traffic at the access layer will not hit the current pod.

insert image description here

When the state of the pod changes from the FAIL state to the success state, it can actually carry the traffic.

The Liveness pointer is similar, it is a survival pointer, used to determine whether a pod is alive or not. What happens when a pod is not alive?
insert image description here

At this time, the upper-layer judgment mechanism will judge whether the pod needs to be restarted. If the restart strategy configured in the upper layer is restart always, then the pod will be restarted directly at this time.

Application Health Status - Usage

Next, let's look at the specific usage of Liveness pointer and Readiness pointer.

detection method

The Liveness pointer and Readiness pointer support three different detection methods:

  1. The first is httpGet. It is judged by sending an http Get request. When the return code is a status code between 200-399, it indicates that the application is healthy;
  2. The second detection method is Exec. It judges whether the current service is normal by executing a command in the container. When the return result of the command line is 0, it indicates that the container is healthy;
  3. The third detection method is tcpSocket. It performs a TCP health check by detecting the IP and Port of the container. If the TCP link can be established normally, it indicates that the current container is healthy.

detection results

In terms of detection results, there are mainly three types:

  • The first is success. When the status is success, it means that the container has passed the health check, that is, Liveness probe or Readiness probe is a normal state;
  • The second type is Failure. Failure means that the container has not passed the health check. If it does not pass the health check, then a corresponding processing will be carried out at this time. One way of processing in Readiness is through service. The service layer will remove the pod that has not passed Readiness, and Liveness will pull up or delete the pod again.
  • The third state is Unknown. Unknown means that the current execution mechanism has not performed a complete execution. It may be because something like a timeout or some scripts did not return in time, then the Readiness-probe or Liveness-probe will not do it at this time. Any operation will wait for the next mechanism to check.

There is a component called ProbeManager in the kubelet, which contains Liveness-probe or Readiness-probe. These two probes will apply the corresponding Liveness diagnosis and Readiness diagnosis to the pod to achieve a specific judgment.

Application Health Status - Pod Probe Spec

The following describes the use of a yaml file for these three different detection methods.

First look at exec, the use of exec is actually very simple. As shown in the figure below, you can see that this is a Liveness probe, which is configured with an exec diagnosis. Next, it configures a command field. In this command field, cat a specific file to judge the current status of the Liveness probe. When the result returned in this file is 0, or when the command returns 0, It will think that the pod is in a healthy state at this time.

insert image description here

Let's take a look at this httpGet again. One field in httpGet is the path, the second field is port, and the third field is headers. When this place sometimes needs to make a health judgment through a mechanism like the header, the header needs to be configured. Usually, it may only need to pass the health and port.

insert image description here

The third is tcpSocket. The usage of tcpSocket is actually relatively simple. You only need to set a port for detection. For example, port 8080 is used in this example. When the tcp connect audit of port 8080 is established normally, then tecSocket, Probe will consider it a healthy state.

insert image description here

In addition, there are the following five parameters, which are Global parameters.

  • The first parameter is called initialDelaySeconds, which indicates how long the pod startup delay is checked. For example, if there is a Java application now, it may take a long time to start, because it involves the startup of jvm, including Java's own jar of loading. So in the early stage, there may be a period of time when there is no way to be detected, and this time is predictable, then it may be necessary to set initialDelaySeconds at this time;

  • The second is periodSeconds, which represents the detection interval, and the normal default value is 10 seconds;

  • The third field is timeoutSeconds, which indicates the timeout time of the detection. When the detection is not successful within the timeout time, it will be considered a state of failure;

  • The fourth is successThreshold, which means: when the pod fails to detect and judges the detection is successful again, the threshold number of times required, by default, is 1, which means that it failed originally, and then the next detection succeeds this time , it will be considered that the pod is in a state where the probe state is normal;

  • The last parameter is failureThreshold, which indicates the number of retries for detection failures. The default value is 3, which means that when the detection fails for 3 consecutive times from a healthy state, then it will be judged that the current state of the pod is in a failed state.

Application Health Status - Summary of Liveness and Readiness

Next, make a brief summary of Liveness pointer and Readiness pointer.

insert image description here

The Liveness pointer is a survival pointer, which is used to determine whether the container is alive and whether the pod is running. If the Liveness pointer determines that the container is unhealthy, the kubelet will kill the corresponding pod at this time, and judge whether to restart the container according to the restart policy. If the Liveness pointer is not configured by default, the probe is considered to be successful by default.

The Readiness pointer is used to judge whether the container is started, that is, whether the condition of the pod is ready. If a result of the detection is unsuccessful, it will be removed from the Endpoint on the pod at this time, that is to say, the previous pod will be removed from the access layer, and the pod will not hang up again until the next judgment is successful. to the corresponding endpoint.

There are some considerations when using Liveness pointer and Readiness pointer. Because both the Liveness pointer and the Readiness pointer need to be configured with a suitable detection method to avoid misoperation.

  • The first one is to increase the timeout threshold, because executing a shell script in the container takes a very long time, usually executed on an ecs or a vm, a script returned in 3 seconds may be in Inside the container it takes 30 seconds. Therefore, this time needs to be judged in advance in the container. If the timeout threshold can be increased, it can prevent occasional timeouts when the container pressure is relatively high;

  • The second is to adjust the number of judgments. The default value of 3 times is actually under the relatively short judgment cycle, which is not necessarily the best practice. It is also a better way to adjust the number of judgments appropriately;

  • The third is exec. If you use shell scripts for this judgment, the call time will be longer. It is recommended that you use some compiled scripts like Golang or some binary binary compiled by C language or C++ for judgment. That is usually 30% to 50% more efficient than shell scripts;

  • The fourth is that if you use the tcpSocket method to judge, if you encounter TLS services, it may cause many such unsound tcp connections in TLS. At this time, you need to judge the business scenario by yourself. Whether the link will affect the business.

Guess you like

Origin blog.csdn.net/lin819747263/article/details/125728413