Remember the problems caused by a k8s health check

problem background

The mobile phone received a warning text message, and the online environment interface is abnormal! ! ! The alarm content is that a certain external service API status code is abnormal, and the status code is 500. The good guy went to the PaaS platform (KuberSphere) to check at the first reaction, and found that a pod of the service was restarting, and continued to restart after the restart. At this time, another alarm recovery message was received (about one minute after the alarm message), and the status code was 200. . . This article discusses this issue.

k8s health check

k8s probe

k8s probes are periodic diagnostics performed by the kubelet on containers. To perform diagnostics, the kubelet calls the Handler implemented by the container . There are three types of handlers:

ExecAction : Execute the specified command inside the container. The diagnostic is considered successful if the command exits with a return code of 0.
CPSocketAction : Performs a TCP check against the container's IP address on the specified port. The diagnosis is considered successful if the port is open.
HTTPGetAction : Performs an HTTP Get request to the IP address of the container on the specified port and path. A diagnosis is considered successful if the response has a status code greater than or equal to 200 and less than 400.

k8s health check probe

livenessProbe (survival probe) : It is used to determine whether the container is alive (Running state). If the livenessProbe probe detects that the container is unhealthy, the kubelet will "kill" the container and perform corresponding processing according to the container's restart strategy. If a container does not contain a livenessProbe probe, then kubelet thinks that the value returned by the livenessProbe probe of the container will always be Success.
readinessProbe (readiness probe) : Used to determine whether the container service is available (Ready state), and the Pod that reaches the Ready state can receive requests. For the Pod managed by the Service, the relationship between the Service and the PodEndpoint will also be set based on whether the Pod is Ready. If the Ready status becomes False during the running process, the system will automatically isolate it from the Service's backend Endpoint list, and then add the Pod restored to the Ready state back to the backend Endpoint list.
startupProbe (startup probe) : If startupProbe is configured, other probes will be prohibited until it succeeds, and no more probes will be performed after success. It is more suitable for scenarios where the container startup time is long. Requires kubernetes version v1.18 or above.

identify the problem

The health checks we configure are as follows:

livenessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 90
  timeoutSeconds: 3
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 90
  timeoutSeconds: 3
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3
复制代码

Here are some configuration meanings as follows:

configuration	meaning
httpGet.path	get request path
httpGet.port	get request port
httpGet.scheme	get request protocol
initialDelaySeconds	Initial delay, in seconds, how long a container should wait after starting before checking its health.
timeoutSeconds	Timeout (in seconds), how long to wait for the probe to complete. If the time is exceeded, the probe is considered to have failed. The default is 1 second. The minimum value is 1.
periodSeconds	Execute probe frequency (seconds), how often to perform the probe, in seconds. The default is 10 seconds. The minimum value is 1.
successThreshold	Health threshold, after the detection fails, the minimum consecutive successful detection is successful. The default value is 1. The minimum value is 1. Must be 1 in liveness probe and start probe.
failureThreshold	Unhealthy threshold, the minimum number of consecutive probe failures required for the probe to enter the failed state.

The configuration document provided by the company does not include the configuration of the startup probe. It is guessed that the version of k8s deployed does not support the startup probe.

From the background of the problem, we have extracted a few key points :

A pod is restarting.
After the pod restart is complete, it continues to restart.
告警短信大概一分钟后告警恢复。

到这里可以联想到，存活探针发送get请求获取到的响应的状态码不在 200 和 400之间或者直接超时，所以容器重启直接影响服务，告警通知；但是配置的初始延迟为90秒太短导致一直重启；就在这时就绪探针判断 Ready 状态变为False，则系统自动将其从 Service 的后端 Endpoint 列表中隔离出去，故障 pod 排除掉之后，告警恢复。

是什么原因导致正在运行的容器，PaaS平台是有事件日志的，当时忘记截图了（盘的时候查不到了）... 记得当时有http超时事件，也有状态码为503的事件。超时可能是网络波动或者大概率是初始延迟设置过短。那么这个503状态码到底是为什么呢？

Actuator

Actuator是Springboot的一个模块，模块提供了Spring Boot的所有生产就绪功能。

Endpoints

Actuator 端点允许您监视应用程序并与之交互。 Spring Boot 包括许多内置端点，并允许您添加自己的端点。例如，提供基本的应用程序运行状况信息的 health 端点。

Actuator的health端点

我们配置健康检查用的接口就是Actuator提供的 health 端点接口。像我们引入DB依赖，Nacos依赖啥的，这些依赖实现了Actuator的health策略接口HealthIndicator，请求health端点的时候就会调用策略实现类检查健康状况。

health端点的返回会返回一个status，可以通过配置management.endpoint.health.show-details=always设置返回详细信息。

下边是一个详细信息的返回值。

{
    "components":{
        "db":{
            "components":{
                "dataSource":{
                    "details":{
                        "database":"MySQL",
                        "result":1,
                        "validationQuery":"/* ping */ SELECT 1"
                    },
                    "status":{
                        "code":"UP",
                        "description":""
                    }
                },
                "dataSource2":{
                    "details":{
                        "database":"MySQL",
                        "result":1,
                        "validationQuery":"/* ping */ SELECT 1"
                    },
                    "status":{
                        "code":"UP",
                        "description":""
                    }
                }
            },
            "status":{
                "code":"UP",
                "description":""
            }
        },
        "discoveryComposite":{
            "components":{
                "discoveryClient":{
                    "details":{
                        "services":[
                            "***",
                            "***",
                            "***-gateway"
                        ]
                    },
                    "status":{
                        "code":"UP",
                        "description":""
                    }
                }
            },
            "status":{
                "code":"UP",
                "description":""
            }
        },
        "diskSpace":{
            "details":{
                "total":"528309530624",
                "free":"463192977408",
                "threshold":10485760
            },
            "status":{
                "code":"UP",
                "description":""
            }
        },
        "mail":{
            "details":{
                "location":"10.************:25"
            },
            "status":{
                "code":"UP",
                "description":""
            }
        },
        "ping":{
            "details":{

            },
            "status":{
                "code":"UP",
                "description":""
            }
        },
        "refreshScope":{
            "details":{

            },
            "status":{
                "code":"UP",
                "description":""
            }
        }
    },
    "groups":[

    ],
    "status":{
        "code":"UP",
        "description":""
    }
}
复制代码

返回的Status的code编码有四种。

/**
 * 指示组件或子系统处于未知状态。
 */
public static final Status UNKNOWN = new Status("UNKNOWN");

/**
 * 指示组件或子系统按预期运行。
 */
public static final Status UP = new Status("UP");

/**
 * 指示组件或子系统发生了意外故障。
 */
public static final Status DOWN = new Status("DOWN");

/**
 * 指示组件或子系统已从服务中取出，不应再使用。
 */
public static final Status OUT_OF_SERVICE = new Status("OUT_OF_SERVICE");
复制代码

翻看源码看到了这四个编码与http状态码的关系，即 DOWN 和 OUT_OF_SERVICE 会返回http状态码 503，其他返回200状态码。

health端点如果异常，即可以通过详细信息定位到异常的组件！