pod health check detection mechanisms of k8s

A: Demand Source:

First look at the source of the entire demand: When migrating applications to Kubernetes, how to protect the health and stability of the application of it? Actually very simple, it can be enhanced in two ways:

1, the first application is to improve observability;
2, a second application is to improve recoverability.

From the observability concerned, you can do it in three areas to enhance:

1, the first is the health status of the application of the above, can be observed in real time;
2, and the second is access to resources such as application usage;
3, the third is a real-time log can get applications to diagnose and analyze problems .

When there is a problem, the first thing to do is to reduce the scope of influence, debug and diagnose problems. Finally, when something goes wrong, the ideal situation is: can a complete recovery and K8s through integrated self-healing mechanism.

Two: describes two detection methods: livenessProbe and ReadnessProbe

livenessProbe: [liveness detection], is to determine whether the health of the container according to user-defined rules. Also known as survival pointer, if the pointer Liveness judge container unhealthy, this time will kill the appropriate pod by kubelet, and to determine whether to restart the container according to the restart strategy. If the default is not configured Liveness pointer, then by default considered to be in default return it the probe was successful.
ReadnessProbe: [agile probe], used to determine whether to activate this container is completed, i.e., the state of pod (expected value) whether ready. If a result of the detection is not successful, then the time it is removed from the upper Endpoint pod, a pod that is removed from the front for access would top layer (pod set as unavailable state), until the next to judge the success, the pod will again hang onto the appropriate endpoint.

What is Endpoint?
Endpoint k8s is a cluster resource object stored in etcd, the pod for recording all access address corresponding to a service.

2, Liveness and two kinds of detection mechanisms Readness usage scenarios:
Liveness pointer application scenarios is to support those applications can be re-pull, and Readiness pointer major deal with these applications immediately after the start not provide external services.

3, the same point Liveness and Readness two kinds of detection mechanisms and different points:
the same point is based on an application or file within the probe pod, to check the health of the pod, the difference is liveness If the probe fails will restart pod, and readliness then after three consecutive failures to detect, pod set state will be unavailable, and will not restart pod.

4, Liveness Readiness pointer and the pointer detection support three different ways:

1. The first is httpGet. It is to be judged by sending a http Get request, when the return code status code between 200-399, the identification of this application is healthy;
2, the second detection mode is Exec. It is performed through a command vessel to determine whether the current service is normal, when the result of the command return line is 0, the identity of the vessel is healthy;
3, the third detection methods is tcpSocket. It is TCP health check by IP and Port detection container, if the normal TCP link can be established, the identity of the current container is healthy.

The first and third detection methods are very similar, commonly used in the first and second detection methods.

Third, the detection mechanisms application examples:

1, LivenessProbe:

Method 1: Use exec detection methods, whether a specified file exists within the pod to see if there is considered a healthy state, otherwise it will restart strategy according to reboot to restart the pod set.

### pod configuration file:

[root@sqm-master yaml]# vim livenss.yaml
kind: Pod
apiVersion: v1
metadata:
  name: liveness
  labels:
    name: liveness
spec:
  restartPolicy: OnFailure    ##定义重启策略，仅在pod对象出现错误时才重启
  containers:
  - name: liveness
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300  #创建文件，并且在30秒后将文件进行删除
    livenessProbe:      #执行活跃度探测
      exec:
        command:
        - cat              #探测/tmp目录下是有test文件，如果有则代表健康，如果没有则执行重启pod策略。
        - /tmp/test
      initialDelaySeconds: 10        #当容器运行多久之后开始探测（单位是s）
      periodSeconds: 5     #探测频率（单位s），每隔5秒探测一次。

Other detection mechanisms in optional fields:

initialDelaySeconds: the first execution after the vessel started probing is how many seconds to wait.

periodSeconds: to perform a probe frequency. The default is 10 seconds, a minimum of 1 second.

timeoutSeconds: probe timeout. Default 1 second, a minimum of 1 second.

successThreshold: After the probe fails, the least successful of consecutive probe how many times was only recognized as a success. The default is 1. For liveness must be 1. The minimum value is 1.

failureThreshold: After detection is successful, how many times a minimum of consecutive probe failures was only identified as a failure. The default is 3. The minimum value is 1.

//运行该pod进行测试：
[root@sqm-master yaml]# kubectl  apply -f  livenss.yaml 
pod/liveness created

// monitoring pod of state:
begin probing after the vessel started 10 seconds, and the probe once every 5s.
pod health check detection mechanisms of k8s
We can see that the pod has been restarted, from the figure seen RESTARTS
times already 7 times, because execution of the command at startup pod:

/bin/sh -c "touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300"

There is a / tmp / test file within the first 30 seconds of the container life, cat in the 30 seconds / tmp / test command returns a successful return code. But 30 seconds later, cat / tmp / test returns a failure return code will trigger the restart strategy pod.

//我们来查看一下pod的Events信息：
[root@sqm-master ~]# kubectl  describe  pod liveness

pod health check detection mechanisms of k8s
It is found from the above events, the probe fails, it will restart the container, because the file is not found in the specified directory.

Method 2: Use httpGet detection methods, a web service running probe whether the specified file in the root directory of your web page, it is equivalent to "curl -I container ip address: / healthy". (Here, directory /, provides a web service specifies the home directory in the vessel.)

// pod of yaml file:

[root@sqm-master yaml]# vim http-livenss.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web
  labels:
    name: mynginx
spec:
  restartPolicy: OnFailure      #定义pod重启策略
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    livenessProbe:       #定义探测机制
      httpGet:               #探测方式为httpGet
        scheme: HTTP    #指定协议
        path: /healthy       #指定路径下的文件，如果不存在，探测失败
        port: 80
      initialDelaySeconds: 10       #当容器运行多久之后开始探测（单位是s）
      periodSeconds: 5       #探测频率（单位s），每隔5秒探测一次
---
apiVersion: v1       #关联一个service对象
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    name: mynginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

Detection method httpGet following optional control fields:

host: host name of the connection, the default connection to the pod of IP. You may want to set "Host" in the http header instead of using IP.

scheme: the connection schema using the default HTTP.

path: HTTP server to access the path.

httpHeaders: header custom request. HTTP run duplicate header.

port: Port name of the vessel or port access number. Port number must be between 1 and 65,525.

//运行该pod：
[root@sqm-master yaml]# kubectl  apply -f  http-livenss.yaml 
pod/web created
service/web-svc created

See pod situation prior to the run 10 seconds ##:
pod health check detection mechanisms of k8s
## 10 seconds of the beginning of the container it is alive, and the returned status code 200.

### after 10 seconds when the detection mechanisms to detect start viewing pod situation again:
pod health check detection mechanisms of k8s

//查看的pod的events：
[root@sqm-master yaml]# kubectl describe  pod web

pod health check detection mechanisms of k8s

You can see the return of the status code 404 indicates not find the specified file in the root directory of the web page indicates that the detected failure, and restarted four times, the state is completed (completed state), indicating pod is problematic.

2) Next, we continue to detect the probe so that the final success:
modified pod configuration file:
[root @ SQM-Master YAML] # vim HTTP-livenss.yaml
pod health check detection mechanisms of k8s

//重新运行pod：
[root@sqm-master yaml]# kubectl  delete -f  http-livenss.yaml 
pod "web" deleted
service "web-svc" deleted
[root@sqm-master yaml]# kubectl apply -f  http-livenss.yaml 
pod/web created
service/web-svc created

//最终我们查看pod的状态及Events信息：
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
web    1/1     Running   0          5s    10.244.1.11   node01   <none>           <none>

[root@sqm-master yaml]# kubectl  describe  pod web 
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  71s   default-scheduler  Successfully assigned default/web to node01
  Normal  Pulling    71s   kubelet, node01    Pulling image "nginx"
  Normal  Pulled     70s   kubelet, node01    Successfully pulled image "nginx"
  Normal  Created    70s   kubelet, node01    Created container nginx
  Normal  Started    70s   kubelet, node01    Started container nginx

When you can see the status of the normal operation of the pod.

##测试访问网页头部信息：
[root@sqm-master yaml]# curl -I 10.244.1.11

pod health check detection mechanisms of k8s
Return status code of 200 and representing the state of health of the pod.

ReadnessProbe detection:

Method 1: Use exec detection methods, and iveness same, detecting whether a file exists.
// pod configuration file as follows:

[root@sqm-master yaml]# vim readiness.yaml
kind: Pod
apiVersion: v1
metadata:
  name: readiness
  labels:
    name: readiness
spec:
  restartPolicy: OnFailure
  containers:
  - name: readiness
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/test; sleep 30; rm -rf /tmp/test; sleep 300;
    readinessProbe:   #定义readiness探测方式
      exec:
        command:
        - cat
        - /tmp/test
      initialDelaySeconds: 10
      periodSeconds: 5

//运行该pod：
[root@sqm-master yaml]# kubectl apply -f  readiness.yaml 
pod/readiness created

// Check the state of the pod:
pod health check detection mechanisms of k8s

//查看pod的Events：
[root@sqm-master yaml]# kubectl  describe  pod readiness

pod health check detection mechanisms of k8s
Can be seen not find the file indicates that the detected failure, but different mechanisms of readiness and liveness mechanism, it will not restart the pod, but after three consecutive probe fails, the vessel set to an unusable state.

Method Two: httpGet way.

[root@sqm-master yaml]# vim http-readiness.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web2
  labels:
    name: web2
spec:
  containers:
  - name: web2
    image: nginx
    ports:
    - containerPort: 81
    readinessProbe:
      httpGet:
        scheme: HTTP    #指定协议
        path: /healthy    #指定路径，如果不存在，则需要进行创建,否则探测失败
        port: 81
      initialDelaySeconds: 10
      periodSeconds: 5
---     
apiVersion: v1       
kind: Service
metadata:
  name: web-svc
spec:
  selector:
    name: web2
  ports:
  - protocol: TCP
    port: 81             
    targetPort: 81

//运行pod：
[root@sqm-master yaml]# kubectl apply -f  http-readiness.yaml 
pod/web2 created
service/web-svc created

//查看pod的状态：
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME        READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
readiness   0/1     Completed   0          37m     10.244.2.12   node02   <none>           <none>
web         1/1     Running     0          50m     10.244.1.11   node01   <none>           <none>
web2        0/1     Running     0          2m31s   10.244.1.14   node01   <none>           <none>

pod health check detection mechanisms of k8s
View Events pod of information by detecting, can know the pod is unhealthy, and http access failure.
It does not restart, but directly to the pod to an unusable state.

Application health monitoring in the rolling update process:

First we see the updated fields used by the explain tool:

[root@sqm-master ~]#  kubectl  explain deploy.spec.strategy.rollingUpdate

pod health check detection mechanisms of k8s
You can see there are two parameters can be used in the process of rolling update:

maxSurge：此参数控制滚动更新过程中，副本总数超过预期pod数量的上限。可以是百分比，也可以是具体的值，默认为1。如果该值设置为3，则在更新的过程中，一来则会直接增加三个pod进行更新（当然还要进行探测机制的验证是否更新成功）。该值设置得越大、升级速度越快，但会消耗更多的系统资源。
maxUnavailable：此参数控制滚动更新过程中，不可用的pod的数量，注意是在原pod数量中进行减少，不会计算maxSurge值的范围。若当该值为3时，如果探测失败则在升级的过程中会有3个pod不可用。该值设置得越大、升级速度越快，但会消耗更多的系统资源。

maxSurge和maxUnavailable的适用场景：
1，如果您希望在保证系统可用性和稳定性的前提下尽可能快地进行升级，可以将 maxUnavailable 设置为 0，同时为 maxSurge 赋予一个较大值。
2，如果系统资源比较紧张，pod 负载又比较低，为了加快升级速度，可以将 maxSurge 设置为 0，同时为 maxUnavailable 赋予一个较大值。需要注意的是，如果 maxSurge 为 0maxUnavailable 为 DESIRED，可能造成整个服务的不可用，此时 RollingUpdate 将退化成停机发布
1）首先我们创建一个deployment资源对象：

[root@sqm-master ~]# vim app.v1.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec:
  replicas: 10              #定义副本数量为10个
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: nginx
          args:
          - /bin/sh
          - -c
          - touch /usr/share/nginx/html/test.html; sleep 300000; #创建文件，使其在探测时保持pod为健康状态
          ports:
          - containerPort: 80
          readinessProbe:       #使用readiness机制
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 10

//运行该pod后，查看pod数量（10个）：
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME                      READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7bbd55db99-2g6tp   1/1     Running     0          2m11s   10.244.2.44   node02   <none>           <none>
my-web-7bbd55db99-2jdbz   1/1     Running     0          118s    10.244.2.45   node02   <none>           <none>
my-web-7bbd55db99-5mhcv   1/1     Running     0          2m53s   10.244.1.40   node01   <none>           <none>
my-web-7bbd55db99-77b4v   1/1     Running     0          2m      10.244.1.44   node01   <none>           <none>
my-web-7bbd55db99-h888n   1/1     Running     0          2m53s   10.244.2.41   node02   <none>           <none>
my-web-7bbd55db99-j5tgz   1/1     Running     0          2m38s   10.244.2.42   node02   <none>           <none>
my-web-7bbd55db99-kjgm2   1/1     Running     0          2m25s   10.244.1.42   node01   <none>           <none>
my-web-7bbd55db99-kkmh2   1/1     Running     0          2m38s   10.244.1.41   node01   <none>           <none>
my-web-7bbd55db99-lr896   1/1     Running     0          2m13s   10.244.1.43   node01   <none>           <none>
my-web-7bbd55db99-rpd8v   1/1     Running     0          2m23s   10.244.2.43   node02   <none>

探测成功，10个副本全部运行。

2）第一次更新：
更新nginx镜像版本，且设置滚动更新策略：

[root@sqm-master yaml]# vim app.v1.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec: 
  strategy:       #设置滚动更新策略，通过该字段下的rollingUpdate的子属性来设置
    rollingUpdate:
      maxSurge: 3                 #指定在滚动更新过程中最多可创建3个额外的 pod
      maxUnavailable: 3     #- 指定在滚动更新过程中最多允许3 pod 不可用
  replicas: 10              
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: 172.16.1.30:5000/nginx:v2.0    #更新的镜像为私有仓库中的镜像nginx:v2.0
          args:
          - /bin/sh
          - -c
          - touch /usr/share/nginx/html/test.html; sleep 300000; 
          ports:
          - containerPort: 80
          readinessProbe:       
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 10

//执行yaml文件后，查看pod数量：
[root@sqm-master yaml]# kubectl  get pod -o wide
NAME                      READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7db8b88b94-468zv   1/1     Running     0          3m38s   10.244.2.57   node02   <none>           <none>
my-web-7db8b88b94-bvszs   1/1     Running     0          3m24s   10.244.1.60   node01   <none>           <none>
my-web-7db8b88b94-c4xvv   1/1     Running     0          3m38s   10.244.2.55   node02   <none>           <none>
my-web-7db8b88b94-d5fvc   1/1     Running     0          3m38s   10.244.1.58   node01   <none>           <none>
my-web-7db8b88b94-lw6nh   1/1     Running     0          3m21s   10.244.2.59   node02   <none>           <none>
my-web-7db8b88b94-m9gbh   1/1     Running     0          3m38s   10.244.1.57   node01   <none>           <none>
my-web-7db8b88b94-q5dqc   1/1     Running     0          3m38s   10.244.1.59   node01   <none>           <none>
my-web-7db8b88b94-tsbmm   1/1     Running     0          3m38s   10.244.2.56   node02   <none>           <none>
my-web-7db8b88b94-v5q2s   1/1     Running     0          3m21s   10.244.1.61   node01   <none>           <none>
my-web-7db8b88b94-wlgwb   1/1     Running     0          3m25s   10.244.2.58   node02   <none>           <none>

//查看pod的版本信息：
[root@sqm-master yaml]# kubectl  get deployments. -o wide
NAME     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                        SELECTOR
my-web   10/10   10           10          49m   my-web       172.16.1.30:5000/nginx:v2.0   name=my-web

探测成功，10个pod版本全部更新成功。

3）第二次更新：
将镜像版本更新为3.0版本，且设置滚动更新策略。（探测失败）
pod的配置文件如下：

[root@sqm-master yaml]# vim app.v1.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-web
spec:
  strategy:
    rollingUpdate:
      maxSurge: 3                  #定义更新策略，数量依然都是保持3个
      maxUnavailable: 3
  replicas: 10                       #pod数量依然是10个
  template:
    metadata:
      labels:
        name: my-web
    spec:
       containers:
        - name: my-web
          image: 172.16.1.30:5000/nginx:v3.0   #测试镜像版本更新为3.0
          args:
          - /bin/sh
          - -c
          - sleep 300000;        #不在创建指定文件，使其探测失败
          ports:
          - containerPort: 80
          readinessProbe:       
            exec:
              command:
              - cat
              - /usr/share/nginx/html/test.html
            initialDelaySeconds: 10
            periodSeconds: 5

//重新运行pod配置文件：
[root@sqm-master yaml]# kubectl apply -f  app.v1.yaml 
deployment.extensions/my-web configured

//查看pod更新后的数量：
[root@sqm-master yaml]# kubectl  get pod  -o wide
NAME                      READY   STATUS      RESTARTS   AGE    IP            NODE     NOMINATED NODE   READINESS GATES
my-web-7db8b88b94-468zv   1/1     Running     0          12m    10.244.2.57   node02   <none>           <none>
my-web-7db8b88b94-c4xvv   1/1     Running     0          12m    10.244.2.55   node02   <none>           <none>
my-web-7db8b88b94-d5fvc   1/1     Running     0          12m    10.244.1.58   node01   <none>           <none>
my-web-7db8b88b94-m9gbh   1/1     Running     0          12m    10.244.1.57   node01   <none>           <none>
my-web-7db8b88b94-q5dqc   1/1     Running     0          12m    10.244.1.59   node01   <none>           <none>
my-web-7db8b88b94-tsbmm   1/1     Running     0          12m    10.244.2.56   node02   <none>           <none>
my-web-7db8b88b94-wlgwb   1/1     Running     0          12m    10.244.2.58   node02   <none>           <none>
my-web-849cc47979-2g59w   0/1     Running     0          3m9s   10.244.1.63   node01   <none>           <none>
my-web-849cc47979-2lkb6   0/1     Running     0          3m9s   10.244.1.64   node01   <none>           <none>
my-web-849cc47979-762vb   0/1     Running     0          3m9s   10.244.1.62   node01   <none>           <none>
my-web-849cc47979-dv7x8   0/1     Running     0          3m9s   10.244.2.61   node02   <none>           <none>
my-web-849cc47979-j6nwz   0/1     Running     0          3m9s   10.244.2.60   node02   <none>           <none>
my-web-849cc47979-v5h7h   0/1     Running     0          3m9s   10.244.2.62   node02   <none>           <none>

We can see the current total number of pod is 13, (including maxSurge additional amount) because the probe fails, it is set to be three pod (including additional pod) into an unusable state, but the remaining seven pod available (because maxUnavailable to three), but note: this 7 pod version has not been updated successfully, or on a version.

//查看pod的更新后的版本信息：
[root@sqm-master yaml]# kubectl  get deployments. -o wide
NAME     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                        SELECTOR
my-web   7/10    6            7           58m   my-web       172.16.1.30:5000/nginx:v3.0   name=my-web

Of parameters:
READY: indicating the user's expectations
UP-TO-DATE: indicates the updated
AVAILABLE: indicate available

We can find the number of the updated image version 6 (including an additional three pod), but as an unusable state, but to ensure that the number of available pod of seven, but the version has not updated.

Summary: The
description of the rollover process, the detection mechanisms do?
If you need to update an application pod in the company, if there is no mechanism to detect, whether the pod is not updated when you've done the work to prepare updating, it will be the application of all the pod to be updated, it will have serious consequences, although the update after you discover pod status is normal, in order to meet expectations controller manager, the value is still 1/1 READY data in the pod, but the pod has been regenerated pod, and said it will be lost.
If we add detection mechanisms, it will detect the container you specify a file or other application exists, if you reach the specified conditions, the probe succeeds, it will be updated on your pod, if the probe fails it will set the pod (container ) is not available, although the failure to detect the container is not available, but at least the previous module, there are other versions of the pod is available, to ensure the normal operation of the company for the service. You can see how important a discovery mechanism yes ah.

-------- end of this article so far, thanks for reading --------