OpenKruise v1.1: Feature enhancements aligned with upstream, large-scale scene performance optimization

Author: Wine Toast (Wang Siyu)

Cloud native application automation management suite, CNCF Sandbox project - OpenKruise, recently released v1.1 version.

OpenKruise [1]  is an enhanced capability suite for Kubernetes, focusing on the deployment, upgrade, operation and maintenance, stability protection and other fields of cloud-native applications. All functions are extended by standard methods such as CRD, and can be applied to any Kubernetes cluster of version 1.16 and above. One-click deployment of Kruise can be done with a single helm command, no further configuration is required. ******

Version resolution

In the v1.1 version, OpenKruise extended and enhanced many existing functions, and optimized the running performance in large-scale clusters. The following is a brief introduction to some functions of v1.1.

It is worth noting that OpenKruise v1.1 has upgraded the Kubernetes code dependency version to v1.22, which means that users can use the new fields up to v1.22 in the pod template templates of workloads such as CloneSet, etc., but users install The Kubernetes cluster version compatible with OpenKruise remains >= v1.16.

In-place upgrade supports container order priority

In the v1.0 version released at the end of last year, OpenKruise introduced the container startup sequence control function [2] , which supports defining different weight relationships for multiple containers in a Pod, and controls the startup of different containers according to the weight when the Pod is created. order.

In v1.0, this feature only works at the creation stage of each Pod. After the creation is complete, if multiple containers in the Pod are upgraded in place, these containers will be upgraded at the same time.

Recently, the community has made some exchanges with companies such as LinkedIn and obtained more input from user scenarios. In some scenarios, multiple containers in the Pod are associated. For example, when the business container is upgraded, some other containers in the Pod also need to upgrade their configurations to be associated with this new version; or multiple containers avoid parallel upgrades to ensure log The sidecar container of the collection class will not lose the logs in the business container.

Therefore, in v1.1 OpenKruise supports in-place upgrades in order of container priority. In the actual use process, the user does not need to configure any additional parameters, as long as the Pod is created with the container startup priority, not only in the Pod creation stage, the high-priority container will be guaranteed to start before the low-priority container; In a single in-place upgrade , if multiple containers are upgraded at the same time, the high-priority container will be upgraded first, and then the low-priority container will be upgraded after the upgrade and startup are completed.

The in-place upgrade here includes modifying the image image upgrade and modifying the environment variable upgrade of env from metadata. For details, please refer to the in-place upgrade introduction [3]. Summary:

  • For Pods that do not have container startup order, there is no ordering guarantee during multi-container in-place upgrades.

  • For Pods where there is a container startup sequence:

  • If multiple containers to be upgraded this time have different startup orders, the order of in-place upgrades will be controlled according to the startup order.

  • If the startup order of multiple containers that are upgraded locally in-place is the same, there is no ordering guarantee during in-place upgrade.

For example, a CloneSet containing two containers with different startup orders would be as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  ...
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        app-config: "... config v1 ..."
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_CONTAINER_PRIORITY
          value: "10"
        - name: APP_CONFIG
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['app-config']
      - name: main
        image: main-image:v1
  updateStrategy:
    type: InPlaceIfPossible
复制代码

When we update the CloneSet and modify the image of the app-config annotation and the main container, it means that both the sidecar and the main container need to be updated. Kruise will upgrade the Pod in place to rebuild the sidecar container to take effect the new env from annotation. .

Next, we can see the apps.kruise.io/inplace-update-state annotation and its value in the upgraded Pod:

{
  "revision": "{CLONESET_NAME}-{HASH}",         // 本次原地升级的目标 revision 名字
  "updateTimestamp": "2022-03-22T09:06:55Z",    // 整个原地升级的初次开始时间
  "nextContainerImages": {"main": "main-image:v2"},                // 后续批次中还需要升级的容器镜像
  // "nextContainerRefMetadata": {...},                            // 后续批次中还需要升级的容器 env from labels/annotations
  "preCheckBeforeNext": {"containersRequiredReady": ["sidecar"]},  // pre-check 检查项,符合要求后才能原地升级后续批次的容器
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]}  // 已更新的首个批次容器(它仅仅表明容器的 spec 已经被更新,例如 pod.spec.containers 中的 image 或是 labels/annotations,但并不代表 node 上真实的容器已经升级完成了)
  ]
}
复制代码

当 sidecar 容器升级成功之后,Kruise 会接着再升级 main 容器。最终你会在 Pod 中看到如下的 apps.kruise.io/inplace-update-state annotation:

{
  "revision": "{CLONESET_NAME}-{HASH}",
  "updateTimestamp": "2022-03-22T09:06:55Z",
  "lastContainerStatuses":{"main":{"imageID":"THE IMAGE ID OF OLD MAIN CONTAINER"}},
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]},
    {"timestamp":"2022-03-22T09:07:20Z","containers":["main"]}
  ]
}
复制代码

通常来说,用户只需要关注其中 containerBatchesRecord 来确保容器是被分为多批升级的。 如果这个 Pod 在原地升级的过程中卡住了,你可以检查 nextContainerImages/nextContainerRefMetadata 字段,以及 preCheckBeforeNext 中前一次升级的容器是否已经升级成功并 ready 了。

StatefulSetAutoDeletePVC 功能

从 Kubernetes v1.23 开始,原生的 StatefulSet 加入了 StatefulSetAutoDeletePVC 功能,即根据给定策略来选择保留或自动删除 StatefulSet 创建的 PVC 对象,参考文档 [4]

因此,v1.1 版本的 Advanced StatefulSet 从上游同步了这个功能,允许用户通过 .spec.persistentVolumeClaimRetentionPolicy 字段来指定这个自动清理策略。这需要你在安装或升级 Kruise 的时候,启用 StatefulSetAutoDeletePVC feature-gate 功能。

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  ...
  persistentVolumeClaimRetentionPolicy:  # optional
    whenDeleted: Retain | Delete
    whenScaled: Retain | Delete
复制代码

其中,两个策略字段包括:

  • whenDeleted:当 Advanced StatefulSet 被删除时,对 PVC 的保留/删除策略。
  • whenScaled:当 Advanced StatefulSet 发生缩容时,对缩容 Pod 关联 PVC 的保留/删除策略。

每个策略都可以配置以下两种值:

  • Retain(默认值):它的行为与过去 StatefulSet 一样,在 Pod 删除时对它关联的 PVC 做保留。
  • Delete:当 Pod 删除时,自动删除它所关联的 PVC 对象。

除此之外,还有几个注意点:

  1. StatefulSetAutoDeletePVC 功能只会清理由 volumeClaimTemplate 中定义和创建的 PVC,而不会清理用户自己创建或关联到 StatefulSet Pod 中的 PVC。
  2. 上述清理只发生在 Advanced StatefulSet 被删除或主动缩容的情况下。例如 node 故障导致的 Pod 驱逐重建等,仍然会复用已有的 PVC。

Advanced DaemonSet 重构并支持生命周期钩子

早先版本的 Advanced DaemonSet 实现与上游控制器差异较大,例如对于 not-ready 和 unschedulable 的节点需要额外配置字段来选择是否处理,这对于我们的用户来说都增加了使用成本和负担。

在 v1.1 版本中,我们对 Advanced DaemonSet 做了一次小重构,将它与上游控制器重新做了对齐。因此,Advanced DaemonSet 的所有默认行为会与原生 DaemonSet 基本一致,用户可以像使用 Advanced StatefulSet 一样,通过修改 apiVersion 就能很方便地将一个原生 DaemonSet 修改为 Advanced DaemonSet 来使用。

另外,我们还为 Advanced DaemonSet 增加了生命周期钩子,首先支持 preDelete hook,来允许用户在 daemon Pod 被删除前执行一些自定义的逻辑。

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  ...
  # define with label
  lifecycle:
    preDelete:
      labelsHandler:
        example.io/block-deleting: "true"
复制代码

当 DaemonSet 删除一个 Pod 时(包括缩容和重建升级):

  • 如果没有定义 lifecycle hook 或者 Pod 不符合 preDelete 条件,则直接删除。
  • 否则,会先将 Pod 更新为 PreparingDelete 状态,并等待用户自定义的 controller 将 Pod 中关联的 label/finalizer 去除,再执行 Pod 删除。

Disable DeepCopy 性能优化

默认情况下,我们在使用 controller-runtime 来编写 Operator/Controller 时, 使用其中 sigs.k8s.io/controller-runtime/pkg/client Client 客户端来 get/list 查询对象(typed),都是从内存 Informer 中获取并返回,这是大部分人都知道的。

但很多人不知道的是,在这些 get/list 操作背后,controller-runtime 会将从 Informer 中查到的所有对象做一次 deep copy 深拷贝后再返回。

这个设计的初衷,是避免开发者错误地将 Informer 中的对象直接篡改。在深拷贝之后,无论开发者对 get/list 返回的对象做了任何修改,都不会影响到 Informer 中的对象,后者只会从 kube-apiserver 的 ListWatch 请求中同步。

但是在一些很大规模的集群中,OpenKruise 中各个控制器同时在运行,同时每个控制器还存在多个 worker 执行 Reconcile,可能会带来大量的 deep copy 操作。例如集群中有大量应用的 CloneSet,而其中一些 CloneSet 下管理的 Pod 数量非常多,则每个 worker 在 Reconcile 的时候都会 list 查询一个 CloneSet 下的所有 Pod 对象,再加上多个 worker 并行操作, 可能造成 kruise-manager 瞬时的 CPU 和 Memory 压力陡增,甚至在内存配额不足的情况下有发生 OOM 的风险。

在上游的 controller-runtime 中,我在去年已经提交合并了 DisableDeepCopy 功能 [5] ,包含在 controller-runtime v0.10 及以上的版本。它允许开发者指定某些特定的资源类型,在做 get/list 查询时不执行深拷贝,而是直接返回 Informer 中的对象指针。

例如下述代码,在 main.go 中初始化 Manager 时,为 cache 加入参数即可配置 Pod 等资源类型不做深拷贝。

mgr, err := ctrl.NewManager(cfg, ctrl.Options{
        ...
        NewCache: cache.BuilderWithOptions(cache.Options{
            UnsafeDisableDeepCopyByObject: map[client.Object]bool{
                &v1.Pod{}: true,
            },
        }),
    })
复制代码

但在 Kruise v1.1 版本中,我们没有选择直接使用这个功能,而是将 Delegating Client [6] 重新做了封装, 从而使得开发者可以在任意做 list 查询的地方通过 DisableDeepCopy ListOption 来指定单次的 list 操作不做深拷贝。

if err := r.List(context.TODO(), &podList, client.InNamespace("default"), utilclient.DisableDeepCopy); err != nil {
        return nil, nil, err
    }
复制代码

这样做的好处是使用上更加灵活,避免为整个资源类型关闭深拷贝后,众多社区贡献者在参与开发的过程中如果没有注意到则可能会错误修改 Informer 中的对象。

其他改动

你可以通过 Github release [7] 页面,来查看更多的改动以及它们的作者与提交记录。

社区参与

非常欢迎你通过 Github/Slack/钉钉/微信 等方式加入我们来参与 OpenKruise 开源社区。你是否已经有一些希望与我们社区交流的内容呢?可以在我们的社区双周会 [8] 上分享你的声音,或通过以下渠道参与讨论:

  • 加入社区 Slack channel [9] (English)
  • 加入社区钉钉群:搜索群号 23330762 (Chinese)
  • 加入社区微信群(新):添加用户 openkruise 并让机器人拉你入群 (Chinese)

相关链接​* *​

[1]OpenKruise

​https://openkruise.io/​

[2]容器启动顺序控制

​https://openkruise.io/zh/docs/user-manuals/containerlaunchpriority/​

[3]原地升级介绍

​https://openkruise.io/zh/docs/core-concepts/inplace-update​

[4]参考文档

​https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#persistentvolumeclaim-retention​

[5]DisableDeepCopy 功能

​https://github.com/kubernetes-sigs/controller-runtime/pull/1274 ​

[6]Delegating Client

​https://github.com/openkruise/kruise/blob/master/pkg/util/client/delegating_client.go​

[7]Github release

​https://github.com/openkruise/kruise/releases​

[8]社区双周会

​https://shimo.im/docs/gXqmeQOYBehZ4vqo​

[9]Slack channel

​https://kubernetes.slack.com/?redir=%2Farchives%2Fopenkruise​

​点击​此处​,查看 OpenKruise 项目官方主页与文档!​

Guess you like

Origin juejin.im/post/7082848971948818463