ByteDance’s open source KubeAdmiral: a new generation of multi-cluster orchestration and scheduling engine based on K8s

Source|KubeAdmiral open source community

Project address: https://github.com/kubewharf/kubeadmiral

自2014年开源以来,Kubernetes已然成为编排调度系统的事实标准,为开发者提供了极大的便利。随着越来越多企业拥抱云原生,全球云基础设施规模仍在加速增长,Kubernetes社区版本单集群5000节点的规模已经无法满足企业级大规模应用场景,同时,更多公司选择使用多云架构满足降本增效、异地容灾、环境隔离等需求,多集群管理的必要性日渐显著。

background

With the rapid development of business, the number of Kubernetes clusters within ByteDance has also continued to grow. The number of clusters exceeds 500, and the number of application copies ranges from 0 to 20,000. The largest application exceeds 100W core.

In the early days, due to isolation and security considerations, each business line of Byte had exclusive clusters. These exclusive clusters caused resource islands, which ultimately affected the elastic efficiency of resources. This is first reflected in the need to maintain independent buffers for each business line; secondly, the business is deeply bound to the cluster, the business is aware of a large number of clusters, and allocates resources to the application among the clusters. SRE also needs to be deeply aware of the business and clusters in terms of operational resources. , ultimately leading to slow turnover of resources between various business lines, low automation efficiency, and suboptimal deployment rates. Therefore, we need to introduce federation, decouple the binding relationship between applications and clusters, pool the resources of each business line, reduce buffers, and improve the automation efficiency of resources.

As multi-cloud and hybrid cloud increasingly become mainstream forms in the industry, and Kubernetes becomes a cloud-native operating system, various infrastructures are further abstracted and standardized to provide more unified standard interfaces for applications. On this basis, we hope to introduce federation as the base of the cloud native system in distributed cloud scenarios, provide a unified platform entrance for applications, improve the ability to distribute applications across clusters, do a good job in distribution and scheduling of applications across clusters, and manage multiple clouds. Infrastructure in native scenarios.

KubeFed V2 byte implementation

Faced with the challenges brought by multi-cluster management, the infrastructure team started the construction of cluster federation based on community KubeFed V2 in 2019. KubeFed V2 distinguishes between master clusters and member clusters. Users create "federation objects" in the master cluster, and multiple KubeFed Controllers distribute resources among member clusters based on federation objects. There are three fields on the federated object: Template (object template), Placement (target cluster), and Overrides (cluster differentiation) to declare the deployment status of the object. For example, you can create a FederatedDeployment as shown below in the master cluster for Deployment distribution:

apiVersion: types.kubefed.k8s.io/v1beta1
kind: FederatedDeployment
metadata:
  name: test-deployment
  namespace: test-namespace
spec:
  template: # 定义 Deployment 的所有內容,可理解成 Deployment 与 Pod template 之间的关联。
    metadata:
      labels:
        app: nginx
    spec:
      ...
  placement:
    # 分发到指定的两个集群中
    clusters:
    - name: cluster1
    - name: cluster2
  overrides: 
  # 在cluster2中修改副本数为5
  - clusterName: cluster2
    clusterOverrides:
    - path: spec.replicas
      value: 5

For Deployments and ReplicaSets, KubeFed also allows specifying more advanced replica distribution strategies through ReplicaSchedulingPreference (RSP). Users can configure the weight, minimum and maximum number of replicas of each cluster on RSP, and the RSP controller automatically calculates the placement and overrides fields and updates FederatedDeployment or FederatedReplicaSet.

Image source: https://www.kubernetes.org.cn/5702.html

picture.image

However, during implementation, we found that KubeFed could not meet the requirements of the production environment:

  1. Low resource utilization - KubeFed's replica scheduling policy RSP can only set static weights for each member cluster and cannot flexibly respond to changes in cluster resources, resulting in uneven deployment levels of different member clusters.
  2. Changes are not smooth enough - Uneven distribution of instances often occurs when scaling up or down, resulting in reduced disaster recovery capabilities.
  3. Scheduling semantic limitations - only good support for stateless resources, insufficient support for diverse resources such as stateful services and jobs, and poor scheduling scalability.
  4. The access cost is high - it needs to be distributed through the creation of federated objects, it is not compatible with native APIs, and users and the upper platform need to completely change their usage habits.

With the evolution of Bytedance's infrastructure, we have put forward higher requirements for efficiency, scale, performance and cost; at the same time, as business scenarios such as stateful services, storage, offline operations, and machine learning further embrace cloud native, support diverse Cross-cluster orchestration and scheduling capabilities in specialized scenarios have become increasingly important. Therefore, we developed a new generation of cluster federation system KubeAdmiral based on KubeFed v2 at the end of 2021.

picture.image

KubeAdmiral architecture evolution

The name KubeAdmiral is derived from admiral (pronounced [ˈædm(ə)rəl]), which originally means fleet commander. The addition of the Kube(rnetes) prefix means that the tool has powerful Kubernetes multi-cluster orchestration and scheduling capabilities.

KubeAdmiral支持Kubernetes原生API,提供丰富的、可扩展的调度框架,并对调度算法、分发过程进行了细致的打磨。以下对一些显著特性进行详细介绍:

picture.image

1. Rich multi-cluster scheduling capabilities

The scheduler is the core component of the federated system. It is responsible for allocating resources to member clusters. In the replica scheduling scenario, it is also responsible for calculating the replicas each cluster deserves. Its scheduling logic directly affects federated multi-cluster disaster recovery, resource efficiency, important features such as stability.

KubeFed provides the RSP scheduler for replica scheduling, but its customization and scalability are very limited, and its logical abstraction is insufficient. To change its behavior, it must be completed by modifying the code. At the same time, it lacks support for stateful services, job resources, etc.

KubeAdmiral introduces richer scheduling semantics, supports more flexible cluster selection through labels, stains, etc., provides stateful, job-type resource scheduling capabilities, and introduces optimizations such as dependency-following scheduling. The semantics of scheduling can be configured through the PropagationPolicy object as shown below:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: mypolicy
  namespace: default
spec:
  # 提供多种集群选择方式,最终结果取交集
  placement: # 手动指定集群与权重
    - cluster: Cluster-01
      preferences:
        weight: 40
    - cluster: Cluster-02
      preferences:
        weight: 30
    - cluster: Cluster-03
      preferences:
        weight: 40
  clusterSelector: # 类似Pod.Spec.NodeSelector,通过label过滤集群
    IPv6: "true"
  clusterAffinity: # 类似Pod.Spec.NodeAffinity,通过label过滤集群,语法比clusterSelector更加灵活
    - matchExpressions:
        - key: region
          operator: In
          values:
            - beijing
  tolerations: # 通过污点过滤集群
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
  schedulingMode: Divide # 是否为副本数调度
  stickyCluster: false # 仅在首次调度,适合有状态服务或作业类服务
  maxClusters: 1 # 最多可分发到多少个子集群,适合有状态服务或作业类服务
  disableFollowerScheduling: false # 是否开启依赖调度

At the same time, for resources scheduled to different clusters, OverridePolicy can be used to differentiate based on cluster names or labels:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: OverridePolicy
metadata:
  name: example
  namespace: default
spec:
  # 最终匹配的集群是所有rule匹配集群的交集
  overrideRules:
    - targetClusters:
        # 通过名称匹配集群
        clusters:
          - member1
          - member2
        # 通过标签selector匹配集群
        clusterSelector:
          region: beijing
          az: zone1
        # 通过基于标签的affinity匹配集群
        clusterAffinity:
          - matchExpressions:
            - key: region
              operator: In
              values:
              - beijing
            - key: provider
              operator: In
              values:
                - volcengine
      # 在匹配的集群中,使用jsonpatch语法修改第一个容器的镜像
      overriders:
        jsonpatch:
          - path: "/spec/template/spec/containers/0/image"
            operator: replace
            value: "nginx:test"

2. Scheduling capabilities can be expanded

KubeAdmiral refers to the design of kube-scheduler and provides an extensible scheduling framework. It abstracts the scheduling logic into four steps: Filter, Score, Select and Replica, and uses multiple relatively independent plug-ins to implement its logic in each step. Almost every field in the PropagaionPolicy shown in the figure above is implemented by an independent built-in scheduling plug-in. The plug-ins do not interfere with each other, and the scheduler calls the required plug-ins for global orchestration.

In addition, the KubeAdmiral scheduler also supports interaction with external plug-ins through the http protocol. Users can write and deploy customized scheduling logic to meet the needs of accessing the company's internal system for scheduling. The built-in plug-in implements more general capabilities and complements the external plug-in. Users can expand the scheduling logic at minimal cost and without changing the federation control plane, and rely on KubeAdmiral's powerful multi-cluster distribution capability to make the scheduling results effective.

picture.image

3. Automatic migration if application scheduling fails

For resources scheduled by replicas, KubeAdmiral will calculate how many replicas each member cluster deserves, overwrite the number of replicas fields and then deliver them to each member cluster. This process is called federated scheduling; after the resources are delivered, each member cluster's kube-scheduler will allocate the pod corresponding to the resource to the corresponding node. This process becomes single-cluster scheduling.

After resources are issued, single-cluster scheduling may sometimes fail due to node offline, insufficient resources, unsatisfied node affinity, etc. If not handled, the available business instances will be lower than expected. KubeAdmiral provides the function of automatic migration when scheduling fails. When enabled, it can identify unschedulable replicas in member clusters and migrate them to clusters that can accommodate redundant replicas, realizing multi-cluster resource turnover.

For example, three clusters A, B, and C are allocated 6 copies with equal weight. After the initial federation scheduling, each cluster will be assigned 2 copies. If the two replicas in cluster C fail to be scheduled in a single cluster, KubeAdmiral will automatically migrate them to A and B.

cluster A B C
Weights 1 1 1
Number of initial federated scheduling instances 2 2 2
Replicas that fail to be scheduled in a single cluster 0 0 2
Number of federated scheduling instances after automatic migration 3 3 0

4. Dynamically schedule resources based on cluster water levels

In a multi-cluster environment, the resource levels of each cluster change dynamically as machines go online and offline. Relying only on the static weight scheduling replicas provided by KubeFed RSP can easily lead to uneven cluster water levels. Clusters with too high deployment rates are prone to problems during service upgrades. Pods appear to be pending for a long time, and cluster resources with a low deployment rate cannot be fully utilized. In this regard, KubeAdmiral introduces dynamic weight scheduling based on cluster water level. It calculates the available amount by collecting the total resource amount and usage of each cluster, and uses the available resource amount as the weight of copy scheduling, ultimately achieving load balancing of each member cluster. And the deployment rate of all member clusters is maintained above 95%.

5. Improvement of copy allocation algorithm

KubeFed's copy allocation algorithm often causes the number of instances to deviate from expectations when scaling up or down, for example:

30 instances are distributed in three member clusters A, B, and C. In the case of rsp.rebalance = false, the user wants to scale down to 15 instances:

Before shrinking:

cluster A B C
Weights 10 10 10
Number of instances 15 15 0

After shrinking:

cluster A B C
Weights 10 10 10
Number of instances 15 0 0

The reason for this phenomenon is that KubeFed's replica algorithm first pre-allocates the number of currently existing instances in the cluster, and then allocates the remaining instances according to the weight of each cluster. If there are too many replicas in the current cluster, it will This leads to serious deviation between instance distribution and weight.

KubeAdmiral optimizes KubeFed's copy algorithm to make the final distribution as close to a weight distribution as possible while ensuring that no unexpected migration occurs during expansion and contraction. Taking scaling from 30 instances to 15 as an example, the simplified algorithm process is as follows:

  1. current distribution = [15, 15, 0], total replicas: 30
  2. desired distribution = [5, 5, 5], total replicas: 15
  3. distance = desired - current = [-10, -10, 5], total distance: 15
  4. For shrinking scenarios, remove the positive term distance = [-10, -10, 0]
  5. Using distance as the weight, redistribute the difference 15: [-7, -8, 0]
  6. Final scheduling result: [-7, -8, 0] + [15, 15, 0] -> [8, 7, 0]
cluster A B C
Weights 10 10 10
Number of instances 8 7 0

6. Support native resources

Unlike KubeFed, which requires users to use a completely incompatible new API, KubeAdmiral caters to the usage habits of Kubernetes single-cluster users and provides support for native Kubernetes APIs. After users create native resources (such as Deployment), the Federate Controller automatically converts them into federated internal objects for use by other controllers. Users can quickly migrate from a single cluster to a multi-cluster architecture and enjoy the convenience of multiple clusters with a low threshold.

KubeAdmiral doesn't stop there. In a single cluster, Kubernetes' native controller will update the status of some resources to reflect their current status. Users or upper-layer systems often rely on status to view information such as deployment status and health status. In multiple clusters, the status of resources is scattered across multiple clusters. To view the global status, users must view the status of resources in each cluster one by one, causing problems such as fragmented views and low operation and maintenance efficiency. In order to solve this problem and seamlessly support native resources, KubeAdmiral provides status aggregation capabilities. Status Aggregator merges and integrates the status of resources in multiple member clusters and writes them back to native resources, so that users do not need to be aware of multi-cluster topology. The status of resources across the entire federation can be observed at a glance.

present and future

KubeAdmiral has been incubated within Byte for many years. It strongly supports Byte Group's business TCE platform and manages 210,000+ machines and 10 million+ pods. It has been polished by large-scale businesses such as Douyin and Toutiao, and has accumulated a lot of valuable practical experience. . In order to give back to the community, KubeAdmiral has been officially open sourced on GitHub. At the same time, Volcano Engine is building a new enterprise-level multi-cloud and multi-cluster management model based on KubeAdmiral - the distributed cloud native platform (DCP), so stay tuned.

picture.image

Looking to the future, we plan to continue to evolve in the following aspects:

  • Continue to improve the orchestration and scheduling capabilities of stateful, job-type and other resources, and develop advanced capabilities such as automatic migration and price comparison scheduling to embrace the advent of the multi-cloud and multi-cluster era of batch computing.
  • Improve user experience and provide out-of-the-box solutions to further reduce users’ cognitive load.
  • Improve observability, optimize log and monitoring indicators, and improve the interpretability of the scheduler.
  • Explore functions such as one-click federation and multi-cluster migration to fully unleash the potential of multi-cluster architecture.

Multi-cluster orchestration and scheduling are not simple in nature. A universal and complete multi-cluster federated system must be polished in various scenarios. We look forward to more friends paying attention to and joining the KubeAdmiral community. We also welcome everyone to try KubeAdmiral and provide us with various suggestions. suggestion!

GitHub :https://github.com/kubewharf/kubeadmiral

Fellow chicken "open sourced" deepin-IDE and finally achieved bootstrapping! Good guy, Tencent has really turned Switch into a "thinking learning machine" Tencent Cloud's April 8 failure review and situation explanation RustDesk remote desktop startup reconstruction Web client WeChat's open source terminal database based on SQLite WCDB ushered in a major upgrade TIOBE April list: PHP fell to an all-time low, Fabrice Bellard, the father of FFmpeg, released the audio compression tool TSAC , Google released a large code model, CodeGemma , is it going to kill you? It’s so good that it’s open source - open source picture & poster editor tool
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6210722/blog/10086587