Ali cloud Kubernetes tens of thousands of large-scale cluster management practice

brief introduction:

Ali cloud container service from 2015 on the line, all the way to accompany and support the development of two-eleven. In 2019 the two-eleven, container services in addition to the container of ACK support the Group's internal core systems and Ali cloud cloud cloud product itself, but also the years of large-scale container Ali technical ability to export products to many of the surrounding Double Ten a company of ecological and ISV companies. By supporting all walks of life from around the world cloud container, container services has precipitated a support unit architecture, global architecture, the flexible architecture of the cloud-native application hosting capabilities in Taiwan, more than 1W manage more than one container cluster. This article will introduce the next ACK container service experience on mass k8s cluster management.

introduction

What is the mass k8s cluster management. You may have seen some of the shared before, Alibaba introduced internal or ants gold dress best practices on how to manage a single cluster node 1W, the management of large-scale node is a very interesting challenge. But speaking here Massive k8s cluster management, will focus on how to manage more than 1W speak different specifications k8s cluster. As far as we and some colleagues with communication, often a long internal management of several to dozens of k8s clusters, then why do we need to consider the management of such a large number of k8s cluster? First of all, the service container products on the ACK cloud Ali cloud provides the ability Kubernetes as a Service, and customers worldwide, now supported in 20 regions around the world. Secondly, thanks to the development of the original cloud era of life, more and more companies embrace k8s, K8s cloud infrastructure has gradually become primary age students, to become platform of platform.

Background

首先我们一起来看下托管这些k8s集群的痛点：
1.集群种类不同：有标准的、无服务器的、AI的、裸金属的、边缘、Windows等k8s集群。不同种类的集群参数、组件和托管要求不一样，并且需要支撑更多面向垂直场景的k8s。
2.集群大小不一：每个集群规模大小不一，从几个节点到上万个节点，从几个service到几千个service等。需要能够支撑每年持续几倍集群数量的增长。
3.集群安全合规：分布在不同的地域和环境的k8s集群，需要遵循不同的合规性要求。比如欧洲的k8s集群需要遵循欧盟的GDPR法案，在中国的金融业和政务云需要有额外的等级保护等要求。
4.集群持续演进：需要能够持续的支持k8s的新版本新特性演进。。

设计目标：

1.支持单元化的分档管理、容量规划和水位管理
2.支持全球化的部署、发布、容灾和可观测性
3.支持柔性架构的可插拔、可定制、积木式的持续演进能力

1.支持单元化的分档管理、容量规划和水位管理

单元化：
一般讲到单元化，大家都会联想到单机房容量不够或二地三中心灾备等场景。那单元化和k8s管理有什么关系？对我们来说，一个地域（比如杭州）可能会管理几千个k8s，需要统一维护这些k8s的集群生命周期管理。作为一个k8s专业团队，一个朴素的想法就是通过多个k8s元集群来管理这些guest K8s master。而一个k8s元集群的边界就是一个单元。
曾经我们经常听说某某机房光纤被挖断，某某机房电力故障而导致服务中断，容器服务ACK在设计之初就支持了同城多活的架构形态，任何一个用户k8s集群的master组件都会自动地分散在多个机房，采用主主模式运行，不会因单机房问题而影响集群稳定性；另外一个层面，同时要保证master组件间的通信稳定性，容器服务ACK在打散master时调度策略上也会尽量保证master组件间通信延迟在毫秒级。

分档化：

大家都知道，k8s集群的master组件的负载主要与k8s集群的节点规模、worker侧的controller或workload等需要与kube-apiserver交互的组件数量和调用频率息息相关，对于上万个k8s集群，每个用户k8s集群的规模和业务形态都千差万别，我们无法用一套标准配置来去管理所有的用户k8s集群，同时从成本经济角度考虑，我们提供了一种更加灵活、更加智能的托管能力。考虑到不同资源类型会对master产生不同的负载压力，因此我们需要为每类资源设置不同的因子，最终可归纳出一个计算范式，通过此范式可计算出每个用户k8s集群master所适应的档位；同时我们也会基于已构建的k8s统一监控平台实时指标来不断地优化和调整这些因素值和范式，从而可实现智能平滑换挡的能力。

容量规划：接下来我们看下k8s元集群的容量模型，单个元集群到底能托管多少个用户k8s集群的master? 首先要确认容器网络规划。这里我们选择了阿里云自研的高性能容器网络Terway, 一方面需要通过弹性网卡ENI打通用户VPC和托管master的网络，另一方面提供了高性能和丰富的安全策略。接下来我们需要结合VPC内的ip资源，做网段的规划，分别提供给node、pod和service。最后，我们会结合统计规律，结合成本、密度、性能、资源配额、档位配比等多种因素的综合考量，设计每个元集群单元中部署的不同档位的guest k8s的个数，并预留40%的水位。

2.支持全球化的部署、发布、容灾和可观测性

容器服务已经在全球20个地域支持，我们提供了完全自动化的部署、发布、容灾和可观测性能力。这里重点介绍下全球化跨数据中心的可观测。

全球跨数据中心的可观测性
全球化布局的大型集群的可观测性，对于k8s集群的日常保障至关重要。如何在纷繁复杂的网络环境下高效、合理、安全、可扩展的采集各个数据中心中目标集群的实时状态指标，是可观测性设计的关键与核心。我们需要兼顾区域化数据中心、单元化集群范围内可观测性数据的收集，以及全局视图的可观测性和可视化。基于这种设计理念和客观需求，全球化可观测性必须使用多级联合方式，也就是边缘层的可观测性实现下沉到需要观测的集群内部，中间层的可观测性用于在若干区域内实现监控数据的汇聚，中心层可观测性进行汇聚、形成全局化视图以及告警。样设计的好处在于可以灵活的在每一级别层内进行扩展以及调整，适合于不断增长的集群规模，相应的其他级别只需调整参数，层次结构清晰；网络结构简单，可以实现内网数据穿透到公网并汇聚。

针对该全球化布局的大型集群的监控系统设计，对于保障集群的高效运转至关重要，我们的设计理念是在全球范围内将各个数据中心的数据实时收集并聚合，实现全局视图查看和数据可视化，以及故障定位、告警通知。进入云原生时代，Prometheus作为CNCF中第二个毕业的项目，天生适用于容器场景，Prometheus 与 Kubernetes 结合一起，实现服务发现和对动态调度服务的监控，在各种监控方案中具有很大的优势，实际上已经成为容器监控方案的标准，所以我们也选择了Prometheus作为方案的基础。

针对每个集群，需要采集的主要指标类别包括：

OS指标，例如节点资源（CPU, 内存，磁盘等）水位以及网络吞吐；
元集群以及用户集群K8s master指标，例如kube-apiserver, kube-controller-manager, kube-scheduler等指标；
K8s组件（kubernetes-state-metrics，cadvisor）采集的关于K8s集群状态；
etcd指标，例如etcd写磁盘时间，DB size，Peer之间吞吐量等等。

当全局数据聚合后，AlertManager对接中心Prometheus，驱动各种不同的告警通知行为，例如钉钉、邮件、短信等方式。

监控告警架构
为了合理的将监控压力负担分到到多个层次的Prometheus并实现全局聚合，我们使用了联邦Federation的功能。在联邦集群中，每个数据中心部署单独的Prometheus，用于采集当前数据中心监控数据，并由一个中心的Prometheus负责聚合多个数据中心的监控数据。基于Federation的功能，我们设计的全球监控架构图如下，包括监控体系、告警体系和展示体系三部分。

监控体系按照从元集群监控向中心监控汇聚的角度，呈现为树形结构，可以分为三层：

边缘Prometheus

为了有效监控元集群K8s和用户集群K8s的指标、避免网络配置的复杂性，将Prometheus下沉到每个元集群内，

级联Prometheus

级联Prometheus的作用在于汇聚多个区域的监控数据。级联Prometheus存在于每个大区域，例如中国区，欧洲美洲区，亚洲区。每个大区域内包含若干个具体的区域，例如北京，上海，东京等。随着每个大区域内集群规模的增长，大区域可以拆分成多个新的大区域，并始终维持每个大区域内有一个级联Prometheus，通过这种策略可以实现灵活的架构扩展和演进。

中心Prometheus

中心Prometheus用于连接所有的级联Prometheus，实现最终的数据聚合、全局视图和告警。为提高可靠性，中心Prometheus使用双活架构，也就是在不同可用区布置两个Prometheus中心节点，都连接相同的下一级Prometheus。

图2-1 基于Prometheus Federation的全球多级别监控架构

优化策略

监控数据流量与API server流量分离

API server的代理功能可以使得K8s集群外通过API server访问集群内的Pod、Node或者Service。

图3-1 通过API Server代理模式访问K8s集群内的Pod资源

常用的透传K8s集群内Prometheus指标到集群外的方式是通过API server代理功能，优点是可以重用API server的6443端口对外开放数据，管理简便；缺点也明显，增加了API server的负载压力。如果使用API Server代理模式，考虑到客户集群以及节点都会随着售卖而不断扩大，对API server的压力也越来越大并增加了潜在的风险。对此，针对边缘Prometheus增加了LoadBalancer类型的service，监控流量完全走LoadBalancer，实现流量分离。即便监控的对象持续增加，也保证了API server不会因此增加Proxy功能的开销。

收集指定Metric

在中心Prometheus只收集需要使用的指标，一定不能全量抓取，否则会造成网络传输压力过大丢失数据。

Label管理

Label用于在级联Prometheus上标记region和元集群，所以在中心Prometheus汇聚是可以定位到元集群的颗粒度。
同时，尽量减少不必要的label，实现数据节省。

3.支持柔性架构的可插拔、可定制、积木式的持续演进能力

前面两部分简要描述了如何管理海量k8s集群的一些思考，然而光做到全球化、单元化的管理还远远不够。k8s能够成功，包含了声明式的定义、高度活跃的社区、良好的架构抽象等因素，k8s已经成为云原生时代的Linux。我们必须要考虑k8s版本的持续迭代和CVE漏洞的修复，必须要考虑k8s相关组件的持续更新，无论是CSI、CNI、Device Plugin还是Scheduler Plugin等等。为此我们提供了完整的集群和组件的持续升级、灰度、暂停等功能。

June 2019, Alibaba cloud inside the native application automation engine OpenKruise open source, here we highlight BroadcastJob function at which he is well suited to components on each worker machine upgrade, or nodes on each machine for testing. (Broadcast Job will run every node in the cluster on top of a pod until the end. DaemonSet community similar, except that the DaemonSet always maintained a pod long service running on each node, while BroadcastJob in this pod will eventually end.)

Also, consider the different k8s usage scenarios, we offer a variety of k8s cluster profile can help to provide users with more convenient cluster selection. We will combine a large number of clusters of practice, we continue to provide more and better cluster template.

to sum up

With the development of cloud computing to cloud-based Kubernetes native technology industry continues to drive the digital transformation. ACK container service provides a secure and stable, high-performance Kubernetes hosting service has become the best carrier Kubernetes running on the cloud. In this double-11, ACK container services in each scene as two-eleven to contribute to the support of the cloud on the container of Alibaba internal core systems, support Ali cloud microphysical Service Engine MSE, video cloud, CDN and other cloud products, support 11 double eco-companies and ISV companies, including poly Spire electricity supplier cloud, cloud rookie logistics, payment systems and so on Southeast Asia. ACK container service will continue to move forward, continue to provide more and better network cloud native container, storage, scheduling capabilities and flexibility, end-all-link security capabilities, serverless and servicemesh capabilities. For developers who are interested can visit Ali cloud console https://cn.aliyun.com/product/kubernetes , create a cluster Kubernetes to experience. For container ecology partner, is also welcome to join Ali cloud container application market, and join us for a native cloud era.