KubeNode: Alibaba Cloud Native Container Infrastructure Operation and Maintenance Practice

Challenges of Alibaba node operation and maintenance

In the scenario of Alibaba, the challenges facing node operation and maintenance mainly come from these aspects: scale, complexity, and stability.

The first is large scale . From the establishment of the first cluster in 18 years, there are hundreds of ASI clusters and hundreds of thousands of nodes running online, of which the number of nodes in a single cluster can exceed 10,000 at most. On top of this, tens of thousands of different applications of the Alibaba Group are running, such as the well-known Taobao, Tmall, etc., and the total number of container instances is millions of scales. ASI refers to Alibaba Serverless Infrastructure, which is Alibaba Serverless Infrastructure. The concept of ASI includes both the management and control plane of the cluster and the nodes. Each ASI cluster is an ACK hosting version cluster created by calling Aliyun's standard OpenAPI. On top of this, we have developed a scheduler, developed and deployed a lot of addons, enhanced functions, optimized performance, and connected the various groups of the group. System, and achieve full node hosting, no need for application development and operation personnel to care about the underlying container infrastructure.

Secondly, the environment is very complicated . At present, the IaaS layer runs many heterogeneous models, including x86 servers and localized ARM models. At the same time, in order to serve new computing and AI businesses, there are also GPU and FPGA models. There are also many online kernel versions. 4.19 was the kernel version that started to go online on a large scale last year. At the same time, the node problems on the 3.10 / 4.9 kernel version also need to continue to support, and the evolution of different kernel versions also requires large-scale rotation operation and maintenance capabilities. At present, tens of thousands of online applications include various services such as Taobao, Tmall, Cainiao, Gaode, Ele.me, Koala, Hema, etc. At the same time, they run mixed and secure container services together with online applications. Business types such as big data, offline computing, real-time computing, search, etc. are all running in the same host environment as online businesses.

Finally, there are high requirements for stability . The characteristic of online business is that it is very sensitive to delay and jitter. The jitter, tamper, downtime and other failures of a single node may affect a user’s order and payment on Taobao, triggering complaints and complaints from users, so the overall stability is The requirements are very high, requiring high timeliness and effectiveness in handling single node failures.

KubeNode: Introduction to the cloud-native node operation and maintenance base

KubeNode is a base project developed by Alibaba to manage and operate nodes based on a cloud native method. Compared with traditional procedural operation and maintenance methods, KubeNode extends CRD through K8s and a corresponding set of Operators, which can provide a complete set of operators. Node life cycle and node component life cycle management, through a declarative and final state-oriented way, the management of nodes and node components becomes as simple as managing an application in K8s, and achieves a high degree of consistency and self-healing of nodes Ability.

On the right side of the figure above is a simplified architecture of KubeNode, which is composed of these parts as a whole:

At the center, there is a Machine Operator responsible for the management of nodes and node components, and a Remedy Operator responsible for the self-healing and repair of node failures. There is Kube Node Agent on the node side. This stand-alone agent component is responsible for the CRD object instances generated by the Machine Operator and Remedy Operator in the watch center, and performs corresponding operations, such as the installation of node components and the execution of fault self-healing tasks.

Cooperating with KubeNode, Alibaba also uses NPD for stand-alone failure detection, as well as connection to Kube Defender (Ali self-developed component) for unified risk control. Of course, the fault detection items provided by the community version of NPD are relatively limited. Alibaba has expanded on the basis of the community, combined with Alibaba’s years of node and container operation and maintenance practices, and added many node fault detection items, which greatly enriched Ability to detect single machine faults.

1. Relationship between KubeNode and community projects

  • http://github.com/kube-node  : Not relevant, the project has been discontinued in early 2018.
  • ClusterAPI: KubeNode can be used as a supplement to the final state of the ClusterAPI node.

Function comparison:

Here is an explanation of the relationship between Alibaba's self-developed KubeNode project and community projects. When you see the name kube-node, you may feel a bit familiar. There is a project with the same name on github  http://github.com/kube-node , but in fact, this project was already in early 2018. Stopped, so just the same name, the two are not related.

In addition, the community’s ClusterAPI is a project to create and manage K8s clusters and nodes. Here is a comparison of the relationship between the two projects:

  • Cluster creation: ClusterAPI is responsible for the creation of clusters, KubeNode does not provide this function.
  • Node creation: Both ClusterAPI and KubeNode can create nodes.
  • Node component management and final state maintenance: ClusterAPI does not provide corresponding functions, KubeNode can manage node components and maintain the final state.
  • Node failure self-healing: ClusterAPI mainly provides self-healing ability to rebuild nodes based on node health status; KubeNode provides a richer node component self-healing function, which can self-heal and repair various hardware and software failures on the node.

In general, KubeNode can work with ClusterAPI, which is a good supplement to ClusterAPI.

The node component mentioned here refers to the kubelet and Docker software running on the node. Ali internally uses Pouch as our container runtime. In addition to kubelet, Pouch, these necessary components for scheduling, there are more than a dozen components for distributed container storage, monitoring collection, secure container, and fault detection.

Usually install and upgrade kubelet, Docker is done through one-time process-oriented actions like Ansible. In the long-running process, it is very common that the software version is accidentally modified or encounters a bug and cannot work. At the same time, the iteration speed of these components in Alibaba is very fast, and it is often necessary to release a flattened version within a week or two. In order to meet the needs of fast iteration of components, safe upgrades, and consistent versions, Alibaba has developed KubeNode by itself. It describes nodes and node components through K8s CRD, and conducts end-state-oriented management to ensure version consistency. Configuration consistency and operating state correctness.

2. KubeNode - Machine Operator

The figure above is the architecture of Machine Operator, a standard Operator design: an extended set of CRDs plus a central controller.
CRD definitions include: Machine and MachineSet related to nodes, and MachineComponent and MachineComponentSet related to node components.

The central controller includes: Machine Controller, MachineSet Controller, MachineComponentSet Controller, which are respectively used to control the creation and import of nodes, and the installation and upgrade of node components.

Infra Provider is extensible and can connect to different cloud vendors. Currently, it only connects to Alibaba Cloud, but it can also connect to different cloud vendors such as AWS and Azure by implementing the corresponding Provider.

The KubeNode on a single machine is responsible for watching CRD resources. When a new object instance is found, the node component is installed and upgraded, and each component is regularly checked whether it is running normally, and the running status of the component is reported.

1) Use Case: node import

Let's share the import process of existing nodes based on KubeNode.

First, the user will submit an import operation of an existing node in our multi-cluster management system. Next, the system will first issue a certificate and install the KubeNode Agent. After the agent is running and started normally, the third step will submit the Machine CRD, and then Next, the Machine Controller will modify the status to import phase, and after Node is ready, it will synchronize the label/taint from the Machine to the Node. The fifth step is MachineComponentSet, which determines the node components to be installed according to the information of the Machine, and synchronizes it to the Machine. Finally, the Kube Node Agent will watch the information of Machine and MachineComponent to complete the installation of the node components, and after all components are running normally, the node import operation is completed. The whole process is similar to that the user submits a Deployment and finally starts a business Pod.

The final state consistency of the node components mainly includes the correctness and consistency of the software version, software configuration, and operating state.

2) Use Case: component upgrade

The upgrade process of the next component is introduced here, which mainly relies on the batch upgrade capability provided by the MachineComponentSet Controller.

First, the user submits a component upgrade operation on the multi-cluster management system, and then enters a batch-by-batch cyclic upgrade process: first update the MachineComponentSet. What is the number of machines to be upgraded in a batch, and then the MachineComponentSet Controller will calculate and update the corresponding number of nodes Version information of the above component. Next, the Kube Node Agent watches the changes to the components, executes the installation of the new version, and checks that the status is normal and then reports that the component status is normal. After all the components in this batch have been successfully upgraded, the next batch of upgrade operations can be started.

The above-described single-cluster and single-component upgrade process is relatively simple, but for more than ten online components and hundreds of clusters, it is not so simple to complete the version leveling work in all clusters. We use ASIOps cluster Unified operation and maintenance platform for operation. In the ASIOps system, hundreds of clusters are configured to a limited number of release pipelines, and each release pipeline is arranged in the order of: test -> pre-release -> formal. A normal release process is to select a release pipeline and release according to its pre-set cluster sequence. Within each cluster, it is automatically performed in the order of 1/5/10/50/100/... Release. After each batch release is completed, a health inspection will be triggered. If there is a problem, the automatic release will be suspended. If there is no problem, the next batch will be released automatically when the observation period is over. In this way, the process of releasing a new version of a component is completed safely and efficiently.

3. KubeNode - Remedy Operator

Next, I will share the Remedy Operator in KubeNode, which is also a standard Operator for fault self-healing.

Remedy Operator is also composed of a set of CRDs and corresponding controllers. The definition of CRD includes: NodeRemedier and RemedyOperationJob, the controller includes: Remedy Controller, RemedyJob Controller, and there is also a registration center with fault self-healing rules. There are NPD and Kube Node Agent on the stand-alone side.

Host Doctor is an independent fault diagnosis system on the central side. It connects with cloud vendors to obtain active operation and maintenance events and convert them into fault conditions on the node. On the Alibaba Cloud public cloud, hardware failures of the physical machine where ECS is located or planned O&M operations will be obtained in the form of standard OpenAPI. After docking, you can detect node problems in advance and automatically migrate nodes in advance. On the business to avoid failures.

Use Case: Compaction machine self-healing

Here, a typical self-healing process is introduced with a case of self-healing of the compaction machine.

First, we will configure the self-healing rules described by CRD on the multi-cluster management system ASI Captain, and these rules can be dynamically added flexibly, and a corresponding repair operation can be configured for each Node Condition.

Next, the NPD on the node will periodically check whether various types of failures have occurred. When an abnormal log of "task xxx blocked for more than 120 seconds" is found in the kernel log, it will determine the node's ramming machine and give it to the Node. Report the fault Condition, and when the Remedy Controller watch changes, the self-healing repair process is triggered: First, the Kube Defender risk control center interface is called to determine whether the current self-healing operation is allowed to be executed, and the RemedyOperationJob self-healing task is generated after passing, and the Kube Node Agent watch reaches Perform self-healing operation after job.

It can be seen that the entire self-healing process does not depend on external third-party systems. Fault detection is done through NPD, and the Remedy Operator performs self-healing repairs. The entire self-healing repair process is completed in a cloud-native way, which can be done in minutes as fast as possible. The fault is found and repaired. At the same time, through the enhancement of NPD detection rules, the range of faults handled covers the full-link repair from hardware faults, OS kernel faults, to component faults. It is worth emphasizing that all self-healing operations will be connected to the Kube Defender unified risk control center to perform self-healing repair flow control at the minute, hour, and day level to prevent the occurrence of Region / Zone level network disconnection, large-scale io hang, Or in the case of other large-scale software bugs, it triggers self-healing of all nodes in the entire Region, causing more serious secondary failures.

KubeNode data system

The construction of the KubeNode data system plays a very important role in overall measurement and improvement of SLO.

On the node side, NPD will detect faults and report to the event center. At the same time, walle is an indicator collection component on the stand-alone side, which collects various indicator information of nodes and containers, including common indicators such as CPU / Memory / IO / Network, and Many other indicators such as kernel, security container, etc. Promethes (the ARMS product on the Alibaba Cloud public cloud) on the central side collects and stores the metrics of all nodes. It also collects the expanded Kube State Metrics data to obtain the key indicator information of the Machine Operator and Remedy Operator. On the basis of obtaining these data, the upper layer faces users to configure the capabilities of monitoring the market, fault alarm, and full-link diagnosis.

Through the construction of the data system, we can use it to make analysis and statistics of resource utilization, provide real-time monitoring and alarm, perform fault analysis and statistics, and can also analyze the overall KubeNode node and node component coverage, consistency rate, node The efficiency of self-healing and the full-link diagnosis function for the node is provided. When troubleshooting a node, you can view all the events that have occurred in the history of the node, thereby helping users quickly locate the cause.

Future outlook

At present, KubeNode has covered all ASI clusters of Alibaba Group. Next, with Alibaba Group’s "Unified Resource Pool" project, KubeNode will be promoted to cover a larger range and more scenarios, and make cloud-native container foundations. Facility operation and maintenance architecture exerts greater value.

About the Author

Zhou Tao, an Alibaba Cloud technical expert, joined Alibaba in 2017. In the past few years, he has been responsible for the research and development of Alibaba's hundreds of thousands of cluster node management and control systems, and participated in the Double Eleven promotion over the years. With the group cloud project started at the end of 2018, the managed nodes have been covered from the physical machines under the cloud to the Shenlong bare metal server on the Alibaba Cloud public cloud, and supported the Double 11 promotion to realize the comprehensive cloud-native transformation of the transaction core system .

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114687951