[Reposted] Tencent's self-developed business goes to the cloud: a technical solution to optimize Kubernetes cluster load

Tencent's self-developed business goes to the cloud: technical solutions to optimize Kubernetes cluster load 

https://my.oschina.net/jxcdwangtao/blog/3105341

 

Author: [email protected]

Abstract: Kubernetes resource scheduling uses static scheduling. Compare Pod Request Resource with Node Allocatable Resource to determine whether Node has enough resources to accommodate the Pod. The problem caused by static scheduling is that the cluster resources are quickly allocated by the business container, but the overall load of the cluster is very low, and the load of each node is also uneven. This article will introduce various technical solutions for optimizing Kubernetes cluster load.

Why Kubernetes uses static scheduling

Static scheduling refers to packing scheduling based on the resources requested by the container, without considering the actual load of the node. The biggest advantage of static scheduling is that scheduling is simple and efficient, and cluster resource management is convenient. The biggest disadvantage is also obvious. Regardless of the actual load of the node, it is easy to cause the cluster load to be low.

Why does Kubernetes use static scheduling? Because it is almost impossible to do general dynamic scheduling, yes, it is difficult for general dynamic scheduling to meet the demands of different enterprises and different businesses, and the results may be counterproductive. Isn't it necessary for us to go to dynamic scheduling and make technical attempts? not necessarily! The platform can appropriately extend the Kubernetes Scheduler through the scheduler extender to make dynamic scheduling decisions with certain weights according to the business attributes of the hosting.

Cluster resource composition

Taking CPU resources as an example, the resource composition structure of a large-scale Kubernetes cluster is roughly as follows:

Consists of the following parts:

  • The reserved resources of each node correspond to the sum of resources configured by kubelet's system-reserved, kube-reserved, and eviction-hard. When Kubernetes calculates the Allocatable resources of Node, this part of the reserved resources will be subtracted.
  • At present, the average resource fragmentation of our cluster is about 5% to 10%, which is slightly different according to different specifications of CVM models. These resource fragments are scattered across the nodes in the cluster, mainly 1c1g, 2c2g, 3cxg. The container specifications provided by the platform are difficult to match these fragments. This situation often exists: when a pod is scheduled, it is found The CPU on a certain node is sufficient, but the mem is insufficient, or vice versa.
  • The rest is resources that can be truly used by the business Pod. The business has certain subjectivity and blindness when selecting container specifications, resulting in a low load on the business container. Such a large proportion of business will easily lead to cluster Low load, but the cluster can no longer accommodate more business containers according to the Kubernetes static scheduling strategy. As shown in the above figure, the cluster allocation CPU water level is very high, but the actual CPU utilization rate is not high.

Scheme to increase cluster load

In addition to the use of powerful container monitoring data to make dynamic scheduling decisions with certain weights, are there any other solutions that can be used to solve the problem of low cluster load caused by static scheduling? Below I will give a complete set of technical solutions to try to increase the Kubernetes cluster load from multiple technical dimensions.

Pod allocation resource compression

As mentioned earlier, R & D students deploy services to select container resource specifications with a certain amount of blindness, and Kubernetes does not natively support real-time and non-perceptual modification of container specifications (although this can be solved through the Static VPA solution), resulting in low business container load . In order to solve this problem, we can compress a certain percentage of Pod Request Resource (Pod Limit Resource is not compressed). Note that compressing the Pod Request Resource only occurs when the Pod is created or rebuilt. For example, when the business makes changes and releases, this action cannot be done for the Pod in normal operation, otherwise it may cause the corresponding Workload Controller to rebuild the Pod (depending on the UpdateStrategy of Workload) Impact on business.

have to be aware of is:

  • Each Workload has different load fluctuation rules, so the compression ratio of Pod allocated resources is also different. It needs to support the custom configuration of each Workload, and this is not perceptible to users. This compression ratio, we set to Workload Annotation, such as CPU resource compression corresponding Annotation  stke.platform/cpu-requests-ratio;

  • Who sets the compression ratio? The self-developed component (Pod-Resource-Compress-Ratio-Reconciler) is based on the historical monitoring data of Workload and dynamically / periodically adjusts the compression ratio. For example, a Workload with a continuous 7d / 1M load continues to be very low, so the compression ratio can be set larger, so that the cluster can allocate more resources and accommodate more business containers. Of course, the adjustment strategy of the compression ratio is actually not so simple, and needs more monitoring data to assist.

  • The Pod distribution compression feature must be able to be turned off and restored. Workload Annotation can be disabled stke.platform/enable-resource-compress: "n"for Workload level, and compression can be restored  by setting the compression ratio to 1.

  • When to adjust the Request Resource in Pod Spec by compression ratio? As Kubernetes develops to the present stage, directly changing the Kubernetes code is the most stupid way, and we have to make full use of the Kubernetes extension method. Here, we Mutating Admission Webhookintercept the Pod Create event through kube-apiserver , self-developed webhook (pod-resource-compress-webhook) to check whether the compression feature is enabled in Pod Annotations, and configure the compression ratio, if configured, according to The compression ratio recalculates the request resource of the Pod and Patch to APIServer.

Node resources oversold

The Pod resource compression scheme is a dynamic adjustment scheme for resources of each Workload level. The advantage is that it can be targeted to each Workload and can be targeted. The disadvantage is that the business does not make changes and has no effect.

Node resource overselling program is a dynamic adjustment program for Node-level resources. According to the actual historical load data of each node, different proportions of resource overselling are carried out.

  • The resource oversold ratio of each node, we set to the Node's Annotation, for example, cpu oversold corresponds to Annotation  stke.platform/cpu-oversale-ratio.

  • Who sets the oversold ratio of each node? The self-developed component (Node-Resource-Oversale-Ratio-Reconciler) dynamically / periodically adjusts the oversold ratio based on the historical monitoring data of the node. For example, if a Node continues to have a low load for 7d / 1M continuously and the node's allocated resource level is high, then the oversold ratio can be adjusted appropriately to enable the Node to accommodate more business pods.

  • The Node overselling feature must be able to be turned off and restored. Node Overselling is stke.platform/mutate: "false"turned off through Node Annotation  , and Node will complete resource recovery in the next heartbeat.

  • When to adjust Allocatable & Capacity Resource in Node Status by compression ratio? Similarly, we Mutating Admission Webhookintercept the Create and Status Update events of Node through kube-apiserver , self-developed webhook (node-resource-oversale-webhook) to check whether overselling is enabled in Node Annotations and the overselling ratio is configured, if configured Then, the Node ’s Allocatable & Capacity Resource will be recalculated according to the Anchao selling ratio, and patch will be sent to APIServer.

Node resources are oversold, and it looks simple on the surface, but there are actually many details to consider:

  • What is the detailed principle of Kubelet Register Node To ApiServer, is it feasible to directly patch Node Status via webhook?

  • When node resources are oversold, can the Cgroup dynamic adjustment mechanism corresponding to Kubernetes continue to work normally?

  • Node status updates are too frequent. Each status update triggers a webhook. Large-scale clusters are likely to cause performance problems for apiserver. How to solve them?

  • Does the oversold node resources have an over-provisioning effect on the configuration of Kubelet Eviction, or is it still evict according to the actual Node configuration and load? If it affects Evict, how should it be resolved?

  • When the oversold ratio decreases from large to small, there is Sum on the node. (pods' request resource) > node's allocatableIs there a risk here? How to deal with it?

  • The monitoring system's monitoring of Node is related to Node Allocatable & Capacity Resource. After overselling, it means that the monitoring system's monitoring of Node is no longer correct. It needs to be corrected to a certain extent. Correction of the view?

  • How should Node Allocatable and Capacity be oversold? What is the impact of overselling on the reserved resources of nodes?

There are many Kubernetes technical details involved here, which I will introduce in detail in the next article.

Optimize AutoScale capability

Speaking of the elastic scaling of Kubernetes, everyone is more familiar with HPA and HNA, one to scale Workload Pods, and one to scale Nodes in the cluster. There is also a VPA project in the community to adjust Pod resources, but the Pod needs to be rebuilt to take effect. The significance of VPA is to quickly expand the capacity. If like HPA, the Pod needs to be rebuilt to start the application to expand the capacity. value.

The HPA-Controller built into Kube-controller-manager has the following problems:

  • Performance problem: A goroutine loops through all HPA objects in the cluster, obtains corresponding Pod monitoring data for each HPA object, and calculates new Replicas, which is time-consuming for large businesses.

  • The core configuration does not support Workload customization: HPA scaling response time may be different for each business. Some businesses expect to respond in 5s, and some businesses think 60s is enough. The built-in HPA Controller can only configure global startup parameters in response time control horizontal-pod-autoscaler-sync-period. In addition, each business has different tolerance to load jitter. In the built-in HPA Controller, it can only be horizontal-pod-autoscaler-toleranceconfigured globally, and cannot provide service-level customization.

  • Kubernetes currently supports custom metrics. Only one back-end monitoring service can be registered. If some services in the cluster expose application custom indicators through prometheus, and some services monitor application custom indicators through Monitor, this time cannot be done. All In, this is a scene that must exist in the for self-developed cloud scene.

We have self-developed the HPAPlus-Controller component:

  • Each HPA object will start a goroutine coroutine specifically responsible for the management and calculation of the HPA object. Each coroutine is executed in parallel, which greatly optimizes performance. The HPAPlus-Controller is independently deployed, and its resource requirements can be adjusted reasonably by the size of the cluster and the number of HPAs, which has greater flexibility than the original built-in HPA-Controller.

  • HPAPlus-Controller supports customized scaling response time for each HPA object, supports automatic sensing of whether the service is being released and decides whether to disable HPA (some services have such a requirement: it is forbidden to trigger elastic scaling when upgrading), and based on pod resource limit is The cardinality is used to calculate the Pod resource utilization rate, so as to derive the expected replicas after expansion and shrinkage. This is very important for the node oversold and the Pod resource compressed cluster.

  • Support personalized configuration of service level to load jitter tolerance.

  • Support Scale decision based on monitoring data of more dimensions, such as Pod history 7d / 1M CPU load.

  • Support CronHPA to meet the business demands of regular expansion and contraction.

  • Connect to the company's Monitor monitoring through the Extension APIServer, and retain the Prometheus-Adaptor method to support Prometheus-based application monitoring to meet the HPA of custom metrics based on multiple application monitoring systems.

Note: There is a functional conflict between HPAPlus-Controller and Kubernetes buit-in HPA-Controller. Before going online, disable the HPA-Controller controller of kube-controller-manager.

In addition to the optimization and enhancement of HPA, we are also developing the Dynamic VPA technology, which will be introduced in a separate article later.

Other technical solutions

In addition, a scheduler extender is used to develop a dynamic scheduler, a business-level quota dynamic management component, an online and offline business mixing capability based on business priority and quota management, and actively detect node resource fragment information and report it to the controller for Pod Re-driving for resource fragment management and other solutions is also the direction of our ongoing practice. The corresponding solutions and implementation are more complicated, and will be introduced in a separate article in the future.

to sum up

This article introduces the technical solution of the problem of high cluster resource allocation water level but low cluster actual load brought by Kubernetes static scheduling, and details the technology of dynamic compression of Pod resources, dynamic overselling of node resources, and the ability to optimize AutoScale The scheme will be described later in terms of dynamic scheduling, dynamic business quota management, and online and offline business mixing. All of these cluster load improvement schemes, to be dynamic, are strongly dependent on a powerful container monitoring system. We are working in-depth with the Tencent cloud monitoring product team to better serve Tencent's self-developed business on the cloud.

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12598858.html