Pod dynamically adjust resource constraints in the Web cluster level

Author
Ali cloud container platform technology expert Wang Cheng
Ali cloud container platform technology experts ROCKETS (co source) ## primers do not know if you have had this experience, when we have a set of Kubernetes cluster, and then began to deploy the application, we should be the amount of resources allocated to the container it? Hard to say. Because Kubernetes own mechanisms, we can understand the container resource is essentially a static configuration. If you find me a lack of resources, in order to allocate more resources to the container, we need to rebuild the Pod. If the allocation of resources redundant, then our worker node node deployment seems not much container. How we can do on-demand container resources do? The answer to this question, and we can all discuss this together to share in. First, it allows us to throw the challenge of our actual production environment according to our actual situation. Perhaps you still remember, Lynx 2018 the total turnover of 11 two-day reached 213.5 billion. Thus glimpse of the whole picture can be seen, the system is capable of supporting such a large-scale transactions behind the application type and quantity should be what kind of scale. In this scale, vessel scheduling we often hear, such as: container arrangement, load balancing, clustering scalable capacity, cluster upgrade, application release, application gray, etc. These words, in the modified ultra-large-scale cluster after the word , things are no longer easy to handle. Our biggest challenge is the scale itself. How good effect as operations and management of such a large system, and follows the industry dev-ops propaganda, as if to let the elephant to dance. But Ma said, in relation to the dry elephant elephant in the dry matter, why go to dance. ## [] (https://www.atatech.org/articles/143528#1) Kubernetes help [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231251051-1186991920 .png)] (https://ata2-img.cn-hangzhou.oss-pub.aliyun-inc.com/9cbadf9a62b4eb35ad8b347dbad87644.png) whether elephants can dance, with this issue, we need to Taobao Lynx and other APP Speaking behind the system. This Internet application deployment system can be divided into three stages, the traditional deployment, virtual machine deployment and vessel deployment. Compared to the traditional deployment, virtual machine deployment with better isolation and security, but little in the performance of the inevitable emergence of a large number of losses. The vessel deployment and deployment achieved in isolation and security in the context of a virtual machine, made more lightweight solution. Our system is also running along a main channel so. Assumptions underlying system is like a mighty ship, --- face massive container vessel, we need a good captain, they are scheduling arrangement, the system can avoid the ship of layers of obstacles, reduce the difficulty of the operation, and have more flexibility, and ultimately achieve the purpose of navigation. ## [] (https://www.atatech.org/articles/143528#2) ideal and the reality at the beginning of the beginning, think of container and Kubernetes variety of beautiful scenes, the effect of vessel arrange our ideal should look like this of: - calmly: our engineers face more calm in the face of complex challenges, but no longer frown and smile more confident. - Elegant: every line change operations can be as leisurely sipped red wine as gracefully executed by pressing the Enter key. - Ordered: from development to testing, and then to release the gray one go, to fill the gap. - Stability: system robustness good, Seoul, the East and West Wind, our system stays on. Annual system availability plurality N 9. - Efficient: save more manpower, to achieve "happy work, serious life." However, the ideal is full, the reality is very skinny. We were greeted clutter and different forms of distress. Messy because: As a new technology stack a meteoric rise, a lot of the construction of supporting tools and workflow in its infancy. Demo version works well tools, large-scale roll out in the real scene, all kinds of hidden problems will be exposed, one after another. From development to operation and maintenance, all the staff at the various passively exhausted. In addition, "large-scale roll out" also means to directly face different forms of production environment: heterogeneous configuration of the machine, complex needs, or even to adapt the user's past habits and so on. [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231251998-1304247189.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / b729a2a5593fb3083ba9fd6555639301.png)
In addition to it out of the chaos, the system also faced with various application container crashes: lack of memory results in the OOM, CPU quota allocation is too small lead, the process is throttle, there is insufficient bandwidth, response latency increased significantly ... even in the face of cliff-style trading volume fell, and so access to the peak time because the system is not to force a result. Which have allowed us to accumulate a lot of experience in large-scale commercial Kubernetes scene. ## [] (https://www.atatech.org/articles/143528#3) face the problem ### [] (https://www.atatech.org/articles/143528#4) stability problems always have to be faced. As one expert said: If you feel where not quite right, then certainly a problem in some places. So we have to analyze, where exactly is the problem. In memory for the OOM, CPU resources are throttle, we can infer that the initial lack of resources we give container allocation. [! [] (https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231253314-1538169660.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / 5f71fc9f7162cfe83666660274e59629.png) lack of resources is bound to cause problems throughout the application service stability decline. For example, the figure of the scene: Although it is a copy of the same application, perhaps due to load balancing is not strong enough, or because of their own reasons apply, even because the machine itself is heterogeneous, resources same values, the same may apply for the different copies and have equal value and significance. They seem numerically assigned the same resources, but in the actual work load, it is highly likely that the phenomenon of uneven too fat. [! [] (https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231254567-5357624.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / d007ed231f2e301123c6a6efff9a4764.png)
and in the resource overcommit scenario, the application in less than an entire node resources, or when insufficient CPU share pool resources are located, there will be serious competition between resources. Competition for resources is one of the biggest threats to the stability of the application. So we have to try to remove all threats in a production environment.
We all know that stability is a very important thing, especially for the first-line control of millions of R & D personnel reins of the container. Perhaps careless operation is likely to cause a huge impact accidents surface. Therefore, we have also made in accordance with the general process of prevention system and reveal all the details work. In the dimension of prevention, we can link the whole stress testing, and advance through scientific means to predict the number of copies of applications and the amount of resources needed. If it can not accurately budget resources, it is only by way of redundancy in the allocation of resources. In the dimension reveal all the details, we can arrive at the large-scale traffic, do not do business critical service degradation while the main application for temporary expansion. But for the sharp increase in the sudden increase in traffic minutes, to spend so much money for a portfolio, seems worthwhile. Perhaps we can propose some solutions to meet our expectations. ### [] (https://www.atatech.org/articles/143528#5) resource utilization look at our situation application deployment: vessel on node generally belong to a variety of applications, these applications are not necessarily themselves, also generally not accessible while at the peak. For hybrid deployments host applications, resource allocation if they can avoid it are running above the container may be more scientific.
[! [] (https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231255505-328218126.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / 55119fae9afa8638b061b8b4501ef892.png) application resource requirements might like the moon wanes, there are cycle. For example online business, especially trading business, they exhibit a certain periodicity in the use of resources, for example: in the morning, when the morning, its usage is not very high and in the afternoon, will be higher during the afternoon. Analogy: A time for important applications, applications for B may be less important, the proper application of pressure B, maneuvers A resource to use, this is a good choice. It sounds a bit like the feeling of time division multiplexing. But when the demand for allocation of resources if we follow the peak flow will generate a lot of waste. [! [] (https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231256546-317825822.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / 9b9de1f670476432b07909c71a7cd744.png)
In addition to demanding real-time online application, we have off-line applications and real-time computing applications such as: off-line calculation is less sensitive to the use of CPU, Memory and network resources or time, so it can be run at any time. Real-time calculation, might very high for time-sensitive. Early on, our business is independent in accordance with the type of application deployed in different nodes. From the above picture point of view, if they are time-multiplexed with the resources, the demand for real-time level, we will find that it actually use the maximum amount is not 2 + 2 + 1 = 5, but at some point an important and urgent application maximum demand, which is 3. If we are able to monitor the use of real data to each application, assign a reasonable value to it, then it can have a real effect of resource utilization improvement. For the electricity business applications, java framework for the use of a heavyweight and related technology stack web application, or a short time HPA VPA is not an easy thing. Let me talk HPA, we may be able to pull up in seconds Pod, create a new container, however, whether to pull the container is actually available yet. From creation to be used, it may take a relatively long time, for big promotion and buying spike - which views "peak" may only be for a few minutes or ten minutes of the actual scene, if we wait until the HPA copies of all available, possible market activities already over. As for the current community VPA scene, delete the old Pod, create a new Pod, more difficult to accept this logic. So considering that we need a more practical solution to fill the gap HPA and VPA in this single resource scheduling. ## [] (https://www.atatech.org/articles/143528#6) Solutions ### [] (https://www.atatech.org/articles/143528#7) delivery standards we first want to set a standard that can deliver the solutions: that is, "not only to stability, but also the utilization , but also the implementation of automation, of course, if intelligently so much the better. "
Then delivery standards refinement: - security and stability: the tool itself highly available. The algorithm used and the means of implementation must be done controllable. - service container resources on demand: timely resource consumption based on real-time business for less distant future resource consumption forecast, allowing users to understand the real needs for the next business resources. - resource overhead small tool itself: consumption of resources itself the tools to be as small as possible, do not become a burden on operation and maintenance. - easy to operate, scalability: can do without having to be trained to Fun this tool, of course, but also has good scalability tool for users to DIY. - Quickly find & timely response: real-time, which is the most important quality, which is the VPA and HPA or in the way to solve the problem of resource scheduling different places. ### [] (https://www.atatech.org/articles/143528#8) Design and Implementation [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231257764- 1522698968.png)] (https://ata2-img.cn-hangzhou.oss-pub.aliyun-inc.com/3b4d8997642fe71e7e8ca8debc66821e.png) on ​​the map is our initial process design tool: when an application face high business when accessing demand, reflected in the CPU, Memory, or other types of demand for resources increases, we based on the real-time collection of basic data data Collector, use portraits of data Aggregator generates a container or the entire application, and then back to the portrait Policy engine. Policy engine momentarily quickly modify parameters of container Cgroup file directory. Our earliest architecture and we like the idea of ​​simple, conducted intrusive modifications kubelet. Although we just added a few interfaces, but that really is not elegant. Every kubenrnetes upgrade for Policy engine-related component upgrade also has some challenges. [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231259102-1245740822.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / d335945284a49152c8ed95d6208f3285.png)
In order to achieve rapid iteration and Kubelet and decoupling, we had a new evolution for implementation. That's the key applications of the container. So you can achieve the following functions: - do not invade modify K8s core components - easy iteration & Release - by means of Kubernetes related QoS Class mechanisms, resource allocation, resource overhead container controllable. Of course, in subsequent evolution, we also try and HPA, VPA were opened up, after all these and Policy engine is the existence of a complementary relationship. We therefore architecture further evolve into a situation. When Policy engine in dealing with some of the more complex scenes got weak, reporting events so that the center of global end to make more decisions. Horizontal or vertical expansion of additional resources. [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231301220-1118355670.png)] (https: //ata2-img.cn-hangzhou.oss-pub.aliyun-inc .com / 1840c344921b59210c77e1631dc88e8d. png) Here we discuss specific Policy engine design. Policy engine is intelligent scheduling on a single node and perform Pod resource adjustment of core components. It includes api server, and execute the command center command center layer executor. Api server for which a request for external policy engine operating status query and set of services; command center in real time based on the physical machine itself and portraits container load and resource usage, make decisions Pod resource adjustment. Executor and then according to the decision of the command center, to the container resource limits can be adjusted. Meanwhile, the Executor also adjusted each time revision info persistence, so that can be rolled back when a fault occurs. Command center periodically acquired from the data aggregator in real time portrait of the container, including polymeric statistics and the prediction data, determining first node status, for example, a disk node abnormality, or the network is, indicates that the node abnormality has occurred, the need to protect the scene, no longer Pod adjust resources to avoid system shocks, affecting the operation and maintenance and debugging. If the node status is normal, the command center will be the policy rules to filter the data container again. Such as a container cpu rate soared high, or the response time of the container exceeds a safety threshold. If the condition is satisfied, the container is given a set of resources to meet the conditions recommended adjustment, transmitted to the executor. In the framework design, we follow the following principles: - Plug-oriented: all the rules and policies are designed to be modified by the configuration file, and try to control the core code for decoupling process, the data collector and data aggregator and other components update and publish decoupling, enhance scalability. - stable, which includes the following aspects: - Controller stability. The command center of decision-making so as not to affect the stand-alone and even global stability as a precondition, including stable performance and resource allocation for stabilizing the container. For example, each controller is currently only responsible for controlling one cgroup resource, that is, within the same time window, Policy engine does not adjust a variety of resources at the same time, to avoid shocks resource allocation, adjusting the interference effect. - Trigger rule stability. For example, the original trigger condition is a rule exceeds the performance of the container safety threshold, but in order to avoid a sudden peak control action is triggered leading to shock, we trigger rules customized for low over a period of time windows performance indicators percentile exceeds the safe threshold; if the rule is satisfied, indicating that performance this time most of the value already exceeded the safety threshold, you need to trigger a control action. - In addition, Community Edition Vertical-Pod-Autoscaler different, Policy engine does not take the initiative to expel maneuvers container, but directly modify the cgroup file container. - Self-healing: to perform actions such as adjusting the resources may be some exceptions, we have joined each controller healing rollback mechanism to ensure the stability of the entire system. - Do not rely on prior knowledge of application: pressure-measuring all the different applications separately, custom policies, or in advance of the application may be press-row unit together measure will lead to huge cost, scalability reduced. Our strategy as generic as possible in the design, as far as possible is not dependent on a specific platform, operating system, application indicators and control strategies. In the resource adjustments, Cgroup we support each container of CPU, memory, network and disk IO bandwidth, CPU resources currently our main container can be adjusted while exploring dynamic adjustment in the time division multiplexing scene memory in the test limit and swap usage to avoid OOM feasibility of; in the future we will support dynamic adjustment of the vessel network and disk IO. Application does not depend on prior knowledge: pressure-measuring all the different applications separately, custom policies, or in advance of the application may be press-row unit together measure will lead to huge cost, scalability reduced. Our strategy as generic as possible in the design, as far as possible is not dependent on a specific platform, operating system, application indicators and control strategies. In the resource adjustments, Cgroup we support each container of CPU, memory, network and disk IO bandwidth, CPU resources currently our main container can be adjusted while exploring dynamic adjustment in the time division multiplexing scene memory in the test limit and swap usage to avoid OOM feasibility of; in the future we will support dynamic adjustment of the vessel network and disk IO. Application does not depend on prior knowledge: pressure-measuring all the different applications separately, custom policies, or in advance of the application may be press-row unit together measure will lead to huge cost, scalability reduced. Our strategy as generic as possible in the design, as far as possible is not dependent on a specific platform, operating system, application indicators and control strategies. In the resource adjustments, Cgroup we support each container of CPU, memory, network and disk IO bandwidth, CPU resources currently our main container can be adjusted while exploring dynamic adjustment in the time division multiplexing scene memory in the test limit and swap usage to avoid OOM feasibility of; in the future we will support dynamic adjustment of the vessel network and disk IO. ### [! [] (Https://img2018.cnblogs.com/blog/1411156/201908/1411156-20190829231301945-416978928.png)] ### to adjust the effect (https: //ata2-img.cn-hangzhou.oss- pub.aliyun-inc.com/589f7c59ec33c58bd3760264a9ae5a8e.png)
the figure shows some experimental results obtained in our test cluster. We put high priority and low priority applications online offline application deployment in mixed test cluster. SLO is 250ms, we hope percentile value of 95 online application latency is below the threshold 250ms. In experimental results can be seen in about 90s ago, the online application load is low; latency mean and percentile in 250ms or less. To the 90s, we give the online application of pressure, flow increases, the load is increased, leading to the 95th percentile value of the online application latency exceeds the SLO. At around 150s, our small run control policy is triggered, offline application resource competition progressively throttle and online applications occur. To around about 200s, the normal performance of online applications, latency of 95 percentile down to SLO or less. This shows the effectiveness of our control strategy. ## [] (https://www.atatech.org/articles/143528#10) Below we summarize experiences and lessons learned about during the course of the entire project, we gain some experience and lessons learned, I hope these lessons be able to similar problems and scenes of people help. - Avoid hard-coded, micro components of the service, not only to facilitate the rapid evolution and iteration, but also help fuse exception service. - Do not call interface library is still in alpha or beta features as possible. For example, we used to call the CRI direct interface to read some information about the container, or do some update, but with the field or modify the interface methods, build some features will become unavailable, perhaps sometimes call the unstable interface not as direct access to print information an application may be more reliable. - adjustment of resources based on dynamic QoS: If we are talking about before, there are tens of thousands of applications inside the Ali Group, call chain between applications is quite complex. A container application performance anomalies occur, not necessarily lack of resources on a single node or resource competition leads, but most likely it downstream application B, application C, or database, cache access latency caused. Because of the limitations of this information on a single node, resource adjustment based on a stand-alone node information, only a "best effort", which is the best effort of the strategy. In the future, we plan to open up the resource-control link stand-alone node and the central end, reported by the central terminal integrated single node performance information and resource adjustment request, unified reallocation of resources, or rearrange the container or trigger HPA, to form a closed loop of intelligent resource Governor link the cluster level, which will greatly improve the entire cluster dimension stability and overall resource utilization. - Resource vs performance model: Some people may have noticed, our adjustment strategy, and felt no clearly proposed the establishment of a container "resource vs Performance "model. This model is very common in academic years, usually several applications were tested off-line or on-line pressure measurement pressure measurement, change the application's resource allocation, performance measurement applications to obtain performance over resources curve ultimately used in the real-time resources regulation algorithm. in the application of a relatively small number, the call chain is relatively simple, the cluster of physical machine hardware configuration is relatively small, the pressure measurement based on such a method to be exhaustive all possible, to find the optimal or sub-optimal resource adjustment programs, resulting in better performance. but in the scene Ali Group, we have tens of thousands of applications, many versions of the release of key application is very frequent, often new after the release, the old pressure measurement data, or resource performance model, do not apply. in addition, many of our cluster is heterogeneous cluster, a certain kind of performance data on a physical machine test resulting in another different models on the physical machine will not reproduce. these are applied directly to our academic resources in the regulation algorithm creates a barrier. the For interior scenes of Ali Group, we have adopted such a strategy: not to use off-line pressure testing, access to resources, but rather to establish performance model display real-time dynamic portraits container with windows content over time resource usage statistics. data as a forecast for the next short period of time, and dynamically update; Finally, based on this dynamic portraits container, perform small run of resource adjustment strategies, walking to see, do my best. ## ## Summary and Outlook To sum up, our work is mainly realized gains following aspects: - by time division multiplexing and the different priorities of the container (that is, both online and offline tasks) mixed deployment, and by restrictions on container resources dynamic adjustment to ensure that online applications can get sufficient resources under different load conditions, in order to improve the comprehensive utilization of cluster resources. - by dynamically adjusting the container resource on a stand-alone intelligent nodes, reducing interference between application performance, stability guarantee performance of high priority applications - various resource adjustment strategies by Daemonset deployment, can automatically and intelligently in the node on the run, reducing manual intervention, reducing operation and maintenance labor costs. Looking to the future, we hope to strengthen and expand our work in the following areas: - link closed-loop control: As mentioned earlier, the single node due to lack of global information to adjust the resource has its limitations, can only do my best . The future, we hope to open up the path to the HPA and the VPA, the nodes and stand-alone resource center end linkage adjustment, elastically stretchable maximize revenue. - container rearrange: even the same application, the load and the physical environment in which the different containers is dynamic, adjusting resources pod on a stand-alone, may not be able to meet the dynamic needs. We hope on a single real-time picture of the container, to provide more effective information-centric end, the end of the Help Center scheduler to make more intelligent decisions containers heavy orchestration. - Intelligent Strategy: Our current resource adjustment strategy is still relatively coarse-grained, can be adjusted relatively limited resources; we want to make follow-up resource adjustment policy more intelligent, and taking into account the additional resources, such as disk and network IO bandwidth adjustment, improve the effectiveness of resource adjustment. - fine portraits container: Portrait current container is relatively rough, relying solely on statistical data and linear prediction; the type of performance indicators portray container is also more limited. We wanted to find a more accurate, versatile, an indicator of the performance of the vessel in order to more finely characterize the current state of the vessel and the extent of demand for different resources. - Find sources of interference: We hope to find to find effective solutions on a stand-alone node to locate the precise source of interference when application performance suffers; this is also a great sense of intelligent strategy. ## ## If you plan to open-source project code we are interested, is expected in September 2019, our work will also appear in Alibaba open source projects [OpenKruise] (https://github.com/openkruise) ([https: //github.com/openkruise](https://github.com/openkruise)), so stay tuned!

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/11432596.html