Double Eleven elastic capability support - ECI stability construction

1. About ECI

Background Since its official release in 2018, ECI has been polished for four years and has now rapidly grown into Alibaba Cloud's serverless container infrastructure. It serves many public cloud customers and cloud products inside and outside Alibaba, and handles millions of applications every day. Flexible container creation.

However, ECI has not participated in the group's Double Eleven promotion in recent years. Double Eleven can be said to be a parade for Alibaba's technical personnel. Whether it can handle the traffic of Double Eleven has become an important criterion for testing whether a product is stable and reliable. But everything happened naturally. Just this year, ASI began to connect with ECI, trying to let ECI undertake the flexible 30W computing power for the Double Eleven promotion. We all know the significance of the Double Eleven promotion for the entire Alibaba Group, and the mission is approaching. , we will devote ourselves wholeheartedly to the work of docking, stress testing, and escort. After more than two months of business adaptation, stress testing, and preparation, the elastic container for the Double Eleven promotion was finally successfully delivered. Behind this, it is inseparable from the efforts of ASI, ECI and every student who participated in it who were down-to-earth, studied hard and protected. For the first time this year, ECI is used as the group's elastic infrastructure for a large-scale promotion. According to online statistics, ECI elastic resources used a total of about 4 million cores during the large-scale promotion. It is very beneficial to cloud native systems in terms of instantaneous elasticity of resources, retention scale, and system stability. A huge test. As the underlying computing unit, ECI also successfully withstood the test of the Double Eleven elastic traffic peak. While lamenting the rapid development of technologies such as serverless and containers, it also posed a great challenge to the stability of the new system architecture.

Now looking back at ECI's first Double Eleven, it is necessary for us to make a comprehensive summary of what work we have done to ensure the group's flexibility, what work can be reused in the future, and what work can be given to other teams. As a reference to the technology and experience, and what can be done better, prepare for the next big promotion.

In this article, we will introduce to you what ECI has done in terms of stability over the years, and how it protects the group's Double Eleven.

2. Challenges encountered

Stability challenges brought about by large-scale concurrency The biggest challenge encountered is first brought about by large-scale concurrency. After the number of containers increases, the production of container instances is a big test for the cloud management and control system. Especially for elastic scenarios, it is necessary to produce instances and pull large-scale images in a very short time, and then Ensure the successful startup of the container.

How to ensure the successful large-scale production of instances , how to detect problems online in advance, and how to stop the bleeding and perform fault recovery as soon as possible even if a problem occurs, are particularly important for the group's business re-insurance during the Double Eleven period. In addition, for the public cloud environment, it is also important to ensure that it does not affect other public cloud customers. Therefore, it is necessary to have a complete stability guarantee system and fault response plan to ensure that business can continue during the Double Eleven period. Goes smoothly. Instance production system stability ECI and ECS share a resource scheduling system. Compared with ECS applications whose tolerance is minute level, the frequent creation and deletion of ECI instances puts more stringent requirements on the scheduling system and guarantees system capacity and stability. Higher requirements have been put forward. Service availability guarantees that the ECI security sandbox is abnormal for some reason (OOM/physical machine downtime/kernel panic), resulting in an unhealthy situation. In this case, if the ECI Pod is not removed from the endpoint at the k8s level, requests will still be routed to the unhealthy ECI through load balancing, which will lead to a decrease in the success rate of business requests. Therefore, it is also important for the availability guarantee of the group's business services. Especially important.

3. ECI stability technology construction

Stability guarantee starts from the demand collection preparation stage, and the Double Eleven promotion lasts for two months. In order to cooperate with the group's full-link acceptance, ECI's own stability guarantee work is also carried out intensively.

The guarantee of stability runs through the entire promotion process. Before the promotion, system changes should be carefully reduced/reduce to eliminate the interference of human factors. We should be respectful of the release and conduct multiple stress test exercises to ensure system stability and continuously improve system resistance, stability and system recovery. Strengthen stability to ensure the smooth progress of the big promotion, and finally precipitate a replicable key customer re-insurance strategy through problem review, which has positive significance for those who have not gone through actual Double Eleven practice drills.

Therefore, we have sorted out the main work on stability during the entire promotion period, which mainly includes risk control, key business dependency sorting, technical support, stress testing plan, runtime guarantee, fault operation and maintenance capabilities, and final review optimization. , hoping that this can serve as a guide for future promotion work and accumulate experience in stability governance. Next, we will introduce the main stability guarantee methods involved in this promotion and how to apply them.

Instance production guarantee VM reuse technology instance production behavior guarantee is the top priority of the group's flexible use of ECI. A typical instance production process is shown in the figure. ECS and ECI share a management and control system on the control plane. After the ECI control side calls the resource scheduling system, it will allocate computing resources and then call pync (Alibaba Cloud stand-alone management and control component), and then call avs (Alibaba Cloud stand-alone network component) and tdc (Alibaba Cloud stand-alone storage component) produce network cards and disks respectively. In this process, the open api interface that ECS relies on is heavy, and it quickly becomes a system bottleneck in large-scale creation and deletion scenarios. Previously, we developed a VM reuse function specifically for high-frequency creation and deletion scenarios of container instances. For high-frequency scenarios The scenario of deleting container instances, delaying the recycling of VMs, and reusing the network cards, images, and computing resources of container instances reduces the impact on the overall management and control system, thereby ensuring the stability of the instance production system. From this Double Eleven actual battle Judging from the results of the drill, VM reuse has achieved good results, and the overall capacity of the management and control system is at a normal level, ensuring the stable elasticity of the Group's Double Eleven instances.

The rescheduling mechanism handles situations such as insufficient inventory or remote service call timeout. In order to ensure the final consistency of instance production, we have designed corresponding fault handling strategies for ECI instance production. The values ​​are as follows: fail-back: automatic recovery on failure. That is, after the Pod creation fails, it will automatically try to re-create fail-over: failure transfer. The effect is equivalent to fail-backfail-fast: fail quickly. That is, the fault handling strategy of reporting an error directly after the Pod creation fails is essentially a rescheduling strategy. The native k8s scheduling supports rescheduling, that is, after the scheduling fails, the pod will be put back into the scheduling queue to wait for the next scheduling. Analogous to the rescheduling behavior of k8s, when the eci management and control system receives a creation request, it will first enter a queue, and then There is an asynchronous scheduled task that will retrieve the creation from the queue and submit it to the asynchronous workflow for actual resource production, container startup, etc. Even with the optimization of multiple availability zones and multiple specifications, the asynchronous workflow may still fail due to resource competition, insufficient intranet IP addresses, startup failure, etc. At this time, the creation request needs to be returned to the queue again and wait. Production is rescheduled.

Our current troubleshooting strategy :

1. Failed tasks will always be retried, but we will calculate the execution cycle of each task. The more retries, the longer the execution cycle to achieve the backoff effect.

2. The priority policy will consider factors such as user level, task type, and the reason for the last task failure. Tasks with high priority will be submitted for execution first.

3. The reason for each scheduling failure will be notified to the k8s cluster in the form of standard events. The state machine of the entire execution process of tasks in the queue is as follows:

All tasks that fail to execute will re-enter the queue and wait to be scheduled again. Since the task will fail at any step, all produced resources will be rolled back. After the rollback is completed, it will enter the initial state. Tasks in the initial state will be pulled up for execution and then submitted to asynchronous production. If production fails, it will return to the state of waiting for scheduling. If the production is successful, the task ends and the final state is reached. Based on our rescheduling mechanism, we can greatly reduce instance production failures due to production system jitter. For scenarios that require high container startup success rates, we can ensure the ultimate consistency of instance production. The requirements for container startup success rates are not so strict. The scenario can fail quickly and be handled by the upper-layer business.

Service fault tolerance and degradation are also very important for fault scenarios and the degradation of system-dependent services. Most current limiting and downgrading solutions focus on the stability of the service. When a certain resource dependency in the call link is abnormal, for example, it manifests as timeout, and when the abnormal ratio increases, the call of this resource will be restricted. , and allow the request to fail quickly or return to the preset static value to avoid affecting other resources and ultimately produce an avalanche effect. ECI currently implements a three-level downgrade mechanism framework based on historical log self-learning for lossless downgrade, local cache downgrade, and flow control downgrade. ECS/ECI openapi is fully accessible and internally relies on 200+ interface access. According to the call frequency of each interface, Analyze RT distribution and timeout settings separately, select an appropriate downgrade strategy, and set a reasonable threshold, so that when there is a problem with the system, it can be intelligently downgraded to protect the system. A typical downgrade mechanism implementation process is shown in the figure:

When a new request for a non-resource core API comes in , if the historical cache data has not expired, the cache data will be returned directly, and the business logic will end. Otherwise, the remote interface will be requested. If the request is successful, the data is returned, the data is cached, and the cached data is stored in the SLS cache log log for future downgrades, ending the business logic. When the remote request fails, the downgrade strategy is triggered: If the failure indicator (for example, within the specified time Exception ratio) does not reach the configured downgrade policy threshold within the preset time window, the corresponding business exception will be thrown directly, and the business logic will end. If the downgrade policy threshold is reached, the downgrade policy will be implemented in the following order: Find historical log data from the SLS cache log as Degrade the return value, and rewrite the return value into the cache. End the business logic. If there is no corresponding log in the SLS cache log, return: preset static value or null value. End the business logic. For some resources that have nothing to do with user resources, there are few updates and they are global parameters. For services/interfaces, the above general downgrade strategies and solutions may not be effectively implemented because the downgrade rule thresholds are difficult to define.

The strategy of using dubbo exceptions to directly downgrade these interfaces involves the conditions for downgrade or circuit breaker: automatic downgrade (optional use of Sentinal for automatic downgrade): timeout exception, current limit manual switch, support core non-resource api, openapi directly, local downgrade cache, for serious cases In case of system failure, several core describe apis can be cached locally in openapi. When a failure occurs or an avalanche occurs, all are switched to the local cache in openapi. While downgrading the impact, it can also reduce the pressure on calling lower-layer services to win. Recovery Time.

Dependent services do not create link dubbo or http requests for local moc. For dependent services that rarely change frequently, KV is stored through daily SLS analysis. When a failure occurs, it is downgraded to standby, so that the impact of the downgrade tends to 0. Other service downgrade mechanisms, large paging flow control cache, create class API to downgrade dependent dubbo or http services, asynchronous compensation operation class API to downgrade links, cancel unnecessary dependence on database downgrade, ro library traffic downgrade, isolate user level traffic, switch to grayscale API level Traffic is switched to grayscale/independent thread pool log debugging and call link tracking. Use apicontext to implement detailed log debugging and call full link tracking capabilities. Core api debug log construction supports opening debug logs by user. Printing requestId is passed through to dao, and sampling is supported at any time. , timely discovery of DAO abnormal call service dependency downgrade fault-tolerance mechanism can use the historical cache data of relevant interfaces to perform lossless downgrade based on SLS logs on the premise of ensuring service stability. When SLS has no data, local static data can also be used to build a Valid return value. After the service triggers the flow control downgrade circuit breaker, most users will not perceive service abnormalities.

In multiple internal fault drills, the service degradation mechanism can effectively protect the system from system paralysis due to faults. Service Availability Guarantee In a traditional Kubernetes cluster, if a Node becomes unavailable and reaches a time threshold, the Pods on the Node will be evicted and new Pods will be re-launched on other Nodes. In the serverless scenario, ECI management and control will detect unhealthy ECI through an asynchronous detection mechanism, modify the status to unavailable, and add events that cause unavailability to inform ECI users. Then ECI will cure unhealthy ECI through active operation and maintenance, and then trigger The control plane restores the ECI to the Ready state. The main process is as shown in the figure:

Process for handling unhealthy ECI:

Restoring Ready ECI process plan & stress testing In addition to technical guarantees, fault injection, emergency plans, and stress testing drills are also particularly important in stability construction. During the Double Eleven event, we conducted multiple stress testing drills internally to inject faults into common performance bottlenecks in the system to simulate the occurrence of faults. We also developed emergency plans to deal with scenarios when faults have already occurred. Through multiple pressure tests, on the one hand, the upper limit of the system capacity can be evaluated, and on the other hand, it can be used to conduct large-scale pressure test drills to verify the system degradation plan and evaluate system stability. When the early warning & monitoring promotion is implemented, early warning and monitoring are important measures to ensure the stability of the system during operation. Through monitoring and early warning, system faults can be discovered in time and restored quickly.

4. System Robustness

Thinking and precipitating a robust system not only needs to reduce the occurrence of problems, but also has the ability to detect faults and quickly recover from faults. In addition to early warning and monitoring, building operation and maintenance capabilities is also very important.

The robustness of a system is reflected in the system's capacity, the system's fault tolerance, and the SLAs of the various resources that the system relies on. Especially in the complex resource environment on the cloud, due to the "barrel effect", a certain item is likely to rely on resources. Causing the entire system to be directly unavailable. Therefore, as the system continues to improve, we need to use chaos engineering and other methods to find out the "weaknesses" of the current system and then carry out special optimization to improve the robustness of the entire system; secondly, the system's fault recovery and degradation capabilities also need to be improved. It is very important. In history, ECS/ECI control has many times caused a system full-link avalanche due to a single user or a certain link in the system being slowed down, eventually leading to P1P2 failure. ECS/ECI control is the most complex management and control system of Alibaba Cloud, and complex businesses Logic, internal system dependencies, problems in many links may lead to an avalanche of an application in the entire link and global unavailability. Therefore, when a failure has already occurred, the dependency degradation capability can very effectively protect our system. This is also A very important direction for stability construction.

5. Summary

Future Outlook With the end of the last wave of traffic peak on Double Eleven, ECI successfully passed the most stringent technical test for Alibaba people - Double Eleven. This article summarizes the experience of participating in the Double Eleven event. I hope it can Accumulate experience for the future construction of ECI stability. Of course, this is just a touchstone for ECI. As an infrastructure in the cloud native era, ECI has a long way to go. Let’s encourage each other!

This article is produced and acknowledged by: Liu Mi, Yu Yun, Jing Qi, Cun Cheng, Yu Feng, Jing Zhi, Hao Yu, Yue Xuan, Sa Jing, Shang Zhe, Yong Quan, Shi Dao, Mu Ming, Bing Chen, Yi Guan, Dong Dao, Buwu, Xiaoluo, Huaihuan, Changjun, Hanting, Boyan.

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite the history of the industry. The US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities. TIOBE December: C# is expected to become the programming language of the year. A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10319647
Recommended