Stability Guarantee 6 Steps: A Guide for High-availability System Promotion!

I. Introduction

There are big promotions every year, and everyone is familiar with the term stability guarantee. Although the business scenarios are different, the "routines" often go through different paths, including full link stress testing, capacity evaluation, current limiting, emergency plans, etc. , There are always so many tricks to come and go.

Jumping out of these "routines" and returning to the essence of the problem, why should we follow these strategies?

In addition to the historical experience passed on by word of mouth, what else can we do? What is the theoretical basis?

2. What kind of system is stable?

First to answer another question, what kind of system is stable?

In Google SRE (SRE trilogy [1]), there is a hierarchical model to describe the basis of system reliability and high-level requirements (Dickerson's Hierarchy of Service Reliability), as shown in the following figure:

This model was proposed by Google SRE engineer Mikey Dickerson in 2013. The system stability requirements are systematically distinguished at different levels according to the basic level, forming a stability standard pyramid model.

The base of the pyramid is Monitoring, which is the most basic requirement of a system for stability. A system that lacks monitoring is like a wild horse running blindfolded. There is no controllability, let alone stability. The upper level is Incident Response. From a problem that is discovered through monitoring to its final resolution, the time consuming during this period directly depends on the maturity of the emergency response mechanism. A reasonable emergency strategy can ensure that when a failure occurs, all problems can be dealt with in an orderly and proper manner, instead of panicking into a pot of porridge. Postmortem & Root Caue Analysis (Postmortem&Root Caue Analysis), which is what we usually talk about, "retest". Although many people don’t like this activity, we have to admit that this is the most effective way to prevent us from making the same mistake next time. Means, only when we understand the root cause of the failure and the corresponding defect, we can prescribe the right medicine and reasonably avoid it.

Assuming that a system will not be updated and iterated since the initial release, doing the above three aspects can basically meet all the system's needs for stability. Unfortunately, there is basically no such system at present, and applications of all sizes are inseparable from constant changes and releases. Therefore, to ensure that the system continues to be stable during these iterations, testing and release procedures are indispensable. An effective test and release strategy can ensure that all new variables of the system are within a controllable and stable range, so as to achieve the stability of the overall service end state. In addition to code logic updates, iteration may also bring about changes in business scale and traffic. Capacity Planning is a guarantee strategy for changes in this area. Whether the existing system volume is sufficient to support the new traffic demand, and whether there are unequal weak nodes on the overall link are all issues that need to be considered in capacity planning.

At the top of the pyramid model are product design (Product) and software development (Development), that is, through excellent product design and software design to make the system more reliable, build a highly available product architecture system, and improve user experience.

Three ways to promote stability

From the pyramid model, we can see the several aspects of work that need to be done to build and maintain a highly available service. Then the question returns to the stability of the big promotion, how to systematically ensure the stability of the system during the big promotion?

The big promotion guarantee is actually aimed at the concentrated stability construction work of specific business scenarios. Compared with the daily guarantee work, it has the characteristics of high concurrent traffic and short guarantee period, and has clear requirements for system performance and guarantee time (usually 2 months) about).

Considering the above characteristics, how can we optimize and consolidate the system stability requirements for the large-scale promotion and high-traffic business scenarios in a short time?

Since time is limited, blindly casting a net is definitely not the best strategy. It needs to be targeted at key points and weak points. Therefore, the first step is to obtain the current status of the overall system link, including key external dependencies, key business impacts, etc., to find the core focus of the overall guarantee. Next, we will further analyze the big promotion business data and get the variable interference factors other than the system itself. Based on these two aspects, the system will be focused on the system monitoring, planning capacity, emergency response, testing, and review requirements in the pyramid model to carry out targeted and centralized support construction for the system, and obtain the final guarantee result.

At this point, a complete strategic direction for the stability guarantee of the big promotion has been basically obtained. The order of execution is as follows:

  1. System link & business strategy combing (System & Biz Profiling)
  2. Monitoring
  3. Capacity Planning
  4. Incident Response
  5. test
  6. After the fact summary (Testing & Postmortem)

1 System & Biz Profiling-System Link Profiling

System link combing is the basis of all guarantee work. It is like a comprehensive physical examination of the overall application system. Starting from the flow entrance, following the link trajectory, the nodes are layered level by level to obtain the overall picture of the system and the core guarantee points.

Entrance combing inventory

A system often has a dozen or more traffic entrances, including HTTP, RPC, messages, and other sources. If you cannot cover all links, you can start sorting out from the following three types of entrances:

  • Core re-guarantee traffic entrance
    • The user promises that the service SLI is high, and has clear requirements for data accuracy, service response time, and reliability.
    • For enterprise users
  • Corresponding entry for asset loss events
    • Related to company fund income or customer fund income
    • fee charge
  • Large flow entrance
    • System TPS&QPS TOP5~10
    • Although this type of entry does not involve higher SLI and resource loss requirements, the flow rate is higher, which has a greater impact on the overall system load.

Node layering judgment

The flow entrance is like the thread head in the cluster. After the thread head is picked, the nodes on the link (HSF\DB\Tair\HBase and all external dependencies) can be layered according to the degree of dependence, availability, and reliability according to the flow trajectory. distinguish.

(1) Judgment of strong and weak dependent nodes

  • If the node is unavailable, the link business logic is interrupted or high-level lossy (there is a certain tolerance threshold), it is a strong business dependence; otherwise, it is a weak dependence.
  • If the node is unavailable and the link execution logic is interrupted (return error), the system is strongly dependent; otherwise, it is weakly dependent.
  • If the node is unavailable and the system performance is affected, the system is strongly dependent; otherwise, the system is weakly dependent.
    • According to the fast failure design logic, this type of node should not exist, but if this type of node appears without changing the application code, it should be treated as a strong dependency.
  • If the node is not inductive and can be degraded or there is a slight business damage replacement scheme, it is a weak dependency.

(2) Judgment of low-availability dependent nodes

  • The daily timeout of node service is serious
  • Insufficient system resources for the node

(3) Judgment of high-risk nodes

  • After the last big promotion, the node has a major version of the system transformation
  • Newly launched nodes that have not experienced major promotions
  • Whether the node's corresponding system has ever experienced a high-level failure
  • There is a risk of capital loss after node failure

Should output data

After completing this sorting work, we should produce the following data: Corresponding analysis of all core links in the business domain, strong technical & business reliance, core upstream and downstream systems, and asset loss risks should be clearly marked.

The figure below is an example of a single link analysis:

2 System & Biz Profiling-Business Strategy Synchronization

Different from the high-availability system construction system, the stability guarantee system and the targeted guarantee construction for specific business activities are promoted. Therefore, business strategies and data are indispensable data before we carry out guarantees.

Generally, big promotion business data can be divided into two categories, global business form assessment and emergency strategy & gameplay.

Global assessment

This type of data can help us carry out accurate traffic assessment, peak forecasting, large promotion of manpower scheduling, etc., generally including the following categories:

  • Big promotion business time (XX day-XX day)
  • Estimated volume of business volume (X times daily)
  • Estimated peak date
  • Estimated traffic allocation for business scenarios

Emergency strategy

This type of data refers to the business variables of this big promotion compared to previous years of big promotion activities, which can be used for emergency response plans and high-risk node assessments, and generally include the following two categories:

  • Special business play
  • Emergency play strategy

3 Monitoring-Monitoring & Alarm Sorting

At present, the monitoring methods commonly used in the industry generally have two modes, black-box monitoring and white-box monitoring. Black box monitoring is object-oriented, and generally monitors the abnormality that is occurring (not about to occur), that is, the existing failure of the system. The white box monitoring mainly relies on the internal indicator monitoring of the system. It is object-oriented and also oriented to the cause. It can provide early warning of anomalies that the system is about to face, and can also simultaneously monitor lower-level internal indicators when anomalies occur to locate the root cause. Therefore, in the promotion of stability guarantee, we generally choose white box monitoring.

From the perspective of monitoring, our system can generally be divided into three layers from top to bottom: business (Biz), application (Application), and system (System). The system layer is the bottom layer and represents the operating system-related state; the application layer is the JVM layer, covering the main application process and the running state of middleware; the business layer is the top layer, which is the external running state of the service from the business perspective.

Therefore, when sorting out the stability monitoring of the big promotion, you can first break away from the existing monitoring, start with the core and asset loss links, sort out which monitoring is needed according to the three levels of business, application (middleware, JVM, DB) and system, Find the corresponding monitoring alarm based on these indexes. If it does not exist, fill it up accordingly; if it does, check whether the threshold, time, and alarm person are reasonable.

monitor

The monitoring system generally has four golden indicators: Latency, Error, Traffic, and Situation. The critical monitoring of each layer can also be classified according to these four indicators. details as follows:

Table 1

Alert

Does every monitoring require an alarm? The answer is of course no. It is recommended to give priority to setting Biz layer alarms, because the Biz layer is the most intuitive business performance of our external services and the most appropriate user experience. Application&System layer indicators are mainly used for monitoring, and some key & high-risk indicators can be set to alarms for troubleshooting and early detection of faults.

For an alarm, we generally need to pay attention to several points such as the level, threshold, and notification person.

1) Level

That is, when the current alarm is triggered, the severity of the problem generally has several measurement points:

  • Whether to associate with GOC
  • Whether it has a serious business impact
  • Whether to generate capital loss

2) Threshold

That is, the triggering conditions & time of an alarm need to be reasonably formulated according to specific scenarios. Generally follow the following principles:

  • Don't be too slow. In a reasonable monitoring system, relevant alarms should be triggered after any abnormality occurs.
  • Don't be too sensitive. Too sensitive thresholds will cause frequent alarms, which will cause the responders to be fatigued and unable to screen true anomalies. If an alarm occurs frequently, there are generally two reasons: improper system design or improper threshold setting.
  • If a single indicator cannot provide feedback to cover the overall business scenario, it can be constructed by combining multiple indicators.
  • It needs to conform to the business fluctuation curve, and different conditions & notification strategies can be set at different time periods.

3) Notification person & method

If the business indicator is abnormal (Biz layer alarm), the notifier should be a collection of problem handlers (development, operation and maintenance students) and business concerned personnel (TL, business students), and the notification method is relatively real-time, such as telephone notification.

If it is an application & system level alarm, it is mainly used to locate the cause of the abnormality, and the notifier can set up a troubleshooting personnel. The notification method can consider low-interference methods such as nailing and SMS.

In addition to the correlation level, the range of notifiers can be appropriately expanded for different levels of alarms. In particular, the range of alarm indicators related to GOC failures should be appropriately widened, and the notification method should be more real-time and direct.

Should output data

After completing this sorting work, we should produce the following data:

  • System monitoring model, the format is the same as Table 1
    • What are the points to be monitored in Biz, Application, and System respectively
    • Whether the monitoring points have all the indicators and what are still to be added
  • The list of system alarm models must contain the following data
    • Associated monitoring indicators (link)
    • Alarm critical level
    • Whether to push GOC
    • Whether to generate capital loss
    • Whether to associate the fault
    • Whether to associate the plan
  • The business index market includes key monitoring index data at the Biz layer.
  • The system & application index big disk, including the key system indexes of the core system, can be used for white box monitoring and positioning problems.

4 Capacity Planning-Capacity Planning

The essence of capacity planning is to pursue a balance between minimizing computational risks and minimizing computational costs. It is not reasonable to pursue either. In order to achieve the best balance between the two, it is necessary to calculate the peak load flow of the system as accurately as possible, and then convert the flow into the corresponding capacity according to the upper limit of the single-point resource load to obtain the final capacity planning model.

Flow model evaluation

1) Inlet flow

For a big promotion, the peak ingress traffic of the system is generally superimposed and fitted by conventional business traffic and unconventional increments (such as changes in the traffic model ratio caused by changes in disaster recovery plans & business marketing strategies).

(A) Conventional business traffic generally has two types of calculation methods:

Historical flow algorithm: This type of algorithm assumes that the year-on-year growth rate of the big promotion is fully in line with the historical flow model, and calculates the overall business volume year-on-year incremental model based on the current & previous daily traffic; then calculates the estimated flow rate increase based on the previous year's big promotion-daily comparison Quantitative model; finally the two are fitted to get the final evaluation data.

Since the calculation does not need to rely on any business information input, this type of algorithm can be used to ensure that the business volume assessment has not been given in the initial stage of the work, and the initial estimated business flow can be obtained.

Business volume-traffic conversion algorithm (GMV\DAU\order volume): This type of algorithm generally takes the estimated total amount of business (GMV\DAU\order volume) as input, based on historical big promotion & daily business volume-traffic conversion model (such as Classic vulnerability model) conversion to obtain the corresponding sub-domain business volume assessment.

This method strongly relies on the estimation of the total business volume, and can be used in the middle and later stages of the guarantee work, and incorporates the business evaluation factors into consideration based on the initial business flow estimation.

(B) Unconventional increment generally refers to the incremental traffic caused by changes in the front-end business marketing strategy or changes in the traffic model after the execution of the system emergency plan. For example, when the NA61 computer room fails, 100% of the traffic is switched to NA62, which will bring about incremental changes.

In consideration of cost minimization, the calculation of unconventional incremental P generally does not need to be included in the superimposed ingress traffic K together with the conventional business traffic W. Generally, the probability of occurrence of the unconventional strategy λ is used as the weight, namely:

2) Node traffic

The node flow is converted from the inlet flow according to the flow branching model in proportion. The branch traffic model is based on the system link and follows the following principles:

  • At the same entrance, different links are calculated independently.
  • For the same node on the same link, if there are multiple calls, it is necessary to calculate the magnification by a multiple (such as DB\Tair, etc.).
  • Focus on DB write traffic, and hot spots may cause DB HANG to die.

Capacity conversion

1) Little Law Derivative Law

Different types of resource nodes (application container, Tair, DB, HBASE, etc.) have different flow-to-capacity conversion ratios, but they all obey Little Law's derivative rules, namely:

2) N + X redundancy principle

  • On the basis of meeting the minimum capacity required by the target traffic, redundancy reserves X units of redundancy capacity
  • X is positively related to the target cost and the failure probability of the resource node, the higher the unavailability probability, the higher X
  • For general application container clusters, consider X = 0.2N
The above-mentioned rules can only be used for initial capacity estimation (pre-stress test & new dependence), and the final accurate system capacity still needs to be obtained in conjunction with periodic stress tests of the system.

Should output data

  • Ingress flow model based on model evaluation & cluster's own capacity conversion results (if it is a non-ingress application, it will be sorted out by the limit point).
  • Branch traffic model based on link combing & external dependence capacity conversion results.

5 Incident Response-Emergency & Pre-planning

If you want to quickly respond to online emergencies in a large promotion and high concurrent traffic scenario, it is not enough to rely on the on-the-spot performance of the students on duty. In the race against time, it is impossible to leave enough room for the processing staff to think about strategies, and wrong processing decisions often lead to more out-of-control and serious business & system impacts. Therefore, in order to respond to questions quickly and correctly at the big promotion site, the students on duty need to do multiple-choice questions (Which) instead of statement questions (What). The composition of the options is our business & system plan.

Divided from execution timing and problem-solving attributes, plans can be divided into four categories: technical emergency plans, technical pre-plans, business emergency plans, and business pre-plans. Combining the previous link combing and service evaluation results, we can quickly analyze the plans needed in the link, following the following principles:

  • Technical emergency plan: This type of plan is used to deal with the unavailability of a certain level of nodes in the system link, such as strong technical/business dependence, weak stability, high risk and other abnormal scenarios such as node unavailability.
  • Technical pre-plan: This type of plan is used to balance the overall system risk and the availability of single-node services, and ensure the reliability of global services through fusing and other strategies. For example, weak stability & weakly dependent services are degraded in advance, offline tasks that conflict with peak traffic time are tentatively scheduled in advance, etc.
  • Business emergency plan: This type of plan is used to deal with problems that require emergency handling caused by non-systematic abnormalities such as business changes, such as business data errors (sensitive nodes for data correctness), business strategy adjustments (cooperating with business emergency strategies), etc.
  • Business pre-plan: this type of pre-plan is used for pre-service adjustments (non-systematic requirements) for allocation and business overall strategies

Should output data

After completing this sorting work, we should produce the following data:

  • Execution & closing time (pre-plan)
  • Trigger threshold (emergency plan, related alarms must be associated)
  • Correlation impact (system & business)
  • Decision & Execution & Verification Staff
  • Turn on verification
  • Shutdown threshold (emergency plan)
  • Turn off verification

Phased output-full link combat map

After completing the above-mentioned guarantee work, we can basically get the global link combat map, including link branch traffic model, strong and weak dependent nodes, asset loss assessment, corresponding plans & processing strategies and other information. During the promotion period, the map can be used to quickly view the relevant impact of the emergency from a global perspective, and at the same time, it can be used to evaluate whether the plan and capacity are comprehensive and reasonable based on the map.

6 Incident Response-Combat Manual

The operational manual is the basis of action for the entire promotion and runs through the entire life cycle of the promotion. It can be considered in three stages: before, during, and after the event.

The overall combing should be based on the principles of precision and refinement. Ideally, even shift students who are not familiar with business and systems can quickly respond to online problems with the help of the manual.

Advance

1) List of pre-inspection items

The checklist must be performed before the big promotion, which usually contains the following items:

  • Cluster machine restart or manual FGC
  • Shadow table data cleaning
  • Check the permissions of upstream and downstream machines
  • Check the current limit
  • Check the consistency of the machine switch
  • Check database configuration
  • Check the middleware capacity and configuration (DB\Cache\NoSQL, etc.)
  • Check the effectiveness of monitoring (business market, technical market, core alarms)
  • Each item needs to include three columns of data: specific executor, inspection plan, and inspection result

2) Pre-plan

Pre-plan for all business & technology in the domain.

In the matter

1) Emergency technical & business plan

The content to be included is basically the same as the pre-plan, the differences are as follows:

  • Execution conditions & recovery conditions: specific trigger thresholds, corresponding to monitoring alarm items.
  • Notify the decision maker.

2) Emergency tools & scripts

Common troubleshooting methods, core alarm hemostasis methods (strong and weak dependent unavailability, etc.), business-related log retrieval scripts, etc.

3) Alarm & Market

Should include business, system cluster and middleware alarm monitoring and combing results, core business and system market, corresponding log data source details and other data:

  • Log data source details: data source name, file location, sample, segmentation format.
  • Business, system cluster and middleware alarm monitoring and combing results: associated monitoring indicators (links), critical alarm levels, whether to push GOC, whether to generate asset losses, whether to associate failures, and whether to associate plans.
  • Core business & system market: market address, including index details (meaning, whether to associate alarms, and corresponding logs).

4) Upstream and downstream machine grouping

It should include the core system, upstream and downstream systems, grouping in different computer rooms, unit clusters, and application names, which can be used for pre-machine authority checks, and during-emergency troubleshooting.

5) Matters needing attention on duty

Including the must-dos of each shift student on duty, emergency change procedures, core market links, etc.

6) Core broadcast indicators

Including core system & service indicators (CPU\LOAD\RT), business attention indicators, etc., each indicator should specify the specific monitoring address and collection method.

7) In-domain & associated domain personnel address book, on duty

Including the corresponding shift status of the technical, TL, and business parties in the domain, contact information (telephone), and the corresponding shift status of related upstream and downstream components (DB, middleware, etc.).

8) On-duty problem record

Combat records, record work orders, business problems, plans (pre-\emergency) (at least include: time, problem description (screenshot), impact analysis, decision-making & resolution process, etc.). Classmates on duty shall make a record before the end of duty.

afterwards

1) List of system recovery settings (current limiting, shrinking)

It generally corresponds to the checklist of pre-inspection items, including adjustments to current limit thresholds, cluster shrinking and other major post-recovery operations.

2) Big promotion issue review record

It should include a summary of the core events encountered in the promotion.

7 Incident Response-Sand Table Game

The actual combat sand table exercise is the last guarantee work in emergency response. The historical real fault CASE is used as the emergency scene input to simulate the emergency situation during the big promotion period, aiming to test the response of the students on duty to the emergency problem handling.

Generally speaking, an online problem needs to go through the process of location & investigation & diagnosis & repair from discovery to solution. The following principles are generally followed:

  • As far as possible, let the system restore service first, while protecting the site (machines, logs, water level records) for root cause investigation.
  • Avoid blind search, and locate the targeted diagnosis based on white box monitoring.
  • Division of labor in an orderly manner, each performing its own duties, to avoid a swarm of out-of-control chaos.
  • Real-time assessment of the scope of influence based on on-site conditions, situations that cannot be saved by technical means (such as strong dependence and unavailability) are transformed into business problem thinking (scope of influence, degree, whether there is loss of assets, how to collaborate with business parties).
  • The sand table exercise is designed to test the fault handling ability of the students on duty, focusing on three aspects: hemostasis strategy, division of labor arrangement, and problem location:

Internationalized China-Taiwan Double 11 Buyer Domain Exercise

According to the type of failure, common hemostasis strategies have the following solutions:

  • Ingress current limit: lower the current limit value of the corresponding Provider service source
    • Responding to the excessively high burst traffic that causes the load of its own system and downstream strong dependencies to be overwhelmed.
  • Downstream downgrade: Downgrade corresponds to downstream services
    • Weak downstream dependencies are not available.
    • Downstream business relies heavily on downgrading after business approval (the business part is damaged).
  • Single point of failure removal: remove unavailable nodes
    • When the water level of a stand-alone machine is so high, the stand-alone service is unavailable when it is offline first (the machine does not need to be offline, and the site is kept).
    • Coping with single point unavailability and poor performance of the cluster.
  • Switching: Unit switching or switching backup
    • Respond to a single library or a unit's dependence due to its own reasons (host or network), resulting in a drop in the success rate of local traffic.

In Google SRE, there are the following elements for emergency management:

  • Nested separation of responsibilities, that is, a clear division of functions
  • Control Center\War Room
  • Real-time incident status document
  • Clear and public transfer of responsibilities

Among them, the nested separation of duties, that is, the definite division of functions and arrangements, to achieve the effect of performing their duties and orderly processing, generally can be divided into the following roles:

  • Accident Master Control: Responsible for coordinating the division of labor and unallocated affairs, and mastering the overall summary information, generally for the PM/TL.
  • Transaction processing team: The actual accident handling personnel can be divided into multiple small teams according to specific business scenarios & system characteristics. There is a person in charge of the domain within the team to communicate with the accident controller.
  • Speaker: The external liaison officer of the accident is responsible for the periodic information synchronization of the information of the internal members of the accident handling and the external concerned personnel, and the need to maintain and update the accident document in real time.
  • Planning person: responsible for external continuous support, for example, when a large-scale failure occurs and multiple shifts are rotated, responsible for organizing responsibility transfer records.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114629892