Continuous testing new paradigm: integration of dialing and pressure testing

Author: Fu Yi

Recently, at the TiD2023 Quality Competitiveness Conference, Wu Yao from Alibaba Cloud's Cloud Native Observability Team shared the theme of "Continuous Testing New Paradigm: Integration of Dial, Pressure and Test". This sharing included three parts:

  • Business continuity requires a stable platform
  • Evolution and trend analysis of Alibaba’s stability platform
  • The concept and best practice of integrated dial pressure measurement

How to ensure business continuity

Before we officially start today’s topic, let’s talk about business continuity. With the rapid development and widespread application of information technology, business innovation and normal operation represented by the Internet and the financial industry increasingly rely on the safe and stable operation of information systems. How to ensure that the key business functions supported by information systems can be restored in time and continue to operate after a failure or disaster occurs, so as to reduce the losses that may be caused by a failure or disaster, has become a key issue that must be considered in technology construction and operation and maintenance.

Both industry enterprises and government agencies have always attached great importance to disaster recovery and business continuity construction, and have issued a number of specifications and prescriptive opinions, such as the "Information Security Technology Information System Disaster Recovery Specification" (GB/T 20988 - 2007) and "Banking Information System Disaster Recovery Management Specifications" (JR/T0044-2008) and other standards and specifications. At the same time, there are many business continuity models in the industry to guide enterprises in implementing continuity construction, the most well-known of which is the 6R model. The 6R model describes in detail the complete life cycle of a fault from its occurrence to its end. Looking at the entire cycle of the model, we can see that the entire cycle is divided into three lines of defense to ensure business continuity: prevention and control before the incident, response during the incident, and reconstruction after the incident.

Before a business interruption occurs, prevention and control work is mainly carried out, which is called the Reduce stage, that is, the risk reduction stage. The Reduce stage is to organize the team to carry out daily risk management, IT operation and maintenance management, business continuity management and other management work. After a business interruption occurs, in-process response work is carried out and divided into the Respond (emergency response) stage and the Recover & Resume (restart) stage. The Respond (emergency response) phase carries out personnel convening, situation understanding and reporting, damage assessment, troubleshooting, etc.; the Recover (recovery) phase mainly implements recovery plans, including plans for the IT part, business part, and supporting support functions. Recovery plan execution is initiated after a failure or disaster is declared. After the recovery plan is executed, the event becomes stable, enters the Restore and Return phases, and the business returns to normal status.

In the process of continuous summary and review of stability construction, we found that the more investment in the prevention and control and response stages before the incident, the total number of faults throughout the year will be reduced accordingly. Therefore, while we continue to invest in stability construction, we have identified two core needs, namely the reinforcement of two lines of defense:

Reinforcement of the first line of defense: simulate real traffic for stress testing to verify system capacity; fault drills to verify system disaster tolerance.

Reinforcement of the second line of defense: shift the perceived interruption point to the left to detect business failures in a timely manner; establish a switching plan mechanism to quickly downgrade and stop losses.

First, it is the reinforcement of the first line of defense, that is, pre-emptive prevention and control to intercept as many faults as possible. On the one hand, sufficient functional testing is carried out before the business is launched. At the same time, capacity testing of key core businesses is carried out to simulate real traffic, which is also a stress test. On the other hand, conduct fault drills in the pre-production environment or grayscale environment before the entire system goes online. For example, inject faults into the infrastructure layer and application layer respectively to observe whether the system's self-healing ability meets expectations. Therefore, the first line of defense reinforcement needs to ensure that this fault can be eliminated in advance before the system goes online.

Secondly, it is the reinforcement of the second line of defense, that is, shortening the time spent in responding to incidents. The time to cope with an incident is divided into two parts: perception time and recovery time. Regarding sensing time, this puts new requirements on the monitoring and stability platform, that is, moving the sensing point to the left as much as possible, and not waiting until the customer has sensed the fault and provided feedback before processing it. Achieve the ability to proactively sense in advance and quickly stop losses after detecting faults. At the same time, in the actual production process, as various unexpected failures occur more and more, we need a complete contingency plan mechanism. This is actually the construction of the SRE system, which uses a complete set of response mechanisms to deal with various failures. During Alibaba's practice, we designed a switch plan mechanism to abstract various historical failures into the plan. During the fault handling process, a functional degradation switch that can be dynamically configured is designed. During a major promotion, if the capacity of some services has reached the water level threshold and will affect stability, the corresponding functions can be downgraded directly through dynamic switches to ensure a smooth user experience.

Alibaba’s best practices for ensuring business continuity

Next, let’s take a look at the evolution of Alibaba and Alibaba Cloud’s stability system construction, and how to measure the benefits of system construction.

The evolution of the entire stability platform is closely related to the evolution of the technical architecture, and is mainly divided into three major stages.

First of all, when Taobao just started, the technical architecture was mainly a single application. As business volume increases, PHP single applications are replaced by Java single applications. Until 2008, Java single applications also encountered business bottlenecks. The business logic inside was very complex, there were many developers, and the iteration efficiency was very low. Since then, Alibaba began to try to split the distributed application architecture and gradually migrated the core e-commerce transaction system to the cloud after the emergence of Alibaba Cloud to cope with the increasingly large-scale business. In 2018, Alibaba basically ran all of its business on the cloud, and began to experiment with cloud-native explorations such as containerization and serverless.

At the same time, with the evolution of technical architecture, Alibaba is building stability around fault tolerance, remote multi-active disaster recovery, and capacity planning. Stability is improved by introducing technical means such as a call link analysis platform, fault drill capabilities, and stress testing systems, and projects such as ChaosBlade are open sourced and entered into the CNCF Sandbox.

At the same time, in the process of supporting internal and external implementation of stress testing, we found that the responsibilities and authority of the test role moved to the right in the DevOps ring chart. More and more testing teams are not only responsible for functional testing and performance testing before going online, which is the Test stage. After the function is online, the site and online business availability must be actively monitored through dial testing, which is the Monitor stage. Also based on the above trends and needs, the cloud native observable team proposed the concept of integrating pressure and pressure to help the operation, maintenance and testing teams better build stability. The benefits to the team are very clear:

  • Improve business stability: Stress testing verifies system throughput to ensure capacity stability, dial-up testing monitors online business availability in real time, discovers problems in advance of the business side, and reduces the explosion radius.
  • Improved organizational efficiency: The testing team is unified in charge of stress testing, and the work of sorting out business test scripts no longer requires testing and operation and maintenance teams to do it twice; the operation and maintenance team focuses on resource monitoring, and online business monitoring is left to the testing team.
  • Tools improve efficiency: dial-up and pressure tests share a platform, a set of script syntax, and a set of test data, which improves engineers' sense of happiness.

Specific to business benefits, such as reducing the number of faults, shortening fault recovery time, improving fault-free time and failure interval, and reducing human investment in fault handling, etc.

What is the integration of dial pressure measurement?

Business traffic often has peak and valley effects. Business interruptions during peak periods can be detected in a timely manner through server application monitoring and alarms. However, during low periods of business traffic, how to detect business interruptions becomes a problem. If the alarm threshold is configured based on the monitoring indicators during peak business periods, alarms will not be triggered during low traffic periods and business interruption will not be perceived. If the alarm threshold is configured too low, a large number of false alarms will be received during peak periods of business. In response to the above problems and the right shift mentioned above, dial testing and pressure testing are organically combined.

(1) What is dial test?

Dial Test is a zero-intrusion, out-of-the-box, proactive service availability and performance monitoring tool. It simulates the business behavior of real users by deploying monitoring points around the world, initiates tests on the site regularly, and continuously monitors business continuity and performance. network performance, and measure user experience. As an active monitoring service, it is not affected by business peak and valley periods and protects business continuity throughout the entire cycle. The core capabilities and application scenarios of cloud dial testing are as follows:

(2) What is pressure testing?

Stress testing is an indispensable tool in capacity planning, and I believe everyone is very familiar with it. According to different verification scenarios, stress testing can be divided into the following test types:

(3) Integrated platform for dialing and pressure measurement

It can be seen that dial-up testing and stress testing both test the capacity, availability, and performance of the system by simulating the behavior of real users. From the perspective of business scenarios and system architecture, dial-up testing platforms and stress testing platforms are highly similarity. Therefore, we integrated the dial-up and pressure-testing platform into an integrated dial-up and pressure-testing platform, and unified management and control of scripts, scheduling tasks, and traffic.

You need to prepare business scripts before stress testing. In fact, dial testing also requires such a set of scripts. When dial testing and stress testing are divided into two platforms, the same set of business processes needs to be configured twice using the syntax of the two platforms. By integrating the dialing and pressure testing script, the work of configuring the script can be halved. A set of scripts can initiate both dialing and pressure testing.

Best Practices for Integrated Dial and Pressure Measurement

Alibaba Cloud website speed testing platform supports dial-up testing for PING, TCP, DNS, website speed testing, HTTP interface, file download and other scenarios, and supports stress testing of HTTP interface. You can also initiate a comparison test through the Alibaba Cloud website speed test platform to understand the performance difference between the two websites. The following describes how to verify site availability and interface performance through the Alibaba Cloud website speed test platform .

(1) Initiate a website speed test task

Here, we use simulated telecom, mobile, and China Unicom operators' access to the Alibaba Cloud official website in 34 provincial capital cities across the country as an example to demonstrate how to use the Alibaba Cloud website speed testing platform to test the speed of the website.

  1. Log in to the Alibaba Cloud website speed testing platform [ 1] .

  2. Select the dial test type. Select website speed test here .

  3. Click the drop-down box below the dial measurement type and select a monitoring point. The operators selected here are China Telecom, China Mobile and China Unicom , and the selected regions are 34 provincial capital cities across the country .

  4. Enter the web application address that needs to be dialed for testing on the right side of the drop-down box. For example: www.aliyun.com .

  5. Click Start now .

  1. In the dial-up test results area, check the website's availability, first package time, first screen time, full load time and other indicators, as well as a detailed data list of each monitoring point.

  1. Click the details on the right side of the corresponding monitoring point in the detailed data list to view the detailed performance indicators and page elements of the corresponding monitoring point.

Performance

page elements

(2) Initiate a comparison test

You can also initiate a comparison test through the Alibaba Cloud website speed test platform to understand the performance difference between the two websites.

Here, we take the simulated telecom, mobile and China Unicom operators' access to Alibaba Cloud and other cloud vendors in 34 provincial capital cities across the country as an example to demonstrate how to use the Alibaba Cloud website speed test platform to compare the performance of the two websites.

  1. Log in to the Alibaba Cloud website speed testing platform.

  2. Select the dial test type. Select website speed test here .

  3. Click the drop-down box below the dial measurement type and select a monitoring point. The operators selected here are China Telecom, China Mobile and China Unicom , and the selected regions are 34 provincial capital cities across the country .

  4. Click Compare and Dial Test, and then enter the address of the web application that needs to be compared and dialed. For example: www.aliyun.com and www.XXcloud.com .

  5. Click Start now .

  6. In the dial test result area, check the availability of the two websites, first package time, first screen time, complete loading time and other indicators, as well as a detailed data list of each monitoring point.

  1. Click the details on the right side of the corresponding monitoring point in the detailed data list to view the detailed performance indicators and page elements of the corresponding monitoring point.

(3) Initiate performance testing

  1. Log in to the Alibaba Cloud website speed testing platform.

  2. Select performance stress test.

  3. Enter the address of the web application that needs to be stress tested. For example: www.example.com/api/test
    Note: Please ensure that you have stress testing permissions for this URL. All legal consequences resulting from stress testing URLs for which you do not have permission will be borne by you.

  4. Check the performance indicators of the interface in the stress test

Latest events & free trials

Performance Test PTS Practice Training Camp is in full swing!

Participate in the training camp to receive 5,000 VUM free quota, quickly get started using performance test PTS, simulate real users to initiate high-traffic and high-concurrency stress tests on business systems, verify cloud product specification selection, and locate application service performance bottlenecks.

Participate now: https://developer.aliyun.com/trainingcamp/f8400b45d23c4bdf86af0c9d6711de7b

Free trial

Cloud Dial-in Test provides 3,000 free dial-in tests every month. Click the link to receive the free quota immediately, understand the website performance in real time, and quickly initiate a website speed test.

https://free.aliyun.com/?product=9760242,9602838&spm=5176.28055625.J_5831864660.9.1649154aJ7iiyZ

Related Links:

[1] Alibaba Cloud website speed testing platform

https://cesu.pts.aliyun.com/

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10117270